## Import packages

In [4]:
!pip install bstrap
import pandas as pd
import scipy.stats as st
import numpy as np
from bstrap import bootstrap

Collecting bstrap
  Downloading bstrap-0.0.9-py3-none-any.whl (6.6 kB)
Installing collected packages: bstrap
Successfully installed bstrap-0.0.9


You should consider upgrading via the 'c:\users\arisha\appdata\local\programs\python\python38-32\python.exe -m pip install --upgrade pip' command.


## Read data

In [5]:
data = pd.read_csv('kinopoisk rating.csv', sep=';', encoding='windows-1251')

In [7]:
data.head()

Unnamed: 0,num,name_rus,rating_new,origin,genre,rating_old,qty_views
0,1,Зеленая миля,9.1,США,фэнтези/ драма,8.9,692418
1,2,Побег из Шоушенка,9.1,США,драма,8.9,784326
2,3,Властелин колец: Возвращение короля,8.6,Новая Зеландия/ США,фэнтези/ приключения,8.8,481829
3,4,Властелин колец: Две крепости,8.6,Новая Зеландия/ США,фэнтези/ приключения,8.8,467607
4,5,Властелин колец: Братство Кольца,8.6,Новая Зеландия/ США,фэнтези/ приключения,8.8,516856


In [12]:
data.shape

(250, 7)

 ## Analysis

### Mann-Whitney U-test for all data

In [9]:
st.mannwhitneyu(data.rating_old, data.rating_new)

MannwhitneyuResult(statistic=31324.5, pvalue=0.9629567921262221)

As we can see p-value > 0.05 => we can consider that there is no statistical difference between old and new ratings according to the Mann-Whithey U-test. H0 don't rejected.

### Data new information

Let's find the most frequently occuring vaues in dataset for origin and genre columns

In [43]:
data.groupby(['genre'])['genre'].count().sort_values(ascending=False)

genre
фантастика/ боевик     19
мультфильм/ фэнтези    13
драма/ мелодрама       13
драма                  12
триллер/ драма         11
                       ..
биография/ драма        1
криминал/ биография     1
криминал/ боевик        1
боевик/ приключения     1
комедия/ мелодрама      1
Name: genre, Length: 76, dtype: int64

In [42]:
data.groupby(['origin'])['origin'].count().sort_values(ascending=False)

origin
США                         110
СССР                         31
Великобритания/ США          17
США/ Германия                10
Япония                        9
США/ Великобритания           9
Новая Зеландия/ США           6
США/ Канада                   5
Италия                        4
Россия                        4
Германия                      3
Франция                       3
Австралия/ США                2
США/ Мексика                  2
Великобритания/ Франция       2
США/ Япония                   2
Корея Южная                   2
Франция/ США                  2
Франция/ Великобритания       1
Франция/ Бельгия              1
США/ Италия                   1
Франция/ Канада               1
Франция/ Польша               1
США/ Франция                  1
США/ Новая Зеландия           1
Швеция                        1
США/ Мальта                   1
США/ Китай                    1
Бразилия/ Франция             1
Германия/ Италия              1
Великобритания                1
Г

### Mann-Whitney U-test for data with similar origin, genre or origin/genre

In [15]:
data_USA = data[data.origin.str.contains('США')]
data_USA.shape

(174, 7)

In [16]:
st.mannwhitneyu(data_USA.rating_old, data_USA.rating_new)

MannwhitneyuResult(statistic=16482.5, pvalue=0.1461184817005307)

In [32]:
data_fantasy = data[data.genre.str.contains('фэнтези')]
data_fantasy.shape

(38, 7)

In [33]:
st.mannwhitneyu(data_fantasy.rating_old, data_fantasy.rating_new)

MannwhitneyuResult(statistic=868.5, pvalue=0.12298584810308144)

In [11]:
data_USA_fantasy = data[data.origin.str.contains('США') & data.genre.str.contains('фэнтези')]
data_USA_fantasy.shape

(33, 7)

In [13]:
st.mannwhitneyu(data_USA_fantasy.rating_old, data_USA_fantasy.rating_new)

MannwhitneyuResult(statistic=672.0, pvalue=0.09864439629244537)

In [35]:
data_action = data[data.genre.str.contains('боевик')]
data_action.shape

(41, 7)

In [37]:
st.mannwhitneyu(data_action.rating_old, data_action.rating_new)

MannwhitneyuResult(statistic=1094.0, pvalue=0.015909008539090632)

In [39]:
data_USA_action = data[data.origin.str.contains('США') & data.genre.str.contains('боевик')]
data_USA_action.shape

(33, 7)

In [40]:
st.mannwhitneyu(data_USA_action.rating_old, data_USA_action.rating_new)

MannwhitneyuResult(statistic=699.0, pvalue=0.042728150663774264)

### Conclusion Whitney-Uitman U-test

As we can see that for most samples the p-value is more then 0.05 which mean that we have no statistical difference between ratings. But, as an exception, we see that for films in sample USA/action_film the p-value is less than 0.05. For this semple we can't reject H1, but for all other we don't reject H0.

## Bootstrap

### Bootstrap for all data

In [50]:
old, new, p = bootstrap(np.mean, data.rating_old, data.rating_new, nbr_runs=2000)
old, new, p

({'avg_metric': 8.1797266,
  'metric_ci_lb': 8.159600000000001,
  'metric_ci_ub': 8.2},
 {'avg_metric': 8.1845764, 'metric_ci_lb': 8.156, 'metric_ci_ub': 8.2128},
 0.814)

The p-value with bootstrap is more tnah 0.05 which mean that there is no statistical difference.

### Bootstrap for data with the same origin, country or origin/country

In [51]:
old, new, p = bootstrap(np.mean, data_USA.rating_old, data_USA.rating_new, nbr_runs=2000)
old, new, p

({'avg_metric': 8.178181034482758,
  'metric_ci_lb': 8.154022988505748,
  'metric_ci_ub': 8.204022988505747},
 {'avg_metric': 8.154612356321838,
  'metric_ci_lb': 8.120689655172415,
  'metric_ci_ub': 8.189655172413794},
 0.3475)

In [55]:
old, new, p = bootstrap(np.mean, data_fantasy.rating_old, data_fantasy.rating_new, nbr_runs=2000)
old, new, p

({'avg_metric': 8.224065789473686,
  'metric_ci_lb': 8.160526315789474,
  'metric_ci_ub': 8.292105263157897},
 {'avg_metric': 8.142288157894738,
  'metric_ci_lb': 8.071052631578947,
  'metric_ci_ub': 8.215789473684211},
 0.1875)

In [53]:
old, new, p = bootstrap(np.mean, data_USA_fantasy.rating_old, data_USA_fantasy.rating_new, nbr_runs=2000)
old, new, p

({'avg_metric': 8.24436515151515,
  'metric_ci_lb': 8.16969696969697,
  'metric_ci_ub': 8.324242424242424},
 {'avg_metric': 8.148693939393938,
  'metric_ci_lb': 8.069696969696968,
  'metric_ci_ub': 8.236363636363635},
 0.175)

In [54]:
old, new, p = bootstrap(np.mean, data_USA_action.rating_old, data_USA_action.rating_new, nbr_runs=2000)
old, new, p

({'avg_metric': 8.162513636363638,
  'metric_ci_lb': 8.11818181818182,
  'metric_ci_ub': 8.212121212121213},
 {'avg_metric': 8.089484848484847,
  'metric_ci_lb': 8.015151515151516,
  'metric_ci_ub': 8.16969696969697},
 0.1775)

We see that for all samples p-value is more then 0.05. So, there is no relevant statistical difference.

## Conclusion

We did 2 analysis of our data with Mann-Whitney U-test and Bootstrap. As the result we can conclude that old and new ratings doesn't have relevant statistical difference between each other. This result is true for all our samples. We had one exceptio with USA/action data when did Mann-Whitney U-test, but it doesn't prooved by booststrap. So, we can say that the H0 is confirmed with our analysis.