> Igor Sorochan
# "Statistical tests practise"

We have [videogames](https://github.com/obulygin/pyda_homeworks/blob/master/stat_case_study/vgsales.csv) dataset.

Questions to ask:

1. Do critics like sports games?
1. Which video platforms do critics prefer (PC or PS4)?
1. Do critics prefer shooters or strategy games?

### Prepare

In [1]:
#Dependencies
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

import scipy.stats as stats

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

In [2]:
# uncomment to load dataset:
# df_raw = pd.read_csv('https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/stat_case_study/vgsales.csv')

# local source
df_raw = pd.read_csv('/Users/velo1/SynologyDrive/GIT_syno/data/vgsales.csv') 

df_raw

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.00,0.00,0.01,0.00,0.01,,,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.00,0.01,0.00,0.00,0.01,,,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.00,0.00,0.01,0.00,0.01,,,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.00,0.00,0.00,0.01,,,,,,


### Process

In [3]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


In [4]:
# checking for duplicates and NaNs
df_raw.duplicated().sum(), df_raw.isna().sum()

(0,
 Name                  2
 Platform              0
 Year_of_Release     269
 Genre                 2
 Publisher            54
 NA_Sales              0
 EU_Sales              0
 JP_Sales              0
 Other_Sales           0
 Global_Sales          0
 Critic_Score       8582
 Critic_Count       8582
 User_Score         6704
 User_Count         9129
 Developer          6623
 Rating             6769
 dtype: int64)

In [5]:
# drop Genre or Critic_Score empty observations as they are essential for analysis
df = df_raw.drop(df_raw[df_raw.Critic_Score.isna() |  df_raw.Genre.isna()].index)

df

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.28,9.14,6.50,2.88,29.80,89.0,65.0,8.5,431.0,Nintendo,E
7,Wii Play,Wii,2006.0,Misc,Nintendo,13.96,9.18,2.93,2.84,28.92,58.0,41.0,6.6,129.0,Nintendo,E
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16700,Breach,PC,2011.0,Shooter,Destineer,0.01,0.00,0.00,0.00,0.01,61.0,12.0,5.8,43.0,Atomic Games,T
16701,Bust-A-Move 3000,GC,2003.0,Puzzle,Ubisoft,0.01,0.00,0.00,0.00,0.01,53.0,4.0,tbd,,Taito Corporation,E
16702,Mega Brain Boost,DS,2008.0,Puzzle,Majesco Entertainment,0.01,0.00,0.00,0.00,0.01,48.0,10.0,tbd,,Interchannel-Holon,E
16706,STORM: Frontline Nation,PC,2011.0,Strategy,Unknown,0.00,0.01,0.00,0.00,0.01,60.0,12.0,7.2,13.0,SimBin,E10+


In [6]:
# leave only related attributes
df = df[['Genre','Critic_Score', 'Platform']]
df.duplicated().sum(), df.isna().sum()

(4048,
 Genre           0
 Critic_Score    0
 Platform        0
 dtype: int64)

In [7]:
df.shape

(8137, 3)

In [8]:
df.Genre.unique(), df.Genre.nunique()

(array(['Sports', 'Racing', 'Platform', 'Misc', 'Action', 'Puzzle',
        'Shooter', 'Fighting', 'Simulation', 'Role-Playing', 'Adventure',
        'Strategy'], dtype=object),
 12)

In [9]:
df.Platform.unique(), df.Platform.nunique()

(array(['Wii', 'DS', 'X360', 'PS3', 'PS2', '3DS', 'PS4', 'PS', 'XB', 'PC',
        'PSP', 'WiiU', 'GC', 'GBA', 'XOne', 'PSV', 'DC'], dtype=object),
 17)

So we have 12 Genres and 17 video games platforms.

### Analyze

#### Do critics like sports games?

In [10]:
df.groupby('Genre').mean(numeric_only = True).sort_values(by='Critic_Score',ascending= False).style.bar(align='left',color='yellow')

Unnamed: 0_level_0,Critic_Score
Genre,Unnamed: 1_level_1
Role-Playing,72.652646
Strategy,72.086093
Sports,71.968174
Shooter,70.181144
Fighting,69.217604
Simulation,68.619318
Platform,68.05835
Racing,67.963612
Puzzle,67.424107
Action,66.629101


In [11]:
fig = px.box(df, x='Genre', y='Critic_Score', notched=True, color='Genre')
fig.show()

#### Test scores medians

In [12]:
df_nonsports= df[df.Genre != 'Sports'].Critic_Score
df_sports= df[df.Genre == 'Sports'].Critic_Score

print(f'Median score of Sports games: {df_sports.median()}')
print(f'Median score of Other  games: {df_nonsports.median()}')

Median score of Sports games: 75.0
Median score of Other  games: 70.0


In [13]:
# visualisation of appropriate scores
fig = go.Figure()
fig.add_trace(go.Box(x=df_nonsports, notched= True, name= 'Other',marker_color='green'))
fig.add_trace(go.Box(x=df_sports, notched= True, name= 'Sports',marker_color='yellow'))
fig.update_layout(title="Sports and Other Genres Critic's Scores", xaxis_title="Critic's Scores")
fig.show()

Notches displays a confidence interval around the median.  
We compute the confidence interval as  
$median \pm 1.57 * \frac{ IQR } {\sqrt(N)}$, where  
* IQR is the interquartile range  
* and N is the sample size.  

If two boxes' notches do not overlap there is 95% confidence their medians differ. 

Let's check it with one of statistical tests.

#### Test scores means

In [14]:
print(stats.shapiro(df_nonsports), stats.shapiro(df_sports))

ShapiroResult(statistic=0.9778159856796265, pvalue=1.5516293858467083e-31) ShapiroResult(statistic=0.940378725528717, pvalue=1.7022328894188197e-21)



p-value may not be accurate for N > 5000.



Both distributions are normal and are independent.  
We could use a Student's T-test for the means of *two independent* samples.  
This test assumes that the populations have identical variances.  


H0:   $CS.mean{_{Sports}} = CS.mean{_{Others}} $  

H1:   $CS.mean{_{Sports}} \ne CS.mean{_{Others}} $

$confidence = 0.95$

In [15]:
# CS mean of non-sports games
popmean_notsports = df[df.Genre != 'Sports'].Critic_Score.mean()
print(popmean_notsports)
stats.ttest_1samp(df_nonsports, popmean= popmean_notsports, nan_policy= 'omit')

68.4516779490134


Ttest_1sampResult(statistic=0.0, pvalue=1.0)

In [16]:
# CS overall mean
popmean = df.Critic_Score.mean()
print(popmean)
stats.ttest_1samp(df_sports, popmean= popmean, nan_policy= 'omit')

68.96767850559173


Ttest_1sampResult(statistic=7.470587451672033, pvalue=1.538088875231057e-13)

In [17]:
stats.ttest_ind(df_sports, df_nonsports,  equal_var= False, nan_policy= 'omit') #  If False, perform Welch’s t-test, 
#which does not assume equal population variance 

Ttest_indResult(statistic=8.08698828481822, pvalue=1.181171308320441e-15)

$p-value < 0.05 =>$  
We have statistically significant reasons to reject the null hypothesis.  

`Critics prefer Sports games more than other games genres together.`

#### Which video platforms do critics prefer (PC or PS4)?

In [18]:
df.groupby('Platform').mean(numeric_only= True).sort_values(by='Critic_Score',ascending= False).style.bar(align='left', color='coral')

Unnamed: 0_level_0,Critic_Score
Platform,Unnamed: 1_level_1
DC,87.357143
PC,75.928671
XOne,73.325444
PS4,72.09127
PS,71.515
PSV,70.791667
WiiU,70.733333
PS3,70.382927
XB,69.85931
GC,69.488839


In [19]:
y_pc = df[df.Platform == 'PC'].Critic_Score
y_ps4 = df[df.Platform == 'PS4'].Critic_Score

fig = go.Figure()
fig.add_trace(go.Box(x=y_ps4, notched= True, name= 'PS4', marker_color='darkblue'))
fig.add_trace(go.Box(x=y_pc, notched= True, name= 'PC', marker_color='#FF4136'))

fig.update_layout(title="PC and PS4 Critic's Scores", xaxis_title="Critic's Scores")
fig.show()

Two boxes' notches do not overlap  
so there is 95% confidence their medians differ. 

Let's check it with t-tests.

In [20]:
print(stats.shapiro(y_pc), stats.shapiro(y_ps4))

ShapiroResult(statistic=0.9565241932868958, pvalue=1.0608874889683761e-13) ShapiroResult(statistic=0.9328337907791138, pvalue=2.690704770103025e-09)


Both distributions are normal and are independent.  
We could use a Students t-test.  
This test assumes that the populations have identical variances.  


H0:   $CS.mean{_{PC}} = CS.mean{_{PS4}} $  

H1:   $CS.mean{_{PC}} \ne CS.mean{_{PS4}} $

$confidence = 0.95$

In [21]:
stats.ttest_ind(y_pc, y_ps4, equal_var= False, nan_policy= 'omit') #  If False, perform Welch’s t-test, 
#which does not assume equal population variance 

Ttest_indResult(statistic=4.3087588262138725, pvalue=2.067249157283479e-05)

$p-value < 0.05 =>$
We have statistically significant reasons to reject the null hypothesis.  

`Critics prefer PC games to PS4 games.`

### Do critics prefer shooters or strategy games?

In [22]:
df.groupby('Genre').mean(numeric_only= True).sort_values(by='Critic_Score',ascending= False).style.bar(align='left', color='grey')

Unnamed: 0_level_0,Critic_Score
Genre,Unnamed: 1_level_1
Role-Playing,72.652646
Strategy,72.086093
Sports,71.968174
Shooter,70.181144
Fighting,69.217604
Simulation,68.619318
Platform,68.05835
Racing,67.963612
Puzzle,67.424107
Action,66.629101


In [23]:
y_rpg = df[df.Genre == 'Role-Playing'].Critic_Score
y_str = df[df.Genre == 'Strategy'].Critic_Score

fig = go.Figure()
fig.add_trace(go.Box(x=y_rpg, notched= True, name= 'Role-Playing', marker_color='red'))
fig.add_trace(go.Box(x=y_str, notched= True, name= 'Strategy', marker_color='black'))
fig.add_vline(x=y_rpg.median(), line_color='red')
fig.update_layout(title="Role-Playing and Strategy Critic's Scores", xaxis_title="Critic's Scores")
fig.show()

Two boxes' notches do overlap so there is **NO 95% confidence** their medians differ. 

Let's check it with t-tests.

In [24]:
print(stats.shapiro(y_rpg), stats.shapiro(y_str))

ShapiroResult(statistic=0.9816334843635559, pvalue=5.457165030975375e-08) ShapiroResult(statistic=0.9744413495063782, pvalue=3.258884316892363e-05)


Both distributions are normal and are independent.  
We could use a Students t-test.  
This test assumes that the populations have identical variances.  


H0:   $CS.mean{_{rpg}} = CS.mean{_{strategy}} $  

H1:   $CS.mean{_{rpg}} \ne CS.mean{_{strategy}} $

$confidence = 0.95$

In [25]:
stats.ttest_ind(y_rpg, y_str, equal_var= False, nan_policy= 'omit') #  If False, perform Welch’s t-test, 
#which does not assume equal population variance 

Ttest_indResult(statistic=0.698083061405362, pvalue=0.4854113519174341)

We have no reasons to reject the Null Hypothesis.  
We `don't have statistically significant results` to assume that critics prefer RPG over Strategy games or vice versa.