# Statistical Testing on Video Game Ratings

This notebook performs hypothesis testing to understand patterns in user ratings.  

We compare user and critic ratings, examine the influence of maturity and platform, and identify statistically significant differences using non-parametric tests where needed.

In [14]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
from scipy import stats

sys.path.append("..")
games = pd.read_csv("..\data\games_data_cleaned.csv")
games.columns

Index(['id', 'name', 'avg_user_rating', 'user_ratings_count', 'reviews_count',
       'added', 'avg_playtime', 'esrb_rating', 'release_year', 'release_month',
       'release_quarter', 'critic_rating', 'rating_gap', 'genre_count',
       'is_action', 'is_adventure', 'is_arcade', 'is_board_games', 'is_card',
       'is_casual', 'is_educational', 'is_family', 'is_fighting', 'is_indie',
       'is_massively_multiplayer', 'is_platformer', 'is_puzzle', 'is_rpg',
       'is_racing', 'is_shooter', 'is_simulation', 'is_sports', 'is_strategy',
       'platform_type', 'platform_count', 'store_count'],
      dtype='object')

### Are User and Critic Ratings Statistically Different? (Supplementary, Not included in report)

**Hypothesis:**  
H₀: There is no difference between the average user rating and average critic rating.

H₁: There is a significant difference between the two.

**Test Used:** Paired t-test

**Why:** Both ratings refer to the same games, so they are dependent samples. We use the paired t-test to compare the mean difference.

In [15]:
games['avg_user_rating'].mean(), games['critic_rating'].mean()

(np.float64(3.2629980300418615), np.float64(3.734782608695652))

In [16]:
paired = games[['avg_user_rating', 'critic_rating', 'rating_gap']].dropna().copy()

# Check normality
stat, p = stats.shapiro(paired['rating_gap'])
print(f"p-value: {p:.4f}")

p-value: 0.0000


Data not normal but paired t test is robust to minor violations of normality if sample large: https://pythonfordatascienceorg.wordpress.com/paired-samples-t-test-python/#test-assumptions

In [17]:
# Run paired t-test
t_stat, p_val = stats.ttest_rel(paired['avg_user_rating'], paired['critic_rating'])
print(f"Paired t-test: p = {p_val:.4f}")

Paired t-test: p = 0.0000


### Do ESRB Maturity Ratings Affect User Ratings?

**Hypothesis:**  
H₀: The average user rating does not differ across ESRB maturity categories. (Teen, Mature and Everyone)  

H₁: At least one ESRB category differs significantly in average user ratings.

**Test Used:** Kruskal-Wallis + Dunn's Post-hoc

**Why:** The normality assumption fails for all three groups. Kruskal-Wallis is a non-parametric alternative to ANOVA.

In [18]:
esrb = ['Teen', 'Mature','Everyone']

for e in esrb:
    e_rating = games[games['esrb_rating'] == e]['avg_user_rating']
    p = stats.normaltest(e_rating).pvalue
    print(f"{e}: p value ={p:.4f}")

Teen: p value =0.0000
Mature: p value =0.0000
Everyone: p value =0.0000


In [19]:
teen = games[games['esrb_rating'] == 'Teen']['avg_user_rating']
mature = games[games['esrb_rating'] == 'Mature']['avg_user_rating']
every = games[games['esrb_rating'] == 'Everyone']['avg_user_rating']


print("Levene's test p-value:",stats.levene(teen, mature, every).pvalue)

print("Kruskal-Wallis p-value:",stats.kruskal(teen,mature,every).pvalue)

Levene's test p-value: 0.015083017164754815
Kruskal-Wallis p-value: 2.279578976747048e-12


In [20]:
import scikit_posthocs as sp

maturity = games[['esrb_rating', 'avg_user_rating']].dropna().copy()
maturity = maturity[maturity['esrb_rating'].isin(esrb)]

posthoc = sp.posthoc_dunn(maturity, val_col = 'avg_user_rating', group_col='esrb_rating', p_adjust='bonferroni')
posthoc


Unnamed: 0,Everyone,Mature,Teen
Everyone,1.0,1.317894e-05,1.0
Mature,1.3e-05,1.0,4.470862e-12
Teen,1.0,4.470862e-12,1.0


### Do Multi Platform(Console and PC) and PC only Games have different user ratings

**Hypothesis:**  
H₀: There is no difference in user ratings between multi platform and pc only  

H₁: There is a difference between the platform ratings

**Test Used:** Mann Whitney U Test 

**Why:** Normality assumption fails for both groups and no equal variance; thus,non-parametric testing is more appropriate.

In [21]:
games['platform_type'].value_counts()

platform_type
multi_platform    3768
pc_only           3690
console_only       594
other               70
Name: count, dtype: int64

In [22]:
# Group ratings by platform type
multi = games[games['platform_type'] == 'multi_platform']['avg_user_rating']
pc = games[games['platform_type'] == 'pc_only']['avg_user_rating']


# Normality check per platform group
print("Multi platform normality: ",stats.normaltest(multi).pvalue)
print("PC platform normality: ", stats.normaltest(pc).pvalue)

# Levene’s test for equal variances
print("Levene's test p-value:", stats.levene(multi, pc).pvalue)

Multi platform normality:  8.924437713928959e-35
PC platform normality:  1.623991834224775e-107
Levene's test p-value: 8.6533494793048e-78


In [24]:
print("Mann Whitney U p-value: ",stats.mannwhitneyu(multi, pc, alternative ='two-sided').pvalue)

Mann Whitney U p-value:  6.21920733118219e-181
