#### Statistical Data Analysis
Dataset: 

- _videogames_clean.csv_

Author: Luis Sergio Pastrana Lemus  
Date: 2025-05-29

# Statistical Data Analysis – Purchasing Activity Dataset

## __1. Libraries__

In [1]:
from pathlib import Path
import sys

# Define project root dynamically, gets the current directory from which the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:

    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *


from IPython.display import display, HTML
import numpy as np
import os
import pandas as pd
import scipy.stats as st
from scipy.stats import ttest_ind

## __2. Path to Data file__

In [2]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed" / "clean"
df_vg = load_dataset_from_csv(data_file_path, "videogames_clean.csv", sep=',', header='infer')

In [3]:
df_vg = cast_datatypes(df_vg)

## __3. Statistical Data Analysis__

### 3.1  Inferential Tests

Hypotheses: 

- The average user ratings for the Xbox One and PC platforms are the same
- The average user ratings for the Action and Sports genres are different

#### 3.1.1  Hypotheses testing: _"The average user ratings for the Xbox One and PC platforms are the same"_

In [4]:
df_vg

Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,other_sales,critic_score,user_score,rating
0,wii_sports,wii,2006.0,sports,41.36,28.96,3.77,8.45,76.0,8.00,E
1,super_mario...,nes,1985.0,platform,29.08,3.58,6.81,0.77,,,
2,mario_kart_wii,wii,2008.0,racing,15.68,12.76,3.79,3.29,82.0,8.30,E
3,wii_sports_...,wii,2009.0,sports,15.61,10.93,3.28,2.95,80.0,8.00,E
4,pokemon_red...,gb,1996.0,role_playing,11.27,8.89,10.22,1.00,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,samurai_war...,ps3,2016.0,action,0.00,0.00,0.01,0.00,73.0,6.95,M
16711,lma_manager...,x360,2006.0,sports,0.00,0.01,0.00,0.00,72.0,7.30,E
16712,haitaka_no_...,psv,2016.0,adventure,0.00,0.00,0.01,0.00,72.0,7.80,M
16713,spirits_spells,gba,2003.0,platform,0.01,0.00,0.00,0.00,69.0,7.85,E


In [13]:
# Hypothesis: ...

# 1. Hypotheses H0, H1
# H0: The average user ratings for the Xbox One and PC platforms are the same (==)
# H1: The average user ratings for the Xbox One and PC platforms are different (!=)

# Prepare data by plans
df_vg_xone_platform = df_vg.loc[(df_vg['platform'] == 'xone') & (df_vg['user_score'].notna()), 'user_score']
df_vg_pc_platform = df_vg.loc[(df_vg['platform'] == 'pc') & (df_vg['user_score'].notna()), 'user_score']

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

In [14]:
# Levene's test, to ensure that the variances of different samples are equal.
# Preventing Tests Like ANOVA and t-Tests from Being Incorrect

levene_stat, levene_p = st.levene(df_vg_xone_platform, df_vg_pc_platform)
display(HTML(
    f"<b>Levene's Test</b> – Statistic: {levene_stat:.4f}, P-value: {levene_p:.4f}"))

# Determining Equality of Variances
if levene_p < 0.05:
    equal_var = False
    display(HTML("<i>Null Hypothesis H₀ is rejected: the variances are different → use equal_var=False</i>"))
else:
    equal_var = True
    display(HTML("<i>Null Hypothesis H₀ is not rejected: the variances are equal → use equal_var=True</i>"))

In [15]:
# 3. Calculate critical and test values, define acceptance and rejection zones

t_stat, p_val = ttest_ind(df_vg_xone_platform, df_vg_pc_platform, equal_var=True)

display(HTML(f"T-statistic: <b>{t_stat:.15f}</b>"))
display(HTML(f"P-value: <b>{p_val:.15f}</b>"))

# 4. Decision and Conclusion

if p_val < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>The average user ratings for the Xbox One and PC platforms differ significantly.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>The average user ratings for the Xbox One and PC platforms differ significantly</b>."))

#### Hypothesis Test validation

In [17]:
df_vg_xone_platform = df_vg_xone_platform.to_frame()

In [18]:
display(HTML(f"> The user ratings for the <b>Xbox One/b> platform<: \n\n"))
print(df_vg_xone_platform["user_score"].describe())

count    247.000000
mean       6.651093
std        1.306885
min        1.600000
25%        5.900000
50%        6.900000
75%        7.700000
max        9.200000
Name: user_score, dtype: float64


In [19]:
df_vg_pc_platform = df_vg_pc_platform.to_frame()

In [20]:
display(HTML(f"> The user ratings for the <bPC/b> platform<: \n\n"))
print(df_vg_pc_platform["user_score"].describe())

count    962.000000
mean       7.092162
std        1.412612
min        1.400000
25%        6.412500
50%        7.400000
75%        8.100000
max        9.300000
Name: user_score, dtype: float64


#### 3.1.2  Hypotheses testing: _"The average user ratings for the Action and Sports genres are different"_

In [25]:
# Hypothesis: ...

# 1. Hypotheses H0, H1
# H0: The average user ratings for the Action and Sports genres are the same (==)
# H1: The average user ratings for the Action and Sports genres are different (!=)

# Prepare data by plans
df_vg_action_genre = df_vg.loc[(df_vg['genre'] == 'action') & (df_vg['user_score'].notna()), 'user_score']
df_vg_sports_genre = df_vg.loc[(df_vg['genre'] == 'sports') & (df_vg['user_score'].notna()), 'user_score']

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

In [26]:
# Levene's test, to ensure that the variances of different samples are equal.
# Preventing Tests Like ANOVA and t-Tests from Being Incorrect

levene_stat, levene_p = st.levene(df_vg_action_genre, df_vg_sports_genre)
display(HTML(
    f"<b>Levene's Test</b> – Statistic: {levene_stat:.4f}, P-value: {levene_p:.4f}"))

# Determining Equality of Variances
if levene_p < 0.05:
    equal_var = False
    display(HTML("<i>Null Hypothesis H₀ is rejected: the variances are different → use equal_var=False</i>"))
else:
    equal_var = True
    display(HTML("<i>Null Hypothesis H₀ is not rejected: the variances are equal → use equal_var=True</i>"))

In [27]:
# 3. Calculate critical and test values, define acceptance and rejection zones

t_stat, p_val = ttest_ind(df_vg_action_genre, df_vg_sports_genre, equal_var=True)

display(HTML(f"T-statistic: <b>{t_stat:.15f}</b>"))
display(HTML(f"P-value: <b>{p_val:.15f}</b>"))

# 4. Decision and Conclusion

if p_val < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>The average user ratings for the Action and Sports genres differ significantly.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>The average user ratings for the Action and Sports genres differ significantly</b>."))

#### Hypothesis Test validation

In [28]:
df_vg_action_genre = df_vg_action_genre.to_frame()

In [29]:
df_vg_action_genre.describe()

Unnamed: 0,user_score
count,3264.0
mean,7.132371
std,1.188063
min,0.3
25%,6.8
50%,7.2
75%,7.9
max,9.5


In [30]:
df_vg_sports_genre = df_vg_sports_genre.to_frame()

In [31]:
df_vg_sports_genre.describe()

Unnamed: 0,user_score
count,2019.0
mean,7.045295
std,1.365061
min,0.2
25%,6.5
50%,7.3
75%,8.0
max,9.5


##### `LSPL`

**_Note_:**

`Justification for the 'equal_var' parameter`

In hypothesis tests with `ttest_ind()`, the `equal_var` argument defines whether or not we assume equality of variances.
This parameter should NOT be decided based on sample sizes, but on the result of a formal test.

Therefore, we use Levene's test (`scipy.stats.levene`) to compare the variances of both groups:
- H₀: The variances of the groups are equal.
- H₁: The variances of the groups are different.

- If the p-value < 0.05 → We reject H₀ → unequal variances → `equal_var=False` (we use Welch's t test)
- If the p-value >= 0.05 → We do not reject H₀ → equal variances → `equal_var=True` (we use the standard Student's t test)

This validation ensures that the hypothesis test is based on sound statistical assumptions.

## 4. Conclusion of Statistical Data Analysis

Based on the statistical analysis conducted, there is sufficient evidence to conclude that both platform and genre significantly influence user ratings in video games. Specifically:

- User ratings on Xbox One and PC differ significantly, suggesting that platform-specific factors may shape user perception or satisfaction.

- Similarly, user ratings for Action and Sports genres show a statistically significant difference, indicating that genre preferences also play a meaningful role in how games are evaluated.

These findings highlight the importance of considering platform and genre when analyzing user feedback and predicting game performance.

Initial descriptive statistics suggested differences between groups, particularly in the average user score. These differences were then tested formally through hypothesis testing to confirm their statistical significance.