#### Statistical Data Analysis
Dataset: _music_clean.csv_  
Author: Luis Sergio Pastrana Lemus  
Date: 2025-04-23

## __1. Libraries__

In [1]:
from IPython.display import display, HTML
import os
import pandas as pd
from pathlib import Path
import scipy.stats as st
import statsmodels.api as sm
from statsmodels.formula.api import ols
import sys


# Define project root dynamically, gets the current directory from whick the notebook belongs and moves one level upper
project_root = Path.cwd().parent

# Add src to sys.path if it is not already
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import function directly (more controlled than import *)
from src import *

## __2. Path to Data file__

In [2]:
# Build route to data file and upload
data_file_path = project_root / "data" / "processed"
df_music = load_dataset_from_csv(data_file_path, "music_clean.csv", sep=',', header='infer', keep_default_na=False)

## __3. Statistical Data Analysis__

### 3.1  Inferential Tests

Hypothesis: User activity varies by day of the week and city

#### 3.1.1  Hypothesis testing: User activity varies by city

In [3]:
# Hypothesis: User activity varies by city.

# 1. Propose Hypotheses H0, H1
# H0: User activity does not vary by city, user activity is the same (==)
# H1: User activity varies by city, user activity is different (!=)

# Prepare data by city for t-test
df_music['track_count'] = 1
city_grouped = df_music.groupby(['city', 'userid'])['track_count'].sum().reset_index()

springfield_music_activity = city_grouped.loc[(city_grouped['city'] == 'springfield'), 'track_count']
shelbyville_music_activity = city_grouped.loc[(city_grouped['city'] == 'shelbyville'), 'track_count']

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

In [4]:
# Levene's test, to ensure that the variances of different samples are equal. 
# Preventing Tests Like ANOVA and t-Tests from Being Incorrect

levene_stat, levene_p = st.levene(springfield_music_activity, shelbyville_music_activity)
display(HTML(f"<b>Levene's Test</b> – Statistic: {levene_stat:.4f}, P-value: {levene_p:.4f}"))

# Determining Equality of Variances
if levene_p < 0.05:
    equal_var = False
    display(HTML("<i>Null Hypothesis H₀ is rejected: the variances are different → use equal_var=False</i>"))
else:
    equal_var = True
    display(HTML("<i>Null Hypothesis H₀ is not rejected: the variances are equal → use equal_var=True</i>"))

In [5]:
# 3. Calculate critical and test values, define acceptance and rejection zones

t_stat_city, p_val_city = st.ttest_ind(springfield_music_activity, shelbyville_music_activity, equal_var=False)

display(HTML(f"T-statistic: <b>{t_stat_city:.4f}</b>"))
display(HTML(f"P-value: <b>{p_val_city:.4f}</b>"))

# 4. Decision and Conclusion

if p_val_city < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>User music activity differ by city.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>user music activity differs by city</b>."))

#### 3.1.2 Hypothesis testing: User activity varies by day of week and city

In [6]:
# Hypothesis: User activity varies by day of week and city.

# 1. Propose Hypotheses H0, H1
# H0: User activity does not vary by day of week and city, user activity is the same (==)
# H1: User activity varies by day of week and city, user activity is different (!=)

# Prepare data by city for ANOVA
day_city_grouped = df_music.groupby(['city', 'day', 'userid'])['track_count'].sum().reset_index()

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

In [7]:
# 3. Calculate critical and test values, define acceptance and rejection zones

# Run two-way ANOVA
model = ols('track_count ~ C(city) + C(day) + C(city):C(day)', data=day_city_grouped).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# 4. Decision and Conclusion

# City effect
if anova_table.loc['C(city)', 'PR(>F)'] < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>User music activity differ by city.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>user music activity differs by city</b>."))

# Day effect
if anova_table.loc['C(day)', 'PR(>F)'] < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>User music activity differ by day of the week.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>user music activity differs by day of the week</b>."))

# Interaction effect
if anova_table.loc['C(city):C(day)', 'PR(>F)'] < alpha:
    display(HTML("The <i>'null hypothesis' is rejected</i>, <b>accepting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>there is an interaction effect between city and day on user music activity.</b>"))
else:
    display(HTML("The <i>'null hypothesis' is not rejected</i>, <b>accepting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>an interaction effect exists between city and day on user music activity</b>."))
    

## 4. Conclusion of Statistical Data Analysis – Music Activity

This analysis aimed to assess user music activity patterns across different cities and days of the week.

Descriptive Statistics highlighted clear differences in user behavior, with Springfield consistently showing higher user engagement metrics compared to Shelbyville.

A T-test confirmed that the difference in total tracks between Springfield and Shelbyville is statistically significant, indicating user activity is not equal across cities.

Based on the two-way ANOVA test conducted on user music activity data, we assessed the influence of two categorical variables — city and day of the week — along with their interaction.

The statistical results indicated that city has a significant effect on user activity (measured through track counts).

However, the variation across days of the week and the interaction between city and day did not yield statistically significant differences.

These findings seem to contrast slightly with the exploratory data analysis (EDA), which visually suggested small differences in user activity between days (e.g., Friday vs. Wednesday). This apparent discrepancy can be explained by the fact that visual analysis captures perceived trends, while statistical tests quantify whether such trends are strong enough to rule out random variation.

To support the validity of the ANOVA assumptions, Levene’s test for equality of variances was not considered. In two-way ANOVA, this assumption still matters; however, Levene’s test becomes more complex due to the interaction of two factors. Unlike one-way ANOVA (with one grouping variable), two-way ANOVA involves multiple group combinations (e.g., combinations of city × day). That means:

- Levene’s test would need to check all combinations of the two categorical variables.
- If variances differ significantly across these subgroups, a Welch’s ANOVA or generalized linear model (GLM) is preferred.