# Sheth L.U.J. & Sir M.V. College Of Arts, Science & Commerce
#### Ghanshyam Kanojiya | T087 
Practical No. 05    
Aim: ANOVA (Analysis of Variance)  

Perform one-way ANOVA to compare means across multiple groups.     
Conduct post-hoc tests to identify significant differences between group means.

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols

sns.set_theme()

# Load dataset
df = pd.read_csv("stranger_things_all_dialogue.csv")
print("Shape:", df.shape)
df.iloc[8:20]

Shape: (32519, 8)


Unnamed: 0,season,episode,line,raw_text,stage_direction,dialogue,start_time,end_time
8,1,1,9,[Mike] Something is coming. Something hungry f...,[Mike],Something is coming. Something hungry for blood.,00:01:44,00:01:48
9,1,1,10,"A shadow grows on the wall behind you, swallow...",,"A shadow grows on the wall behind you, swallow...",00:01:48,00:01:52
10,1,1,11,-It is almost here. -What is it?,,It is almost here. What is it?,00:01:52,00:01:54
11,1,1,12,What if it's the Demogorgon?,,What if it's the Demogorgon?,00:01:54,00:01:56
12,1,1,13,"Oh, Jesus, we're so screwed if it's the Demogo...",,"Oh, Jesus, we're so screwed if it's the Demogo...",00:01:56,00:01:59
13,1,1,14,It's not the Demogorgon.,,It's not the Demogorgon.,00:01:59,00:02:00
14,1,1,15,An army of troglodytes charge into the chamber!,,An army of troglodytes charge into the chamber!,00:02:00,00:02:02
15,1,1,16,-Troglodytes? -Told ya. [chuckling],[chuckling],Troglodytes? Told ya.,00:02:02,00:02:05
16,1,1,17,-[snorts] -[all chuckling],[snorts] [all chuckling],,00:02:05,00:02:06
17,1,1,18,[softly] Wait a minute.,[softly],Wait a minute.,00:02:08,00:02:09


In [6]:
# Keep only rows where dialogue text exists (spoken lines, not only stage directions)
df['dialogue'] = df['dialogue'].astype(str)

# Some rows have 'nan' as string because of astype(str); treat those as missing
df['dialogue_clean'] = df['dialogue'].replace('nan', np.nan)

# Dialogue length in words (only for spoken dialogue)
df['dialogue_length'] = df['dialogue_clean'].dropna().apply(lambda x: len(str(x).split()))
# For non-dialogue rows, set length to NaN
df.loc[df['dialogue_clean'].isna(), 'dialogue_length'] = np.nan

# Indicator: does this row contain spoken dialogue?
df['has_dialogue'] = ~df['dialogue_clean'].isna()

# Episode group factor for ANOVA: early vs late episodes within a season
df['episode_group'] = np.where(df['episode'] <= df['episode'].median(), 'early', 'late')

df[['season','episode','line','dialogue_clean','dialogue_length','has_dialogue','episode_group']].head()

Unnamed: 0,season,episode,line,dialogue_clean,dialogue_length,has_dialogue,episode_group
0,1,1,1,,,False,early
1,1,1,2,,,False,early
2,1,1,3,,,False,early
3,1,1,4,,,False,early
4,1,1,5,,,False,early


---

## 1. One-way ANOVA (F-test)  

ANOVA compares the means of **more than two groups**.  
Here we test whether the mean dialogue length is the same across all seasons.

- Groups: Season 1, Season 2, Season 3, ...  
- Response: `dialogue_length`

Hypotheses:

- **H₀:** All seasons have the **same mean dialogue length**.  
- **H₁:** At least one season has a **different mean**.



In [7]:
# Collect dialogue length values for each season
season_groups = []
labels = []

for s, g in df[df['dialogue_length'].notna()].groupby('season'):
    season_groups.append(g['dialogue_length'].values)
    labels.append(f"Season {s}")

print("Seasons included:", labels)

F_stat, p_value = stats.f_oneway(*season_groups)

print("\nOne-way ANOVA: dialogue length across seasons")
print("F-statistic:", F_stat)
print("p-value:", p_value)

Seasons included: ['Season 1', 'Season 2', 'Season 3', 'Season 4']

One-way ANOVA: dialogue length across seasons
F-statistic: 27.636518112177217
p-value: 7.768156159566041e-18


**Interpretation:**

- A **small p-value** suggests that at least one season's average dialogue length  
  differs significantly from the others.  
- If the p-value is large, the data does not provide strong evidence that seasons differ in average line length.

ANOVA is a generalisation of the two-sample t-test to more than two groups.


---

## 2. Two-way ANOVA  

Two-way ANOVA lets us test the effect of **two categorical factors** simultaneously,  
as well as their **interaction**.

Here we consider:

- Factor A: `season`  
- Factor B: `episode_group` (early vs late episodes)  

Response: `dialogue_length`

Questions:

1. Does season affect dialogue length?  
2. Does episode group (early vs late) affect dialogue length?  
3. Is there an interaction between season and episode group?



In [8]:
# Create a clean subset with non-missing dialogue length
anova_df = df[['dialogue_length', 'season', 'episode_group']].dropna().copy()

# Convert factors to categorical
anova_df['season'] = anova_df['season'].astype('category')
anova_df['episode_group'] = anova_df['episode_group'].astype('category')

# Fit two-way ANOVA model with interaction
model = ols('dialogue_length ~ C(season) * C(episode_group)', data=anova_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(season),961.24158,3.0,27.325318,1.2303430000000001e-17
C(episode_group),401.425397,1.0,34.23409,4.944134e-09
C(season):C(episode_group),274.047528,3.0,7.790379,3.395303e-05
Residual,309880.271014,26427.0,,


**Interpretation:**

From the ANOVA table:

- The row for `C(season)` tests whether mean dialogue length differs across seasons.  
- The row for `C(episode_group)` tests whether early vs late episodes differ in average dialogue length.  
- The row for `C(season):C(episode_group)` tests whether the effect of episode group  
  depends on the season (interaction).

For each row, a **small p-value** indicates a statistically significant effect.


---

## Summary

In this notebook we:

- Used **real dialogue data from *Stranger Things*** instead of artificial numbers.   
- Demonstrated both **numerical results** (test statistics and p-values) and  
  **clear verbal interpretations** for each test.


