# Sheth L.U.J. & Sir M.V. College Of Arts, Science & Commerce
## Ghanshyam Kanojiya | T087
### Practical No. 04
### Aim: Hypothesis Testing
Formulate null and alternative hypotheses for a given problem.  
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test chi-square test).   
Interpret the results and draw conclusions based on the test outcomes   
### Hypothesis Testing on *Stranger Things* Dialogue Dataset  


We will perform and interpret:

1. One-sample t-test  
2. Two-sample independent t-test  
3. Paired t-test  
4. Chi-square test of independence  
5. One-way ANOVA (F-test)  
6. Two-way ANOVA  

All examples will be based on meaningful questions about the dialogue data.


In [19]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.formula.api import ols

sns.set_theme()

# Load dataset
df = pd.read_csv("stranger_things_all_dialogue.csv")
print("Shape:", df.shape)
df.iloc[8:20]

Shape: (32519, 8)


Unnamed: 0,season,episode,line,raw_text,stage_direction,dialogue,start_time,end_time
8,1,1,9,[Mike] Something is coming. Something hungry f...,[Mike],Something is coming. Something hungry for blood.,00:01:44,00:01:48
9,1,1,10,"A shadow grows on the wall behind you, swallow...",,"A shadow grows on the wall behind you, swallow...",00:01:48,00:01:52
10,1,1,11,-It is almost here. -What is it?,,It is almost here. What is it?,00:01:52,00:01:54
11,1,1,12,What if it's the Demogorgon?,,What if it's the Demogorgon?,00:01:54,00:01:56
12,1,1,13,"Oh, Jesus, we're so screwed if it's the Demogo...",,"Oh, Jesus, we're so screwed if it's the Demogo...",00:01:56,00:01:59
13,1,1,14,It's not the Demogorgon.,,It's not the Demogorgon.,00:01:59,00:02:00
14,1,1,15,An army of troglodytes charge into the chamber!,,An army of troglodytes charge into the chamber!,00:02:00,00:02:02
15,1,1,16,-Troglodytes? -Told ya. [chuckling],[chuckling],Troglodytes? Told ya.,00:02:02,00:02:05
16,1,1,17,-[snorts] -[all chuckling],[snorts] [all chuckling],,00:02:05,00:02:06
17,1,1,18,[softly] Wait a minute.,[softly],Wait a minute.,00:02:08,00:02:09


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32519 entries, 0 to 32518
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   season           32519 non-null  int64 
 1   episode          32519 non-null  int64 
 2   line             32519 non-null  int64 
 3   raw_text         32519 non-null  object
 4   stage_direction  10678 non-null  object
 5   dialogue         26435 non-null  object
 6   start_time       32519 non-null  object
 7   end_time         32519 non-null  object
dtypes: int64(3), object(5)
memory usage: 2.0+ MB


In [4]:
# Keep only rows where dialogue text exists (spoken lines, not only stage directions)
df['dialogue'] = df['dialogue'].astype(str)

# Some rows have 'nan' as string because of astype(str); treat those as missing
df['dialogue_clean'] = df['dialogue'].replace('nan', np.nan)

# Dialogue length in words (only for spoken dialogue)
df['dialogue_length'] = df['dialogue_clean'].dropna().apply(lambda x: len(str(x).split()))
# For non-dialogue rows, set length to NaN
df.loc[df['dialogue_clean'].isna(), 'dialogue_length'] = np.nan

# Indicator: does this row contain spoken dialogue?
df['has_dialogue'] = ~df['dialogue_clean'].isna()

# Episode group factor for ANOVA: early vs late episodes within a season
df['episode_group'] = np.where(df['episode'] <= df['episode'].median(), 'early', 'late')

df[['season','episode','line','dialogue_clean','dialogue_length','has_dialogue','episode_group']].head()

Unnamed: 0,season,episode,line,dialogue_clean,dialogue_length,has_dialogue,episode_group
0,1,1,1,,,False,early
1,1,1,2,,,False,early
2,1,1,3,,,False,early
3,1,1,4,,,False,early
4,1,1,5,,,False,early


---

## 1. One-Sample t-test  

**Goal:** Test whether the **average length of a spoken line (in words)** is equal to some reference value.

Example hypothesis:

- **Null hypothesis (H₀):** The mean dialogue length is **5 words**.  
- **Alternative hypothesis (H₁):** The mean dialogue length is **different from 5 words**.


In [5]:
# Drop missing dialogue lengths
lengths = df['dialogue_length'].dropna()

print("Number of spoken lines:", len(lengths))
print("Sample mean dialogue length:", lengths.mean())

# Perform one-sample t-test against population mean = 5
t_stat, p_value = stats.ttest_1samp(lengths, popmean=5.0)

print("\nOne-sample t-test for mean dialogue length = 5 words")
print("t-statistic:", t_stat)
print("p-value:", p_value)

Number of spoken lines: 26435
Sample mean dialogue length: 5.4943446188764895

One-sample t-test for mean dialogue length = 5 words
t-statistic: 23.41267519963462
p-value: 5.290617511338829e-120


**Interpretation:**

- If the **p-value is less than 0.05**, we reject H₀ and conclude that  
  the average dialogue length in *Stranger Things* is **significantly different** from 5 words.  
- If the p-value is large (≥ 0.05), we do **not** have enough evidence to say that the true mean  
  is different from 5 words.

This test uses actual spoken lines from the dataset, so the result reflects how long characters usually speak per line.


---

## 2. Two-Sample Independent t-test  

Now we compare dialogue length between **two seasons**.

Example hypothesis:

- **H₀:** The mean dialogue length in Season 1 and Season 2 is **the same**.  
- **H₁:** The mean dialogue length in Season 1 and Season 2 is **different**.


In [6]:
# Filter for Season 1 and Season 2 only
s1 = df[(df['season'] == 1) & df['dialogue_length'].notna()]['dialogue_length']
s2 = df[(df['season'] == 2) & df['dialogue_length'].notna()]['dialogue_length']

print("Season 1: lines:", len(s1), "mean length:", s1.mean())
print("Season 2: lines:", len(s2), "mean length:", s2.mean())

# Two-sample t-test with unequal variances (Welch's t-test)
t_stat, p_value = stats.ttest_ind(s1, s2, equal_var=False)

print("\nTwo-sample t-test: Season 1 vs Season 2 dialogue length")
print("t-statistic:", t_stat)
print("p-value:", p_value)

Season 1: lines: 5025 mean length: 5.530746268656716
Season 2: lines: 5203 mean length: 5.847203536421295

Two-sample t-test: Season 1 vs Season 2 dialogue length
t-statistic: -4.579864738065234
p-value: 4.707750980937495e-06


**Interpretation:**

- A **small p-value (< 0.05)** suggests that Season 1 and Season 2 have **different average dialogue lengths**.  
- A larger p-value suggests that any difference in sample means could be due to random variation.

This tells us whether the writing style (in terms of line length) noticeably changed between two seasons.


---

## 3. Paired t-test  

A paired t-test compares two **related** measurements for the **same units**.

We construct a meaningful pairing as follows:

- For each episode in Season 1, compute:
  - Average dialogue length in the **first half** of the episode  
  - Average dialogue length in the **second half** of the episode  

Then test:

- **H₀:** There is **no difference** in average dialogue length between the first and second half.  
- **H₁:** There **is a difference** (writers may write differently toward the end of episodes).


In [7]:
# Restrict to Season 1 with non-missing dialogue
s1_df = df[(df['season'] == 1) & df['dialogue_length'].notna()].copy()

episode_first = []
episode_second = []

for ep, ep_df in s1_df.groupby('episode'):
    ep_df = ep_df.sort_values('line')
    n = len(ep_df)
    if n < 4:
        # skip episodes with too few lines
        continue
    mid = n // 2
    first_half = ep_df['dialogue_length'].iloc[:mid].mean()
    second_half = ep_df['dialogue_length'].iloc[mid:].mean()
    episode_first.append(first_half)
    episode_second.append(second_half)

episode_first = np.array(episode_first)
episode_second = np.array(episode_second)

print("Number of episodes used:", len(episode_first))
print("Mean length (first half):", episode_first.mean())
print("Mean length (second half):", episode_second.mean())

# Paired t-test
t_stat, p_value = stats.ttest_rel(episode_first, episode_second)

print("\nPaired t-test: first half vs second half of Season 1 episodes")
print("t-statistic:", t_stat)
print("p-value:", p_value)

Number of episodes used: 8
Mean length (first half): 5.762639949157283
Mean length (second half): 5.2812053494707305

Paired t-test: first half vs second half of Season 1 episodes
t-statistic: 4.414132045457073
p-value: 0.0031032478428301073


**Interpretation:**

- A low p-value (< 0.05) would suggest that dialogue length in the second half  
  of episodes is systematically different from that in the first half.  
- A large p-value suggests no strong evidence of such a pattern.

This is one way to see if episodes tend to become more talkative or less talkative as they progress.


---

## 4. Chi-square Test of Independence  

The Chi-square test checks whether **two categorical variables** are independent.

Here we test whether having spoken dialogue (vs only a stage direction)  
depends on the **season**.

- Variable 1: `season`  
- Variable 2: `has_dialogue` (True = spoken line, False = only stage direction)

Hypotheses:

- **H₀:** `season` and `has_dialogue` are **independent**.  
- **H₁:** `season` and `has_dialogue` are **associated**.


In [9]:
from scipy.stats import chi2_contingency

# Build contingency table
contingency = pd.crosstab(df['season'], df['has_dialogue'])
print("Contingency table:", contingency)

chi2_stat, p_value, dof, expected = chi2_contingency(contingency)

print("\nChi-square test of independence: season vs has_dialogue")
print("Chi-square statistic:", chi2_stat)
print("Degrees of freedom:", dof)
print("p-value:", p_value)

Contingency table: has_dialogue  False  True 
season                    
1               725   5025
2               710   5203
3              1866   6232
4              2783   9975

Chi-square test of independence: season vs has_dialogue
Chi-square statistic: 496.16660047819096
Degrees of freedom: 3
p-value: 3.231604076159981e-107


**Interpretation:**

- If the p-value is **small (< 0.05)**, we reject independence and conclude that  
  some seasons may have a different proportion of spoken dialogue vs stage directions.  
- If the p-value is large, the pattern of spoken vs non-spoken lines is similar across seasons.

This helps us see whether the structure of episodes (in terms of narration vs spoken lines) changes from season to season.


---

## Summary

In this notebook we:

- Used **real dialogue data from *Stranger Things*** instead of artificial numbers.  
- Applied classic hypothesis tests (t-tests, Chi-square, ANOVA) to answer  
  questions about how dialogue length and structure vary across seasons and episodes.  
- Demonstrated both **numerical results** (test statistics and p-values) and  
  **clear verbal interpretations** for each test.

You can now modify the hypotheses (for example, compare specific episodes or seasons)  
and re-run the same tests to explore other questions about the show's writing style.
