# Chi-Square Test of Independence: Sex vs. Survived

This notebook performs a **Chi-Square Test of Independence (χ²)** to determine whether there is a statistically significant association between a passenger's **sex** and whether they **survived** the Titanic disaster.

## 1. Setup, Data Loading, and Preparation

We load the Titanic dataset (`titanic.csv`). We clean the data to remove missing values and ensure categorical types are correctly set.

In [4]:

import pandas as pd
from scipy.stats import chi2_contingency, chi2

# Load Titanic dataset
try:
    df = pd.read_csv("titanic.csv")
except FileNotFoundError:
    print("'titanic.csv' not found. Using sample simulated data.")
    data = {
        'Sex': ['male','female','female','female','male','male','male','male','female','female']*3000,
        'Survived': [0,1,1,1,0,0,0,0,1,1]*3000
    }
    df = pd.DataFrame(data)

# Select only relevant categorical columns
df_clean = df[['Sex', 'Survived']].dropna()
df_clean['Sex'] = df_clean['Sex'].astype('category')
df_clean['Survived'] = df_clean['Survived'].astype('category')

print(f"Sample Size for Analysis (N): {len(df_clean)}")

Sample Size for Analysis (N): 891


## 2. Observed Frequencies and Expected Counts

We create a contingency table for **Sex vs Survived** to calculate the observed frequencies.  
Then, we calculate the expected counts under the assumption of independence between the variables.

In [5]:
# Contingency Table (Observed Counts)
contingency = pd.crosstab(df_clean['Sex'], df_clean['Survived'])

# Chi-Square Test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency)

# Format Expected Counts
expected_df = pd.DataFrame(expected, index=contingency.index, columns=contingency.columns).round(2)

# Display Results
print("--- Observed Frequencies ---")
print(contingency.to_string(justify='center'))

print("\n--- Expected Counts ---")
print(expected_df.to_string(justify='center'))

print("\nTotal Sample Size (N):", contingency.sum().sum())

--- Observed Frequencies ---
Survived   0    1 
Sex               
female     81  233
male      468  109

--- Expected Counts ---
Survived     0       1  
Sex                     
female    193.47  120.53
male      355.53  221.47

Total Sample Size (N): 891


## 3. Hypothesis Test: Decision

We perform the chi-square test using the observed and expected counts.  
We compare the calculated chi-square statistic with the critical value at **α = 0.05** to decide whether to reject the null hypothesis of independence.  
Finally, we draw a conclusion about whether **sex and survival** are statistically associated.

In [6]:
# Significance Level
alpha = 0.05

# Critical Value
chi2_critical = chi2.ppf(1 - alpha, df=dof)

print("--- Chi-Square Test Summary ---")
print(f"Chi-square Statistic (Calculated): {chi2_stat:.4f}")
print(f"Degrees of Freedom: {dof}")
print(f"P-value: {p_value:.4f}")
print(f"Chi-square Critical Value (α={alpha}): {chi2_critical:.4f}")

print("\n--- Hypothesis Decision ---")
if chi2_stat >= chi2_critical:
    print("Decision: REJECT H0")
    print(f"Conclusion: Calculated χ² ({chi2_stat:.4f}) >= Critical χ² ({chi2_critical:.4f}). There is a statistically significant association between Sex and Survived.")
else:
    print("Decision: FAIL TO REJECT H0")
    print(f"Conclusion: Calculated χ² ({chi2_stat:.4f}) < Critical χ² ({chi2_critical:.4f}). No statistically significant association between Sex and Survived.")


--- Chi-Square Test Summary ---
Chi-square Statistic (Calculated): 260.7170
Degrees of Freedom: 1
P-value: 0.0000
Chi-square Critical Value (α=0.05): 3.8415

--- Hypothesis Decision ---
Decision: REJECT H0
Conclusion: Calculated χ² (260.7170) >= Critical χ² (3.8415). There is a statistically significant association between Sex and Survived.
