## ðŸ“Š Question 04: Titanic Hypothesis Testing with Pandas

**Goal:** Test two hypotheses about the Titanic disaster using the cleaned data and Pandas grouping and merging capabilities.
**Topics:** Pandas Merging, Grouping (`groupby` or `pivot_table`), Binning (`pd.cut`, `pd.qcut`), Hypothesis Conclusion.

### Hypotheses to Test:
1.  "Women and children first."
2.  "The wealthy had a higher survival rate."

### Your Task (using `titanic_cleaned.csv`):


In [2]:
import pandas as pd


1.  **Load Data:** Load both **`titanic_cleaned.csv`** and **`ticket_fares.csv`** into two separate DataFrames.


In [3]:
input_titanic = 'titanic_cleaned.csv'
titanic_cleaned_df = pd.read_csv(input_titanic)

input_ticket = 'ticket_fares.csv'
tickets_df = pd.read_csv(input_ticket)

2.  **Merge Data:** Perform a **merge operation** to combine the two DataFrames, using the **`'Ticket'` column as the common key**.

In [4]:
df = pd.merge(
    titanic_cleaned_df,
    tickets_df,
    on='Ticket',
    how='inner'
)


3.  **Test Hypothesis 1 (Women and Children):**
    * Create a new categorical column **`AgeGroup`**. **Bin the `'Age'` column** into logical groups: `['Child', 'Teen', 'Adult', 'Senior']`. (You decide the age brackets, e.g., 0-12, 13-19, 20-59, 60+).
    

In [7]:
""" 

child = 0, 12
teen 13,19
adult = 20,59
senior = 60,120

"""

bins = [0,13,20,60,121]
labels = ['child','teen','adult','senior']


df['AgeGroup'] = pd.cut(
    df['Age'],
    bins = bins,
    labels=labels,
    right = False
)

* Use `groupby()` or a `pivot_table` to calculate the **mean survival rate (`Survived` column)** broken down by both **`'Sex'` and `'AgeGroup'`**.


In [10]:
survival_rate_pivot = df.pivot_table(
    values='Survived',  # main val counted is Survived
    index= 'AgeGroup',  # row is age group
    columns= 'Sex',     # Col are female male, 
    aggfunc= 'mean',    # aggregate function used
    observed=False
)

print(survival_rate_pivot)

Sex         female      male
AgeGroup                    
child     0.432432  0.379562
teen      0.753247  0.086957
adult     0.699561  0.202624
senior    1.000000  0.111111


* Write a **one-paragraph conclusion** to a **`report.txt`** file: does the data support the "women and children first" hypothesis? Justify with your numbers.

In [25]:
with open('report.txt','w') as obj:
    obj.writelines('The data supports the hypothesis, due to the high mean of the female survived across all age groups, with highest being female seniors ')


4.  **Test Hypothesis 2 (Wealth):**
    * **Method A (Class):** Calculate the **mean survival rate grouped by `'Pclass'`** (1st, 2nd, 3rd).

In [11]:
survival_rate_class_pivot = df.pivot_table(
    values='Survived',
    index='Pclass',
    aggfunc='mean',
    observed=False
)
print(survival_rate_class_pivot)

        Survived
Pclass          
1       0.689055
2       0.480000
3       0.239057


* **Method B (Fare):** Bin the `'Fare'` column into **4 equal-sized quantile groups** (e.g., 'Low', 'Medium', 'High', 'VeryHigh') using **`pd.qcut()`**. Name this new column **`FareBin`**.


In [None]:
labels = ['low','medium','high','veryHigh']

df['FareBin'] = pd.qcut(
    df['Fare_x'],   # Fare already existed in initial Titanic-Dataset
    q=4,            # n quantiles
    labels= labels,
    duplicates='drop'
)

# print(df['FareBin'])

print(df)

      PassengerId  Survived  Pclass  \
0               1         0       3   
1               2         1       1   
2               3         1       3   
3               4         1       1   
4               4         1       1   
...           ...       ...     ...   
1588          888         1       1   
1589          889         0       3   
1590          889         0       3   
1591          890         1       1   
1592          891         0       3   

                                                   Name     Sex  Age  SibSp  \
0                               Braund, Mr. Owen Harris    male   22      1   
1     Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                                Heikkinen, Miss. Laina  female   26      0   
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4          Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
...                                                 ...  

* Calculate the **mean survival rate grouped by your new `FareBin` column**.


In [17]:
survival_rate_fare_pivot = df.pivot_table(
    values='Survived',
    index='FareBin',
    aggfunc='mean',
    observed=False
)

print(survival_rate_fare_pivot)

          Survived
FareBin           
low       0.229426
medium    0.441687
high      0.336449
veryHigh  0.609418


* Append a **second paragraph** to `report.txt`: does the data support the "wealth" hypothesis? Compare the results from Method A and Method B.

In [None]:
with open('report.txt','a') as obj:
    obj.write("\nthe data shows an overall support to the wealth hypothesis with a little disagreement with high paying have a less chance than medium")