
## ðŸš¢ Question 03: Pandas Data Cleaning and Feature Engineering

**Goal:** Clean and transform the Titanic dataset (`Titanic-Dataset.csv`) using Pandas, handling missing values and creating new features.
**Topics:** Pandas DataFrame operations, Missing Data Imputation (`fillna`), Grouping, Feature Engineering.

### Your Task:



1.  **Load and Inspect:**
    * Load `Titanic-Dataset.csv` into a DataFrame.
    * Programmatically generate a report that lists **each column**, its **data type** (e.g., `int64`, `object`), and the **percentage of missing (null) values**.

In [10]:
#%pip install pandas

In [11]:
import pandas as pd

file = 'Titanic-Dataset.csv'

df = pd.read_csv(file)

# print(df.__dict__)

# print 1st five records
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  



2.  **Handle Missing Data (Imputation):**
    * **'Age'**: Fill the missing `'Age'` values using the **median age** of passengers in the **same 'Pclass' (Passenger Class) and 'Sex' group**.
    * **'Embarked'**: Fill the two missing values with the **mode** (most common value) of the `'Embarked'` column.
    * **'Cabin'**: This column is missing over 70% of its data. **Drop the entire 'Cabin' column** from the DataFrame.

In [12]:
df['Age'] = df['Age'].fillna(
    df.groupby(['Pclass','Sex'])['Age'].transform('median')
)

print(df['Age'])

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    21.5
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


In [13]:
mode_embarked = df['Embarked'].mode().iloc[0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

In [14]:
df = df.drop(columns = ['Cabin'])

In [15]:
print(df)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 


3.  **Feature Engineering:**
    * Create a new column **`FamilySize`** by adding the `'SibSp'` (siblings/spouses) and `'Parch'` (parents/children) columns together.
    * Create a second new column **`IsAlone`** that contains a 1 if `FamilySize` is 0, and a 0 otherwise.

In [16]:
df['FamilySize'] = df['Parch'] + df['SibSp']

import numpy as np
# np.where(cond,val if true, val if false)
df['isAlone'] = np.where((df['FamilySize'] == 0), 1, 0)


4.  **Type Conversion:** Convert the `'Age'` column to an **integer type (`int64`)**.

In [18]:
df['Age'] = df['Age'].apply(np.floor).astype('Int64')


5.  **Save Clean Data:** Save your cleaned, imputed, and feature-engineered DataFrame to a new file, **`titanic_cleaned.csv`**.

In [19]:
output = 'titanic_cleaned.csv'

df.to_csv(output,index=False)