In [2]:
import pandas as pd
df=pd.read_csv("../datasets/Titanic-Dataset.csv")


# Titanic Dataset Analysis  
**Goal**: Explore survival patterns using passenger demographics.  
**Dataset**: Titanic-Dataset.csv (891 rows)  
**Tools Used**: Pandas, NumPy  

In [None]:
df.info()
df.describe()
df.isna().sum()


## 1. Initial Data Inspection  
Key observations:  
- Missing values in `Age` (177), `Cabin` (687), `Embarked` (2)  
- Numeric columns: `Age`, `Fare`  
- Categorical columns: `Pclass`, `Sex`, `Embarked`  

In [None]:
df['Age']=pd.to_numeric(df['Age'], errors='coerce')
#adding medeian to missing values
df['Age'].fillna(df['Age'].median(),inplace=True)

df['Fare']=pd.to_numeric(df['Fare'], errors='coerce')

df = df.drop(columns=['Cabin'])

df = df.dropna(subset=['Embarked'])

print("Missing values after cleaning:")
print(df.isna().sum())

## 2. Data Cleaning  
Actions taken:  
- Filled missing `Age` values with median (28.0)  
- Dropped `Cabin` column (too many missing values)  
- Removed 2 rows with missing `Embarked` values  

In [None]:
df[df['Fare'] > 50]
df.groupby('Pclass')['Age'].mean()
df['Survived'].value_counts()


## 3. Feature Engineering  
- Created `AgeGroup` column (Child: <18, Adult: ≥18)  

In [3]:
df['AgeGroup'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')
df.to_csv("../datasets/cleaned_titanic.csv", index=False)


In [None]:
## 4. Key Findings  
1. **Class Disparity**:  
   - Pclass 1 avg age: 38.8 vs Pclass 3: 25.1  
2. **Survival Rate**:  
   - 342 survived (38.4%)  
3. **High-Fare Passengers**: 102 paid > $50  


In [None]:
## Business Insights  
1. 1st class passengers were older (avg 38 vs 25 in 3rd class)  
2. Children had 54% survival vs 36% for adults  
3. 58% of high-fare passengers survived  

## Limitations  
- No visualization yet (Week 2)  
- Could explore gender bias (known that women survived more)  