# Day 7: Titanic Survival Prediction - End-to-End ML Project

**Welcome to your first complete machine learning project!** Today we'll apply everything you've learned in the past 6 days to solve a real-world problem: predicting passenger survival on the Titanic. This is a classic binary classification problem that will help you understand the complete ML workflow.

---

**Goal:** Build a complete ML pipeline to predict Titanic passenger survival using real data and multiple algorithms.

**Skills You'll Practice:**
- Exploratory Data Analysis (EDA)
- Data preprocessing and feature engineering
- Multiple ML algorithms (Logistic Regression, Random Forest, etc.)
- Model evaluation and comparison
- End-to-end project workflow


---

## 1. Project Overview & Setup

### The Titanic Challenge
On April 15, 1912, the RMS Titanic sank after colliding with an iceberg. This tragedy resulted in the deaths of over 1,500 passengers and crew. Using machine learning, we can analyze patterns in the data to predict which passengers were more likely to survive.

### What We'll Build
A binary classifier that predicts passenger survival (0 = died, 1 = survived) based on features like age, ticket class, gender, and fare paid.

### Key ML Concepts We'll Practice
- **Binary Classification**: Predicting one of two outcomes (survived/died)
- **Feature Engineering**: Creating new features from existing ones
- **Model Comparison**: Testing multiple algorithms to find the best performer
- **Cross-Validation**: Ensuring our model generalizes well to new data


In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")
print("Ready to start the Titanic project!")


### Exercise 1: Load and Initial Exploration

**Task:** Load the Titanic dataset and get your first look at the data.

**Questions to think about:**
- How many passengers are in the dataset?
- What features do we have available?
- Are there any missing values?
- What does the target variable (Survived) look like?


In [None]:
# TODO: Load the Titanic dataset
# Hint: Use pd.read_csv() to load 'titanic.csv'

# Your code here:
df = 

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())
print(f"\nFirst few rows:")
print(df.head())


In [None]:
# TODO: Get basic information about the dataset
# Check data types, missing values, and basic statistics
# Hint: Use df.info(), df.isnull().sum(), and df['Survived'].value_counts()

# Your code here:


print("Dataset Info:")
print(df.info())

print("\nMissing Values:")
print(df.isnull().sum())

print("\nTarget Variable Distribution:")
print(df['Survived'].value_counts())
print(f"\nSurvival Rate: {df['Survived'].mean():.2%}")


**Reflection Questions:**
1. What percentage of passengers survived?
2. Which features have missing values?
3. What data types do we have? (numeric, categorical, text)

---

## 2. Exploratory Data Analysis (EDA)

Now let's dive deeper into the data to understand patterns and relationships that might help predict survival.


### Exercise 2: Survival by Key Features

**Task:** Analyze how different features relate to survival rates.

**Key questions:**
- Did gender affect survival chances?
- How did ticket class (Pclass) influence survival?
- What about age groups?
- Did passengers with family members have better survival rates?


In [None]:
# TODO: Create visualizations to explore survival patterns
# Create a 2x2 subplot layout to compare different features

# 1. Survival by Gender
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
# TODO: Create a countplot showing survival by gender
# Hint: Use sns.countplot with data=df, x='Sex', hue='Survived'
# Your code here:

plt.title('Survival Count by Gender')

# 2. Survival by Passenger Class
plt.subplot(2, 2, 2)
# TODO: Create a countplot showing survival by passenger class
# Hint: Use sns.countplot with data=df, x='Pclass', hue='Survived'
# Your code here:

plt.title('Survival Count by Passenger Class')

# 3. Age distribution by survival
plt.subplot(2, 2, 3)
# TODO: Create a histogram showing age distribution by survival
# Hint: Use sns.histplot with data=df, x='Age', hue='Survived', alpha=0.7
# Your code here:

plt.title('Age Distribution by Survival')

# 4. Fare distribution by survival
plt.subplot(2, 2, 4)
# TODO: Create a histogram showing fare distribution by survival
# Hint: Use sns.histplot with data=df, x='Fare', hue='Survived', alpha=0.7
# Your code here:

plt.title('Fare Distribution by Survival')

plt.tight_layout()
plt.show()


In [None]:
# TODO: Calculate survival rates by different features
# Hint: Use df.groupby() to calculate mean survival rates for different features

print("Survival Rates by Feature:")
print("=" * 40)

# TODO: Calculate survival rate by gender
# Hint: Use df.groupby('Sex')['Survived'].mean()
# Your code here:


# TODO: Calculate survival rate by passenger class
# Hint: Use df.groupby('Pclass')['Survived'].mean()
# Your code here:


# TODO: Calculate survival rate by embarkation port
# Hint: Use df.groupby('Embarked')['Survived'].mean()
# Your code here:


print(f"\nBy Gender:")
print("Your results here...")

print(f"\nBy Passenger Class:")
print("Your results here...")

print(f"\nBy Embarkation Port:")
print("Your results here...")


### Exercise 3: Feature Engineering

**Task:** Create new features that might improve our model's performance.

**Ideas to explore:**
- Family size (SibSp + Parch + 1)
- Age groups (child, adult, senior)
- Title extraction from names (Mr., Mrs., Miss, etc.)
- Fare per person (Fare / Family Size)


In [None]:
# TODO: Create new features to improve model performance

# 1. Family Size
# TODO: Create a FamilySize feature by adding SibSp + Parch + 1
# Hint: df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
# Your code here:


# 2. Is Alone (no family members)
# TODO: Create an IsAlone feature (1 if FamilySize == 1, 0 otherwise)
# Hint: df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
# Your code here:


# 3. Age Groups
# TODO: Create age groups using pd.cut()
# Hint: Use bins=[0, 12, 18, 35, 60, 100] and labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior']
# Your code here:


# 4. Extract Title from Name
# TODO: Extract titles from names using string methods
# Hint: Use df['Name'].str.extract() with regex pattern
# Your code here:


# 5. Fare per Person
# TODO: Create FarePerPerson by dividing Fare by FamilySize
# Your code here:


print("New features created!")
print(f"\nFamily Size distribution:")
print("Your results here...")

print(f"\nTitle distribution:")
print("Your results here...")


In [None]:
# TODO: Visualize new features to understand their relationship with survival

plt.figure(figsize=(15, 5))

# Family Size vs Survival
plt.subplot(1, 3, 1)
# TODO: Create a barplot showing survival rate by family size
# Hint: Use sns.barplot(data=df, x='FamilySize', y='Survived')
# Your code here:

plt.title('Survival Rate by Family Size')
plt.ylabel('Survival Rate')

# Age Group vs Survival
plt.subplot(1, 3, 2)
# TODO: Create a barplot showing survival rate by age group
# Hint: Use sns.barplot(data=df, x='AgeGroup', y='Survived')
# Your code here:

plt.title('Survival Rate by Age Group')
plt.xticks(rotation=45)
plt.ylabel('Survival Rate')

# Title vs Survival (top 5 titles)
plt.subplot(1, 3, 3)
# TODO: Filter to top 5 titles and create barplot
# Hint: First get top_titles = df['Title'].value_counts().head(5).index
# Then filter: title_data = df[df['Title'].isin(top_titles)]
# Finally: sns.barplot(data=title_data, x='Title', y='Survived')
# Your code here:

plt.title('Survival Rate by Title (Top 5)')
plt.xticks(rotation=45)
plt.ylabel('Survival Rate')

plt.tight_layout()
plt.show()


**Reflection Questions:**
1. Which features show the strongest relationship with survival?
2. What patterns did you discover that might help predict survival?
3. Are there any surprising findings?

---

## 3. Data Preprocessing

Now let's prepare our data for machine learning by handling missing values, encoding categorical variables, and scaling features.


### Exercise 4: Handle Missing Values and Encode Categorical Variables

**Task:** Clean the data for machine learning.

**Steps:**
1. Handle missing values in Age, Cabin, and Embarked
2. Encode categorical variables (Sex, Embarked, Title)
3. Select relevant features for modeling


In [None]:
# TODO: Handle missing values strategically
# Think about which method makes sense for each feature

# 1. Age - Fill with median age grouped by Pclass and Sex
# TODO: Use groupby and transform to fill Age missing values
# Hint: df.groupby(['Pclass', 'Sex'])['Age'].transform(lambda x: x.fillna(x.median()))
# Your code here:


# 2. Embarked - Fill with mode (most common value)
# TODO: Fill missing Embarked values with the most frequent value
# Hint: df['Embarked'].fillna(df['Embarked'].mode()[0])
# Your code here:


# 3. Fare - Fill with median fare by Pclass
# TODO: Use groupby to fill missing Fare values
# Hint: Similar to Age but group by 'Pclass' only
# Your code here:


# 4. Cabin - Create a binary feature indicating if passenger had a cabin
# TODO: Create HasCabin feature (1 if cabin exists, 0 if not)
# Hint: df['Cabin'].notna().astype(int)
# Your code here:


print("Missing values after preprocessing:")
print("Your results here...")

print(f"\nAge statistics after filling:")
print("Your results here...")


In [None]:
# TODO: Encode categorical variables for machine learning

# Label encoding for binary variables (Sex)
# TODO: Use LabelEncoder to encode Sex column
# Hint: le_sex = LabelEncoder() then df['Sex_encoded'] = le_sex.fit_transform(df['Sex'])
# Your code here:


# One-hot encoding for multi-class variables
# TODO: Use pd.get_dummies for Embarked and Title columns
# Hint: df = pd.get_dummies(df, columns=['Embarked', 'Title'], prefix=['Emb', 'Title'])
# Your code here:


print("Categorical variables encoded!")
print(f"\nSex mapping:")
print("Your results here...")

print(f"\nNew columns created:")
print("Your results here...")


In [None]:
# TODO: Select features for modeling
# Choose which features to use in your model

# TODO: Define your feature columns list
# Include: 'Pclass', 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare', 
#          'FamilySize', 'IsAlone', 'HasCabin', 'FarePerPerson'
# Plus any columns that start with 'Emb_' or 'Title_'
# Hint: Use list comprehension to get dummy columns
# Your code here:


# TODO: Create feature matrix X and target vector y
# Hint: X = df[feature_columns], y = df['Survived']
# Your code here:


print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\nSelected features:")
print("Your feature list here...")


---

## 4. Model Building and Evaluation

Time to build and compare multiple machine learning models!


### Exercise 5: Train-Test Split and Model Training

**Task:** Split the data and train multiple models to compare their performance.

**Models to try:**
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine (bonus)


In [None]:
# TODO: Split the data into training and testing sets
# Hint: Use train_test_split with test_size=0.2, random_state=42, stratify=y
# Your code here:


# TODO: Scale the features for models that need it (like Logistic Regression)
# Hint: Use StandardScaler() to fit on training data and transform both sets
# Your code here:


print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTraining set survival rate: {y_train.mean():.2%}")
print(f"Test set survival rate: {y_test.mean():.2%}")


In [None]:
# TODO: Train and evaluate multiple models

# TODO: Create a dictionary of models to compare
# Include LogisticRegression and RandomForestClassifier
# Hint: models = {'Logistic Regression': LogisticRegression(random_state=42), ...}
# Your code here:


# TODO: Initialize results dictionary to store model performance
# Your code here:


for name, model in models.items():
    print(f"\n{'='*50}")
    print(f"Training {name}")
    print(f"{'='*50}")
    
    # TODO: Train the model
    # Hint: Use scaled data for Logistic Regression, original data for Random Forest
    # Your code here:
    
    
    # TODO: Make predictions on test set
    # Your code here:
    
    
    # TODO: Calculate accuracy score
    # Hint: Use accuracy_score(y_test, y_pred)
    # Your code here:
    
    
    # TODO: Perform cross-validation
    # Hint: Use cross_val_score with cv=5
    # Your code here:
    
    
    # TODO: Store results in dictionary
    # Your code here:
    
    
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"CV Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    # TODO: Print classification report
    # Hint: Use classification_report(y_test, y_pred)
    # Your code here:


In [None]:
# TODO: Compare model performance and find the best one

print("\n" + "="*60)
print("MODEL COMPARISON SUMMARY")
print("="*60)

# TODO: Loop through results and print performance metrics
# Your code here:


# TODO: Find the best model based on accuracy
# Hint: Use max() with a key function
# Your code here:


print(f"\nBest Model: {best_model} with {results[best_model]['accuracy']:.4f} accuracy")


### Exercise 6: Feature Importance Analysis

**Task:** Understand which features are most important for predictions.


In [None]:
# TODO: Analyze feature importance (for Random Forest)

# TODO: Check if Random Forest model exists and analyze feature importance
# Hint: Use if 'Random Forest' in models:
# Your code here:


    # TODO: Create a DataFrame with feature names and importance scores
    # Hint: Use pd.DataFrame with 'feature' and 'importance' columns
    # Your code here:
    
    
    # TODO: Create a barplot of top 10 most important features
    # Hint: Use sns.barplot with the top 10 features
    # Your code here:
    
    
    print("Top 10 Most Important Features:")
    # TODO: Print the top 10 features and their importance scores
    # Your code here:


### Exercise 7: Model Interpretation

**Task:** Create visualizations to understand model predictions.


In [None]:
# TODO: Create confusion matrix for best model

# TODO: Get predictions from the best model
# Hint: Use results[best_model]['predictions']
# Your code here:


# TODO: Create confusion matrix visualization
# Hint: Use confusion_matrix() and sns.heatmap()
# Your code here:


# TODO: Print confusion matrix details
# Hint: Use cm.ravel() to get tn, fp, fn, tp
# Your code here:


print(f"\nConfusion Matrix Details:")
print("Your results here...")


---

## 5. Project Summary and Insights

### Exercise 8: Key Insights and Learnings

**Task:** Reflect on what you've learned and the insights gained.


### Key Findings from Our Analysis:

1. **Gender was the strongest predictor**: Women had a much higher survival rate (~74%) compared to men (~19%)
2. **Passenger class mattered**: First-class passengers had better survival rates than third-class passengers
3. **Age played a role**: Children had higher survival rates, possibly due to "women and children first" evacuation policy
4. **Family size had mixed effects**: Very large families (7+ members) had lower survival rates

### Model Performance:
- Our models achieved accuracy in the range of 80-85%
- Random Forest typically performs well on this dataset
- Feature engineering (creating new features) often improves model performance

### What This Teaches Us:
- **Domain knowledge matters**: Understanding the historical context (women and children first) helps interpret results
- **Feature engineering is crucial**: Creating meaningful features can significantly improve model performance
- **Multiple models should be compared**: Different algorithms may perform better on different types of data
- **Model interpretation is important**: Understanding which features matter helps build trust in the model


### Challenge Questions (Optional):

Try these extensions to deepen your understanding:

1. **Feature Engineering Challenge**: Create more sophisticated features like:
   - Age categories based on survival patterns you observed
   - Fare categories (low, medium, high)
   - Family size categories

2. **Model Tuning**: Try hyperparameter tuning for your best model:
   ```python
   from sklearn.model_selection import GridSearchCV
   # Tune Random Forest parameters like n_estimators, max_depth, etc.
   ```

3. **Additional Models**: Try other algorithms:
   - Support Vector Machine
   - Gradient Boosting
   - Naive Bayes

4. **Advanced Evaluation**: Implement more sophisticated evaluation:
   - ROC curves and AUC scores
   - Precision-Recall curves
   - Learning curves

5. **Business Insights**: Think about how this model could be used:
   - What would you tell a cruise line about passenger safety?
   - How might this analysis inform emergency procedures?


---

## 6. Next Steps and Further Learning

### Congratulations!
You've completed your first end-to-end machine learning project! You've learned how to:
- Perform exploratory data analysis
- Engineer features to improve model performance
- Train and evaluate multiple machine learning models
- Interpret results and draw meaningful insights

### Recommended Next Steps:

1. **Kaggle Competitions**: 
   - Submit your predictions to the actual Titanic competition
   - Explore other beginner-friendly competitions
   - Learn from top solutions and discussions

2. **Advanced Techniques to Explore**:
   - **Ensemble Methods**: Combine multiple models for better predictions
   - **Cross-Validation**: Use k-fold cross-validation for more robust evaluation
   - **Feature Selection**: Learn techniques to select the most important features
   - **Hyperparameter Tuning**: Optimize model parameters for better performance

3. **Related Datasets to Practice On**:
   - **House Prices**: Predict house prices (regression problem)
   - **Customer Churn**: Predict which customers will leave (classification)
   - **Spam Detection**: Classify emails as spam or not spam
   - **Loan Default**: Predict loan default risk

4. **Skills to Develop Next**:
   - **Model Deployment**: Learn to deploy models as web applications
   - **MLOps**: Learn about model versioning and monitoring
   - **Deep Learning**: Explore neural networks for more complex problems
   - **Time Series**: Learn to handle temporal data

### Key Takeaways:

- **Data understanding is crucial**: Spend time exploring your data before modeling
- **Feature engineering matters**: Often more important than choosing the "best" algorithm
- **Domain knowledge helps**: Understanding the problem context improves your analysis
- **Multiple models should be compared**: Don't rely on just one algorithm
- **Interpretability is important**: Being able to explain your model builds trust

### Useful Resources:

- **Kaggle Learn**: Free micro-courses on data science and machine learning
- **Scikit-learn Documentation**: Comprehensive guide to ML algorithms
- **Towards Data Science**: Medium publication with practical ML articles
- **Machine Learning Mastery**: Jason Brownlee's excellent ML tutorials

**Keep practicing and building projects!** The more you work with real data and solve real problems, the better you'll become at machine learning.

---
## 📫 Let's Connect
- 💼 **LinkedIn:** [hashirahmed07](https://www.linkedin.com/in/hashirahmed07/)
- 📧 **Email:** [Hashirahmad330@gmail.com](mailto:Hashirahmad330@gmail.com)
- 🐙 **GitHub:** [CodeByHashir](https://github.com/CodeByHashir)

---

*Happy Learning!*
