# Titanic Survival Prediction - Model Development

This notebook develops a machine learning model to predict passenger survival on the Titanic.

**Selected Features:**
- Pclass (Passenger Class)
- Sex (Gender)
- Age
- Fare
- Embarked (Port of Embarkation)

**Algorithm:** Random Forest Classifier

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 2. Load Dataset

In [None]:
# Load the Titanic dataset
# You can download it from: https://www.kaggle.com/c/titanic/data
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
df.head()

## 3. Exploratory Data Analysis

In [None]:
# Check dataset info
print("Dataset Information:")
df.info()

print("\n" + "="*50)
print("Missing Values:")
print(df.isnull().sum())

print("\n" + "="*50)
print("Statistical Summary:")
df.describe()

In [None]:
# Survival rate
print("Survival Rate:")
print(df['Survived'].value_counts())
print(f"\nSurvival Percentage: {df['Survived'].mean()*100:.2f}%")

## 4. Data Preprocessing

### 4.1 Feature Selection

In [None]:
# Select the 5 features + target variable
selected_features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Survived']
df_selected = df[selected_features].copy()

print(f"Selected features shape: {df_selected.shape}")
print("\nMissing values in selected features:")
print(df_selected.isnull().sum())

### 4.2 Handle Missing Values

In [None]:
# Fill missing Age values with median
df_selected['Age'].fillna(df_selected['Age'].median(), inplace=True)

# Fill missing Embarked values with mode (most common)
df_selected['Embarked'].fillna(df_selected['Embarked'].mode()[0], inplace=True)

# Fill missing Fare values with median
df_selected['Fare'].fillna(df_selected['Fare'].median(), inplace=True)

print("Missing values after handling:")
print(df_selected.isnull().sum())
print("\n✓ All missing values handled!")

### 4.3 Encode Categorical Variables

In [None]:
# Create label encoders
le_sex = LabelEncoder()
le_embarked = LabelEncoder()

# Encode Sex: male=1, female=0
df_selected['Sex'] = le_sex.fit_transform(df_selected['Sex'])

# Encode Embarked: C=0, Q=1, S=2
df_selected['Embarked'] = le_embarked.fit_transform(df_selected['Embarked'])

print("Encoding mappings:")
print(f"Sex: {dict(zip(le_sex.classes_, le_sex.transform(le_sex.classes_)))}")
print(f"Embarked: {dict(zip(le_embarked.classes_, le_embarked.transform(le_embarked.classes_)))}")

print("\n✓ Categorical variables encoded!")
df_selected.head()

### 4.4 Prepare Features and Target

In [None]:
# Separate features (X) and target (y)
X = df_selected.drop('Survived', axis=1)
y = df_selected['Survived']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print("\nFeature columns:")
print(X.columns.tolist())

### 4.5 Feature Scaling

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the features
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for better visualization
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

print("Features after scaling:")
print(X_scaled_df.head())
print("\n✓ Feature scaling completed!")

### 4.6 Train-Test Split

In [None]:
# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining set survival rate: {y_train.mean()*100:.2f}%")
print(f"Testing set survival rate: {y_test.mean()*100:.2f}%")

## 5. Model Training

### 5.1 Initialize and Train Random Forest Classifier

In [None]:
# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

print("Training Random Forest Classifier...")
rf_model.fit(X_train, y_train)
print("✓ Model training completed!")

## 6. Model Evaluation

In [None]:
# Make predictions
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("="*60)
print("MODEL PERFORMANCE")
print("="*60)
print(f"Training Accuracy: {train_accuracy*100:.2f}%")
print(f"Testing Accuracy: {test_accuracy*100:.2f}%")
print("="*60)

In [None]:
# Classification Report
print("\nCLASSIFICATION REPORT (Test Set):")
print("="*60)
print(classification_report(y_test, y_test_pred, target_names=['Did Not Survive', 'Survived']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Did Not Survive', 'Survived'],
            yticklabels=['Did Not Survive', 'Survived'])
plt.title('Confusion Matrix - Random Forest Classifier', fontsize=14, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.tight_layout()
plt.show()

print(f"\nTrue Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")

In [None]:
# Feature Importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importance - Random Forest', fontsize=14, fontweight='bold')
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
print(feature_importance.to_string(index=False))

## 7. Save the Model and Preprocessing Objects

In [None]:
# Save the trained model
joblib.dump(rf_model, 'titanic_survival_model.pkl')
print("✓ Model saved as 'titanic_survival_model.pkl'")

# Save the scaler
joblib.dump(scaler, 'scaler.pkl')
print("✓ Scaler saved as 'scaler.pkl'")

# Save the label encoders
joblib.dump(le_sex, 'label_encoder_sex.pkl')
joblib.dump(le_embarked, 'label_encoder_embarked.pkl')
print("✓ Label encoders saved")

print("\n" + "="*60)
print("All model artifacts saved successfully!")
print("="*60)

## 8. Demonstrate Model Reload and Prediction

In [None]:
# Load the saved model
loaded_model = joblib.load('titanic_survival_model.pkl')
loaded_scaler = joblib.load('scaler.pkl')
loaded_le_sex = joblib.load('label_encoder_sex.pkl')
loaded_le_embarked = joblib.load('label_encoder_embarked.pkl')

print("✓ Model and preprocessing objects loaded successfully!")
print("\nDemonstrating prediction with loaded model...\n")

In [None]:
# Test Case 1: High-class female passenger (likely to survive)
test_passenger_1 = {
    'Pclass': 1,
    'Sex': 'female',
    'Age': 29,
    'Fare': 100,
    'Embarked': 'C'
}

# Test Case 2: Low-class male passenger (likely not to survive)
test_passenger_2 = {
    'Pclass': 3,
    'Sex': 'male',
    'Age': 25,
    'Fare': 8,
    'Embarked': 'S'
}

def predict_survival(passenger_data):
    """Predict survival for a passenger"""
    # Encode categorical variables
    sex_encoded = loaded_le_sex.transform([passenger_data['Sex']])[0]
    embarked_encoded = loaded_le_embarked.transform([passenger_data['Embarked']])[0]
    
    # Create feature array
    features = np.array([[
        passenger_data['Pclass'],
        sex_encoded,
        passenger_data['Age'],
        passenger_data['Fare'],
        embarked_encoded
    ]])
    
    # Scale features
    features_scaled = loaded_scaler.transform(features)
    
    # Make prediction
    prediction = loaded_model.predict(features_scaled)[0]
    probability = loaded_model.predict_proba(features_scaled)[0]
    
    return prediction, probability

# Test predictions
print("="*60)
print("TEST CASE 1: First-class Female Passenger")
print("="*60)
print(f"Details: {test_passenger_1}")
pred1, prob1 = predict_survival(test_passenger_1)
print(f"\nPrediction: {'SURVIVED' if pred1 == 1 else 'DID NOT SURVIVE'}")
print(f"Probability: {prob1[1]*100:.2f}% chance of survival")

print("\n" + "="*60)
print("TEST CASE 2: Third-class Male Passenger")
print("="*60)
print(f"Details: {test_passenger_2}")
pred2, prob2 = predict_survival(test_passenger_2)
print(f"\nPrediction: {'SURVIVED' if pred2 == 1 else 'DID NOT SURVIVE'}")
print(f"Probability: {prob2[1]*100:.2f}% chance of survival")

print("\n" + "="*60)
print("✓ Model reload and prediction demonstration completed!")
print("="*60)

## Summary

### Model Details:
- **Algorithm:** Random Forest Classifier
- **Features Used:** Pclass, Sex, Age, Fare, Embarked
- **Preprocessing:** Missing value imputation, label encoding, standard scaling
- **Persistence Method:** Joblib

### Model Performance:
- Training and testing accuracy displayed above
- Classification report shows precision, recall, and F1-score
- Model successfully saved and reloaded for predictions

### Next Steps:
1. Deploy the model in a web application
2. Create user-friendly interface for predictions
3. Host the application online