# Titanic Survival Prediction

This mini-project uses the Titanic dataset to build a basic classification model predicting passenger survival based on features like age, class, and fare. The focus is on quick data cleaning, basic feature engineering, and training a simple, interpretable model.


## Importing Libraries

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning and model evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


## Data Loading
We will load the Titanic dataset from seaborn's built-in datasets for ease of access.


In [6]:
# Load Titanic dataset from seaborn
df = sns.load_dataset('titanic')

# Preview
print(f"Dataset shape: {df.shape}")
df.head()


Dataset shape: (891, 15)


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Initial Exploration and Data Cleaning
We will check for missing values, drop irrelevant columns, and prepare the dataset for modeling.


In [7]:
# Check for missing values
print("Missing values per column:\n")
print(df.isnull().sum())

Missing values per column:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [8]:
# Drop columns that are redundant or not useful for modeling
df.drop(['deck', 'embark_town', 'alive', 'who', 'class', 'adult_male'], axis=1, inplace=True)

# Fill missing age values with median
df['age'] = df['age'].fillna(df['age'].median())

# Fill missing embarked values with mode
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Confirm no missing values left
print("\nMissing values after cleaning:\n")
print(df.isnull().sum())

# Preview cleaned dataset
df.head()


Missing values after cleaning:

survived    0
pclass      0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
alone       0
dtype: int64


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,alone
0,0,3,male,22.0,1,0,7.25,S,False
1,1,1,female,38.0,1,0,71.2833,C,False
2,1,3,female,26.0,0,0,7.925,S,True
3,1,1,female,35.0,1,0,53.1,S,False
4,0,3,male,35.0,0,0,8.05,S,True


### Data Cleaning Summary

- Dropped redundant or noisy columns: `deck`, `embark_town`, `alive`, `who`, `class`, `adult_male`.
- Filled missing `age` values using the median (simple and robust against outliers).
- Filled missing `embarked` values using the mode (most common value).
- After cleaning, there are no missing values remaining in the dataset.
- Final set of core features includes:  
  - `survived` (target)  
  - `pclass`, `sex`, `age`, `sibsp`, `parch`, `fare`, `embarked`, `alone`


## Feature Encoding
We will encode categorical features into numeric format to prepare the dataset for machine learning.


In [9]:
# Encode 'sex' and 'embarked' using Label Encoding
le = LabelEncoder()

df['sex'] = le.fit_transform(df['sex'])         # male = 1, female = 0
df['embarked'] = le.fit_transform(df['embarked']) # S, C, Q mapped to integers

# Double-check dataset
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,alone
0,0,3,1,22.0,1,0,7.25,2,False
1,1,1,0,38.0,1,0,71.2833,0,False
2,1,3,0,26.0,0,0,7.925,2,True
3,1,1,0,35.0,1,0,53.1,2,False
4,0,3,1,35.0,0,0,8.05,2,True


## Model Building and Evaluation
We will train both a Logistic Regression model and a Random Forest Classifier to predict survival, and compare their performance.


In [10]:
# Final preprocessing: convert 'alone' from boolean to int
df['alone'] = df['alone'].astype(int)

# Define features and target
X = df.drop('survived', axis=1)
y = df['survived']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression
lr = LogisticRegression(max_iter=500)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate both models
print("Logistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.2f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))

print("\nRandom Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.2f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))


Logistic Regression Performance:
Accuracy: 0.80

Confusion Matrix:
[[89 16]
 [20 54]]

Random Forest Performance:
Accuracy: 0.83

Confusion Matrix:
[[91 14]
 [17 57]]


### Model Evaluation Summary

- **Logistic Regression:**
  - Accuracy: 80%
  - Confusion Matrix:
    - True Negatives: 89
    - False Positives: 16
    - False Negatives: 20
    - True Positives: 54

- **Random Forest Classifier:**
  - Accuracy: 83%
  - Confusion Matrix:
    - True Negatives: 91
    - False Positives: 14
    - False Negatives: 17
    - True Positives: 57

### Key Observations:
- Both models perform reasonably well, correctly predicting survival for about 80–83% of passengers.
- Random Forest outperforms Logistic Regression slightly, achieving 3% higher accuracy.
- Both models struggle more with false negatives (missing actual survivors) compared to false positives.
- Random Forest’s ability to model non-linear interactions likely gives it the slight edge over Logistic Regression.


## Final Conclusion

In this mini-project, we built two classification models to predict passenger survival on the Titanic dataset.  
Both Logistic Regression and Random Forest performed well, with Random Forest achieving slightly higher accuracy (83%).

Key steps included:
- Light data cleaning and missing value handling
- Encoding categorical variables
- Training and evaluating two baseline models

This mini-project demonstrates a simple but complete machine learning workflow, from raw data to model evaluation.
