# Problem Definition & Setup

### üéØ Objective
The goal of this project is to build a **machine learning model** that predicts whether a passenger survived the Titanic disaster. This is a **classification problem** because the output (**Survived**) has only two possible categories:

* **1** ‚Üí Passenger survived
* **0** ‚Üí Passenger did not survive

### üìå Problem Type
problem_type = "classification"

### üéØ Target Variable
target_column = "Survived"


In [3]:
import pandas as pd

# Load dataset
df = pd.read_csv("titanic.csv")

# Basic checks
print("Shape of dataset:", df.shape)
print("\nFirst 5 rows:\n", df.head())
print("\nMissing values:\n", df.isnull().sum())


Shape of dataset: (887, 8)

First 5 rows:
    Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  

Missing values:
 Survived                   0
Pcl

In [4]:
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


# Dataset Description

The dataset contains information about **887 Titanic passengers** with the following key features:

* **Pclass** ‚Üí Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
* **Name** ‚Üí Passenger name
* **Sex** ‚Üí Gender
* **Age** ‚Üí Age in years
* **Siblings/Spouses Aboard** ‚Üí Number of siblings/spouses aboard
* **Parents/Children Aboard** ‚Üí Number of parents/children aboard
* **Fare** ‚Üí Ticket price
* **Survived** ‚Üí Target label (0 = No, 1 = Yes)

### Dataset Shape
There are:
* **887** rows (passengers)
* **8** columns

In [11]:
# 1. Fill missing Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# 2. Create FamilySize
df['FamilySize'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] + 1

# 3. Create IsAlone feature
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# 4. Encode Sex
df['Sex_encoded'] = df['Sex'].map({'male':0, 'female':1})

# 5. Extract Title (FIXED REGEX)
df['Title'] = df['Name'].str.extract(r'([A-Za-z]+)\.', expand=False)

# 6. Group rare titles
rare_titles = ['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 
               'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']

df['Title'] = df['Title'].replace({
    'Mlle': 'Miss',
    'Ms': 'Miss',
    'Mme': 'Mrs'
})

df['Title'] = df['Title'].replace(rare_titles, 'Rare')

# 7. Encode Title
title_mapping = {'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4}
df['Title_encoded'] = df['Title'].map(title_mapping)

# 8. Confirm
print(df[['Name','Title','Title_encoded']].head())
print("\nMissing values:\n", df.isnull().sum())
print("\nFinal shape:", df.shape)


                                                Name Title  Title_encoded
0                             Mr. Owen Harris Braund    Mr              0
1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   Mrs              2
2                              Miss. Laina Heikkinen  Miss              1
3        Mrs. Jacques Heath (Lily May Peel) Futrelle   Mrs              2
4                            Mr. William Henry Allen    Mr              0

Missing values:
 Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
FamilySize                 0
IsAlone                    0
Sex_encoded                0
Title                      0
Title_encoded              0
dtype: int64

Final shape: (887, 13)


# Now Train / Test Split (Critical Step)

Now we‚Äôll split the data so that:

* **80%** ‚Üí **Training data** (model learns here)
* **20%** ‚Üí **Testing data** (used **ONLY** to evaluate)

We also use `stratify=y` so the survival ratio remains balanced.

> **Note:** We will only use **numerical model-ready features** ‚Äî not raw text columns.

In [12]:
from sklearn.model_selection import train_test_split

# Select final features
features = [
    'Pclass', 'Sex_encoded', 'Age',
    'Siblings/Spouses Aboard', 'Parents/Children Aboard',
    'Fare', 'FamilySize', 'IsAlone', 'Title_encoded'
]

X = df[features]
y = df['Survived']

# Train‚ÄìTest Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y
)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])
print("\nX_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)


Training samples: 709
Testing samples: 178

X_train shape: (709, 9)
X_test shape: (178, 9)


# Train 2 Models (Baseline + Powerful)

We‚Äôll do this in a clean, human-style workflow:

* **‚úî Model-1: Logistic Regression**
  *(Simple baseline ‚Äî fast, interpretable)*

* **‚úî Model-2: Random Forest**
  *(More powerful ‚Äî usually performs better)*

### üìã We‚Äôll:
1. **Train** the model on *training data only*
2. **Predict** on *test data only*
3. **Measure** Accuracy & F1-Score

| Step | Meaning |
| :--- | :--- |
| **`fit()`** | Model learns patterns from training data |
| **`predict()`** | Model guesses survival on unseen test data |
| **`accuracy_score`** | Measures % correct |
| **`f1_score`** | Balances precision & recall |
| **`classification_report`** | Shows precision/recall per class |

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

# -------- Logistic Regression --------
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)

log_pred = log_model.predict(X_test)

log_acc = accuracy_score(y_test, log_pred)
log_f1 = f1_score(y_test, log_pred)

print("Logistic Regression Results")
print("Accuracy:", round(log_acc, 4))
print("F1 Score:", round(log_f1, 4))
print(classification_report(y_test, log_pred))


# -------- Random Forest --------
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)

rf_acc = accuracy_score(y_test, rf_pred)
rf_f1 = f1_score(y_test, rf_pred)

print("\nRandom Forest Results")
print("Accuracy:", round(rf_acc, 4))
print("F1 Score:", round(rf_f1, 4))
print(classification_report(y_test, rf_pred))


Logistic Regression Results
Accuracy: 0.8034
F1 Score: 0.7407
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       109
           1       0.76      0.72      0.74        69

    accuracy                           0.80       178
   macro avg       0.79      0.79      0.79       178
weighted avg       0.80      0.80      0.80       178


Random Forest Results
Accuracy: 0.7584
F1 Score: 0.695
              precision    recall  f1-score   support

           0       0.81      0.79      0.80       109
           1       0.68      0.71      0.70        69

    accuracy                           0.76       178
   macro avg       0.75      0.75      0.75       178
weighted avg       0.76      0.76      0.76       178



# Feature Importance & Model Interpretation

The goal of this step is to understand **which features contributed the most** to predicting survival on the Titanic. This helps convert model results into **real-world insights**, rather than just numbers.

I analysed feature importance from:
* **‚úî Logistic Regression** (coefficients)
* **‚úî Random Forest Classifier** (feature importance scores)

In [27]:
import numpy as np

coefficients = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': log_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

print(coefficients)


                   Feature  Coefficient
1              Sex_encoded     2.158903
8            Title_encoded     0.470261
4  Parents/Children Aboard     0.036287
5                     Fare     0.006089
2                      Age    -0.046345
6               FamilySize    -0.287426
3  Siblings/Spouses Aboard    -0.321162
7                  IsAlone    -0.362250
0                   Pclass    -0.928149


## Logistic Regression ‚Äì Feature Influence

In Logistic Regression, **positive coefficients** increase the probability of survival, while **negative coefficients** decrease it.

### üìä From my model:

| Feature | Effect | Impact on Survival |
| :--- | :---: | :--- |
| **Sex_encoded** (female = 1) | üî∫ | **Strong positive** ‚Äî females much more likely to survive |
| **Title_encoded** | üî∫ | Titles like Mrs/Miss had higher survival than Mr |
| **Fare** | üî∫ | Higher fare ‚Üí more survival (wealthier passengers likely in safer cabins) |
| **Parents/Children Aboard** (Parch) | üî∫ | Small positive effect |
| **Age** | üîª | Older passengers slightly less likely to survive |
| **FamilySize** | üîª | Large families had lower survival odds |
| **Siblings/Spouses Aboard** (SibSp) | üîª | More companions ‚Üí lower survival chance |
| **IsAlone** | üîª | Alone passengers slightly less likely |
| **Pclass** (3rd class) | üîª | **Strong negative** ‚Äî 3rd class survival lowest |

### ‚úî Key Interpretation (Simple & Clear)
* **Gender is the strongest predictor** ‚Äî females survived far more than males.
* **Passenger class matters a lot** ‚Äî 1st class had a clear survival advantage.
* **Higher ticket fare** increases survival probability.
* **Travelling with very large families** reduced survival chances.
* **Age** has a moderate negative impact.

In [16]:

rf_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(rf_importances)


                   Feature  Importance
5                     Fare    0.249942
2                      Age    0.240470
1              Sex_encoded    0.178903
8            Title_encoded    0.142018
0                   Pclass    0.082390
6               FamilySize    0.049359
3  Siblings/Spouses Aboard    0.032306
4  Parents/Children Aboard    0.015426
7                  IsAlone    0.009185


# Random Forest ‚Äì Feature Importance Ranking

Random Forest also confirms the top predictive factors:

| Rank | Feature | Insight |
| :--- | :--- | :--- |
| **1** | **Fare** | Higher fare = better survival |
| **2** | **Age** | Younger passengers had better survival rates |
| **3** | **Sex_encoded** | Females more likely to survive |
| **4** | **Title_encoded** | Social status strongly linked to survival |
| **5** | **Pclass** | 1st class > 2nd > 3rd survival |
| **6‚Äì9** | FamilySize, SibSp, Parch, IsAlone | Smaller influence |

### ‚úî Combined Interpretation
Both models strongly agree:

* ‚≠ê **Gender, Fare, Title, and Passenger Class** are the best predictors of survival.
* ‚≠ê **Women and higher-class passengers** had clear priority in rescue.
* ‚≠ê **Wealth (Fare)** also correlates with survival.
* ‚≠ê **Travelling alone or with many dependents** reduced survival chances.

In [29]:
results = pd.DataFrame({
    'Model':['Logistic Regression','Random Forest'],
    'Accuracy':[log_acc, rf_acc],
    'F1 Score':[log_f1, rf_f1]
})

results


Unnamed: 0,Model,Accuracy,F1 Score
0,Logistic Regression,0.803371,0.740741
1,Random Forest,0.758427,0.695035


## Model Performance Comparison

| Model | Accuracy | F1-Score | Notes |
| :--- | :--- | :--- | :--- |
| **Logistic Regression** | **80.3%** | **0.74** | ‚úî Best performer |
| **Random Forest** | 75.8% | 0.69 | Slightly weaker |

### üéØ Conclusion
The **Logistic Regression** model performed best in my experiment.

* The model predicts survival with **~80% accuracy**, which is reasonable for this dataset.