#  Titanic Classification: Logistic Regression vs Random Forest

**Student Name:** Mhle L.  
**Student Number:** 22322987  
**Module:** Technical Programming 2  
**Assessment:** Classification Models — Logistic Regression vs Random Forest  
**Due Date:** 8 August 2025  
**Collaborator Email:** xpiyose@gmail.com  

---

This notebook explores binary classification using two models:
- **Logistic Regression**
- **Random Forest**

We apply these models to the Titanic dataset and compare their performance using:
- Accuracy
- Confusion Matrix
- F1 Score

**Dataset Source:**  
[Titanic – Machine Learning from Disaster (Kaggle)](https://www.kaggle.com/datasets/dwiuzila/titanic-machine-learning-from-disaster)


I will install & import required libraries

In [1]:
# Install Kaggle API (if needed) and import necessary libraries
!pip install kaggle --quiet

import pandas as pd
import numpy as np
import zipfile
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


##  Upload Titanic Dataset (Manual Upload)

I will now upload the `archive (8).zip` file containing `train.csv` and `test.csv`.  
It will be automatically unzipped, and files listed below.



In [2]:
from google.colab import files
import zipfile
import os

# Upload file manually
uploaded = files.upload()

# Get the uploaded ZIP filename
zip_filename = next(iter(uploaded))  # Gets the first uploaded file name

# Extract the ZIP file
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall("dataset")

# List extracted files
print("📁 Extracted files:")
for root, dirs, files in os.walk("dataset"):
    for file in files:
        print(os.path.join(root, file))


Saving archive (8).zip to archive (8).zip
📁 Extracted files:
dataset/test.csv
dataset/train.csv


In [3]:
# Load train and test sets
train_df = pd.read_csv("dataset/train.csv")
test_df = pd.read_csv("dataset/test.csv")

# Display basic information
print("Training Set:")
print(train_df.info())
display(train_df.head())

print("\nTest Set:")
print(test_df.info())
display(test_df.head())


Training Set:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S



Test Set:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
 11  Survived     418 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
None


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1


##  Step 5: Data Preprocessing

We'll clean and prepare the dataset for modeling by:
- Handling missing values
- Encoding categorical variables
- Dropping irrelevant columns
- Scaling numerical features


In [6]:
# Combine train and test for consistent preprocessing
train_df['is_train'] = 1
test_df['is_train'] = 0
combined_df = pd.concat([train_df, test_df], sort=False)

# Handle missing values (no inplace warning)
combined_df.loc[:, 'Age'] = combined_df['Age'].fillna(combined_df['Age'].median())
combined_df.loc[:, 'Embarked'] = combined_df['Embarked'].fillna(combined_df['Embarked'].mode()[0])
combined_df.loc[:, 'Fare'] = combined_df['Fare'].fillna(combined_df['Fare'].median())

# Encode categorical variables
label_encoder = LabelEncoder()
for col in ['Sex', 'Embarked']:
    if col in combined_df.columns:
        combined_df.loc[:, col] = label_encoder.fit_transform(combined_df[col])

# Drop irrelevant/text-based columns only if they exist
columns_to_drop = ['Name', 'Ticket', 'Cabin']
existing_columns = [col for col in columns_to_drop if col in combined_df.columns]
combined_df.drop(columns=existing_columns, inplace=True)

# Separate back to train and test
train_df = combined_df[combined_df['is_train'] == 1].drop(columns=['is_train'])
test_df = combined_df[combined_df['is_train'] == 0].drop(columns=['is_train', 'Survived'], errors='ignore')

# Prepare features and target
X = train_df.drop(columns=['Survived', 'PassengerId'], errors='ignore')
y = train_df['Survived']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split training data for model validation
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.3, random_state=42)



##Training the Models

In [7]:
# Logistic Regression
lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)

# Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_val)


##  Model Evaluation


In [8]:
def evaluate(name, y_true, y_pred):
    print(f"--- {name} ---")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("Classification Report:\n", classification_report(y_true, y_pred))

evaluate("Logistic Regression", y_val, y_pred_lr)
evaluate("Random Forest", y_val, y_pred_rf)


--- Logistic Regression ---
Accuracy: 0.8134328358208955
Confusion Matrix:
 [[137  20]
 [ 30  81]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.82      0.87      0.85       157
         1.0       0.80      0.73      0.76       111

    accuracy                           0.81       268
   macro avg       0.81      0.80      0.80       268
weighted avg       0.81      0.81      0.81       268

--- Random Forest ---
Accuracy: 0.7835820895522388
Confusion Matrix:
 [[132  25]
 [ 33  78]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.80      0.84      0.82       157
         1.0       0.76      0.70      0.73       111

    accuracy                           0.78       268
   macro avg       0.78      0.77      0.77       268
weighted avg       0.78      0.78      0.78       268



##  Discussion & Insights

- **Logistic Regression** achieved a higher accuracy (**81.3%**) than Random Forest (**78.3%**) on this dataset.
- This suggests that the Titanic dataset may be relatively linearly separable, making Logistic Regression a suitable model.
- **Precision and recall** for Logistic Regression were also slightly better overall, especially for class `1.0` (survived).
- **Random Forest**, although slightly behind in accuracy, still performed well and is expected to do better on more complex or noisy data.

### When to Choose Each Model:

-  **Logistic Regression**
  - When interpretability is important
  - When the dataset is not very large or complex
  - When speed and simplicity are priorities

-  **Random Forest**
  - When data contains non-linear relationships and complex feature interactions
  - When maximizing accuracy is more important than understanding how features impact predictions
  - When robustness against overfitting is required


##  Conclusion

I implemented and compared two classification models — **Logistic Regression** and **Random Forest**  on the Titanic dataset.

🔍 **Summary of Findings**:
- Logistic Regression: Accuracy = **81.3%**
- Random Forest: Accuracy = **78.3%**
- Logistic Regression slightly outperformed Random Forest in this task.

Both models are valid tools depending on the nature of the data and the goal of the analysis.

---

📧 **Collaborator:** xpiyose@gmail.com  
🧑‍🎓 **Student:** Mhle L. (Student Number: 22322987)  
📅 **Submission Due:** 8 August 2025

---


