# Task 4 – Problem 2
## Machine Learning Model Development
Dataset: Titanic Dataset
Target Variable: Survived

**Name:- Pranjal Godse - Batch:- 6**

## Objective
Build and compare multiple machine learning models:
- Logistic Regression
- Decision Tree
- Random Forest

Evaluate using Accuracy, Confusion Matrix, Feature Importance,
and apply Hyperparameter Tuning using GridSearchCV.

In [1]:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix


## Load Dataset

In [3]:
!wget https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

--2026-02-21 11:14:18--  https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’


2026-02-21 11:14:18 (4.54 MB/s) - ‘titanic.csv’ saved [60302/60302]



In [4]:
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Preprocessing

In [5]:

df["Age"].fillna(df["Age"].median(), inplace=True)
df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)
df.drop("Cabin", axis=1, inplace=True)

df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = 1
df.loc[df["FamilySize"] > 1, "IsAlone"] = 0

df["Sex"] = df["Sex"].map({"male": 0, "female": 1})
df = pd.get_dummies(df, columns=["Embarked"], drop_first=True)

df.drop(["PassengerId", "Name", "Ticket"], axis=1, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)


## Train-Test Split

In [6]:

X = df.drop("Survived", axis=1)
y = df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## Logistic Regression

In [7]:

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

lr_pred = lr.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, lr_pred))


Logistic Regression Accuracy: 0.8044692737430168
Confusion Matrix:
 [[90 15]
 [20 54]]


## Decision Tree

In [8]:

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

dt_pred = dt.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, dt_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, dt_pred))


Decision Tree Accuracy: 0.7988826815642458
Confusion Matrix:
 [[85 20]
 [16 58]]


## Random Forest

In [9]:

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

rf_pred = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, rf_pred))


Random Forest Accuracy: 0.8268156424581006
Confusion Matrix:
 [[91 14]
 [17 57]]


## Feature Importance (Random Forest)

In [10]:

import pandas as pd

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=False)


Unnamed: 0,0
Sex,0.269306
Fare,0.264815
Age,0.246012
Pclass,0.076197
FamilySize,0.051687
SibSp,0.029636
Embarked_S,0.023586
Parch,0.019965
Embarked_Q,0.009432
IsAlone,0.009364


## Hyperparameter Tuning – Decision Tree

In [11]:

param_grid_dt = {
    "max_depth": [3, 5, 10, None],
    "min_samples_split": [2, 5, 10]
}

grid_dt = GridSearchCV(DecisionTreeClassifier(random_state=42),
                       param_grid_dt,
                       cv=5)

grid_dt.fit(X_train, y_train)

print("Best Parameters:", grid_dt.best_params_)

dt_best = grid_dt.best_estimator_
dt_tuned_pred = dt_best.predict(X_test)

print("Tuned Decision Tree Accuracy:",
      accuracy_score(y_test, dt_tuned_pred))


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Tuned Decision Tree Accuracy: 0.7988826815642458


## Hyperparameter Tuning – Random Forest

In [12]:

param_grid_rf = {
    "n_estimators": [100, 200],
    "max_depth": [None, 5, 10],
    "min_samples_split": [2, 5]
}

grid_rf = GridSearchCV(RandomForestClassifier(random_state=42),
                       param_grid_rf,
                       cv=5)

grid_rf.fit(X_train, y_train)

print("Best Parameters:", grid_rf.best_params_)

rf_best = grid_rf.best_estimator_
rf_tuned_pred = rf_best.predict(X_test)

print("Tuned Random Forest Accuracy:",
      accuracy_score(y_test, rf_tuned_pred))


Best Parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 100}
Tuned Random Forest Accuracy: 0.8156424581005587


## Conclusion
- Logistic Regression provides a strong baseline.
- Decision Tree improves after tuning.
- Random Forest performs best after tuning.
- Hyperparameter tuning improves overall accuracy.