## Business Problem

Predict whether a passenger survived the Titanic disaster 
based on demographic and travel information.

Survived (0 = No, 1 = Yes)
Binary Classification


In [1]:
# Import Libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#Data Loading
df = pd.read_csv("train (6).csv")

DATA CLEANING

In [3]:
# Check missing
df.isnull().sum()


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
df['Age'].fillna(df['Age'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


In [5]:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [6]:
df.drop('Cabin', axis=1, inplace=True)


Feature Engineering

In [7]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1


In [8]:
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)


In [9]:
df['Title'] = df['Title'].replace(
    ['Lady', 'Countess','Capt','Col','Don','Dr',
     'Major','Rev','Sir','Jonkheer','Dona'],
    'Rare'
)


In [10]:
df.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)


In [None]:
df = pd.get_dummies(df, drop_first=True)

In [12]:
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [13]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)


In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

rf = RandomForestClassifier(random_state=42)
dt = DecisionTreeClassifier(random_state=42)

rf.fit(X_train, y_train)
dt.fit(X_train, y_train)


In [15]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, rf.predict(X_test)))


              precision    recall  f1-score   support

           0       0.84      0.88      0.86       110
           1       0.80      0.74      0.77        69

    accuracy                           0.83       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.83      0.83      0.83       179



In [16]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(rf, X, y, cv=5)
print(cv_scores.mean())


0.8069612704789405


In [17]:
import pandas as pd

importances = rf.feature_importances_
features = X.columns

importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

importance_df.head()


Unnamed: 0,Feature,Importance
4,Fare,0.243486
1,Age,0.215152
6,Sex_male,0.124865
12,Title_Mr,0.121958
0,Pclass,0.075992


In [18]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing the most powerful predictors: Sex and Class
plt.figure(figsize=(10, 5))
sns.barplot(x='Pclass', y='Survived', hue='Sex', data=df) 
plt.title("Survival Rate by Class and Sex")
plt.show()

# Correlation Heatmap to check for multi-collinearity
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

ValueError: Could not interpret value `Sex` for `hue`. An entry with this name does not appear in `data`.

<Figure size 1000x500 with 0 Axes>