## Supervised ML Model ( Regression & Classification ) TITANIC DATASET

In [2]:
# Importing Required Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score, f1_score

# Loading the Dataset ( train.xlxs is the titanic dataset)

data = pd.read_excel('train.xlsx')

# Data Preprocessing
# Handle missing values
age_imputer = SimpleImputer(strategy='mean')
data['Age'] = age_imputer.fit_transform(data[['Age']])

embarked_imputer = SimpleImputer(strategy='most_frequent')
data['Embarked'] = embarked_imputer.fit_transform(data[['Embarked']]).ravel()

# Winsorization (treating outliers) on Age and Fare columns
def winsorize_series(series, lower_quantile=0.05, upper_quantile=0.95):
    lower = series.quantile(lower_quantile)
    upper = series.quantile(upper_quantile)
    return series.clip(lower, upper)

data['Age'] = winsorize_series(data['Age'])
data['Fare'] = winsorize_series(data['Fare'])

# Droping the Cabin column
data = data.drop('Cabin', axis=1)

# Encode categorical variables
le_sex = LabelEncoder()
data['Sex'] = le_sex.fit_transform(data['Sex'])
data = pd.get_dummies(data, columns=['Embarked'], drop_first=True)

# Droping un-needed columns
data = data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.25,False,True
1,1,1,0,38.0,1,0,71.2833,False,False
2,1,3,0,26.0,0,0,7.925,False,True
3,1,1,0,35.0,1,0,53.1,False,True
4,0,3,1,35.0,0,0,8.05,False,True


In [3]:
# Spliting dataset into Features and Targets

# Regression Model : Predict Fare
# Classification Model: Predict Survived

y_reg = data['Fare']
y_clf = data['Survived']
X_reg = data.drop(['Fare', 'Survived'], axis=1)
X_clf = data.drop(['Survived', 'Fare'], axis=1)

# Splitting into train/test

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=100)

X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.3, random_state=100)


###  Regresion model

In [5]:
# Regression Models: Linear, Ridge, Lasso

lr = LinearRegression().fit(X_train_reg, y_train_reg)
ridge = Ridge(alpha=1.0).fit(X_train_reg, y_train_reg)
lasso = Lasso(alpha=0.1).fit(X_train_reg, y_train_reg)

# Performance
mse_lr = mean_squared_error(y_test_reg, lr.predict(X_test_reg))
mse_ridge = mean_squared_error(y_test_reg, ridge.predict(X_test_reg))
mse_lasso = mean_squared_error(y_test_reg, lasso.predict(X_test_reg))

print('MSE Linear:', mse_lr)
print('MSE Ridge:', mse_ridge)
print('MSE Lasso:', mse_lasso)


MSE Linear: 314.9972156917778
MSE Ridge: 314.853486247438
MSE Lasso: 314.5550688367272


### Classification model

# Classification Models: Logistic Regression, SVM, KNN, Naive Bayes

logreg = LogisticRegression(max_iter=1000).fit(X_train_clf, y_train_clf)
svm = SVC().fit(X_train_clf, y_train_clf)
knn = KNeighborsClassifier().fit(X_train_clf, y_train_clf)
nb = GaussianNB().fit(X_train_clf, y_train_clf)

models = ['Logistic Regression', 'SVM', 'KNN', 'Naive Bayes']
preds = [
    logreg.predict(X_test_clf),
    svm.predict(X_test_clf),
    knn.predict(X_test_clf),
    nb.predict(X_test_clf)
]
for model, pred in zip(models, preds):
    print(model, 'Accuracy:', accuracy_score(y_test_clf, pred),'\n'
          'Precision:', precision_score(y_test_clf, pred),'\n'
          'Recall:', recall_score(y_test_clf, pred),'\n'
          'F1:', f1_score(y_test_clf, pred),'\n\n')


## Best Performing Models: 

Regression Models : 
Lasso Regression has the lowest Mean Squared Error (MSE = 314.56), making it the best-performing regression model among Linear, Ridge, and Lasso

Classification Models : 
Logistic Regression has the highest balance of accuracy = 79% , F1 score = 0.72 , and strong precision/recall.

Naive Bayes is a close second, with slightly lower accuracy and F1.

## Conclusion :

This analysis evaluated multiple supervised machine learning models on the Titanic dataset for both regression and classification tasks:

**Key Findings:**

1. **Regression Task (Predicting Fare):** Lasso Regression emerged as the best model with the lowest MSE of 314.56, outperforming Linear and Ridge regression models.

2. **Classification Task (Predicting Survival):** Logistic Regression achieved the highest performance with 79% accuracy, F1 score of 0.72, and strong precision (78.5%) and recall (67%). Naive Bayes was a close second with slightly lower metrics.

3. **Model Comparison:** Among classification models, SVM showed perfect precision but lower recall, while KNN demonstrated balanced performance across metrics.
