# dataset = https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv

Task 3 - Classification Models

1. Use a dataset (like Titanic) to build a classification model that predicts a categorical target variable (e.g., survived/not survived).

2. Compare at least two classifiers (e.g., Logistic Regression vs Decision Tree).

3. Evaluate your model using a confusion matrix, accuracy, precision, recall, and F1-score.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler


In [2]:
# Load Titanic dataset
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print("Shape:", df.shape)
df.head()

Shape: (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Step 2: Define X and y
y = df['Survived']   # target variable
X = df.drop(['Survived'], axis=1)  # features

In [5]:
# Step 2.5: Convert categorical columns to numeric
cat_cols = X.select_dtypes(include='object').columns

for col in cat_cols:
    X[col] = LabelEncoder().fit_transform(X[col].astype(str))

X.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,108,1,22.0,1,0,523,7.25,147,2
1,2,1,190,0,38.0,1,0,596,71.2833,81,0
2,3,3,353,0,26.0,0,0,669,7.925,147,2
3,4,1,272,0,35.0,1,0,49,53.1,55,2
4,5,3,15,1,35.0,0,0,472,8.05,147,2


In [6]:
# Step 3: Check and handle nulls / duplicates
print("Null values per column:\n", X.isnull().sum())
print("Duplicates:", X.duplicated().sum())

# Fill missing numeric values with mean
X = X.fillna(X.mean())

# Drop duplicates if any
X = X.drop_duplicates().reset_index(drop=True)

print("After cleaning:", X.shape)


Null values per column:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
dtype: int64
Duplicates: 0
After cleaning: (891, 11)


In [13]:
# Step 4: Logistic Regression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred_log = log_reg.predict(X_test)

# Evaluation
cm_log = confusion_matrix(y_test, y_pred_log)
acc_log = accuracy_score(y_test, y_pred_log)
prec_log = precision_score(y_test, y_pred_log)
rec_log = recall_score(y_test, y_pred_log)
f1_log = f1_score(y_test, y_pred_log)

print("Logistic Regression Results:")
print("Confusion Matrix:\n", cm_log)
print(f"Accuracy: {acc_log:.3f}")
print(f"Precision: {prec_log:.3f}")
print(f"Recall: {rec_log:.3f}")
print(f"F1 Score: {f1_log:.3f}")


Logistic Regression Results:
Confusion Matrix:
 [[89 16]
 [17 57]]
Accuracy: 0.816
Precision: 0.781
Recall: 0.770
F1 Score: 0.776


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
# Step 5: Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)

# Evaluation
cm_dt = confusion_matrix(y_test, y_pred_dt)
acc_dt = accuracy_score(y_test, y_pred_dt)
prec_dt = precision_score(y_test, y_pred_dt)
rec_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)

print("Decision Tree Results:")
print("Confusion Matrix:\n", cm_dt)
print(f"Accuracy: {acc_dt:.3f}")
print(f"Precision: {prec_dt:.3f}")
print(f"Recall: {rec_dt:.3f}")
print(f"F1 Score: {f1_dt:.3f}")


Decision Tree Results:
Confusion Matrix:
 [[86 19]
 [18 56]]
Accuracy: 0.793
Precision: 0.747
Recall: 0.757
F1 Score: 0.752


In [16]:
# Step 6: Compare both models
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree'],
    'Accuracy': [acc_log, acc_dt],
    'Precision': [prec_log, prec_dt],
    'Recall': [rec_log, rec_dt],
    'F1 Score': [f1_log, f1_dt]
})

print("\nComparison of Models:")
print(results)

print("Confusion Matrix:\n", cm_log)
print("Confusion Matrix:\n", cm_dt)




Comparison of Models:
                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.815642   0.780822  0.770270  0.775510
1        Decision Tree  0.793296   0.746667  0.756757  0.751678
Confusion Matrix:
 [[89 16]
 [17 57]]
Confusion Matrix:
 [[86 19]
 [18 56]]


# Observation

- Both Logistic Regression and Decision Tree performed well, but Logistic Regression achieved slightly better overall accuracy (81.5%) compared to Decision Tree (79.3%).

- Logistic Regression also has higher precision (0.78) and F1-score (0.77), meaning it balances false positives and false negatives better.

- Recall is similar for both models (~0.77), indicating both detect survivors at a similar rate.

- The confusion matrices show Logistic Regression correctly predicts more true negatives (89 vs 86) and slightly fewer false positives (16 vs 19).

Overall, Logistic Regression performs slightly better and is more generalizable, while the Decision Tree may slightly overfit the training data.