The deadline for this homework is on **07.03.2025 18:29** (right before the practice session). After completing the exercises, you should

1. Download this file into your computer (`File` $\to$ `Download .ipynb`)

2. Name the file in the following way *HWx_NameSurname* (for example `HW2_NshanPotikyan.ipynb`)

4. Submit the file via the e-learning environment.

**Note** if you do not follow any of the above conditions, your homework will not be graded.

**Problem.** During the practice session we tried to build a binary classifier on the titanic dataset that would predict whether a passenger will survive or not.

* In this homework, you need to take the same dataset but this time you need to try the 3 different algorithm families on the given problem
  * KNN
  * Naive Bayes
  * Decision Trees

* Split the training dataset into train/test parts, so that you can evaluate the performance of the best approach at the end (use random_state=42, train=80%, test=20% splits). **You should not used the test set when looking for the best algorithm/hyper-parameters.**

* Try leaving out unimportant features from the data (use feature importances returned from the decision tree).

* Make use of sklearn [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) to construct the different approaches.

* Use hyper-parameter tuning (GridSearchCV) to find the best combination of parameters for each algorithm.

* Evaluate the model performance in terms of the accuracy score.

* Report the accuracy score of the best approach on the test dataset.

Your grade will be based on

* whether you have done all the modelling steps correctly
* how many things you have tried
* how good your final model performs on the test set.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import graphviz   # will be used to visualize the trees

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.tree import export_text
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
RANDOM_STATE = 42

In [3]:
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
titanic_df = pd.read_csv(url)
titanic_df.drop(['Name'], axis=1, inplace=True)
titanic_df_ = pd.get_dummies(titanic_df, columns=['Sex'], drop_first=True)
X = titanic_df_.drop('Survived', axis=1)
y = titanic_df_['Survived']
print(titanic_df.head())

   Survived  Pclass     Sex   Age  Siblings/Spouses Aboard  \
0         0       3    male  22.0                        1   
1         1       1  female  38.0                        1   
2         1       3  female  26.0                        0   
3         1       1  female  35.0                        1   
4         0       3    male  35.0                        0   

   Parents/Children Aboard     Fare  
0                        0   7.2500  
1                        0  71.2833  
2                        0   7.9250  
3                        0  53.1000  
4                        0   8.0500  


In [4]:
X_train_valid, X_test, y_train_valid, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_valid, y_train_valid, test_size=0.1, random_state=RANDOM_STATE
)

In [5]:
print("Class distribution in the train set:\n", y_train.value_counts())
print("Class distribution in the test set:\n", y_test.value_counts())

Class distribution in the train set:
 Survived
0    393
1    245
Name: count, dtype: int64
Class distribution in the test set:
 Survived
0    111
1     67
Name: count, dtype: int64


In [6]:
model = RandomForestClassifier(random_state=RANDOM_STATE)
model.fit(X_train, y_train)

importances = model.feature_importances_
print(importances)

feature_importance_df = pd.DataFrame({'Feature': X.columns,
                                      'Importance': importances})

feature_importance_df = feature_importance_df.sort_values(by='Importance',
                                                          ascending=False)

print(feature_importance_df)

[0.096246   0.25610992 0.05063575 0.03162374 0.2755896  0.28979498]
                   Feature  Importance
5                 Sex_male    0.289795
4                     Fare    0.275590
1                      Age    0.256110
0                   Pclass    0.096246
2  Siblings/Spouses Aboard    0.050636
3  Parents/Children Aboard    0.031624


In [7]:
pipline = make_pipeline(StandardScaler(), DecisionTreeClassifier())
pipline.fit(X_train, y_train)
y_pred = pipline.predict(X_valid)

accuracy_train = accuracy_score(y_train, pipline.predict(X_train))
accuracy_valid = accuracy_score(y_valid, y_pred)
print(f'Accuracy on the train set: {accuracy_train:.2f}')
print(f'Accuracy on the validation set: {accuracy_valid:.2f}')

Accuracy on the train set: 0.99
Accuracy on the validation set: 0.73


In [8]:
X_train.drop(columns=['Parents/Children Aboard', 'Parents/Children Aboard'], inplace=True)
X_valid.drop(columns=['Parents/Children Aboard', 'Parents/Children Aboard'], inplace=True)
X_test.drop(columns=['Parents/Children Aboard', 'Parents/Children Aboard'], inplace=True)


In [9]:
pipline = make_pipeline(StandardScaler(), DecisionTreeClassifier())
pipline.fit(X_train, y_train)
y_pred = pipline.predict(X_valid)

accuracy_train = accuracy_score(y_train, pipline.predict(X_train))
accuracy_valid = accuracy_score(y_valid, y_pred)
print(f'Accuracy on the train set: {accuracy_train:.2f}')
print(f'Accuracy on the validation set: {accuracy_valid:.2f}')

Accuracy on the train set: 0.99
Accuracy on the validation set: 0.70


# Since the result on validation set with removed features is worse. We will keep them.

In [10]:
X_train_valid, X_test, y_train_valid, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_valid, y_train_valid, test_size=0.1, random_state=RANDOM_STATE
)

In [11]:
pipline_nv = make_pipeline(MultinomialNB(alpha=0.2, fit_prior=False))
pipline_nv.fit(X_train, y_train)
y_pred = pipline_nv.predict(X_valid)

accuracy_train = accuracy_score(y_train, pipline_nv.predict(X_train))
accuracy_valid = accuracy_score(y_valid, y_pred)
print(f'Accuracy on the train set: {accuracy_train:.2f}')
print(f'Accuracy on the validation set: {accuracy_valid:.2f}')

Accuracy on the train set: 0.68
Accuracy on the validation set: 0.72


In [12]:
from sklearn.model_selection import GridSearchCV
model = MultinomialNB

param_grid = {
    'alpha': [0.1, 0.2, 0.3],
    'fit_prior': [True, False],
}
grid_search = GridSearchCV(model(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

pipline_nv = make_pipeline(model(**grid_search.best_params_))
pipline_nv.fit(X_train, y_train)
y_pred = pipline_nv.predict(X_valid)

accuracy_train = accuracy_score(y_train, pipline_nv.predict(X_train))
accuracy_valid = accuracy_score(y_valid, y_pred)
print(f'Accuracy on the train set: {accuracy_train:.2f}')
print(f'Accuracy on the validation set: {accuracy_valid:.2f}')

{'alpha': 0.1, 'fit_prior': True}
Accuracy on the train set: 0.68
Accuracy on the validation set: 0.72


In [13]:

model = KNeighborsClassifier

param_grid = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2],
}
grid_search = GridSearchCV(model(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

pipline = make_pipeline(model(**grid_search.best_params_))
pipline.fit(X_train, y_train)
y_pred = pipline.predict(X_valid)

accuracy_train = accuracy_score(y_train, pipline.predict(X_train))
accuracy_valid = accuracy_score(y_valid, y_pred)
print(f'Accuracy on the train set: {accuracy_train:.2f}')
print(f'Accuracy on the validation set: {accuracy_valid:.2f}')

{'n_neighbors': 7, 'p': 1, 'weights': 'distance'}
Accuracy on the train set: 0.99
Accuracy on the validation set: 0.76


In [14]:
model = DecisionTreeClassifier

param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'splitter': ['random', 'best'],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 4, 6],
}
grid_search = GridSearchCV(model(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

pipline = make_pipeline(model(**grid_search.best_params_))
pipline.fit(X_train, y_train)
y_pred = pipline.predict(X_valid)

accuracy_train = accuracy_score(y_train, pipline.predict(X_train))
accuracy_valid = accuracy_score(y_valid, y_pred)
print(f'Accuracy on the train set: {accuracy_train:.2f}')
print(f'Accuracy on the validation set: {accuracy_valid:.2f}')

{'criterion': 'log_loss', 'max_depth': 10, 'min_samples_split': 2, 'splitter': 'best'}
Accuracy on the train set: 0.93
Accuracy on the validation set: 0.75


In [15]:
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(
    estimators=[
        ("KNN", KNeighborsClassifier(n_neighbors=7, p=1, weights="distance")),
        ("Naive Bayes", MultinomialNB(alpha=0.1, fit_prior=True)),
        (
            "Decision tree",
            DecisionTreeClassifier(
                criterion="log_loss",
                max_depth=5,
                min_samples_split=4,
                splitter="random",
            ),
        ),
    ]
)
ensemble.fit(X_train, y_train)

print("Valid score: ", ensemble.score(X_valid, y_valid))
print("Test score: ", ensemble.score(X_test, y_test))

Valid score:  0.8169014084507042
Test score:  0.7303370786516854
