# Data Science - Assignment 5 - Comparative Experimentation

Dataset: <b>Heart Failure</b>     (small dataset)

Thomas Br√ºndl

se21m032

<br>

### Approach

In this exercise I experimented with three different algorithms (KNN, Perceptron, Decission Tree).
The input parameters of the respective algorithms were varied to determine how this would affect Effectiveness and Efficiency.


### Evaluation methods

For each dataset, I investigated 2 parameters to determine Efficiency:

1. Training time
2. Testing time

To determine Effectiveness I took a look at 4 different parameters:

1. Accuracy score
2. Jaccard score
3. f1 score
4. Precision score




# Imports

In [50]:
import pandas as pd
import numpy as np
import time as time
import statistics
from tabulate import tabulate
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import jaccard_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

# Load Data

In [51]:
data = pd.read_csv('heart_failure.csv')
data.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


# Hold Out Method

The following columns are used as independent varaibles (x):

1. age
2. anaemia
3. creatinine_phosphokinase
4. diabetes
5. ejection_fraction
6. high_blood_pressure
7. platelets
8. serum_creatinine
9. serum_sodium
10. sex
11. smoking

to predict the dependent variable "time" (y).

The data (rows) is split into train and test data with a ratio of 66/33. 

In [52]:
X_train, X_test, y_train, y_test = train_test_split(data.loc[:,:'smoking':], data.loc[:,'DEATH_EVENT':], test_size=0.33, random_state=1524401)

# KNN (k-nearest neighbors)

KNN was tested with the kd-tree algorithm and three different input parameters were chosen (neighbors).

In [53]:
train_times = []
test_times = []

accuracy_scores = []
jaccard_scores = []
f1_scores = []
precision_scores = []

neighbors = [3, 5, 8]

for n_neighbors in neighbors:
    # print("--[KNN]----[n_neighbors: " + str(n_neighbors) + "]---------------------------------------------")

    algo = KNeighborsClassifier(n_neighbors=n_neighbors, algorithm='kd_tree')

    # train ----------------------------------------------------
    start_training = time.time()
    model = algo.fit(X=X_train, y=y_train.values.ravel())
    training_time = time.time() - start_training
    # print("training_time: " + str(training_time))
    train_times.append(training_time)

    # predict ----------------------------------------------------
    start_testing = time.time()
    y_pred = model.predict(X=X_test)
    test_time = time.time() - start_testing
    # print("test_time: " + str(test_time))
    test_times.append(test_time)

    # --- accuracy -------------------------------------------------------------------
    accuracy_score_result = accuracy_score(y_true=y_test, y_pred=y_pred)
    # print("accuracy: " + str(accuracy_score_result))
    accuracy_scores.append(accuracy_score_result)

    # --- jaccard -------------------------------------------------------------------
    jaccard_score_result = jaccard_score(y_true=y_test, y_pred=y_pred, average='weighted')
    # print("jaccard: " + str(jaccard_score_result))
    jaccard_scores.append(jaccard_score_result)

    # --- f1 -------------------------------------------------------------------
    f1_score_result = f1_score(y_true=y_test, y_pred=y_pred, average='weighted')
    # print("f1_score: " + str(f1_score_result))
    f1_scores.append(f1_score_result)

    # --- precision -------------------------------------------------------------------
    precision_score_result = precision_score(y_true=y_test, y_pred=y_pred, average='weighted')
    # print("precision_score: " + str(precision_score_result))
    precision_scores.append(precision_score_result)
    

print("--[KNN]----[Mean Results]---------------------------------------------")

mean_training_time = statistics.mean(train_times)
mean_testing_time = statistics.mean(test_times)

print("mean training time: " + str(mean_training_time))
print("mean testing time: " + str(mean_testing_time))

mean_accuracy_score = statistics.mean(accuracy_scores)
mean_jaccard_score = statistics.mean(jaccard_scores)
mean_f1_score = statistics.mean(f1_scores)
mean_precision_score = statistics.mean(precision_scores)

print("mean accuracy score: " + str(mean_accuracy_score))
print("mean jaccard score: " + str(mean_jaccard_score))
print("mean f1 score: " + str(mean_f1_score))
print("mean precision score: " + str(mean_precision_score))

knn_mean_training_time = mean_training_time
knn_mean_testing_time = mean_testing_time

knn_mean_accuracy_score = mean_accuracy_score
knn_mean_jaccard_score = mean_jaccard_score
knn_mean_f1_score = mean_f1_score
knn_mean_precision_score = mean_precision_score

--[KNN]----[Mean Results]---------------------------------------------
mean training time: 0.0016782283782958984
mean testing time: 0.004324913024902344
mean accuracy score: 0.6565656565656566
mean jaccard score: 0.4667498714762524
mean f1 score: 0.5853028649924437
mean precision score: 0.5585386091027569


## KNN - Analyze the results based on different input parameters 

In [54]:
eval_criteria = [train_times, test_times, accuracy_scores, jaccard_scores, f1_scores, precision_scores]
eval_criteria_name = ["train_times", "test_times", "accuracy_scores", "jaccard_scores", "f1_scores", "precision_scores"]

i = 0
for criteria in eval_criteria:
    print("\n " + eval_criteria_name[i])

    headers = ["neighbors", "3", "5", "8"]
    table_data = [[""]]

    for idx, neighbor in enumerate(neighbors):
        table_data[0].append(eval_criteria[i][idx])
    print(tabulate(table_data, headers=headers, tablefmt="grid"))
    i += 1
    


 train_times
+-------------+------------+-----------+------------+
| neighbors   |          3 |         5 |          8 |
|             | 0.00202274 | 0.0020113 | 0.00100064 |
+-------------+------------+-----------+------------+

 test_times
+-------------+------------+------------+------------+
| neighbors   |          3 |          5 |          8 |
|             | 0.00497508 | 0.00298715 | 0.00501251 |
+-------------+------------+------------+------------+

 accuracy_scores
+-------------+----------+----------+----------+
| neighbors   |        3 |        5 |        8 |
|             | 0.636364 | 0.676768 | 0.656566 |
+-------------+----------+----------+----------+

 jaccard_scores
+-------------+----------+----------+----------+
| neighbors   |        3 |        5 |        8 |
|             | 0.460969 | 0.494938 | 0.444342 |
+-------------+----------+----------+----------+

 f1_scores
+-------------+----------+----------+----------+
| neighbors   |        3 |        5 |        8 |


Using different input parameters does not significantly influence the results.

# Perceptron

The perceptron was tested with different alphas (0.0001, 0.00001, 0.001) and penalties (l2, l1, elasticnet).
I could not find significant differences in Effectivenes nor in Efficiency by using different input parameters. 
(At least with the small dataset. 
At the large dataset differences could be found.)

In [55]:
train_times = []
test_times = []

accuracy_scores = []
jaccard_scores = []
f1_scores = []
precision_scores = []

alphas = [0.0001, 0.00001, 0.001]
penalties = ['l2', 'l1', 'elasticnet']

for alpha in alphas:
    for penalty in penalties:

        # print("--[Perceptron]----[alpha: " + str(alpha) + "]-----[penalty: " + str(penalty) + "]----------------------------------------")

        algo = Perceptron(alpha=alpha, penalty=penalty, random_state=1524401)

        # train ----------------------------------------------------
        start_training = time.time()
        model = algo.fit(X=X_train, y=y_train.values.ravel())
        training_time = time.time() - start_training
        # print("training_time: " + str(training_time))
        train_times.append(training_time)

        # predict ----------------------------------------------------
        start_testing = time.time()
        y_pred = model.predict(X=X_test)
        test_time = time.time() - start_testing
        # print("test_time: " + str(test_time))
        test_times.append(test_time)

        # --- accuracy -------------------------------------------------------------------
        accuracy_score_result = accuracy_score(y_true=y_test, y_pred=y_pred)
        # print("accuracy: " + str(accuracy_score_result))
        accuracy_scores.append(accuracy_score_result)

        # --- jaccard -------------------------------------------------------------------
        jaccard_score_result = jaccard_score(y_true=y_test, y_pred=y_pred, average='weighted')
        # print("jaccard: " + str(jaccard_score_result))
        jaccard_scores.append(jaccard_score_result)

        # --- f1 -------------------------------------------------------------------
        f1_score_result = f1_score(y_true=y_test, y_pred=y_pred, average='weighted')
        # print("f1_score: " + str(f1_score_result))
        f1_scores.append(f1_score_result)

        # --- precision -------------------------------------------------------------------
        precision_score_result = precision_score(y_true=y_test, y_pred=y_pred, average='weighted')
        # print("precision_score: " + str(precision_score_result))
        precision_scores.append(precision_score_result)
    

print("--[Perceptron]----[Mean Results]---------------------------------------------")


# print("Take only the first element of the train_times and the test_times list due to highly volatile behaviour of the train_time when it comes to alpha (0.00001, 0.001).")
# print("This means when we take a alpha of 0.00001 or 0.001 then the train_time is increased substantially.")
# print("I choose to take only the first instance into account that is performed with a alpha of 0.0001 to not produce a misleading training result.")
mean_training_time = statistics.mean(train_times)
mean_testing_time = statistics.mean(test_times)

print("mean training time: " + str(mean_training_time))
print("mean testing time: " + str(mean_testing_time))

mean_accuracy_score = statistics.mean(accuracy_scores)
mean_jaccard_score = statistics.mean(jaccard_scores)
mean_f1_score = statistics.mean(f1_scores)
mean_precision_score = statistics.mean(precision_scores)

print("mean accuracy score: " + str(mean_accuracy_score))
print("mean jaccard score: " + str(mean_jaccard_score))
print("mean f1 score: " + str(mean_f1_score))
print("mean precision score: " + str(mean_precision_score))

perceptron_mean_training_time = mean_training_time
perceptron_mean_testing_time = mean_testing_time

perceptron_mean_accuracy_score = mean_accuracy_score
perceptron_mean_jaccard_score = mean_jaccard_score
perceptron_mean_f1_score = mean_f1_score
perceptron_mean_precision_score = mean_precision_score

--[Perceptron]----[Mean Results]---------------------------------------------
mean training time: 0.0016645060645209418
mean testing time: 0.001111639870537652
mean accuracy score: 0.6767676767676768
mean jaccard score: 0.4580144883175186
mean f1 score: 0.5463064378727029
mean precision score: 0.4580144883175186


## Perceptron - Analyze the results based on different input parameters 

In [56]:
eval_criteria = [train_times, test_times, accuracy_scores, jaccard_scores, f1_scores, precision_scores]
eval_criteria_name = ["train_times", "test_times", "accuracy_scores", "jaccard_scores", "f1_scores", "precision_scores"]
i = 0
for criteria in eval_criteria:
    print("\n " + eval_criteria_name[i])
    headers = ["penalty\\alpha", "0.0001", "0.00001", "0.001"]
    table_data = []
    for idy, y in enumerate(penalties):
        table_data.append([penalties[idy]])
        for idx, x in enumerate(alphas):
            table_data[idy].append(eval_criteria[i][len(alphas)*idy+idx])
    
    print(tabulate(table_data, headers=headers, tablefmt="grid"))
    i += 1


 train_times
+-----------------+-------------+------------+------------+
| penalty\alpha   |      0.0001 |    0.00001 |      0.001 |
| l2              | 0.00199842  | 0.00198364 | 0.00199938 |
+-----------------+-------------+------------+------------+
| l1              | 0.000999928 | 0.0010004  | 0.00200152 |
+-----------------+-------------+------------+------------+
| elasticnet      | 0.00100017  | 0.00199723 | 0.00199986 |
+-----------------+-------------+------------+------------+

 test_times
+-----------------+-------------+-------------+-------------+
| penalty\alpha   |      0.0001 |     0.00001 |       0.001 |
| l2              | 0.00200176  | 0.00100088  | 0.000999451 |
+-----------------+-------------+-------------+-------------+
| l1              | 0.00100017  | 0.000999689 | 0.000998735 |
+-----------------+-------------+-------------+-------------+
| elasticnet      | 0.000999689 | 0.00100017  | 0.00100422  |
+-----------------+-------------+-------------+------------

# Decision Tree

The decision tree was tested with different min_samples_splits (2, 50, 100, 500, 1000) and min_samples_leafs (1, 50, 100, 500, 1000).

In [57]:
train_times = []
test_times = []

accuracy_scores = []
jaccard_scores = []
f1_scores = []
precision_scores = []

min_samples_splits = [2, 50, 100, 500, 1000]
min_samples_leafs = [1, 50, 100, 500, 1000]


for min_samples_split in min_samples_splits:
    for min_samples_leaf in min_samples_leafs:
        # print("--[DecisionTree]----[min_samples_splits: " + str(min_samples_split) + "]-----[min_samples_leafs: " + str(min_samples_leaf) + "]----------------------------------------")

        algo = DecisionTreeClassifier(criterion='gini', splitter='best', min_samples_split=min_samples_split, random_state=1524401)

        # train ----------------------------------------------------
        start_training = time.time()
        model = algo.fit(X=X_train, y=y_train.values.ravel())
        training_time = time.time() - start_training
        # print("training_time: " + str(training_time))
        train_times.append(training_time)

        # predict ----------------------------------------------------
        start_testing = time.time()
        y_pred = model.predict(X=X_test)
        test_time = time.time() - start_testing
        # print("test_time: " + str(test_time))
        test_times.append(test_time)

        # --- accuracy -------------------------------------------------------------------
        accuracy_score_result = accuracy_score(y_true=y_test, y_pred=y_pred)
        # print("accuracy: " + str(accuracy_score_result))
        accuracy_scores.append(accuracy_score_result)

        # --- jaccard -------------------------------------------------------------------
        jaccard_score_result = jaccard_score(y_true=y_test, y_pred=y_pred, average='weighted')
        # print("jaccard: " + str(jaccard_score_result))
        jaccard_scores.append(jaccard_score_result)

        # --- f1 -------------------------------------------------------------------
        f1_score_result = f1_score(y_true=y_test, y_pred=y_pred, average='weighted')
        # print("f1_score: " + str(f1_score_result))
        f1_scores.append(f1_score_result)

        # --- precision -------------------------------------------------------------------
        precision_score_result = precision_score(y_true=y_test, y_pred=y_pred, average='weighted')
        # print("precision_score: " + str(precision_score_result))
        precision_scores.append(precision_score_result)


print("--[DecisionTree]----[Mean Results]---------------------------------------------")

mean_training_time = statistics.mean(train_times)
mean_testing_time = statistics.mean(test_times)

print("mean training time: " + str(mean_training_time))
print("mean testing time: " + str(mean_testing_time))

mean_accuracy_score = statistics.mean(accuracy_scores)
mean_jaccard_score = statistics.mean(jaccard_scores)
mean_f1_score = statistics.mean(f1_scores)
mean_precision_score = statistics.mean(precision_scores)

print("mean accuracy score: " + str(mean_accuracy_score))
print("mean jaccard score: " + str(mean_jaccard_score))
print("mean f1 score: " + str(mean_f1_score))
print("mean precision score: " + str(mean_precision_score))

decisionTree_mean_training_time = mean_training_time
decisionTree_mean_testing_time = mean_testing_time

decisionTree_mean_accuracy_score = mean_accuracy_score
decisionTree_mean_jaccard_score = mean_jaccard_score
decisionTree_mean_f1_score = mean_f1_score
decisionTree_mean_precision_score = mean_precision_score  

--[DecisionTree]----[Mean Results]---------------------------------------------
mean training time: 0.0017145729064941405
mean testing time: 0.0009657573699951171
mean accuracy score: 0.7333333333333334
mean jaccard score: 0.5645705513885477
mean f1 score: 0.679151997440421
mean precision score: 0.6453556897935424


## Decision Tree - Analyze the results based on different input parameters 

In [58]:
eval_criteria = [train_times, test_times, accuracy_scores, jaccard_scores, f1_scores, precision_scores]
eval_criteria_name = ["train_times", "test_times", "accuracy_scores", "jaccard_scores", "f1_scores", "precision_scores"]
i = 0
for criteria in eval_criteria:
    print("\n " + eval_criteria_name[i])
    headers = ["leafs\splits", "2", "50", "100", "500", "1000"]
    table_data = []
    for idy, y in enumerate(min_samples_leafs):
        table_data.append([min_samples_leafs[idy]])
        for idx, x in enumerate(min_samples_splits):
            table_data[idy].append(eval_criteria[i][len(min_samples_leafs)*idy+idx])
    
    print(tabulate(table_data, headers=headers, tablefmt="grid"))
    i += 1


 train_times
+----------------+-------------+-------------+------------+------------+-------------+
|   leafs\splits |           2 |          50 |        100 |        500 |        1000 |
|              1 | 0.00199962  | 0.00397635  | 0.00301409 | 0.00196028 | 0.00199866  |
+----------------+-------------+-------------+------------+------------+-------------+
|             50 | 0.00202727  | 0.00195622  | 0.00104046 | 0.00195956 | 0.00200176  |
+----------------+-------------+-------------+------------+------------+-------------+
|            100 | 0.0010016   | 0.00100136  | 0.00199628 | 0.00200129 | 0.000974178 |
+----------------+-------------+-------------+------------+------------+-------------+
|            500 | 0.000936747 | 0.00100136  | 0.0019908  | 0.00196815 | 0.00200009  |
+----------------+-------------+-------------+------------+------------+-------------+
|           1000 | 0.00104094  | 0.000999689 | 0.00199747 | 0.00101924 | 0.00100088  |
+----------------+-----------

# Results (KNN, Perceptron, Decision Tree)

In [59]:
headers = ["", "Train time", "Test time", "Accuracy", "Jaccard", "f1", "Precision"]

table_data = [
    ["K-NN", str(knn_mean_training_time), str(knn_mean_testing_time), str(knn_mean_accuracy_score),  str(knn_mean_jaccard_score), str(knn_mean_f1_score), str(knn_mean_precision_score)],
    ["Perceptron",  str(perceptron_mean_training_time), str(perceptron_mean_testing_time), str(perceptron_mean_accuracy_score),  str(perceptron_mean_jaccard_score), str(perceptron_mean_f1_score), str(perceptron_mean_precision_score)],
    ["Decision Tree",str(decisionTree_mean_training_time), str(decisionTree_mean_testing_time), str(decisionTree_mean_accuracy_score),  str(decisionTree_mean_jaccard_score), str(decisionTree_mean_f1_score), str(decisionTree_mean_precision_score)],
]

print(tabulate(table_data, headers=headers, tablefmt="grid"))

+---------------+--------------+-------------+------------+-----------+----------+-------------+
|               |   Train time |   Test time |   Accuracy |   Jaccard |       f1 |   Precision |
| K-NN          |   0.00167823 | 0.00432491  |   0.656566 |  0.46675  | 0.585303 |    0.558539 |
+---------------+--------------+-------------+------------+-----------+----------+-------------+
| Perceptron    |   0.00166451 | 0.00111164  |   0.676768 |  0.458014 | 0.546306 |    0.458014 |
+---------------+--------------+-------------+------------+-----------+----------+-------------+
| Decision Tree |   0.00171457 | 0.000965757 |   0.733333 |  0.564571 | 0.679152 |    0.645356 |
+---------------+--------------+-------------+------------+-----------+----------+-------------+


The prceptron and the decision tree achieved better results regarding efficiency (test time) than KNN. 

Besides that there could not be found significat differences in Effectivenes nor in Efficiency when conducting the test with the small dataset.

However, the large dataset produces different results when it comes to Effectivenes and Efficiency.