In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

## Loading Covid Dataset, first column is the index (0 based)

In [2]:
data = pd.read_csv('COVID_Dataset.csv', index_col=0)

### Printing summary of dataset
Showing the number of rows, cols, types and overal size of data set in memory
Also showing an overal look of dataset

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80000 entries, 0 to 79999
Columns: 106 entries, 24 to Infection
dtypes: float64(104), int64(2)
memory usage: 65.3 MB


### Input-Output Split
The last column of dataset is the output and the other ones are the inputs

In [4]:
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]

The output is binary, so we are facing binary classification task

In [5]:
y.unique()

array([1, 0], dtype=int64)

### Normalization
To test the effect of data scaling on classifier quality, we create another dataset based on this one but scale the values of this new set using Standard Scaler

#### Standard Scaler: Standardize features by removing the mean and scaling to unit variance.
<img src="z_score.svg">
after applying the above formula, data will have 0 mean and std of 1

In [6]:
X_scaled = StandardScaler().fit_transform(X)

### Train-Test split
We need to split our data into test and train and we have to do it twice, once for normalized data and other one for raw data. To do so we used sklearn train_test_split and set it's test size ratio to all data as 0.25 which means it's going to use 25% of dataset as test and rest of it as train. It also shuffle the rows of data before splitting. (Shuffle option is on by default)<br>
random_state is set to get similar results in every experitment during code debugging. (1379 is an arbitrery number)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1379)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.25, random_state=1379)

### DeTree
First algorithm is desicion tree, we use this algorithm twice, first without pruning and second with pruning. <br>
pruning is the method of removing unnecceary nodes, which pervent overfitting

In [8]:
# DeTree withoud pruning
clf_tree = DecisionTreeClassifier()
clf_tree_s = DecisionTreeClassifier()

fitting decision tree over trainng data (once for normalized and once for unnormalized)

In [9]:
clf_tree.fit(X_train, y_train)
clf_tree_s.fit(X_train_s, y_train_s)

KeyboardInterrupt: 

Get predictions of classifers

In [None]:
p_train = clf_tree.predict(X_train)
p_train_s = clf_tree_s.predict(X_train_s)

p_test = clf_tree.predict(X_test)
p_test_s = clf_tree_s.predict(X_test_s)

Calculating classfication metrics for each case

In [None]:
print('---- Raw Data ----')
print(f'[Train]:')
print(classification_report(y_train, p_train))
print('[Test]:')
print(classification_report(y_test, p_test))

---- Raw Data ----
[Train]:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     39968
           1       1.00      1.00      1.00     20032

    accuracy                           1.00     60000
   macro avg       1.00      1.00      1.00     60000
weighted avg       1.00      1.00      1.00     60000

[Test]:
              precision    recall  f1-score   support

           0       0.80      0.80      0.80     13286
           1       0.61      0.61      0.61      6714

    accuracy                           0.74     20000
   macro avg       0.71      0.71      0.71     20000
weighted avg       0.74      0.74      0.74     20000



In [None]:
print('---- Normalized Data ----')
print(f'[Train]:')
print(classification_report(y_train_s, p_train_s))
print('[Test]:')
print(classification_report(y_test_s, p_test_s))

---- Normalized Data ----
[Train]:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     39968
           1       1.00      1.00      1.00     20032

    accuracy                           1.00     60000
   macro avg       1.00      1.00      1.00     60000
weighted avg       1.00      1.00      1.00     60000

[Test]:
              precision    recall  f1-score   support

           0       0.80      0.80      0.80     13286
           1       0.61      0.61      0.61      6714

    accuracy                           0.74     20000
   macro avg       0.70      0.71      0.70     20000
weighted avg       0.74      0.74      0.74     20000



Based on the above results, we can see that normalization didn't have any effect on data which is completely expected because 
"The algorithm is based on partitioning the data to make predictions, therefore, it does not require normalization. For example, a decision tree splits a node on a feature, where this feature is not influenced by another feature and neither influences another feature."<br>
Source: https://www.kdnuggets.com/2022/07/random-forest-algorithm-need-normalization.html

### DeTree with Pruning
Now let's see the results after pruning trees. <br>
for the pruning pars, we use this algorithm "Minimal Cost-Complexity Pruning", which needs a value alpha.<br>
$R_\alpha(T) = R(T) + \alpha|\widetilde{T}|$<br>
If we set alpha = 0, no pruning will be done. We test pruning with 4 different alphas

In [None]:
list_alpha = [0.0, 0.00001, 0.0001, 0.001, 0.01]
for alpha in list_alpha:
    clf = DecisionTreeClassifier(ccp_alpha = alpha)
    clf.fit(X_train, y_train)
    p_train = clf.predict(X_train)
    p_test = clf.predict(X_test)
    print(f'---- Alpha: {alpha} ----')
    print(f'Train: {clf.score(X_train, y_train)}')
    print(f'Test: {clf.score(X_test, y_test)}')

---- Alpha: 0.0 ----
Train: 0.9994833333333333
Test: 0.73825


---- Alpha: 1e-05 ----
Train: 0.9994833333333333
Test: 0.7387


---- Alpha: 0.0001 ----
Train: 0.8118333333333333
Test: 0.77265


---- Alpha: 0.001 ----
Train: 0.7618666666666667
Test: 0.75815


---- Alpha: 0.01 ----
Train: 0.7087333333333333
Test: 0.7064


So, based on pruning results, we can see that by eliminating some extra nodes, we can increase the test accuracy by sacrificing a portion of trainig accuracy (alpha = 0.0001, alpha = 0.001)<br>
but this improvement stops at some point, because we are sacrificing too much nodes, which cause in loosing model complexity which results in both smaller train and test accuracy. So we need to test different alphas to find the sweet spot.

### Ada Boost
The second algorithm is ada boost, ada boost needs a base classifier which in this case we're gonna use the same Decision Tree as above <br>
(Default parameter is Decision Tree)
we test 4 different number of base estimators

In [None]:
list_n = [50, 100, 500, 1000]

In [None]:
for n in list_n:
    clf = AdaBoostClassifier(n_estimators = n)
    clf.fit(X_train, y_train)
    p_train = clf.predict(X_train)
    p_test = clf.predict(X_test)
    print(f'---- Number of Estimators: {n} ----')
    print(f'[Train]:')
    print(classification_report(y_train, p_train))
    print('[Test]:')
    print(classification_report(y_test, p_test))
    

---- Number of Estimators: 50 ----
[Train]:
              precision    recall  f1-score   support

           0       0.79      0.88      0.83     39968
           1       0.68      0.52      0.59     20032

    accuracy                           0.76     60000
   macro avg       0.73      0.70      0.71     60000
weighted avg       0.75      0.76      0.75     60000

[Test]:
              precision    recall  f1-score   support

           0       0.78      0.88      0.83     13286
           1       0.68      0.51      0.58      6714

    accuracy                           0.75     20000
   macro avg       0.73      0.69      0.70     20000
weighted avg       0.74      0.75      0.74     20000



---- Number of Estimators: 100 ----
[Train]:
              precision    recall  f1-score   support

           0       0.79      0.89      0.84     39968
           1       0.70      0.53      0.60     20032

    accuracy                           0.77     60000
   macro avg       0.75      0.71      0.72     60000
weighted avg       0.76      0.77      0.76     60000

[Test]:
              precision    recall  f1-score   support

           0       0.78      0.88      0.83     13286
           1       0.69      0.52      0.59      6714

    accuracy                           0.76     20000
   macro avg       0.74      0.70      0.71     20000
weighted avg       0.75      0.76      0.75     20000



---- Number of Estimators: 500 ----
[Train]:
              precision    recall  f1-score   support

           0       0.81      0.88      0.84     39968
           1       0.72      0.58      0.64     20032

    accuracy                           0.78     60000
   macro avg       0.76      0.73      0.74     60000
weighted avg       0.78      0.78      0.78     60000

[Test]:
              precision    recall  f1-score   support

           0       0.80      0.87      0.83     13286
           1       0.69      0.56      0.62      6714

    accuracy                           0.77     20000
   macro avg       0.74      0.72      0.73     20000
weighted avg       0.76      0.77      0.76     20000



---- Number of Estimators: 1000 ----
[Train]:
              precision    recall  f1-score   support

           0       0.82      0.89      0.85     39968
           1       0.73      0.60      0.66     20032

    accuracy                           0.79     60000
   macro avg       0.77      0.74      0.75     60000
weighted avg       0.79      0.79      0.79     60000

[Test]:
              precision    recall  f1-score   support

           0       0.80      0.87      0.83     13286
           1       0.69      0.57      0.63      6714

    accuracy                           0.77     20000
   macro avg       0.74      0.72      0.73     20000
weighted avg       0.76      0.77      0.76     20000



Increasing the number of estimators, will slightly increase the metrics. This is important because we want to increase the recall as much as possible, which means we want to reduce false negatives, because not detecting a covid 19 case is more dangerous than classifiyng a healthy person as postivive case. 

The problem is, to get recall to increase by 6% (from 51 to 57) we have to use 950 more estimators which means it's going to incraese the trainig time by a factor of 20. It also makes the prediction phase slower too.