## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
AUTHOR = "Dimitrios Sourlantzis"
COURSE = "Machine Learning - CSD AUTH"

---

# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work. 

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) (+ [Additional information](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 33% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [3]:
# BEGIN CODE HERE
X, y = load_breast_cancer( return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.33 ,random_state=RANDOM_VARIABLE)

#END CODE HERE

In [4]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

Size of train set:381
Size of test set:188
Unique classes:2


**Expected output**:  

```
Size of train set:381  
Size of test set:188  
Unique classes:2
```



**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [5]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(random_state = RANDOM_VARIABLE)
classifier_igain = DecisionTreeClassifier(criterion = 'entropy', random_state = RANDOM_VARIABLE)

classifier_gini = classifier_gini.fit(X_train, y_train)
classifier_igain = classifier_igain.fit(X_train, y_train)

prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

f_measure_gini = f1_score(prediction_gini, y_test)
f_measure_igain = f1_score(prediction_igain, y_test)

#END CODE HERE

In [6]:
print("F-Measure Gini: {}".format(f_measure_gini))
print("F-Measure Information Gain: {}".format(f_measure_igain))

F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386


**Expected output**:  

```
F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386
```

**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifiers by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to the corresponding list of the *fscores* dictionary (one list for training set and one for test set). Before appending the scores to the corresponding list, multiply them by 100, and round the values to 2 decimals.

In [7]:
# BEGIN CODE HERE
depth = classifier_gini.tree_.max_depth
fscores = {}
fscores['train'] = []
fscores['test'] = []

for i in range(1,depth+1):
    dummy = DecisionTreeClassifier(max_depth = i, random_state = RANDOM_VARIABLE)
    dummy = dummy.fit(X_train, y_train)
    train_score = dummy.predict(X_train)
    test_score = dummy.predict(X_test)
    train_score = 100*f1_score(train_score, y_train)
    test_score = 100*f1_score(test_score, y_test)
    fscores['train'].append(round(train_score, 2))
    fscores['test'].append(round(test_score, 2))
    
#END CODE HERE

In [8]:
print("Fscores Train: {}".format(fscores['train']))
print("Fscores Test:  {}".format(fscores['test']))


Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]


**Expected output**:  
```
Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]
```

**1.4** Compare the results from the train set with the results from the test set. What do you notice? How are you going to choose the max_depth of your model?

# ANSWERS
* Fscore Train: F1 score increases as the depth increases

* Fscore Test: F1 score increases until a point, reaching maximum, then it starts decreasing.
We shall choose a max_depth that guarantees the maximum Test F1 score

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.  

In [9]:
# BEGIN CODE HERE
train_set = pd.read_csv('income.csv')
X_train = train_set.drop('income',axis=1)
y_train = train_set['income']
# any other code you need


test_set = pd.read_csv('income_test.csv')
X_test = test_set.drop('income',axis=1)
y_test = test_set['income']
# any other code you need

# any other code you need
# End CODE HERE

**2.2** Create and test your pipeline

In [10]:
#Your pipeline

categorical_features = [X_train.columns[i] for i in range(len(X_train.columns)) 
          if  X_train.dtypes[i] == 'object']

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent")), 
           ('encoder', OneHotEncoder())]
)

numeric_features = [X_train.columns[i] for i in range(len(X_train.columns))
                   if X_train.columns[i] not in categorical_features]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])


preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, numeric_features),
        ("categorical", categorical_transformer, categorical_features),
    ]
)

#VANILLA CLF WITH EVERYING ON DEFAULT
clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier())])

#CLF AFTER HYPERPARAMETER TUNING WITH GRID SEARCH
#clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier(criterion= 'gini', max_depth= 10, max_leaf_nodes= 18, min_samples_leaf= 2, splitter= 'best'))])
clf.fit(X_train, y_train)
y_predict =  clf.predict(X_test)

In [11]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test, y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test, y_predict,average='weighted'))

Model score Accuracy: 0.808
Model score F1 Weighted: 0.809


**2.3** Perform a gooood grid search to find the best parameters for your pipeline

In [12]:
param_grid = [{'classifier__criterion': ['gini', 'entropy'],
               'classifier__splitter': ['best', 'random'],
               'classifier__max_depth': [5, 10, 15],
               'classifier__min_samples_leaf': [2, 5, 8],
               'classifier__max_leaf_nodes': [12, 16, 18],
               'preprocessor__numeric__imputer__strategy': ['mean', 'median', 'most_frequent']
               }]


grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
y_predict =  grid_search.predict(X_test)

print("Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__criterion': 'gini', 'classifier__max_depth': 10, 'classifier__max_leaf_nodes': 18, 'classifier__min_samples_leaf': 2, 'classifier__splitter': 'best', 'preprocessor__numeric__imputer__strategy': 'mean'}


In [13]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict,average='weighted'))

Model score Accuracy: 0.850
Model score F1 Weighted: 0.838


**2.4** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following 
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters 
- Report any additional results and comments on your approach.

You should achieve at least 85% accuracy score and 84% f1 score.

# ANSWERS
* When handling missing values, one can choose to ignore the column, ignore the set element or fill in the blank. In this scenario, filling a categorical missing value with the most frequent label value did the trick. For numeric values, we fill the missing value with the mean of the column.
* When handling categorical values, OneHotEncoder was used as it allows the representation of categorical data to be more expressive. If we assigned arbitrary label numbers to every categorical value, we could have damaged the perfomance of the model.
* Check both numerican and categorical collumns for any missing values and fill them. Then, initialize our transformers with the required steps: 
  * Numeric Values: Check for missing values and, if found, fill the blanks with the rules proposed previously.
  * Categorical Values: Check for missing values and, if found, fill the blanks with the rules proposed previously. Then, encode the value using OneHotEncoder
* Predict the output (`y_predict`), compare it with `y_test` and find total matches. We first run the training and evaluation procedures using the default parameters from the Pipeline class, scoring **0.809** on accuracy and **0.810** F1 score. Then we did a thorough grid search to find the optimal hyperparameter values, training our model with these values. (Grid search is very slow on a mere home computer, so it was done using the power of Google Colab Notebooks). Accuracy with tuning: **0.850**, F1 Score with tuning: **0.838**.
* The structure of the total pipeline can get very interesting. In this case, two pipelines were created, one for numerical features and one for categorical, each one with their own necessary steps. Those 2 pipelines were then combined to a global pipeline that can train and test the perfomance of the model. This logic can be used for models that have to solve bigger and more complicated problems. While more steps will be required and the model would be more complicated and convoluted, the general idea of combining pipelines can make model building, training and testing easier.

## 3.0 Common Issues ## 

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions. 

In [14]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)

# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)
classifier.fit(X_train,y_train)
y_predict = classifier.predict(X_test)
accuracy_scores = accuracy_score(y_test,y_predict)
print("Model score accuracy: %.3f" % accuracy_scores)

Model score accuracy: 0.791


**3.1** Evaluate the classifier using at least three evaluation metrics except accuracy_score and f1 (weighted).

In [15]:
from sklearn.metrics import balanced_accuracy_score, average_precision_score, f1_score, brier_score_loss, jaccard_score
y_predict = classifier.predict(X_test)

# BEGIN CODE HERE
metric1 = average_precision_score(y_test,y_predict)
metric2 = jaccard_score(y_test,y_predict)
metric3 = brier_score_loss(y_test,y_predict)
#END CODE HERE

In [None]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

# ANSWERS

What is wrong with the classifier?
---
  It is obvious that our classifier performs poorly. Let's explain why this is reasonable. Taking a deeper look into our code, we can see that we train our model using raw data: No analysis has been done, we don't have any knowledge of the distribution the data follows, we don't check our data for missing values and we don't encode our data correctly, or at all to be exact. Furthermore, we apply zero balancing to our data prior to fitting. We enforce only the minimum samples per leaf to be 4, an arbitrary value with no real meaning. This leads to the conclusion that no pre-pruning has been applied to the model. All the above leave the model prone to over-fitting or under-fitting.

What can we do to increase the performance of the classifier?
---
 - Regarding the data, as mentioned before, we have to do a prior data analysis to our dataset in order to gain more information and discover hidden facts and patterns or the data's distribution. We also can drop columns with missing feature values or fill them with certain methods(mean, frequency etc.). We can distinguish numerical features from the categorical features in dataset. We can encode, then, the categorical features to meaningful values, although in this case we drop all categorical features from the dataset. Last, but not least, we can merge and re-split the data using stratification to balance them.
 - Regarding the model, we can apply pre-pruning. We can add some hyper-parameters to control the model features like the tree's max depth, the minimum or maximum samples per leaf, the minimum samples required for a leaf to split, and many more. It is very easy for tree models to suffer from overfitting or underfitting, as showed before, so assigning arbitrary values to the above parameters is not recommended at all. We can perform a grid search to tune these hyper-parameters, assign to them an optimal value, and achieve maximum perfomance.

**3.3** Implement your solution using the cells below. Report your results and the process you followed. You are reccommended to use stratification and grid search. You should only have to increase a little bit the metrics you calculated above, and also reach an accuracy score higher than 82%!

In [17]:
# BEGIN CODE HERE
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv', usecols=columns) #LOAD TRAIN DATASET
data_test = pd.read_csv('income_test.csv', usecols=columns) #LOAD TEST DATASET
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 }) #ENCODE INCOME - TRAIN DATASET
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 }) #ENCODE INCOME - TEST DATASET
# data_all = pd.concat([data, data_test], ignore_index=True) #MERGE DATASETS

#CONFIGURE INPUT-OUTPUT
y = data['income']
X = data.drop('income',axis=1)

#SPLIT AGAIN USING STRATIFICATION
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_VARIABLE, stratify=y)

print()
#BUILD USUAL PIPELINE AND EVALUATED CONFIGURED MODEL

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, X_train.columns)
    ]
)

new_clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier())])

param_grid = [{'classifier__criterion': ['gini', 'entropy'],
               'classifier__splitter': ['best', 'random'],
               'classifier__max_depth': [5, 10, 15],
               'classifier__min_samples_leaf': [2, 5, 8],
               'classifier__max_leaf_nodes': [12, 16, 18],
               'preprocessor__numeric__imputer__strategy': ['mean', 'median', 'most_frequent']
               }]

grid_search = GridSearchCV(new_clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
y_predict =  grid_search.predict(X_test)

metric1 = average_precision_score(y_test,y_predict)
metric2 = jaccard_score(y_test,y_predict)
metric3 = brier_score_loss(y_test,y_predict)

final_score = ""

# #END CODE HERE

In [None]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict,average='weighted'))
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Results:
---
We can see that model accuracy went up to 83.3% and F1 Weighted score is 81.6%, an adequate result as it is above 82%. Metric 1 (Average Precision) went from 0.414 up to 0.473, Metric 2(Jaccard Score - Similarity Score) went from 0.363 up to 0.390 and, finally, metric 3(Brier Score Loss - Mean Squared Error) dropped from 0.209 to 0.167.

What did we do differently?
---
First of all, we merged the datasets and splitted them back differently, using stratification, so the dataset is well-balanced. We checked the train dataset for missing values and filled them, we distinguish the numerical features from the categorical features in the dataset. We left the categorical values out of the dataset as asked, although we could have encoded the categorical features to meaningful values, using OneHotEncoder. Last but not least, we performed a grid search to tune every chosen hyperparameter. Combining all the above, we secured the good perfomance of the classifier.