## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Αλέξανδρος Τσίγγος"
AEM = "3690"

---

# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work. 

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) (+ [Additional information](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 33% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [3]:
# BEGIN CODE HERE
X, y = load_breast_cancer().data, load_breast_cancer().target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_VARIABLE)

#END CODE HERE

In [4]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

Size of train set:381
Size of test set:188
Unique classes:2


**Expected output**:  

```
Size of train set:381  
Size of test set:188  
Unique classes:2
```



**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [5]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(criterion="gini", random_state=RANDOM_VARIABLE)
classifier_igain = DecisionTreeClassifier(criterion="entropy",random_state=RANDOM_VARIABLE)

classifier_gini.fit(X_train,y_train)
classifier_igain.fit(X_train,y_train)

prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

f_measure_gini = f1_score(y_pred=prediction_gini, y_true=y_test)
f_measure_igain = f1_score(y_pred=prediction_igain, y_true=y_test)

#END CODE HERE

In [6]:
print("F-Measure Gini: {}".format(f_measure_gini))
print("F-Measure Information Gain: {}".format(f_measure_igain))

F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386


**Expected output**:  

```
F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386
```

**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifiers by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to the corresponding list of the *fscores* dictionary (one list for training set and one for test set). Before appending the scores to the corresponding list, multiply them by 100, and round the values to 2 decimals.

In [7]:
# BEGIN CODE HERE
depth = classifier_gini.get_depth()
fscores = {}
fscores['train'] = []
fscores['test'] = []

for i in range(1,depth+1,1):
    classifier = DecisionTreeClassifier(criterion="gini", random_state=RANDOM_VARIABLE, max_depth=i)
    classifier.fit(X_train, y_train)
    prediction_train = classifier.predict(X_train)
    prediction_test = classifier.predict(X_test)
    fscores['train'].append(round(f1_score(y_pred=prediction_train,y_true=y_train)*100,2))
    fscores['test'].append(round(f1_score(y_pred=prediction_test,y_true=y_test)*100,2))

#END CODE HERE

In [8]:
print("Fscores Train: {}".format(fscores['train']))
print("Fscores Test:  {}".format(fscores['test']))


Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]


**Expected output**:  
```
Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]
```

**1.4** Compare the results from the train set with the results from the test set. What do you notice? How are you going to choose the max_depth of your model?

<ul>
    <li>Αρχικά παρατηρούμε ότι το f1_score στο Train set είναι πιο υψηλό σε σχέση με αυτό του Test set σε κάθε ένα από τα διαφορετικά depths, το οποίο είναι και λογικό αφού με το Train Set εκπαιδεύτηκε το μοντέλο.</li>
    <li>Στη συνέχεια, παρατηρούμε στο f1_score του Train set πως όσο αυξάνουμε τη hyperparameter max_depth, τόσο πιο πολύ βελτιώνεται και το f1_score ώσπου φτάνουμε μέχρι και το 100 (δηλαδή το μοντέλο μας κάνει τέλειο classification στα examples που του δώσαμε) με αποτέλεσμα να έχει γίνει overfit. Το ότι έχει γίνει overfit στις πιο μεγάλες τιμές του max_depth επιβεβαιώνεται και από το γεγονός ότι το f1_score του Test Set ενώ μέχρι ένα σημείο αυξάνεται μετά αρχίζει και μειώνεται (96.64 --> 94.12).</li>
    <li>Συνεπώς, θα βλέπαμε μέχρι ποιο βάθος αυξάνεται το f1_score και στο Train αλλά και στο Test Set και θα επιλέγαμε αυτό ως max_depth. Στο παράδειγμά μας, θα επιλέγαμε ως max_depth = 3, καθώς μετά από αυτό το βάθος το f1_score μειώνεται στο Test Set, το οποίο είναι ένδειξη για την ύπαρξη overfit.</li>
</ul>

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.

In [9]:
# BEGIN CODE HERE
data = pd.read_csv('income.csv')
train_set = data[['age','workclass','fnlwgt','education','education_num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week']]
y_train = data['income'].map({ "<=50K": 0, ">50K": 1 })
# any other code you need

test_data = pd.read_csv('income_test.csv')
test_set = test_data[['age','workclass','fnlwgt','education','education_num','marital-status','occupation','relationship','race','sex','capital-gain','capital-loss','hours-per-week']]
y_test = test_data['income'].map({ "<=50K": 0, ">50K": 1 })
# any other code you need
# End CODE HERE

**2.2** Create and test your pipeline

In [10]:
#Your pipeline
numeric_features = ["age", "fnlwgt","education_num","capital-gain", "capital-loss","hours-per-week"]
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
categorical_features = ["workclass", "education","marital-status","occupation","relationship","race","sex"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("one_hot_cat", categorical_transformer, categorical_features)
    ]
)
clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier(random_state=RANDOM_VARIABLE))])
clf.fit(train_set,y_train)
y_predict =  clf.predict(test_set)

In [11]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test, y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test, y_predict,average='weighted'))

Model score Accuracy: 0.807
Model score F1 Weighted: 0.808


**2.3** Perform a gooood grid search to find the best parameters for your pipeline

In [13]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "classifier__max_depth": [2, 5, 10],
    "classifier__criterion": ["gini","entropy"],
    "classifier__max_features": [3, 7, 9, 11, None],
    "classifier__min_samples_leaf": [1,10,20,50],
}

grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(train_set,y_train)
y_predict =  grid_search.predict(test_set)

print("Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__criterion': 'gini', 'classifier__max_depth': 10, 'classifier__max_features': None, 'classifier__min_samples_leaf': 10, 'preprocessor__num__imputer__strategy': 'mean'}


KeyboardInterrupt: 

In [None]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict,average='weighted'))

**2.4** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following 
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters 
- Report any additional results and comments on your approach.

You should achieve at least 85% accuracy score and 84% f1 score.

<ul>
    <li>How do you handle missing values and why</li>
        <ul>
            <li>Το data set μας έχει missing values στο feature "workclass"(categorical feature) και στο feature "occupation"(categorical feature). Θα μπορούσαμε να διαγράψουμε τα examples που έχουν missing values, ωστόσο για να μην χάσουμε πληροφορία από το Data Set μας, προτιμήθηκε να γίνει imputation των missing values(Στα numerical features ακολουθήσαμε ως στρατηγική το median ενώ σε categorical features χρησιμοποιήσαμε το most frequent, που έχει νόημα όταν διαχειριζόμαστε categorical features)</li>
        </ul>
    <li>How do you handle categorical variables and why</li>
        <ul>
            <li>Στα categorical variables αρχικά κάναμε imputation στα missing values για να τα εξαλείψουμε από το Data Set μας και έπειτα χρησιμοποιήσαμε τον one hot encoder προκείμενου να κάνουμε encode τα categorical features σε numerical features μιας και ο αλγόριθμος CART δέχεται μόνο numerical features./li>
        </ul>
    <li>Any further preprocessing steps</li>
        <ul><li>Δεν κάναμε κάτι παρπάνω πέρα από αυτά που ήδη αναφέραμε. Δεν έγινε κάποιο normalization/standardization γιατί είμαστε στην οικογένεια των Δενδρικών Μοντέλων</li></ul>
    <li>How do you evaluate your model and how did you choose its parameters</li>
        <ul>
            <li>Το μοντέλο μας αξιολογείται με τις μετρικές Accuracy και f1 score(weighted), ενώ οι υπερ-παράμετροι του μοντέλου επιλέχθηκαν έπειτα από gridSearchCV (δοκιμάζονται όλοι οι συνδυασμοί των υπερ-παραμέτρων που έχουμε ορίσει στην μεταβλητή param_grid και επίλέγεται εκείνος ο συνδυασμός παραμέτρων που μας δίνει το πιο υψηλό performance)</li>
        </ul>
     <li>Report any additional results and comments on your approach.</li>
        <ul><li>Το μοντέλο μας σημείωσε Accuracy: 0.857 και F1 Weighted: 0.850</li></ul>
</ul>

## 3.0 Common Issues ## 

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions. 

In [63]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)
classifier.fit(X_train,y_train)
y_predict = classifier.predict(X_test) #προστέθηκε από εμένα, γιατί διαφορετικά στην επόμενη γραμμή θα είχαμε το y_predict από την άσκηση 2.3
accuracy = accuracy_score(y_test,y_predict) #άλλαξα το όνομα της μεταβλητής από accuracy_score σε accuracy
print("Model score accuracy: %.3f" % accuracy)

Model score accuracy: 0.790


**3.1** Evaluate the classifier using at least three evaluation metrics except accuracy_score and f1 (weighted).

In [64]:
from sklearn.metrics import balanced_accuracy_score, average_precision_score, f1_score
y_predict = classifier.predict(X_test)

# BEGIN CODE HERE
metric1 = f1_score(y_test,y_predict,average='macro')
metric2 = balanced_accuracy_score(y_test,y_predict)
metric3 = average_precision_score(y_test,y_predict)
#END CODE HERE

In [65]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score Metric 1: 0.698
Model score Metric 2: 0.687
Model score Metric 3: 0.413


**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

Το πρόβλημα που παρατηρούμε στον classifier είναι το γεγονός πως έχουμε ορίσει μόνο μία υπέρ-παράμετρο(αυτή του min_samples_leaf) και άρα το decision tree δεν έχει κάποιον ιδιαίτερο περιορισμό στην εκπαίδευσή του. Επομένως, σε αυτήν την περίπτωση ελοχεύει ο κίνδυνος του overfit. Για να αντιμετωπίσουμε αυτό θα κάνουμε ένα stratified grid search για να μπορέσουμε να βρούμε εκείνο τον συνδυασμό των υπέρ-παραμέτρων που μας οδηγούν σε καλύτερο performance του μοντέλου.

**3.3** Implement your solution using the cells below. Report your results and the process you followed. You are reccommended to use stratification and grid search. You should only have to increase a little bit the metrics you calculated above, and also reach an accuracy score higher than 82%!

In [66]:
from sklearn.model_selection import StratifiedShuffleSplit

# BEGIN CODE HERE
final_score = ""
stratification = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=RANDOM_VARIABLE)
param_grid = [
    {'criterion': ["gini","entropy"],
     'max_features': [2, 4, 5, None],
     'min_samples_leaf': [1,10,20,50],
     'max_depth': [2,5,10]}
]
grid_search = GridSearchCV(classifier, param_grid, cv=stratification,
                           scoring='accuracy', return_train_score=True)
grid_search.fit(X_train,y_train)
y_predict = grid_search.predict(X_test)
accuracy = accuracy_score(y_test,y_predict)
metric1 = f1_score(y_test,y_predict,average='macro')
metric2 = balanced_accuracy_score(y_test,y_predict)
metric3 = average_precision_score(y_test,y_predict)
#END CODE HERE

In [67]:
print("Model score accuracy: %.3f" % accuracy) #άλλαξα το όνομα της μεταβλητής από accuracy_score σε accuracy
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score accuracy: 0.824
Model score Metric 1: 0.717
Model score Metric 2: 0.689
Model score Metric 3: 0.462


Παραπάνω, εφαρμόσαμε ένα stratified grid search δοκιμάζοντας 4 παραμέτρους (criterion,max_features,min_samples_leaf,max_depth) και εκπαιδεύσαμε το μοντέλο μας με τον συνδυασμό των υπέρ-παραμέτρων που μας επέστρεψε το grid search. Αποτέλεσμα αυτού ήταν μια μικρή αύξηση στα metrics 1,2 και 3 και το γεγονός ότι πετύχαμε accuracy score ίσο με 0.824.