## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

---

# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work. 

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) (+ [Additional information](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 33% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [3]:
# BEGIN CODE HERE
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state= RANDOM_VARIABLE)

#END CODE HERE

In [4]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

Size of train set:381
Size of test set:188
Unique classes:2


**Expected output**:  

```
Size of train set:381  
Size of test set:188  
Unique classes:2
```



**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [5]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(random_state= RANDOM_VARIABLE) # criterion= "gini" by default
classifier_igain = DecisionTreeClassifier(criterion="entropy", random_state= RANDOM_VARIABLE)

classifier_gini.fit(X_train,y_train)
classifier_igain.fit(X_train,y_train)

prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

f_measure_gini = f1_score(y_test, prediction_gini)
f_measure_igain = f1_score(y_test, prediction_igain)

#END CODE HERE

In [6]:
print("F-Measure Gini: {}".format(f_measure_gini))
print("F-Measure Information Gain: {}".format(f_measure_igain))

F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386


**Expected output**:  

```
F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386
```

**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifiers by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to the corresponding list of the *fscores* dictionary (one list for training set and one for test set). Before appending the scores to the corresponding list, multiply them by 100, and round the values to 2 decimals.

In [7]:
# BEGIN CODE HERE
depth = classifier_gini.get_depth()
fscores = {}
fscores['train'] = []
fscores['test'] = []

for i in range(1,depth+1):
    clf = DecisionTreeClassifier(max_depth=i, random_state=RANDOM_VARIABLE).fit(X_train, y_train)
    pred_train = clf.predict(X_train)
    pred_test = clf.predict(X_test)
    f1_train = round(f1_score(y_train, pred_train)*100, 2)
    f1_test = round(f1_score(y_test, pred_test)*100, 2)
    fscores['train'].append(f1_train)
    fscores['test'].append(f1_test)
#END CODE HERE

In [8]:
print("Fscores Train: {}".format(fscores['train']))
print("Fscores Test:  {}".format(fscores['test']))


Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]


**Expected output**:  
```
Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]
```

**1.4** Compare the results from the train set with the results from the test set. What do you notice? How are you going to choose the max_depth of your model?

We can observe that in the train set, we get better f1 scores with greater max_depth. Eventually, for max_depth= 7 we get f1_score=1 which means that the model, perfectly classifies each observation from the training set, into the correct class. On the other hand, on the test set, we get the best f1 score for max_depth=3, while for max_depth > 3, the f1 scores, deteriorate. That is because the model starts overfitting, and thus, it performs worse in the test set. In my opinion, the best choice is max_depth =3, because the model performs almost equally well on the train and the test set and therefore, it generalizes better. 

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.  

In [9]:
# BEGIN CODE HERE
data=pd.read_csv('./income.csv')
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
train_set = data.iloc[:,:-1]
y_train = data["income"].values
# any other code you need

data_test = pd.read_csv('./income_test.csv')
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
test_set = data_test.iloc[:,:-1]
y_test = data_test['income'].values

# any other code you need
# End CODE HERE

**2.2** Create and test your pipeline

In [10]:
#Your pipeline
numeric_columns= ['age', 'fnlwgt', 'education_num', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_columns= ['workclass','race','marital-status','occupation','relationship']
ordinal1= OrdinalEncoder(categories=[['Preschool','Dropout', 'HS-grad', 'CommunityCollege', 'Bachelors', 'Masters', 'Doctorate']])
ordinal2= OrdinalEncoder(categories= 'auto')
standard = StandardScaler()
column_transforms = ColumnTransformer([('OneHot', OneHotEncoder(handle_unknown='ignore'), cat_columns), ('Ordinal', ordinal1, ['education']),
                                         ('Ordinal2', ordinal2, ['sex']), ('Standard', standard, numeric_columns)])


imputer= SimpleImputer(strategy= 'mean')
tree = DecisionTreeClassifier(criterion='entropy' , max_features= None , random_state= RANDOM_VARIABLE)

clf = Pipeline([('preprocessing', column_transforms),('imputer', imputer), ('classifier', tree)]) 
clf.fit(train_set,y_train)
y_predict =  clf.predict(test_set)


In [11]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test, y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test, y_predict,average='weighted'))

Model score Accuracy: 0.811
Model score F1 Weighted: 0.811


**2.3** Perform a gooood grid search to find the best parameters for your pipeline

In [12]:
param_grid = {
    'imputer__strategy':['mean', 'median'],
    'classifier__max_depth':[12, 17, 30],
    'classifier__criterion':['gini', 'entropy'],
    'classifier__min_samples_leaf':[5, 17, 21],
    'classifier__max_features': [None, 0.25, 0.5, 0.75]
}

grid_search = GridSearchCV( clf, param_grid, cv = 5, verbose=1, scoring= 'accuracy', n_jobs= 8)
grid_search.fit(train_set, y_train)
y_predict_grid =  grid_search.predict(test_set)

print("Best params:")
print(grid_search.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best params:
{'classifier__criterion': 'gini', 'classifier__max_depth': 12, 'classifier__max_features': 0.5, 'classifier__min_samples_leaf': 21, 'imputer__strategy': 'mean'}


In [13]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict_grid))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict_grid,average='weighted'))

Model score Accuracy: 0.851
Model score F1 Weighted: 0.844


**2.4** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following 
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters 
- Report any additional results and comments on your approach.

You should achieve at least 85% accuracy score and 84% f1 score.

First, I split the columns of the dataset into numeric and categorical columns and left out the 'education' and 'sex' columns in order to handle them separately. For the categorical columns, I used OneHotEncoder, because DecisionTreeClassifiers from scikit-learn can not handle categorical data. For the 'education' column I used OrdinalEncoder and specified the categories in order to make the transformations in the right order. For the 'sex' column I also used an OrdinalEncoder with categories set to 'auto' (It makes no difference). I also used a StandardScaler for the numeric columns, to fix the variance. In order to perform these separate tasks to different columns, I used a column transformer. After all the data is numeric, I used a SimpleImputer, in order to handle missing values. I defined a DecisionTreeClassifier and then I set up the Pipeline. 
In order to find the best parameters for the model, I used GridSearchCV and tried out different strategies for the imputer, and different parameters for the classifier. I tried many different combinations, but in order to keep the runtime relatively low, I left a subset of the original options I used in the param_grid. The highest accuracy and weighted F1 score my model achieved was 85.1% and 8.44% respectively.

## 3.0 Common Issues ## 

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions. 

In [14]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)
classifier.fit(X_train,y_train)
accuracy_score = accuracy_score(y_test,y_predict)
print("Model score accuracy: %.3f" % accuracy_score)

Model score accuracy: 0.811


**3.1** Evaluate the classifier using at least three evaluation metrics except accuracy_score and f1 (weighted).

In [15]:
from sklearn.metrics import balanced_accuracy_score, average_precision_score, f1_score, confusion_matrix
y_predict = classifier.predict(X_test) 

# BEGIN CODE HERE
metric1 = balanced_accuracy_score(y_test,y_predict)
metric2 = average_precision_score(y_test,y_predict)
metric3 = f1_score(y_test,y_predict )
metric4 = confusion_matrix(y_test, y_predict)
print(metric4)
#END CODE HERE

[[10295  1248]
 [ 1957  1815]]


In [16]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score Metric 1: 0.687
Model score Metric 2: 0.413
Model score Metric 3: 0.531


**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

Problems:
We can see that the model has a pretty good accuracy score (81.9%), but the balanced accuracy score is substantially worse (68.8%). The other metrics (average precision score, f1 score) also point out that the model is not performing well. Due to the above, I suspect that the dataset is imbalanced. By printing the confusion matrix, I can make better conclusions. The training set has 7841 samples in one class (class 1) and 24727 in the other (class 0). In the testing set we have a similar situation (3772 samples in class 1 and 11543 samples in class 0). We know that Decision Tree Classifiers don't handle imbalanced datasets very well. The conclusion we can draw is that the model is taking advantage of the imbalanced dataset and that is why it appears to be doing well (accuracy score), when in reality, it is only doing well in classifying samples from the majority class.

Solutions:
Two techniques we could use is Synthetic Minority Oversampling (SMOTE) of the minority class (class 1) or Random Undersampling of the majority class (class 0). A better solution, because we are using all the data we are given (without leaving valuable data samples unused and without synthetic samples) is using a weights for the two classes, by providing the class_weight arguement in the DecisionTreeClassifier.
Finally, we could use stratification. In order to implement this technique, I joined the two datasets and used sklearn's train_test_split() with the 'stratify' parameter. I passed test_size = 0.32 so that the train and test sets have the same sizes as they did originally. 
I also did GridSearch in order tp find the best parameters for the classifier. Eventually, I can see a small improvement in the metrics I used and the accuracy achieved is higher than 82%.

**3.3** Implement your solution using the cells below. Report your results and the process you followed. You are reccommended to use stratification and grid search. You should only have to increase a little bit the metrics you calculated above, and also reach an accuracy score higher than 82%!

In [17]:
# # BEGIN CODE HERE
columns1 = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns1)
data_test = pd.read_csv('income_test.csv',usecols=columns1)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })


from sklearn.metrics import accuracy_score
data_full = data.append(data_test, ignore_index=True)
y = data_full["income"]
X = data_full.drop(["income"],axis=1)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.32, shuffle= True ,stratify = y)


tree_classifier = DecisionTreeClassifier(min_samples_leaf=4)


param_grid1 = {
    'criterion':['gini', 'entropy'],
    'max_depth':[10, 17, 25, 30],
    'min_samples_leaf':[4, 12, 15, 20],
    'max_features':[None, 0.25, 0.5]
    
}

grid_search1 = GridSearchCV( tree_classifier, param_grid1, cv = 5)
grid_search1.fit(X_train, y_train)
final_prediction =  grid_search1.predict(X_test)

print("Best params:")
print(grid_search1.best_params_)
print(accuracy_score(y_test,final_prediction))
metric1 = balanced_accuracy_score(y_test,final_prediction)
metric2 = average_precision_score(y_test,final_prediction)
metric3 = f1_score(y_test,final_prediction)
#END CODE HERE

Best params:
{'criterion': 'entropy', 'max_depth': 10, 'max_features': None, 'min_samples_leaf': 4}
0.8364336531557992


In [18]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)


Model score Metric 1: 0.698
Model score Metric 2: 0.484
Model score Metric 3: 0.559
