## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Σοφία Κατσάκη"
AEM = "3656"

---

# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work. 

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) (+ [Additional information](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 33% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [3]:
# BEGIN CODE HERE
X, y = load_breast_cancer().data,load_breast_cancer().target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=RANDOM_VARIABLE)

#END CODE HERE

In [4]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

Size of train set:381
Size of test set:188
Unique classes:2


**Expected output**:  

```
Size of train set:381  
Size of test set:188  
Unique classes:2
```



**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [5]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(random_state = RANDOM_VARIABLE) #the 'gini' criterion is the default one
classifier_igain = DecisionTreeClassifier(criterion='entropy',random_state = RANDOM_VARIABLE)

classifier_gini.fit(X_train,y_train)
classifier_igain.fit(X_train,y_train)

#predictions
prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

#f_scores 
f_measure_gini = f1_score(y_test,prediction_gini)
f_measure_igain = f1_score(y_test,prediction_igain)
#END CODE HERE

In [6]:
print("F-Measure Gini: {}".format(f_measure_gini))
print("F-Measure Information Gain: {}".format(f_measure_igain))

F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386


**Expected output**:  

```
F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386
```

**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifiers by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to the corresponding list of the *fscores* dictionary (one list for training set and one for test set). Before appending the scores to the corresponding list, multiply them by 100, and round the values to 2 decimals.

In [7]:
# BEGIN CODE HERE
depth = classifier_gini.tree_.max_depth
fscores = {}
fscores['train'] = []
fscores['test'] = []

for i in range(1,depth+1,1):
    classifier_gini = DecisionTreeClassifier(max_depth = i,random_state = RANDOM_VARIABLE)
    classifier_gini.fit(X_train,y_train)

    #Train set f1_score
    prediction_gini = classifier_gini.predict(X_train)
    f_measure_gini = round(100*f1_score(y_train,prediction_gini),2)
    fscores['train'].append(f_measure_gini)

    #Test set f1_score
    prediction_gini = classifier_gini.predict(X_test)
    f_measure_gini = round(100*f1_score(y_test,prediction_gini),2)
    fscores['test'].append(f_measure_gini)
  
#END CODE HERE

In [8]:
print("Fscores Train: {}".format(fscores['train']))
print("Fscores Test:  {}".format(fscores['test']))

Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]


**Expected output**:  
```
Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]
```

**1.4** Compare the results from the train set with the results from the test set. What do you notice? How are you going to choose the max_depth of your model?

First of all, the f1 score metric is the weighted average of recall and precision and takes false positives and false negatives into account.Comparing the results from the train set with the results from the test set, we notice that the model reaches very high f1-scores, when it comes to the train data. It even reaches a 100% score, when the depth is equal to 7. The model learns the training data so well and that even results in the model learning the noise in the train set. The score of the model in the test set is also improving, but not as much as the train set, as the test set is an unseen data set and the model cannot fully adapt to the noise etc. The improvement in the test set can be seen until the moment the max_depth is equal to 3(but we also notice that there is a small increase in the f-score from depth 4 to 5), and then it is starting to decrease. So, the ideal max_depth for the tree is going to be 3, as it is the "peak point", for the f1-score of the test set, but in such cases we can also use grid search to find the ideal parameters.

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.  

In [9]:
# BEGIN CODE HERE
#train set includes all columns except from the 'income' column and the education column 
train_set = pd.read_csv('income.csv', usecols = [i for i in range(13)]) #columns from 0 to 12
train_set = train_set.drop(['education'],axis=1)

#y train includes only the 'income' column
y_train = pd.read_csv('income.csv',usecols=['income'])
y_train['income'] = y_train['income'].map({ "<=50K": 0, ">50K": 1 })

#test set includes all columns except from the 'income' column and the education column
test_set = pd.read_csv('income_test.csv', usecols = [i for i in range(13)]) #columns from 'age' to hours-per-week'
test_set = test_set.drop(['education'],axis=1)

#y test includes only the 'income' column
y_test = pd.read_csv('income_test.csv',usecols=['income'])
y_test['income'] = y_test['income'].map({ "<=50K": 0, ">50K": 1 })
# End CODE HERE

**2.2** Create and test your pipeline

In [10]:
#Simple Imputer strategy for numerical values
numeric_features = ['age','fnlwgt','education_num','capital-gain','capital-loss','hours-per-week']
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])

#The categorical features 'workclass' and 'occupation' have some missing values. We will ignore the NaN values, that are going to be all
#zeros after we use the OneHotEncoder 
categorical_features = ['workclass','marital-status','occupation','relationship','race','sex']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[('numeric',numeric_transformer,numeric_features),
                                               ('one_hot', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier(random_state=RANDOM_VARIABLE))]) #gini is the default criterion
clf.fit(train_set,y_train)
y_predict = clf.predict(test_set)

In [11]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test, y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test, y_predict,average='weighted'))

Model score Accuracy: 0.807
Model score F1 Weighted: 0.808


**2.3** Perform a gooood grid search to find the best parameters for your pipeline

In [12]:
param_grid = {
    "preprocessor__numeric__imputer__strategy": ["mean", "median","most_frequent"],
    "classifier__max_depth": [2, 5, 7, 10, 13],
    "classifier__criterion": ["gini","entropy"],
    "classifier__max_features": [0.25, 0.5, 0.75, None],
    "classifier__min_samples_leaf": [1,10,20,30,40,50],
}

grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(train_set, y_train)
y_predict =  grid_search.predict(test_set)

print("Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__criterion': 'gini', 'classifier__max_depth': 10, 'classifier__max_features': 0.75, 'classifier__min_samples_leaf': 30, 'preprocessor__numeric__imputer__strategy': 'mean'}


In [13]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict,average='weighted'))

Model score Accuracy: 0.851
Model score F1 Weighted: 0.841


**2.4** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following 
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters 
- Report any additional results and comments on your approach.

You should achieve at least 85% accuracy score and 84% f1 score.

- Missing values in numerical features were replaced with the mean of all values. Here, we can see that the data set happened to have no NaN numerical values. The code that is handling missing values is still included though, as it appies to most cases(the majority of data sets are expected to have some missing values). In order to achieve this, we use SimpleImputer. We do not drop the examples with missing values, as the number of examples available will be decreased(there are not so many examples with missing values, though).The mean of all values is a good estimator of the values missing, because it takes into account all the other values and is not extremely high or low.

- Categorical values need to be converted in order for the accuracy and the f1_score of the model to be higher. The different words of each categorical variable needs to be destinguished. This can be accomplished through the use of OneHotEncoder. The missing values in categorical variables, are "ignored" and therefore instead of 0's and 1's, we are going to have only 0's in the variable that is missing.

- For the train_set and test_set we use every column, except from the one that we want to predict, the 'income' column. This column is included in both y_train and y_test sets. Additionaly, in train_set and test_set we drop the column 'education', as the column 'education_num' describes its content well and it is numerical. We may need to drop other columns as well, such as 'marital_status' or 'race' that seem kind of irrelevant to the income, but here we do not proceed to try different column combinations. Also, in y_train and y_test data, the
"<=50K" and ">50K" values, were mapped to 0's and 1's, in order for them to become numerical and therefore easier to handle.

- In 2.2, a simple Pipeline was used. The data were fitted and a simple prediction was made. The SimpleImputer and OneHotEncoder were used, with the strategies described above. In 2.3, the default 5-fold cross-validation (4 parts for training and 1 for testing,each time) is used. When it comes to the parameters, for the imputer strategy for numerical values, the "mean","median" and "most_frequent" were tried. For the max_depth different values from 2 to 10 were tried. For the criterion of finding the best way to deal with impurity, the "gini" and the "entropy" were tried. For the max features the values tried were: 0.25, 0.5, 0.75, and None and finally for the min samples in a single leaf the values tried were:1,10,20,30,40,50. In each case, different values of each parameter were tried, in order to have a variety of choices for the best parameters.
  

- The best result was when using: 'gini' as a criterion, 10 as max_depth, the 0.75 float value as max_features, 30 as minimum samples in a leaf and 'mean' as the numeric imputer strategy.The results have the expected accuracy(around 85% accuracy for the simple score and 84% for the f1 score), but we can always try different parameters, different combinations of columns that are involved etc.

## 3.0 Common Issues ## 

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions. 

In [14]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)
classifier.fit(X_train,y_train)
accuracy_score = accuracy_score(y_test,y_predict)
print("Model score accuracy: %.3f" % accuracy_score)

Model score accuracy: 0.851


**3.1** Evaluate the classifier using at least three evaluation metrics except accuracy_score and f1 (weighted).

In [15]:
from sklearn.metrics import balanced_accuracy_score, average_precision_score, f1_score
y_predict = classifier.predict(X_test)

# BEGIN CODE HERE
metric1 = balanced_accuracy_score(y_test, y_predict) 
metric2 = average_precision_score(y_test, y_predict) 
metric3 = f1_score(y_test, y_predict, average='micro') 
#END CODE HERE

In [16]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score Metric 1: 0.687
Model score Metric 2: 0.413
Model score Metric 3: 0.790


**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

The evaluation metrics make the model less accurate in its predictions than before. This is due to the fact that we take into account only the numeric columns in the data set. We could try using the OneHotEncoder to include categorical values as well or using grid search in order to find the best parameters for our model.

**3.3** Implement your solution using the cells below. Report your results and the process you followed. You are reccommended to use stratification and grid search. You should only have to increase a little bit the metrics you calculated above, and also reach an accuracy score higher than 82%!

In [17]:
# BEGIN CODE HERE
final_score = ""
#imports
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

RANDOM_VARIABLE = 42
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week','capital-loss','capital-gain','income']
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values

#numeric_features = ['age','fnlwgt','education_num','capital-gain','capital-loss','hours-per-week']
numeric_features = X_train.columns
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])

preprocessor = ColumnTransformer(transformers=[('numeric',numeric_transformer, numeric_features)])

classifier = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', DecisionTreeClassifier(random_state=RANDOM_VARIABLE))]) #gini is the default criterion
classifier.fit(X_train,y_train)
y_predict = classifier.predict(X_test)
param_grid = {
    "preprocessor__numeric__imputer__strategy": ["mean","median","most_frequent"],
    "classifier__max_depth": [2, 5, 7, 10],
    "classifier__criterion": ["gini","entropy"],
    "classifier__max_features": [0.25, 0.5, 0.75, None],
    "classifier__min_samples_leaf": [1,10,20,30,40,50],
}

grid_search = GridSearchCV(classifier, param_grid, cv=5)
grid_search.fit(X_train, y_train)
y_predict =  grid_search.predict(X_test)

print("Best params:")
print(grid_search.best_params_)

accuracy_score = accuracy_score(y_test,y_predict)
metric1 = balanced_accuracy_score(y_test, y_predict) 
metric2 = average_precision_score(y_test, y_predict) 
metric3 = f1_score(y_test, y_predict, average='micro') 


#END CODE HERE

Best params:
{'classifier__criterion': 'gini', 'classifier__max_depth': 7, 'classifier__max_features': 0.75, 'classifier__min_samples_leaf': 10, 'preprocessor__numeric__imputer__strategy': 'mean'}


In [18]:
print("Model score accuracy: %.3f" % accuracy_score)
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score accuracy: 0.825
Model score Metric 1: 0.697
Model score Metric 2: 0.467
Model score Metric 3: 0.825


The process followed was:
- The data were loaded. The X_train and X_test data include only the numeric columns(except from the 'income' one). The y_train and y_test consist only of the 'income' column.
- The "<=50K" and ">50K" values of y sets were mapped to "0" and "1" for each example.
-  A "preprocessor" (with a SimpleImputer) with the numeric transformation and the numeric features was made. A pipeline was used for the X_train and y_train data.
- The grid search parameters were the same as Exercise 2. 
- Then, a 5-fold cross-validation is used, which may not be the best one(10-fold may be better), but it is a quick and generally accurate one.
- The scores of accuracy, balanced accuracy, average precision and f1 micro were calculated.

The results are: 
- score of accuracy: 82.5%
- score of balanced accuracy: 69.7%
- score of average precision : 46.7%
- score of fmicro: 82.5%

The best parameters, according to Grid Search are:
- 'gini' for the criterion
-  7 for max_depth 
-  0.75 for max_features
-  10 for min_samples_leaf and
- 'mean' for the numeric imputer strategy


The scores are indeed a little higher now that we found better parameters with grid search. The accuracy is higher than 82% as expected.