# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work.

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
import unittest

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-wisconsin-diagnostic-dataset) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 30% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [None]:
# BEGIN CODE HERE
X,y = load_breast_cancer(return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_VARIABLE)

#END CODE HERE

In [None]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

**Expected output**:
Size of train set:398
Size of test set:171
Unique classes:2

**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [None]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(criterion="gini", random_state=RANDOM_VARIABLE).fit(X_train, y_train)
classifier_igain = DecisionTreeClassifier(criterion="entropy", random_state=RANDOM_VARIABLE).fit(X_train, y_train)

prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

f_measure_gini = f1_score(y_true=y_test, y_pred=prediction_gini)
f_measure_igain = f1_score(y_true=y_test, y_pred=prediction_igain)

#END CODE HERE

In [None]:
print("F-Measure Gini:{}".format(f_measure_gini))
print("F-Measure Information Gain:{}".format(f_measure_igain))

**Expected output**:
F-Measure Gini:0.9528301886792453
F-Measure Information Gain:0.9724770642201834


**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifier by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to lists.

In [None]:
# BEGIN CODE HERE
depth = classifier_gini.get_depth()
fscores_train = []
fscores_test = []

for i in range(1, depth + 1):
    clf = DecisionTreeClassifier(criterion="gini", random_state=RANDOM_VARIABLE, max_depth=i).fit(X_train, y_train)
    fscores_train.append(f1_score(y_true=y_train, y_pred=clf.predict(X_train)))
    fscores_test.append(f1_score(y_true=y_test, y_pred=clf.predict(X_test)))

#END CODE HERE

In [None]:
print("Fscores Train:{}".format(fscores_train))
print("Fscores Test:{}".format(fscores_test))


**Expected output**:
Fscores Train:[0.9392712550607287, 0.9533468559837729, 0.9761904761904762, 0.996, 0.996, 0.9979959919839679, 1.0]
Fscores Test:[0.9150943396226415, 0.9444444444444444, 0.9724770642201834, 0.9629629629629629, 0.9629629629629629, 0.9674418604651163, 0.9528301886792453]



**1.4** Compare the results from the train set with the results from the test set. What do you notice? Explain your findings. How are you going to choose the max_depth of your model?

As the depth is being increased, the training accuracy is steadily increasing until it reaches the perfect fit on the
training data. For the testing accuracy, an improvement is being noticed until the depth of 3 is reached. After that,
the testing accuracy is getting worse. This phenomenon can be easily described with a single word, overfitting. The
final goal of the model is to be trained and achieve a good accuracy on the training data, but also generalize well.
That being said, the best option for the model's max_depth seems to be equal to 3, for which the decision tree achieves
0.97 accuracy during both the training and testing phase.

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.

In [None]:
# BEGIN CODE HERE

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Loading Datasets
training_set = pd.read_csv("income.csv", header=0)
testing_set = pd.read_csv("income_test.csv", header=0)

cols = ['age','workclass','fnlwgt','education','education_num','marital-status','occupation',
        'relationship','race','sex','capital-gain','capital-loss','hours-per-week','income']
num_cols = ['age','fnlwgt','education_num',"capital-gain","capital-loss",'hours-per-week']
categorical_cols = ['workclass','marital-status','occupation','relationship','race','sex']
target_feature = 'income'

# Missing Values

# Option 2: Drop the samples that have missing values.
# training_set = training_set.dropna()
# testing_set = testing_set.dropna()

# Option 3: Fill the missing values with the most frequent one per column.
imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
training_set = imputer.fit_transform(training_set)
training_set = pd.DataFrame(training_set, columns=cols)

# Convert target variable to 0 and 1
training_set[target_feature] = training_set[target_feature].map({'<=50K': 0, '>50K': 1})
testing_set[target_feature] = testing_set[target_feature].map({'<=50K': 0, '>50K': 1})

training_set_labels = training_set[target_feature]
testing_set_labels = testing_set[target_feature]

training_set = training_set.drop(target_feature, axis=1)
testing_set = testing_set.drop(target_feature, axis=1)

num_pipeline = Pipeline([
    ('normalization', MinMaxScaler())
    # ('standardization', StandardScaler())
])

full_pipeline = ColumnTransformer([
    ('num_pipeline', num_pipeline, num_cols),
    ('1hot', OneHotEncoder(), categorical_cols)
])

transformer = full_pipeline.fit(training_set)

X_train = pd.DataFrame(transformer.transform(training_set).toarray())
X_test = pd.DataFrame(transformer.transform(testing_set).toarray())

y_train = pd.DataFrame(training_set_labels)
y_test = pd.DataFrame(testing_set_labels)

# param_grid = [{'criterion': ["gini", "entropy"],
#                'max_depth': list(range(1, 45)),
#                'min_samples_split': [2, 3, 4, 5]
#                }]
#
# grid_search = GridSearchCV(DecisionTreeClassifier(random_state = RANDOM_VARIABLE), param_grid, cv=10, scoring='f1', return_train_score=True, verbose=3)
# grid_search.fit(X_train, y_train)
# print(grid_search.best_params_)
# CV Result: {'criterion': 'entropy', 'max_depth': 13, 'min_samples_split': 2}

# Using the CV best parameters
clf = DecisionTreeClassifier(random_state=RANDOM_VARIABLE, criterion='entropy', max_depth=13, min_samples_split=2)
clf.fit(X_train, y_train)

ftrScore = f1_score(y_true=y_train, y_pred=clf.predict(X_train))
fteScore = f1_score(y_true=y_test, y_pred=clf.predict(X_test))
acctrScore = accuracy_score(y_true=y_train, y_pred=clf.predict(X_train))
accteScore = accuracy_score(y_true=y_test, y_pred=clf.predict(X_test))

# print("Accuracy Train:{}".format(acctrScore))
# print("Accuracy Test:{}".format(accteScore))
# print("Fscores Train:{}".format(ftrScore))
# print("Fscores Test:{}".format(fteScore))

fScore = '0.6744918408245062'
accScore = '0.8515181194906954'

#END CODE HERE

**2.2** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters
- Report any additional results and comments on your approach.

For the handling of missing values, there are three main options: <br />

> Option 1: Drop the features that have missing values.

> Option 2: Drop the samples that have missing values.

> Option 3: Fill the missing values with a constant value.

Option 1 seems too extreme for our case since only 5% of the training data has missing values. Option 2 could be a
viable solution, if the 5% of training data is not deemed as a significant percentage to be dropped. For option 3, and
since the missing values correspond to categorical features, the most frequent value of each column is used.

For the handling of categorical variables, the typical One Hot Encoding is used since there is no obvious reason why
it should not be used.

For the numerical valued features, both normalization and standardization could be deployed. The difference seems to be
neglectable between the results of those two.

In order to evaluate the model and make the parameter selection, the standard GridSearchCV was deployed.

## 3.0 Common Issues ##

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions.

In [None]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)

**3.1** Draw a learning curve for the classifer for the train and test set loaded above.

In [None]:
# BEGIN CODE HERE
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

def learning_curve(X_train,y_train,X_test,y_test,clf,metric): #Use any parameters you need
    trainErrs = []
    testErrs = []
    slices = np.linspace(0, X_train.shape[0], 11, dtype=int)[1:]
    for sliceIdx in slices:
        model = clf.fit(X_train[:sliceIdx], y_train[:sliceIdx])
        if metric == 'f1':
            trainErrs.append(f1_score(y_true=y_train[:sliceIdx], y_pred=clf.predict(X_train[:sliceIdx])))
            testErrs.append(f1_score(y_true=y_test, y_pred=clf.predict(X_test)))
        elif metric == 'accuracy':
            trainErrs.append(accuracy_score(y_true=y_train[:sliceIdx], y_pred=clf.predict(X_train[:sliceIdx])))
            testErrs.append(accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))
        else:
            print("Wrong metric parameter used.")

    plt.plot(np.arange(10, 101, 10), trainErrs, label='training set')
    plt.plot(np.arange(10, 101, 10), testErrs, label='test set')
    plt.xlabel('Percentage of training set used')
    plt.ylabel(metric)
    plt.legend(loc='best')
    plt.xlim([0, 110])
    plt.ylim([min(min(testErrs), min(trainErrs)) - 0.2, 0.2 + max(max(testErrs), max(trainErrs))])
    plt.show()
    return trainErrs, testErrs


f1Errs = learning_curve(X_train,y_train,X_test,y_test,classifier,'f1')
accErrs = learning_curve(X_train,y_train,X_test,y_test,classifier,'accuracy')

#END CODE HERE

**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

The produced learning curves show that the trained model suffers from high variance, and thus overfitting. In order
to overcome this problem, model regularization should be attempted. Tuning the parameters of the model could help in
that direction. The DecisionTreeClassifier is known for its overfitting nature. Letting the tree grow indefinitely
will result in training data overadaption. To compat this, the model has to be stopped before overadapting to the data
used for training by having limited depth, and a limited number of leafs in best-first fashion. Best nodes are defined
as relative reduction in impurity.

**3.3** Implement your solution using the cells below. Report your results and the process you followed. 

In [None]:
# BEGIN CODE HERE
# param_grid = [{'criterion': ["gini", "entropy"],
#                'max_depth': list(range(1, 45)),
#                'max_leaf_nodes': [50, 150, 250, 350]
#                }]
#
# grid_search = GridSearchCV(DecisionTreeClassifier(random_state = RANDOM_VARIABLE), param_grid, cv=10, scoring='f1', return_train_score=True, verbose=3)
# grid_search.fit(X_train, y_train)
# print(grid_search.best_params_)
# CV Result: {'criterion': 'gini', 'max_depth': 14, 'max_leaf_nodes': 150}

best_params = {'criterion': "gini", 'max_depth': 14, 'max_leaf_nodes': 150}

# Using the CV best parameters
classifier.set_params(**best_params)

f1Errs = learning_curve(X_train,y_train,X_test,y_test,classifier,'f1')
accErrs = learning_curve(X_train,y_train,X_test,y_test,classifier,'accuracy')

# The final test f1 score for the classifier
final_score = f1Errs[1][-1]
print(final_score)

#END CODE HERE

A simple GridSearch Cross Validation is used to find the best fine-tuning parameters for the model, and an immediate
impact is obvious when using the best parameters for the trained model. The final model has less variance with the
tuned parameters and achieves a final f1 score of 0.5750767738807175.

