Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/05_Classification'
except ImportError as e:
    pass

## Exercise 5: Classification

### 5.1. Learning a classifier for the Iris Data Set

In the last exercise, you have learned lazy classification models for the Iris dataset. Now try a Decision Tree based approach with 10-fold cross-validation. Use a pipeline to perform some preprocessing before learning or applying the decision tree classifier.

#### 5.1.1.	Discretise the Iris data set into three bins. Then use the DecisionTreeClassifier with a 10-fold stratified cross validation and compute the accuracy. Afterwards plot the decision tree.

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

## Windows users: either add the path to graphviz' dot.exe to your PATH variable 
## OR comment in the 2 lines below (may have to change path):

# import os
# os.environ['PATH'] += ';C:\\Program Files (x86)\\Graphviz2.38\\bin'

iris = pd.read_csv("iris.csv")
iris_data = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]
iris_target = iris['Name']

# TODO

#### 5.1.2.	Remove the discretization and adjust the max_depth parameter of DecisionTreeClassifier to increase the accuracy. Does the accuracy change? Compare the complexity of the two models. Which model should be preferred according to Occam’s razor?

In [None]:
# TODO

## Exercise 5: Classification

### 5.2. Parameter optimization
In Exercise 4.1 we have used the German credit data set from the UCI data set library (http://archive.ics.uci.edu/ml/index.html), which describes the customers of a bank with respect to whether they should get a bank credit or not. The data set is provided as credit-g.arff file in ILIAS.

#### 5.2.1.	(recap) Go back to the results of exercise 4.1.4. Re-run the classifiers with their default parameter settings.
- Used the 10-fold validation approach.
- Balanced the training set multiplying the “bad customer” examples. 
- Evaluated the results, setting up your cost matrix to ((0,100)(1,0)) – that is, you assumed you will lose 1 Unit if you refuse a credit to a good customer, but that you lose 100 Units if you give a bad customer a credit.

Rerun your process to get the performance results. Now additionally use a  Decision Tree Classifier. How does it perform? What were the default parameters of the Decision Tree?

In [None]:
import itertools
import matplotlib.pyplot as plt
import numpy as np

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

First, we load the dataset:

In [None]:
import pandas as pd
from scipy.io import arff
from sklearn.preprocessing import LabelEncoder

credit_arff_data, credit_arff_meta = arff.loadarff(open('credit-g.arff', 'r'))
credit_data = pd.DataFrame(credit_arff_data)

# select all columns of type object
columns_with_binary_strings = credit_data.select_dtypes('object').columns.values

# decode the values of these columns using utf-8
credit_data[columns_with_binary_strings] = credit_data[columns_with_binary_strings].apply(lambda x: x.str.decode("utf-8"))
credit_target = credit_data['class']
credit_data = credit_data.drop(columns='class')

label_encoder = LabelEncoder()
credit_target = label_encoder.fit_transform(credit_target)
label_names=['bad','good']
label_order=label_encoder.transform(label_names)

credit_data.head()

Then, we set up a pipeline and evaluate it using cross validation:

In [None]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix

numeric_features = ['duration', 'credit_amount', 'installment_commitment', 'residence_since', 'age', 'existing_credits', 'num_dependents']
categorical_features = ['credit_history', 'purpose', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 'foreign_worker']
ordinal_features = [ 'checking_status', 'savings_status', 'employment']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features),
        ('ord', OrdinalEncoder(categories=[
            [ 'no checking', '<0', '0<=X<200', '>=200' ],
            [ 'no known savings', '<100', '100<=X<500', '500<=X<1000', '>=1000' ],
            [ 'unemployed', '<1', '1<=X<4', '4<=X<7', '>=7' ]
        ]), ordinal_features)])

def cost_function(y_true, y_pred): 
    cm = confusion_matrix(y_true, y_pred, labels=label_order)
    return cm[0][1] * 100 + cm[1][0] * 1

# TODO

Finally, we fit the pipeline to the dataset and plot the decision tree:

In [None]:
import graphviz 
from sklearn import tree
from sklearn.utils.multiclass import unique_labels

## Windows users: either add the path to graphviz' dot.exe to your PATH variable 
## OR comment in the 2 lines below (may have to change path):

# import os
# os.environ['PATH'] += ';C:\\Program Files (x86)\\Graphviz2.38\\bin'

# TODO

#### 5.2.2.	Now, try to find a more appropriate configuration for the Decision Tree classifier. Use the GridSearchCV from scikit-learn. 

Try the following parameters of the Decision Tree:
- criterion: ['gini', 'entropy']
- 'max_depth': [2, 3, 4, 5, None]
- 'min_samples_split' :[2,3,4,5]

You should come up with 48 (2 x 6 x 4) combinations.

What is the best configuration for the data set and the classification approach? 

Note: The grid search can take some time. You can use the ```n_jobs=-1``` parameter setting for the ```cross_val_predict()``` function to enable parallel processing (all CPU cores will be used).

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# define the parameter grid
parameters = {
    'estimator__criterion':['gini', 'entropy'],
    #TODO: add more parameters
}

# define the folds for the cross validation

# create a scorer for the grid search
cost_score = make_scorer(cost_function, greater_is_better=False)

# create the grid search estimator

# cross-validate

# calculate costs

# fit the grid search (= determine the optimal parameters)

#### 5.1.4.	How does the optimal decision tree differ from the one you have learned in 5.2.1?
Plot the optimised tree!

In [None]:
#TODO