# Data Mining and Probabilistic Reasoning, WS18/19


Dr. Gjergji Kasneci, The University of Tübingen

-----
## Introduction to binary classification
-----

###### Date 19/11/2018

Teaching assistants:

 - Vadim Borisov (vadim.borisov@uni-tuebingen.de)

 - Johannes Haug (johannes-christian.haug@uni-tuebingen.de)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn import tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
plt.style.use('ggplot')

-----
# Rescaling & Data Partitioning

In [2]:
credit_unscaled = pd.read_csv('./data/cleaned_german_credit_data.csv', index_col=0)
credit_unscaled.head()

Unnamed: 0,Age,Sex,Job,Credit amount,Duration,Risk,Housing_own,Housing_free,Housing_rent,Purpose_radio/TV,...,Purpose_domestic appliances,Purpose_repairs,Purpose_vacation/others,Saving accounts_little,Saving accounts_quite rich,Saving accounts_rich,Saving accounts_moderate,Checking account_little,Checking account_moderate,Checking account_rich
0,67,0,2,1169,6,0,1,0,0,1,...,0,0,0,1,0,0,0,1,0,0
1,22,1,2,5951,48,1,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
2,49,0,1,2096,12,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,45,0,2,7882,42,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
4,53,0,2,4870,24,1,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0


Since our dataset contains different units, it might increase the classification performance if we somewhat rescale the features of our dataset. The most common ways of rescaling are:
- **Normalization:** rescale the data such that each value lies in a range from 0 to 1
- **Standardization:** rescale the data such that the data has a mean of 0 and a standard deviation of 1
<br><br>**Note:** Make sure that you don't rescale the target variable, since our classification models expect it to be a binary categorical variable

In [3]:
credit_standardized = credit_unscaled[:]
credit_standardized.loc[:, credit_standardized.columns != 'Risk'] =  preprocessing.scale(credit_standardized.loc[:, credit_standardized.columns != 'Risk'])

credit_normalized = credit_unscaled[:]
credit_normalized.loc[:, credit_normalized.columns != 'Risk'] = preprocessing.normalize(credit_normalized.loc[:, credit_normalized.columns != 'Risk'])

credit_data = {'unscaled': credit_unscaled,
              'standardized': credit_standardized,
              'normalized': credit_normalized}

  


To evaluate the classification performance of our models we want to apply the models to some data that was not previously seen. Therefore we split our initial dataset into a training and a test set. The training set is used to train the models. Afterwards we apply the models to our test set and evalute the classification accuracy of each model. More about data partitioning follows in the exercise about cross validation.

In [4]:
def partition_data(data, test_percent):
    X = data.loc[:, data.columns != 'Risk']
    y = data['Risk']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_percent)
    data_partitioned = {'X_train': X_train,
                'X_test': X_test,
                'y_train': y_train,
                'y_test': y_test}
    
    return data_partitioned

-----
# K Nearest Neighbour

In [5]:
def train_knn(k_neighbours, X_train, y_train, X_test):
    knn_model = neighbors.KNeighborsClassifier(k_neighbours, weights='distance') # weight neighbours based on their distance to the datapoint: closer neighbours get higher weights
    knn_model.fit(X_train, y_train)

    y_hat = knn_model.predict(X_test)
    return y_hat

-----
# Decision Tree

In [6]:
def train_dctree(X_train, y_train, X_test):
    dctree_model = tree.DecisionTreeClassifier()
    dctree_model.fit(X_train, y_train)
    
    y_hat = dctree_model.predict(X_test)
    return y_hat

-----
# Model evaluation

In [7]:
for dataset in credit_data:
    knn_acc = []
    dctree_acc = []
    
    for i in range(20): # average the classification accuracy of 20 runs to get a more stable result
        d = partition_data(credit_data[dataset], 0.2)
        
        knn_pred = train_knn(10, d['X_train'], d['y_train'], d['X_test'])
        knn_acc.append(accuracy_score(d['y_test'], knn_pred))
        
        dctree_pred = train_dctree(d['X_train'], d['y_train'], d['X_test'])
        dctree_acc.append(accuracy_score(d['y_test'], dctree_pred))
    
    
    print('------')
    print('Classification accuracy for {} dataset with KNN: {}%'.format(dataset,round(np.mean(knn_acc) * 100, 2)))
    print('Classification accuracy for {} dataset with Decision Tree: {}%'.format(dataset,round(np.mean(dctree_acc) * 100, 2)))

------
Classification accuracy for unscaled dataset with KNN: 66.14%
Classification accuracy for unscaled dataset with Decision Tree: 61.11%
------
Classification accuracy for standardized dataset with KNN: 70.47%
Classification accuracy for standardized dataset with Decision Tree: 62.59%
------
Classification accuracy for normalized dataset with KNN: 69.51%
Classification accuracy for normalized dataset with Decision Tree: 61.84%
