### Teil 3 (7 + 5 P): 
Nutzen Sie drei Klassifikationsverfahren, um vorherzusagen, ob es eine Klimaanlage gibt (CentralAir). Eine davon soll ein Entscheidungsbaum sein, der auch grafisch ausgegeben werden kann (ggfs. separat in einem pdf). Gehen Sie analog zu Teil 2 vor, mit folgenden Unterschieden (siehe *): A, Statt Inferenz – Entscheidungsbaum beschreiben, B, Statt der o.g. Metriken (statt R2 etc.): Korrektklassifikationsrate (Accuracy), False-Positive und False-Negative-Rate.

We used the following three classification methods: 
- k-nearest-neighbors
- SVM
- ?

In [92]:
import sys
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation

from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.preprocessing import StandardScaler  # feature scaling
from sklearn.neighbors import KNeighborsClassifier  # knn model from sklearn
from sklearn.metrics import classification_report, confusion_matrix  # evaluation of model
from sklearn import svm

import pickle  # conservation of model 

#import matplotlib
%matplotlib inline 

# Seaborn customice visualisation
sns.set(style="white", palette="muted", color_codes=True) 

print('Python Verison: ', sys.version)  # parentheses necessary in python 3.
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)
print('Seaborn Version: ', sns.__version__)
#print('Matplotlib Version: ', matplotlib.__version__)

Python Verison:  3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Pandas Version:  0.25.1
Numpy Version:  1.16.5
Seaborn Version:  0.10.0


## K-nearest-neighbors classification
Für die knn Klassifikation ist es sinnvoll, wenn nur numerische Spalten in dem Datensatz enthalten sind. Somit müssen die nicht numerischen Spalten gelöschte werden.

In [56]:
# gets dataFrame from file
def read_data(path):
    df = pd.read_csv(path, sep=";")
    
    # renaming columns 
    df = df.rename(columns={"HeatingQC": "Heating Quality and Condition","BldgType": "Dwelling Type", "RoofStyle":"Roof Style", "MSZoning":"Zoning Classification", "LotArea": "Lot Area", "OverallQual":"Overall Quality", "OverallCond":"Overall Condition","YearBuilt":"Year Built", "TotalBsmtSF":"Basement Area SF", "YearRemodAdd": "Remodel Date", "GrLivArea": "Ground Area SF", "TotRmsAbvGrd": "Total Rooms above Grade", "YrSold": "Year Sold", "GarageCars": "Garage Cars", "SalePrice": "Sale Price"})
    df.head()
    
    return df


# seperates data by datatype
def seperate_columns_by_datatype(df):
    categoricalCols = []
    for i in range(len(df.columns)):
        if (df.dtypes[i] == object):
            categoricalCols.append(df.columns[i])

    numericalCols = []
    for i in range(len(df.columns)):
        if (df.dtypes[i] == "int64" ):
            numericalCols.append(df.columns[i])

    boolCols = []
    for i in range(len(df.columns)):
        if (df.dtypes[i] == "bool" ):
            boolCols.append(df.columns[i])


    print('categoricalCols:', categoricalCols)
    print('numericalCols:', numericalCols)
    print('boolCols:', boolCols)
    
    return categoricalCols, numericalCols, boolCols


def remove_non_numerical_columns(df):
    
    categoricalCols, numericalCols, boolCols = seperate_columns_by_datatype(df)

    # Remove non numerical columns and reaadd the column CentralAir
    dfnumerical = df.drop(columns=categoricalCols)
    dfnumerical = dfnumerical.join(df['CentralAir'])
    
    return dfnumerical


def preprocessing_and_scaling(dfnumerical):
    
    # safe the columns and labels in two different variables
    X = dfnumerical.iloc[:, :-1].values  # X contains all columns of the dataset but cantralAir
    y = dfnumerical.iloc[:, 11].values   # y contains the labels for each row (Y or N), so if it has central air or not

    # Splitting the dataset into a 80% train set and a 20% test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

    # Feature scaling for normalizing so that each feature contributes approximately proportionately to the final distance.
    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    return X_train, X_test, y_train, y_test


def train_model(X_train, X_test, y_train, y_test):
    
    # initiate model with parameter K = 5 
    knn_classifier = KNeighborsClassifier(n_neighbors=1)

    # fit the model to the data
    knn_classifier.fit(X_train, y_train)
    
    return knn_classifier


def evaluate_model(model):

    # make prediction on test data
    y_pred = model.predict(X_test)
    
    # evaluating the trained model
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

    # calculate TP, TN, FP, FN
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    
    # FP, FN, TN and TP in percent
    FP_Perc = fp / (tp + fp)
    FN_Perc = fn / (tn + fn)
    TP_Perc = tp / (tp + fp)
    TN_Perc = tn / (tn + fn)

    dataPercent = np.array([[TP_Perc, FP_Perc], [FN_Perc, TN_Perc]])
    confusion_Matric_Percent = pd.DataFrame(data=dataPercent, index=['Positive', 'Negative'], columns=['Positive', 'Negative'])

    return confusion_Matric_Percent

In [127]:
path = 'cleanedSet.csv'
#path = 'SetFiltered.csv'

#read the data
df = read_data(path)

#remove non numerical data
dfnumerical = remove_non_numerical_columns(df)
dfnumerical.head()

# split and process data
X_train, X_test, y_train, y_test = preprocessing_and_scaling(dfnumerical)

# build model
model = train_model(X_train, X_test, y_train, y_test)

# evaluate model
confusion_Matric_Percent = evaluate_model(model)
confusion_Matric_Percent.head()

categoricalCols: ['Zoning Classification', 'Neighborhood', 'Dwelling Type', 'Roof Style', 'Heating Quality and Condition', 'CentralAir']
numericalCols: ['Lot Area', 'Overall Quality', 'Overall Condition', 'Year Built', 'Remodel Date', 'Basement Area SF', 'Ground Area SF', 'Total Rooms above Grade', 'Garage Cars', 'Year Sold', 'Sale Price']
boolCols: []
[[  4   1]
 [  2 134]]
              precision    recall  f1-score   support

           N       0.67      0.80      0.73         5
           Y       0.99      0.99      0.99       136

    accuracy                           0.98       141
   macro avg       0.83      0.89      0.86       141
weighted avg       0.98      0.98      0.98       141



Unnamed: 0,Positive,Negative
Positive,0.992593,0.007407
Negative,0.333333,0.666667


Es ist zu sehen, dass der knn Klassifikator bereits sehr gute Ergebnisse lierfert ohne das hierfür eine aufwändige Datenverarbeitung stattfinden musste. Die Accuracy beträgt für das angepasste Modell die accuracy 98%.

### Saving the generated model

In [128]:
def save_model(model):
    # specifying filename for model
    filename='knn_model.sav'

    # save the model in the created file only executed once
    pickle.dump(model, open(filename, 'wb'))

In [129]:
#save_model(model)

In [130]:
# load the model
def load_knn_model():
    load_knn_classification_model = pickle.load(open(filename, 'rb'))
    return load_knn_classification_model


# prepare data for validation
def prepare_data(dfnumerical):
    # safe the columns and labels in two different variables
    X = dfnumerical.iloc[:, :-1].values  # X contains all columns of the dataset but cantralAir
    y = dfnumerical.iloc[:, 11].values   # y contains the labels for each row (Y or N), so if it has central air or not

    # Feature scaling for normalizing so that each feature contributes approximately proportionately to the final distance.
    scaler = StandardScaler()
    scaler.fit(X)

    X = scaler.transform(X)
    
    return X, y


def test_model(model, X, y):

    # make prediction on test data
    y_pred = model.predict(X)
    
    # evaluating the trained model
    print(confusion_matrix(y, y_pred))
    print(classification_report(y, y_pred))

    # calculate TP, TN, FP, FN
    tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
    
    # FP, FN, TN and TP in percent
    FP_Perc = fp / (tp + fp)
    FN_Perc = fn / (tn + fn)
    TP_Perc = tp / (tp + fp)
    TN_Perc = tn / (tn + fn)

    dataPercent = np.array([[TP_Perc, FP_Perc], [FN_Perc, TN_Perc]])
    confusion_Matric_Percent_Test = pd.DataFrame(data=dataPercent, index=['Positive', 'Negative'], columns=['Positive', 'Negative'])

    return confusion_Matric_Percent_Test

Eine Vorhersage zusätzlicher Daten kann im folgenden Codeteil vorgenommen werden. Hierfür muss nur die folgende path Variable angepasst werden.

In [132]:
path = ''

#read the data
df = read_data(path)

#remove non numerical data
dfnumerical = remove_non_numerical_columns(df)

# seperate labels
X, y = prepare_data(dfnumerical)

# load saved model
model = load_knn_model()

# test model
confusion_Matric_Percent_Test = test_model(model, X, y)
confusion_Matric_Percent_Test.head()

categoricalCols: ['Zoning Classification', 'Neighborhood', 'Dwelling Type', 'Roof Style', 'Heating Quality and Condition', 'CentralAir']
numericalCols: ['Lot Area', 'Overall Quality', 'Overall Condition', 'Year Built', 'Remodel Date', 'Basement Area SF', 'Ground Area SF', 'Total Rooms above Grade', 'Garage Cars', 'Year Sold', 'Sale Price']
boolCols: []
[[ 31   1]
 [  1 670]]
              precision    recall  f1-score   support

           N       0.97      0.97      0.97        32
           Y       1.00      1.00      1.00       671

    accuracy                           1.00       703
   macro avg       0.98      0.98      0.98       703
weighted avg       1.00      1.00      1.00       703



Unnamed: 0,Positive,Negative
Positive,0.99851,0.00149
Negative,0.03125,0.96875


## SVM Classification
SVM folgt der selben Struktur wie die knn Klassifikation.

In [160]:
def train_model_SVM(X_train, X_test, y_train, y_test):
    
    # initiate model with parameter K = 5 
    svm_classifier = svm.SVC(kernel='sigmoid') # Linear Kernel

    # fit the model to the data
    svm_classifier.fit(X_train, y_train)
    
    return svm_classifier


def save_model(model):
    # specifying filename for model
    filename='svm_model.sav'

    # save the model in the created file only executed once
    pickle.dump(model, open(filename, 'wb'))
    
    
# load the model
def load_svm_model():
    load_svm_classification_model = pickle.load(open(filename, 'rb'))
    return load_svm_classification_model

In [156]:
path = 'cleanedSet.csv'
#path = 'SetFiltered.csv'

#read the data
df = read_data(path)

#remove non numerical data
dfnumerical = remove_non_numerical_columns(df)
dfnumerical.head()

# split and process data
X_train, X_test, y_train, y_test = preprocessing_and_scaling(dfnumerical)

# build model
model = train_model(X_train, X_test, y_train, y_test)

# evaluate model
confusion_Matric_Percent = evaluate_model(model)
confusion_Matric_Percent.head()

categoricalCols: ['Zoning Classification', 'Neighborhood', 'Dwelling Type', 'Roof Style', 'Heating Quality and Condition', 'CentralAir']
numericalCols: ['Lot Area', 'Overall Quality', 'Overall Condition', 'Year Built', 'Remodel Date', 'Basement Area SF', 'Ground Area SF', 'Total Rooms above Grade', 'Garage Cars', 'Year Sold', 'Sale Price']
boolCols: []
[[  2   4]
 [  1 134]]
              precision    recall  f1-score   support

           N       0.67      0.33      0.44         6
           Y       0.97      0.99      0.98       135

    accuracy                           0.96       141
   macro avg       0.82      0.66      0.71       141
weighted avg       0.96      0.96      0.96       141



Unnamed: 0,Positive,Negative
Positive,0.971014,0.028986
Negative,0.333333,0.666667


In [157]:
# save_model(model)

Eine Vorhersage zusätzlicher Daten kann im folgenden Codeteil vorgenommen werden. Hierfür muss nur die folgende path Variable angepasst werden.

In [161]:
path = ''

#read the data
df = read_data(path)

#remove non numerical data
dfnumerical = remove_non_numerical_columns(df)

# seperate labels
X, y = prepare_data(dfnumerical)

# load saved model
model = load_svm_model()

# test model
confusion_Matric_Percent_Test = test_model(model, X, y)
confusion_Matric_Percent_Test.head()

categoricalCols: ['Zoning Classification', 'Neighborhood', 'Dwelling Type', 'Roof Style', 'Heating Quality and Condition', 'CentralAir']
numericalCols: ['Lot Area', 'Overall Quality', 'Overall Condition', 'Year Built', 'Remodel Date', 'Basement Area SF', 'Ground Area SF', 'Total Rooms above Grade', 'Garage Cars', 'Year Sold', 'Sale Price']
boolCols: []
[[ 28   4]
 [  2 669]]
              precision    recall  f1-score   support

           N       0.93      0.88      0.90        32
           Y       0.99      1.00      1.00       671

    accuracy                           0.99       703
   macro avg       0.96      0.94      0.95       703
weighted avg       0.99      0.99      0.99       703



Unnamed: 0,Positive,Negative
Positive,0.994056,0.005944
Negative,0.066667,0.933333
