# Selecting between models.
---
Jose Miguel Contreras Mantilla

Undergraduate Math Student, Universidad Nacional de Colombia.

Mathematics for Machine Learning.

---


In thi notebook we will work with the datasets from:

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+

And will try to select a learning model between a set of 4 or 5 five of them, which we hope, is the best to solve each problem. Finally we will test our selected Model.

First we import the libraries we will use.

In [1]:
# Basic libraries.
import numpy as np
import pandas as pd
import zipfile
from pathlib import Path
import urllib.request
import numpy as np
from datetime import datetime

# Optimization
from scipy.optimize import linprog

# Machine Learning Models and Metrics.
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.linear_model import Perceptron
from sklearn import svm
from sklearn import neighbors
from sklearn import tree

# Visualization libraries.
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import plotly.express as px
from plotly.offline import plot
from plotly.subplots import make_subplots

Then we save the data from the two problems we want to check, and split it between different kind of sets. Here we will consider three type of sets for boh problems: 
* A training set: We will use the data from this set to train our models. (All models will use this set)
* A first test set: We will use the data from this set to compare the accuracy of our different models. (All models will use this set)
* A second test set: We will use the data from this set to check the accuracy of the model we selected using the information of the previous test. Only one model will use this data in each problem.

In [2]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt", 
                 sep = ',', 
                 header = None, 
                 names=["variance_of_Wavelet","skewness_of_Wavelet",
                        "curtosis_of_Wavelet","entropy",
                        "class"],
                 thousands = ',')
variables=["date","temperature", "curtosis_of_Wavelet","entropy"]


train_set_df, test_set_df= train_test_split(df, test_size=0.3,random_state=42)

test_set_df1,test_set_df2=train_test_split(test_set_df,test_size=0.5,random_state=42)

#Definition of the features X and the labels y for training
X=train_set_df[['variance_of_Wavelet', 'skewness_of_Wavelet', 'curtosis_of_Wavelet','entropy']]
y=train_set_df["class"]
#df.keys()

 #Definition of the features X_test and the labels y_test for testing 1
X_test1=test_set_df1[['variance_of_Wavelet', 'skewness_of_Wavelet', 'curtosis_of_Wavelet','entropy']]
y_test1=test_set_df1['class']

X_test2=test_set_df2[['variance_of_Wavelet', 'skewness_of_Wavelet', 'curtosis_of_Wavelet','entropy']]
y_test2=test_set_df2['class']


In [6]:
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

def load_occupancy_data():
    tarball_path = Path("datasets/occupancy_data.zip")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00357/occupancy_data.zip"
        urllib.request.urlretrieve(url, tarball_path)
        with zipfile.ZipFile(tarball_path) as occupancy_tarball:
           # open the csv file in the dataset
           occupancy_tarball.extractall(path="datasets")
    list_df =[pd.read_csv(Path("datasets/datatraining.txt"),parse_dates=['date'],date_parser=dateparse),
              pd.read_csv(Path("datasets/datatest.txt"),parse_dates=['date'],date_parser=dateparse),
              pd.read_csv(Path("datasets/datatest2.txt"),parse_dates=['date'],date_parser=dateparse),]
    return list_df

train, test1, test2= load_occupancy_data()

train['date_numeric'] = train['date'].apply(lambda time: time.year+time.month/12+ time.day/365 + time.hour/8760+time.minute/525600)
test1['date_numeric'] = test1['date'].apply(lambda time: time.year+time.month/12+ time.day/365 + time.hour/8760+time.minute/525600)
test2['date_numeric'] = test2['date'].apply(lambda time: time.year+time.month/12+ time.day/365 + time.hour/8760+time.minute/525600)

X_train_o=train[['date_numeric', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio']]
y_train_o=train['Occupancy']

X_test1_o=test1[['date_numeric', 'Temperature', 'Humidity', 'Light', 
              'CO2', 'HumidityRatio']]
y_test1_o=test1['Occupancy']

X_test2_o=test2[['date_numeric', 'Temperature', 'Humidity', 'Light', 
              'CO2', 'HumidityRatio']]
y_test2_o=test2['Occupancy']



Now we will train our learning models using the training set. Here we will use: SVM, perceptron, K-nearest neighborhoods (with uniform and distance related weights) and decision-trees.

Now we will work with de first problem:

In [3]:
clf_p = Perceptron(tol=1e-3, random_state=0)
clf= svm.SVC()
clf_t=tree.DecisionTreeClassifier()
n_neighbors=15

#For the banknote problem
print("Working with the banknote dataset:")
print("Using the perceptron model:")
clf_p.fit(X, y)
print('Accuracy training set:')
clf_p.score(X, y)

predicted_perceptron_test1=clf_p.predict(X_test1)
print('Accuracy testing set 1:')
clf_p.score(X_test1, y_test1)
print(accuracy_score(y_test1, predicted_perceptron_test1).round(2))
conf_matrix_test1= confusion_matrix(y_test1, predicted_perceptron_test1)
#print('Acuracy:',accuracy_test1_o)
print('Confusion matrix:\n', conf_matrix_test1)

print("Using the SVM model:")
clf.fit(X, y)
print('Accuracy training set:')
clf.score(X, y)

predicted_svm_test1=clf.predict(X_test1)
print('Accuracy testing set 1:')
clf.score(X_test1, y_test1)
print(accuracy_score(y_test1, predicted_svm_test1).round(2))
conf_matrix_test1= confusion_matrix(y_test1, predicted_svm_test1)
#print('Acuracy:',accuracy_test1_o)
print('Confusion matrix:\n', conf_matrix_test1)


for weights in ["uniform", "distance"]:
    print(" Using the K-nearest neighborhood with ", weights , "weights") 
    # we create an instance of Neighbours Classifier and fit the data.
    clf_k = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf_k.fit(X, y)
    print('Accuracy training set:')
    clf_k.score(X, y)

    predicted_k_test1=clf_k.predict(X_test1)
    print('Accuracy testing set 1:')
    clf_k.score(X_test1, y_test1)
    print(accuracy_score(y_test1, predicted_k_test1).round(2))
    conf_matrix_test1= confusion_matrix(y_test1, predicted_k_test1)
    #print('Acuracy:',accuracy_test1_o)
    print('Confusion matrix:\n', conf_matrix_test1)

print("Using the Decision tree model:")
clf_t.fit(X, y)
print('Accuracy training set:')
clf_t.score(X, y)

predicted_tree_test1=clf_t.predict(X_test1)
print('Accuracy testing set 1:')
clf_t.score(X_test1, y_test1)
print(accuracy_score(y_test1, predicted_tree_test1).round(2))
conf_matrix_test1= confusion_matrix(y_test1, predicted_tree_test1)
#print('Acuracy:',accuracy_test1_o)
print('Confusion matrix:\n', conf_matrix_test1)


Working with the banknote dataset:
Using the perceptron model:
Accuracy training set:
Accuracy testing set 1:
0.99
Confusion matrix:
 [[121   1]
 [  2  82]]
Using the SVM model:
Accuracy training set:
Accuracy testing set 1:
1.0
Confusion matrix:
 [[122   0]
 [  0  84]]
 Using the K-nearest neighborhood with  uniform weights
Accuracy training set:
Accuracy testing set 1:
1.0
Confusion matrix:
 [[122   0]
 [  0  84]]
 Using the K-nearest neighborhood with  distance weights
Accuracy training set:
Accuracy testing set 1:
1.0
Confusion matrix:
 [[122   0]
 [  0  84]]
Using the Decision tree model:
Accuracy training set:
Accuracy testing set 1:
0.99
Confusion matrix:
 [[122   0]
 [  2  82]]


We can conclude form the information, that we get from the code that the perceptron model and the decision tree model aren't the ones we are looking for. Then we have to choose between svm, and k-nearest neighborhood. Trying to find a way to decide betwwen them we look in their behavior with the trainind data:

In [4]:
predicted_svm= clf.predict(X)
print("Acurracy with the trainin data for svm:", accuracy_score(y, predicted_svm).round(2))
for weights in ["uniform", "distance"]:
    print(" Using the K-nearest neighborhood with ", weights , "weights") 
    # we create an instance of Neighbours Classifier and fit the data.
    clf_k = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf_k.fit(X, y)
    predicted_k=clf_k.predict(X)
    print('Accuracy training set:', accuracy_score(y, predicted_k).round(2))
    


Acurracy with the trainin data for svm: 0.99
 Using the K-nearest neighborhood with  uniform weights
Accuracy training set: 1.0
 Using the K-nearest neighborhood with  distance weights
Accuracy training set: 1.0


Finally we observe that the models which predicted better the test1 data and fit better the training data where the k-nearest neighborhood models. Finally we will select the k-nearest neighborhood model using uniform weights considering the problem we are working in. We will test how good was our selection with the test2 dataset:

In [5]:
#train2
clf_k= neighbors.KNeighborsClassifier(n_neighbors, weights="uniform")
predicted_k_test2=clf_p.predict(X_test2)
print('Accuracy testing set 2:')
clf_p.score(X_test2, y_test2)

#accuracy_test2_o= accuracy_score(y_test2_o, predicted_perceptron_test2_o).round(2)
conf_matrix_test2= confusion_matrix(y_test2, predicted_k_test2)
#print('Acuracy:',accuracy_test2_o)
print('Confusion matrix:\n', conf_matrix_test2)
print('Accuracy testing set 2:')
clf_t.score(X_test2, y_test2)
print(accuracy_score(y_test2, predicted_k_test2).round(2))

Accuracy testing set 2:
Confusion matrix:
 [[103   4]
 [  1  98]]
Accuracy testing set 2:
0.98


Finally the model we stay with it is the K-nearest neighborhood with uniform weights, and we can observe that ir has an acurracy of 98% in the test 2 dataset, which it is pretty good.


Now we will repeat this process with the dataset from the second problem.

In [7]:
clf_p = Perceptron(tol=1e-3, random_state=0)
clf= svm.SVC()
clf_t=tree.DecisionTreeClassifier()
n_neighbors=15

#For the banknote problem
print("Working with the banknote dataset:")
print("Using the perceptron model:")
clf_p.fit(X_train_o, y_train_o)
print('Accuracy training set:')
clf_p.score(X_train_o, y_train_o)

predicted_perceptron_test1_o=clf_p.predict(X_test1_o)
print('Accuracy testing set 1:')
clf_p.score(X_test1_o, y_test1_o)
print(accuracy_score(y_test1_o, predicted_perceptron_test1_o).round(2))
conf_matrix_test1= confusion_matrix(y_test1_o, predicted_perceptron_test1_o)
#print('Acuracy:',accuracy_test1_o)
print('Confusion matrix:\n', conf_matrix_test1)

print("Using the SVM model:")
clf.fit(X_train_o, y_train_o)
print('Accuracy training set:')
clf.score(X_train_o, y_train_o)

predicted_svm_test1=clf.predict(X_test1_o)
print('Accuracy testing set 1:')
clf.score(X_test1_o, y_test1_o)
print(accuracy_score(y_test1_o, predicted_svm_test1).round(2))
conf_matrix_test1= confusion_matrix(y_test1_o, predicted_svm_test1)
#print('Acuracy:',accuracy_test1_o)
print('Confusion matrix:\n', conf_matrix_test1)


for weights in ["uniform", "distance"]:
    print(" Using the K-nearest neighborhood with ", weights , "weights") 
    # we create an instance of Neighbours Classifier and fit the data.
    clf_k = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf_k.fit(X_train_o, y_train_o)
    print('Accuracy training set:')
    clf_k.score(X_train_o, y_train_o)

    predicted_k_test1=clf_k.predict(X_test1_o)
    print('Accuracy testing set 1:')
    clf_k.score(X_test1_o, y_test1_o)
    print(accuracy_score(y_test1_o, predicted_k_test1).round(2))
    conf_matrix_test1= confusion_matrix(y_test1_o, predicted_k_test1)
    #print('Acuracy:',accuracy_test1_o)
    print('Confusion matrix:\n', conf_matrix_test1)

print("Using the Decision tree model:")
clf_t.fit(X_train_o, y_train_o)
print('Accuracy training set:')
clf_t.score(X_train_o, y_train_o)

predicted_tree_test1=clf_t.predict(X_test1_o)
print('Accuracy testing set 1:')
clf_t.score(X_test1_o, y_test1_o)
print(accuracy_score(y_test1_o, predicted_tree_test1).round(2))
conf_matrix_test1= confusion_matrix(y_test1_o, predicted_tree_test1)
#print('Acuracy:',accuracy_test1_o)
print('Confusion matrix:\n', conf_matrix_test1)


Working with the banknote dataset:
Using the perceptron model:
Accuracy training set:
Accuracy testing set 1:
0.95
Confusion matrix:
 [[1648   45]
 [  79  893]]
Using the SVM model:
Accuracy training set:
Accuracy testing set 1:
0.98
Confusion matrix:
 [[1641   52]
 [  13  959]]
 Using the K-nearest neighborhood with  uniform weights
Accuracy training set:
Accuracy testing set 1:
0.97
Confusion matrix:
 [[1643   50]
 [  31  941]]
 Using the K-nearest neighborhood with  distance weights
Accuracy training set:
Accuracy testing set 1:
0.96
Confusion matrix:
 [[1644   49]
 [  50  922]]
Using the Decision tree model:
Accuracy training set:
Accuracy testing set 1:
0.87
Confusion matrix:
 [[1668   25]
 [ 323  649]]


In this case there is a higher difference among the values of the accuracy in each model for the data set of test 1. However we can see that the one, which predicts the new data with better precision is the model of SVM. We will select this one as our final model, and check how well is his behavior usinf the dataset Test 2.

In [9]:
predicted_svm_test2=clf.predict(X_test2_o)
print('Accuracy testing set 2:')
clf_p.score(X_test2_o, y_test2_o)

#accuracy_test2_o= accuracy_score(y_test2_o, predicted_perceptron_test2_o).round(2)
conf_matrix_test2= confusion_matrix(y_test2_o, predicted_svm_test2)
#print('Acuracy:',accuracy_test2_o)
print('Confusion matrix:\n', conf_matrix_test2)
print('Accuracy testing set 2:')
clf_t.score(X_test2_o, y_test2_o)
print(accuracy_score(y_test2_o, predicted_svm_test2).round(2))

Accuracy testing set 2:
Confusion matrix:
 [[7566  137]
 [   8 2041]]
Accuracy testing set 2:
0.99


We can see that we ended up with an even better accuracy in the dataset of test 2, using the SVM than the accuracy in the dataset of test 1 with the same model. That really good, and give us certain degree of confidence in our selection.

#References:


*   https://scikit-learn.org/stable/modules/tree.html
*   https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
*   https://scikit-learn.org/stable/modules/neighbors.html
*  https://scikit-learn.org/stable/modules/svm.html
