# **AFI Escuela de finanzas**

![alt text](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTRsvzArVKQ5jGTVEqwdNneQFIgYVvjLPbYNvxAfFV_iktBaf9u&s)

## **Máster Executive en Data Science y Big Data en Finanzas**

**17 de Enero de 2020**

# **IoT Use cases**




# **Practical Session : Classification methods**

Starting from a dataset of water consumption that you can find in dataset_eventos.csv.


In this lab session we are going to deep in our knowledge about classifiers by managing most well-known classification algorithms. Besides, we are going to review some useful techniques, such as the cross validation process, which will allow us to adjust the free parameters of the classifier. 

#### ** During this lab we will cover: **

#### * Part 1: Linear models*
#### * Part 2: K-Nearest Neighbours (K-NN)*
#### * Part 3: Support Vector Machines (SVMs) with different kernel funcions*
#### * Part 4: Tree based algorithms*
#### * Part 5: Neural Networks*


As in previous lab session, to implement the different approaches we will base in [Scikit-Learn](http://scikit-learn.org/stable/) python toolbox.


### ** Part 0: Load and prepare the data **

Thidataset consists of 6 classes of water consumption (tap, toilet, shower,...)
    
The next code includes the lines to download this data set and create the training and test data partitions, as well as normalize them.

Useful functions: [make_classification( )](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), [train_test_split( )](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) and [StandardScaler( )](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).


In [1]:
%matplotlib inline

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import StandardScaler

from sklearn import preprocessing

np.random.seed(12)

def ReadEvents(file):
    data = np.loadtxt(file,skiprows=1,delimiter=';',usecols=range(0,37))
    labels = np.loadtxt(file,skiprows=1,delimiter=';',usecols=(37,),dtype='str')
    (nSamples,nFeatures)=data.shape
    randomPermutation = np.random.permutation(nSamples)
    data=data[randomPermutation,:]
    labels=labels[randomPermutation]
    le = preprocessing.LabelEncoder()
    le.fit(np.unique(labels))
    labels = le.transform(labels)
    return data,labels
    

###############################################################################
# Download the data, if not already on disk and load it as numpy arrays
print('The first time that you downlaod the data it can take a while...')
import numpy as np
#dataset = np.loadtxt('./dataPrepared.csv', delimiter=',',skiprows=1)
#X = dataset[:,:-1]
#Y=dataset[:,-1]
X,Y = ReadEvents('./dataset_eventos.csv')

# for machine learning we use the data directly (as relative pixel
# positions info is ignored by this model)
n_features = X.shape[1]

# the label to predict is the id of the person
n_classes = np.unique(Y).shape[0]

print("Dataset size information:")
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)


###############################################################################
# Preparing the data

# Initialize the random generator seed to compare results
np.random.seed(1)

# Split into a training set and a test set using a stratified k fold

# split into a training and testing set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)

# Normalizing the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Binarize the labels for some feature selection methods
set_classes = np.unique(Y)
Y_train_bin = label_binarize(Y_train, classes=set_classes)

print("Number of training samples: %d" % X_train.shape[0])
print("Number of test samples: %d" % X_test.shape[0])

The first time that you downlaod the data it can take a while...
Dataset size information:
n_features: 37
n_classes: 6
Number of training samples: 5492
Number of test samples: 5492


### ** Part 1: Linear models**

Include the necessary code to train and test a classifier based in:
1. A logistic regression model: in thiscase adjust the C parameter by CV
2. Linear Discrimation Analysis 



In [13]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# Logistic regression
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")

rang_C = np.logspace(-3, 3, 10)
tuned_parameters = [{'C': rang_C}]
nfold = 10

# Train a LR model and adjust by CV the parameter C
clf_LR  = GridSearchCV(LogisticRegression(),
                   tuned_parameters, cv=nfold)
clf_LR.fit(X_train, Y_train)# <FILL IN> 
acc_test_LR=clf_LR.score(X_test,Y_test)# <FILL IN> 

# LDA 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
clf_LDA = LDA()
clf_LDA.fit(X_train,Y_train)# <FILL IN> 
acc_test_LDA=clf_LDA.score(X_test,Y_test)# <FILL IN> 

print("The test accuracy of LR is %2.2f" %(100*acc_test_LR))
print("The test accuracy of LDA is %2.2f" %(100*acc_test_LDA))

The test accuracy of LR is 65.33
The test accuracy of LDA is 60.67


## ** Part 2: K nearest neigbors**

A K-NN approach classifies each new data searching its K nearest neighbors (among the training data) and assigning the majority class among these neighbors. As expected, its performance depends on the number of neighbors (K) used.

To start to work, let's analyze for different values of K the K-NN performance, both over training and test sets. Use the [KNeighborsClassifier()](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) method to complete the below code.


This figure points out the necessity of selecting the adequate value of K. And, as expected, using the training error for such selection would provide a poor generalization.

#### ** Selecting the number of neighbors of a K-NN classifier**

Therefore, next step will consist of applying a cross validation (CV) process to select the optimum value of K. You can use the [GridSearchCV( )](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) function to implement it. 

In [41]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

from sklearn import neighbors
from sklearn.model_selection import GridSearchCV

# Parameters
K_max = 50
rang_K = np.arange(1, K_max+1)
nfold = 10
# Define a dictionary with the name of the parameters to explore as a key and the ranges to explores as value
tuned_parameters = [{'n_neighbors': rang_K}]


# Cross validation proccess 
clf_base = neighbors.KNeighborsClassifier( )
# Define the classfifier with the CV process (use GridSearchCV here!!!)
clf =  GridSearchCV(clf_base, tuned_parameters, cv = nfold, scoring = 'accuracy')#<FILL IN>
# Train it (this executes the CV)
clf.fit(X_train,Y_train)#<FILL IN>

print('CV process sucessfully finished')

CV process sucessfully finished


After running the CV process, the classifier object  contains the information of the CV process (next cell explore the parameter ".grid\_scores\_" to obtain this information).

In [42]:
# Printing results
print("Cross validation results:")

paramsFolds = clf.cv_results_['params']
meanScoreFolds = clf.cv_results_['mean_test_score']
stdScoreFolds = clf.cv_results_['std_test_score']

for fold in range(K_max):
    params = paramsFolds[fold]
    mean_score = meanScoreFolds[fold]
    std_score = stdScoreFolds[fold]
    print("For K = %d, validation accuracy is %2.2f (+/-%1.3f)%%" 
          % (params['n_neighbors'], 100*mean_score, 100*std_score / 2))



Cross validation results:
For K = 1, validation accuracy is 56.77 (+/-3.938)%
For K = 2, validation accuracy is 62.49 (+/-2.736)%
For K = 3, validation accuracy is 62.58 (+/-3.195)%
For K = 4, validation accuracy is 63.58 (+/-2.952)%
For K = 5, validation accuracy is 62.43 (+/-2.889)%
For K = 6, validation accuracy is 64.71 (+/-2.369)%
For K = 7, validation accuracy is 64.53 (+/-2.287)%
For K = 8, validation accuracy is 63.36 (+/-2.778)%
For K = 9, validation accuracy is 65.49 (+/-1.842)%
For K = 10, validation accuracy is 64.49 (+/-2.345)%
For K = 11, validation accuracy is 64.58 (+/-2.362)%
For K = 12, validation accuracy is 65.69 (+/-2.060)%
For K = 13, validation accuracy is 65.73 (+/-1.922)%
For K = 14, validation accuracy is 65.60 (+/-1.992)%
For K = 15, validation accuracy is 65.46 (+/-2.044)%
For K = 16, validation accuracy is 66.42 (+/-0.788)%
For K = 17, validation accuracy is 65.48 (+/-2.165)%
For K = 18, validation accuracy is 66.39 (+/-0.787)%
For K = 19, validation accura

Examine the fields ".best\_estimator\_" and ".best\_params\_" of the classifier generated by the CV process:
* ".best\_estimator\_" contains  the final classifier trained with this select value.
* ".best\_params\_" is a dictionary with the selected parameters. In our example, "best\_params\_['n\_neighbors']" would provide the selected value of K.

Save the selected value of K in variable denoted "K_opt" and compute the test error of the final classifier.

In [43]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

# Assign to K_opt the value of K selected by CV
K_opt = clf.best_params_['n_neighbors']# <FILL IN>
print("The value optimum of K is %d" %(K_opt))



The value optimum of K is 16


Note that you can also compute the test error directly over the classifier object return by the CV process

In [40]:
KNN_acc_test = clf.score(X_test, Y_test)
print("The test accuracy is %2.2f" %(100*KNN_acc_test))

ValueError: query data dimension must match training data dimension

### ** Part 3: SVM**

SVM is one of the most well-known classifiers due to its good generalization properties in many different applications. Besides, by means of the kernel trick, its linear formulation can easily extended to a non linear fashion. 

Here, we will test its performance when different kernel functions are used. For this purpose, we can use the [SCV( )](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) method, which let you select the kernel function to be used, and the method GridSearchCV( ) to adjust the different free parameters (C and kernel parameter). 

Complete the following cells, when it is required, to train in each case a linear SVM (defining kernel='linear' in the method SCV( )), an SVM with gaussian kernel (kernel='rbf') and an SVM with polynomial kernel (kernel='poly'). 

For each method, adjust the corresponding free parameters with a 10 fold CV process (the ranges to explore are defined at the beginning of each cell). Return the values of selected parameters and the accuracy of the final SVM.

#### ** SVM with gaussian kernel**

In [12]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
from sklearn import svm
n_dim=X_train.shape[1]
rang_g=np.array([10, 100])
rang_gamma=np.array([10, 100])
tuned_parameters = [{'C': rang_C, 'gamma': rang_g}]

# Train an SVM with gaussian kernel and adjust by CV the parameter C
clf_base = svm.SVC(kernel='rbf')
selection = np.array([2,9,0,6,4,10,1,7,3])
rbf_svc  =  GridSearchCV(clf_base, tuned_parameters, cv = nfold, scoring = 'accuracy')# <FILL IN> 
rbf_svc.fit(X_train[:,selection],Y_train) # <FILL IN> 
# Save the values of C and gamma selected and compute the final accuracy
C_opt = rbf_svc.best_params_['C']# <FILL IN> 
g_opt = rbf_svc.best_params_['gamma']# <FILL IN> 


print("The C value selected is " + str(C_opt))
print("The gamma value selected is " + str(g_opt))
acc_rbf_svc = rbf_svc.score(X_test[:,selection], Y_test)
print("The test accuracy of the RBF SVM is %2.2f" %(100*acc_rbf_svc))

The C value selected is 46.41588833612773
The gamma value selected is 10
The test accuracy of the RBF SVM is 66.35


#### ** 2.3. SVM with polynomial kernel**

In [None]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

rang_d=np.arange(1,5)
tuned_parameters = [{'C': rang_C, 'degree': rang_d}]

# Train an SVM with polynomial kernel and adjust by CV the parameter C
clf_base =  svm.SVC(kernel='poly')
poly_svc  = GridSearchCV(clf_base, tuned_parameters, cv = nfold, scoring = 'accuracy')# <FILL IN> 
poly_svc.fit(X_train,Y_train)# <FILL IN> 

# Save the values of C and degree selected and compute the final accuracy
C_opt = poly_svc.best_params_['C']# <FILL IN> 
d_opt = poly_svc.best_params_['degree']# <FILL IN> 


print("The C value selected is " + str(C_opt))
print("The degree value selected is " + str(d_opt))
acc_poly_svc = poly_svc.score(X_test, Y_test)
print("The test accuracy of the polynomial SVM is %2.2f" %(100*acc_poly_svc))

### ** Part 4: Trees**

** Training a Random Forest**
A Random Forest (RF) trains several decision tree classifiers, where each one is trained with different samples and features of the training data, and averages their outputs to improve the final accuracy.

Use the RandomForestClassifier( ) function to train a RF classifier and select by cross validation the number of trees to use. The remaining parameters, such as the number of subsampled data or features, can be used with their default values. Return the optimal number of trees to be used and the final accuracy of the RF classifier.


In [6]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rang_n_trees=np.arange(1,10)
tuned_parameters = [{'n_estimators': rang_n_trees}]
nfold = 10

clf_RF  = GridSearchCV(RandomForestClassifier(), tuned_parameters, cv = nfold, scoring = 'accuracy')#<FILL IN>
clf_RF.fit(X_train, Y_train)
n_trees_opt = clf_RF.best_params_['n_estimators']#<FILL IN>
acc_RF = clf_RF.score(X_test,Y_test)#<FILL IN>

print("The number of selected trees is " + str(n_trees_opt))
print("The test accuracy of the RF is %2.2f" %(100*acc_RF))

The number of selected trees is 9
The test accuracy of the RF is 80.88


### ** Part 5: Neural Networks**

