# Classification example

In this example we will be exploring an exercise of binary classification using logistic regression to estimate whether a room is occupied or not, based on parameters measured from it.
The implementation of the logistic regression using gradient descend algorithm shares many similarities with that of linear regression explained in last unit. In this unit we will rely on the implementation offered by sklearn.


## 1) Reading and inspecting the data
For this example we will use the Occupancy Detection Dataset 
obtained here: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
        
The dataset is described here:
Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, VÃ©ronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39


In [12]:
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

occupancyData = pd.read_csv('data/occupancy_data/datatraining.txt')

We can visualize its contents:

In [13]:
occupancyData.describe()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
count,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0
mean,20.619084,25.731507,119.519375,606.546243,0.003863,0.21233
std,1.016916,5.531211,194.755805,314.320877,0.000852,0.408982
min,19.0,16.745,0.0,412.75,0.002674,0.0
25%,19.7,20.2,0.0,439.0,0.003078,0.0
50%,20.39,26.2225,0.0,453.5,0.003801,0.0
75%,21.39,30.533333,256.375,638.833333,0.004352,0.0
max,23.18,39.1175,1546.333333,2028.5,0.006476,1.0


In [14]:
occupancyData.groupby('Occupancy').mean()

Unnamed: 0_level_0,Temperature,Humidity,Light,CO2,HumidityRatio
Occupancy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,20.334931,25.349685,27.776442,490.320312,0.00373
1,21.673192,27.147938,459.854347,1037.704786,0.004355


In [15]:
occupancyData.groupby('Occupancy').std()

Unnamed: 0_level_0,Temperature,Humidity,Light,CO2,HumidityRatio
Occupancy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.909973,5.294887,89.598692,152.919609,0.000753
1,0.622891,6.128497,42.286862,377.603278,0.001006


A priori we can see that there is a big difference between Light and CO2 in occupied vs non occupied status. We will see whether these parameters play an important role in the classification.

To continue, we split the data into the input and output parameters

In [16]:
occupancyDataInput = occupancyData.drop(['Occupancy', 'date'], axis=1)
occupancyDataOutput = occupancyData['Occupancy']

As we saw in last unit, in order to improve convergence speed and accuracy we usually normalize the input parameters

In [27]:
occupancyDataInput = (occupancyDataInput - occupancyDataInput.mean())/ occupancyDataInput.std()
occupancyDataInput.describe()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio
count,8143.0,8143.0,8143.0,8143.0,8143.0
mean,-2.37342e-15,-2.848104e-15,-1.354246e-15,5.793938e-16,-3.294866e-15
std,1.0,1.0,1.0,1.0,1.0
min,-1.59215,-1.624691,-0.6136884,-0.6165554,-1.39427
25%,-0.9037947,-1.000054,-0.6136884,-0.533042,-0.920091
50%,-0.2252728,0.08876767,-0.6136884,-0.4869108,-0.07243307
75%,0.7580921,0.8681329,0.7027037,0.1027202,0.5742176
max,2.518315,2.420084,7.326169,4.523892,3.066304


## 2) Applying Logistic regression on the whole data

We are now ready to instantiate the logistic regression from sklearn and to learn parameters $\Theta$ to optimally map input parameters to output class.

In [28]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [29]:
lr.fit(occupancyDataInput, occupancyDataOutput)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We can see how this system performs on the whole data by implementing ourselves the comparison or by using the internal function to score

In [32]:
predictedOccupancy = lr.predict(occupancyDataInput)
comparison = np.logical_xor(occupancyDataOutput, predictedOccupancy)
(occupancyDataOutput.shape[0] - np.sum(comparison))/occupancyDataOutput.shape[0]

0.9860002456097261

In [33]:
lr.score(occupancyDataInput, occupancyDataOutput)

0.9860002456097261

Is this a good score? we check what the percentage of 1/0 are in the output data:

In [34]:
occupancyDataOutput.mean()

0.21232960825248681

This means that by always returning "yes" we would get a 79% accuracy. Not bad to obtain approx 20% absolute above chance. 

Now, which features are most important in the classification? we can see this by looking at the estimated values of the $\Theta$ parameters

In [35]:
pd.DataFrame(list(zip(occupancyDataInput.columns, np.transpose(lr.coef_))))

Unnamed: 0,0,1
0,Temperature,[-1.24840035829]
1,Humidity,[0.170937181511]
2,Light,[3.83643435106]
3,CO2,[1.88258341568]
4,HumidityRatio,[-0.382370112498]


As expected, Light and CO2 are the most relevant variables, and Temperature follows. Note that we can compare these values only because we normalized the input features, else the individual $\theta$ variables would not be comparable.

## 3) Train-test sets
Applying any machine learning to datasets as a whole is always a bad idea as we are looking into predicted results over data that has been used for the training. This has a big danger of overfitting and giving us the wrong information.

To solve this, let's do a proper train/test set split on our data. We will train on one set and test on the other. If we ever need to set metaparameters after training the model we will usually define a third set (usually called cross validation or development) which is independent from training and test.

In [36]:
from sklearn.model_selection import train_test_split

occupancyDataInput_train, occupancyDataInput_test, occupancyDataOutput_train, occupancyDataOutput_test = train_test_split(occupancyDataInput, occupancyDataOutput, test_size=0.3, random_state=0)
lr2 = LogisticRegression()
lr2.fit(occupancyDataInput_train, occupancyDataOutput_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

In [37]:
predicted = lr2.predict(occupancyDataInput_test)
print(predicted)
probs = lr2.predict_proba(occupancyDataInput_test)
print(probs)

[1 1 0 ..., 1 0 1]
[[  5.37686998e-02   9.46231300e-01]
 [  1.48170513e-01   8.51829487e-01]
 [  9.98808887e-01   1.19111304e-03]
 ..., 
 [  4.25577666e-02   9.57442233e-01]
 [  9.99838352e-01   1.61648431e-04]
 [  1.48112490e-01   8.51887510e-01]]


The model is assigning a true whenever the value in the second column (probability of "true") is > 0.5

Let us now see some evaluation metrics:

In [68]:
# generate evaluation metrics
from sklearn import metrics

print(metrics.accuracy_score(occupancyDataOutput_test, predicted))
print(metrics.roc_auc_score(occupancyDataOutput_test, probs[:, 1]))
print(metrics.confusion_matrix(occupancyDataOutput_test, predicted))
print(metrics.classification_report(occupancyDataOutput_test, predicted))


0.984445354073
0.993319289501
[[1887   28]
 [  10  518]]
             precision    recall  f1-score   support

          0       0.99      0.99      0.99      1915
          1       0.95      0.98      0.96       528

avg / total       0.98      0.98      0.98      2443



## 4) Cross-validation datasets
Not to cunfuse these with the subset we can use to set some metaparameters, we can use the cross-validation technique (also called jackknife technique) when we do not have much data over all and the idea of loosing some for testing is not a good idea. We normally split the data into 10 parts and perform train/test on each 9/1 groups.


In [70]:
# evaluate the model using 10-fold cross-validation
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(LogisticRegression(), occupancyDataInput, occupancyDataOutput, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())

[ 0.9791411   0.92147239  0.9791411   0.99877301  0.99385749  1.          0.9987715
  1.          0.97542998  0.96555966]
0.981214623102


We see how average results over all tests are the same as above. All good to go.