# Logistic Regression Model

### Dataset Information
No. of Features: 10  
No. of Instances: 4492  

### Table of Contents<a name='table of contents'></a>

1. [Data Ingestion](#data ingestion)
2. [Features & Target Arrays](#features and target arrays)
3. [Logistic Regression Model](#logreg)  
    a. [Scale Data](#scale data)  
    b. [Hyperparameter Tuning](#hyperparameter tuning)  
    c. [Classification Report](#classification report)  
    d. [Confusion Matrix](#confusion matrix)   
4. [Save Model](#pickle)

In [1]:
%matplotlib inline

import os
import json
import time
import pickle
import requests
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import yellowbrick as yb
sns.set_palette('RdBu', 10)

## Data Ingestion<a name='data ingestion'></a>

In [2]:
URL = 'https://raw.githubusercontent.com/georgetown-analytics/classroom-occupancy/master/models/sensor_data_ml.csv'

def fetch_data(fname='sensor_data_ml.csv'):
    response = requests.get(URL)
    outpath  = os.path.abspath(fname)
    with open(outpath, 'wb') as f:
        f.write(response.content)
    
    return outpath

# Defining fetching data from the URL
DATA = fetch_data()

In [3]:
# Import as pandas dataframe with DateTimeIndex: df
df = pd.read_csv('sensor_data_ml.csv', index_col='datetime', parse_dates=True)

In [4]:
# Rename columns
df.columns = ['temp', 'humidity', 'co2', 'light', 'light_st', 'noise',
              'bluetooth', 'images', 'door', 'occupancy_count', 'occupancy_level']

In [5]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4492 entries, 2017-03-25 09:05:00 to 2017-06-10 16:47:00
Data columns (total 11 columns):
temp               4492 non-null float64
humidity           4492 non-null float64
co2                4492 non-null float64
light              4492 non-null float64
light_st           4492 non-null float64
noise              4492 non-null float64
bluetooth          4492 non-null float64
images             4492 non-null float64
door               4492 non-null float64
occupancy_count    4492 non-null float64
occupancy_level    4492 non-null object
dtypes: float64(10), object(1)
memory usage: 421.1+ KB


Unnamed: 0_level_0,temp,humidity,co2,light,light_st,noise,bluetooth,images,door,occupancy_count,occupancy_level
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2017-03-25 09:05:00,22.6,36.9,781.0,430.0,1.0,511.0,1.0,15.242697,0.0,0.0,empty
2017-03-25 09:06:00,23.8,38.954167,765.465279,428.533744,1.0,503.515931,11.399457,15.242697,0.0,0.0,empty
2017-03-25 09:07:00,23.85,38.9,768.458333,423.5765,1.0,510.548913,19.916667,15.242697,0.083333,4.416667,low
2017-03-25 09:08:00,23.9,38.766667,777.791667,423.053571,1.0,506.50463,29.75,15.242697,0.0,23.416667,mid-level
2017-03-25 09:09:00,23.908333,38.733333,770.864583,438.607904,1.0,500.092672,35.860577,15.242697,0.0,30.0,high


## Features & Target Arrays<a name='features and target arrays'></a>

In [6]:
# Breakdown of classroom occupancy levels
df.occupancy_level.value_counts()

high         2881
mid-level     781
empty         482
low           348
Name: occupancy_level, dtype: int64

In [7]:
# Encode multiclass target variable
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit_transform(df['occupancy_level'])

array([0, 0, 2, ..., 2, 2, 2], dtype=int64)

In [8]:
# Use TimeSeriesSplit to create training and test set split indices
from sklearn.model_selection import TimeSeriesSplit

X = df.drop('occupancy_level', axis=1).values
y = df['occupancy_level']

tscv = TimeSeriesSplit(n_splits=12)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

## LogisticRegression Model<a name='logreg'></a>

In [9]:
# Initial cross-validation scores
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Fit logistic regression classifier onto the training data: logreg
logreg = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)

# Print the 12-fold cross-validation scores
cv_scores = cross_val_score(logreg, X_train, y_train, cv=tscv)

print('Logistic Regression Cross-Validation Scores')
print(cv_scores)
print('Average 12-Fold CV Score: {:.4f}'.format(np.mean(cv_scores)))

Logistic Regression Cross-Validation Scores
[ 0.79937304  0.85579937  0.95924765  0.81818182  0.64890282  0.830721
  0.53291536  0.94357367  0.80877743  0.7492163   0.67711599  0.94357367]
Average 12-Fold CV Score: 0.7973


In [10]:
# Initial classification report
from sklearn.metrics import classification_report

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the classification report and training and test scores
print('Logistic Regression Model')
print(classification_report(y_test, y_pred))
print('Training set score: {:.4f}'.format(logreg.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(logreg.score(X_test, y_test)))

Logistic Regression Model
             precision    recall  f1-score   support

      empty       1.00      0.87      0.93        61
       high       0.97      0.99      0.98       198
        low       0.68      1.00      0.81        41
  mid-level       0.96      0.60      0.74        45

avg / total       0.94      0.92      0.92       345

Training set score: 0.8949
Test set score: 0.9217


### Scale Data<a name='scale data'></a>

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

# Setup the pipeline with RobustScaler: steps
steps = [('scaler', RobustScaler()),
         ('logreg', LogisticRegression(solver='lbfgs', multi_class='multinomial'))]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training set: logreg_scaled
logreg_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a logreg classifier to the unscaled data: logreg_unscaled
logreg_unscaled = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {:.4f}'.format(logreg_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {:.4f}'.format(logreg_unscaled.score(X_test, y_test)))

Accuracy with Scaling: 0.9913
Accuracy without Scaling: 0.9217


In [12]:
# Cross-validation scores for scaled data
cv_scores = cross_val_score(logreg_scaled, X_train, y_train, cv=tscv)
print('Logistic Regression Cross-Validation Scores: Scaled')
print(cv_scores)
print('Average 12-Fold CV Score: {:.4f}'.format(np.mean(cv_scores)))

Logistic Regression Cross-Validation Scores: Scaled
[ 0.86833856  0.91222571  0.93103448  0.85893417  0.76175549  0.72727273
  0.61442006  0.99059561  0.98746082  0.99059561  0.90909091  0.99373041]
Average 12-Fold CV Score: 0.8788


In [13]:
# Predict the labels of the test set: y_pred
y_pred = logreg_scaled.predict(X_test)

# Compute and print the classification report and training and test scores
print('Logistic Regression Model: Scaled')
print(classification_report(y_test, y_pred))
print('Training set score: {:.4f}'.format(logreg_scaled.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(logreg_scaled.score(X_test, y_test)))

Logistic Regression Model: Scaled
             precision    recall  f1-score   support

      empty       1.00      1.00      1.00        61
       high       1.00      1.00      1.00       198
        low       0.93      1.00      0.96        41
  mid-level       1.00      0.93      0.97        45

avg / total       0.99      0.99      0.99       345

Training set score: 0.9891
Test set score: 0.9913


### Hyperparameter Tuning<a name='hyperparameter tuning'></a>

In [14]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(RobustScaler(),
                     LogisticRegression(solver='lbfgs', multi_class='multinomial'))

param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100, 110, 120]}

grid = GridSearchCV(pipe, param_grid, cv=tscv)

logreg_clf = grid.fit(X_train, y_train)

print('Best estimator:\ n{}'.format(logreg_clf.best_estimator_))

Best estimator:\ nPipeline(steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('logisticregression', LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])


In [16]:
print('Logistic Regression Model (Hypertuned)')
print('Best cross-validation accuracy: {:.4f}'.format(logreg_clf.best_score_))
print('Test set score: {:.4f}'.format(logreg_clf.score(X_test, y_test)))
print('Best parameters: {}'.format(logreg_clf.best_params_))

Logistic Regression Model (Hypertuned)
Best cross-validation accuracy: 0.9378
Test set score: 0.9942
Best parameters: {'logisticregression__C': 100}


In [19]:
print('Logistic regression coefficients:\ n{}'.format(logreg_clf.best_estimator_.named_steps['logisticregression'].coef_))

Logistic regression coefficients:\ n[[  1.34545398e-01  -3.22510416e-01  -1.17065056e+00   5.28899360e-01
    3.53873454e+00  -1.10973335e+00  -5.29392868e-01   2.82714770e-01
   -1.11748443e-01  -3.55673512e+01]
 [ -1.77820099e-01   5.24521962e-01   1.83219648e+00  -2.46018015e-01
   -2.32100742e+00   2.44918998e+00   9.68101884e-03  -4.93399108e-01
    1.11157086e-01   7.24321416e+01]
 [ -8.07061869e-02  -1.95847907e-01   6.07258288e-02  -1.26072416e-01
    1.12450200e-01  -3.92625574e-01   1.63648787e-01   7.18097248e-02
   -1.20948344e-01  -2.59175950e+01]
 [  1.23980888e-01  -6.16363920e-03  -7.22271743e-01  -1.56808929e-01
   -1.33017731e+00  -9.46831053e-01   3.56063062e-01   1.38874613e-01
    1.21539701e-01  -1.09471954e+01]]


### Classification Report<a name='classification report'></a>

In [17]:
# Predict the labels of the test set: y_pred
y_pred = logreg_clf.predict(X_test)

# Compute and print the classification report and training and test scores
print('Logistic Regression Model')
print(classification_report(y_test, y_pred))

Logistic Regression Model
             precision    recall  f1-score   support

      empty       1.00      1.00      1.00        61
       high       1.00      1.00      1.00       198
        low       0.95      1.00      0.98        41
  mid-level       1.00      0.96      0.98        45

avg / total       0.99      0.99      0.99       345



### Confusion Matrix <a name='confusion matrix'></a>

In [21]:
from sklearn.metrics import confusion_matrix

print('Logistic Regression Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

Logistic Regression Confusion Matrix
[[ 61   0   0   0]
 [  0 198   0   0]
 [  0   0  41   0]
 [  0   0   2  43]]


## Save Model<a name='pickle'></a>

In [24]:
import pickle

logreg_model = 'logreg_model.sav'

# Save fitted model to disk
pickle.dump(logreg_clf, open(logreg_model, 'wb'))

### [Return to Table of Contents](#table of contents)