# Research Practicum

This notebook contains a model that predicts whether a room is Empty or Full based on number of devices connected to wifi.

This model uses the median number of devices connected in a room per hour per day.

<b> GET DATA </b>

- import any required libraries
- read in data from csv files and put into a dataframe
- check data is loaded correctly into dataframe

In [115]:
#import pandas package to read and merge csv files
import pandas as pd
#import csv package for reading from and writing to csv files
import csv
# Import package numpy for numeric computing
import numpy as np
# Import package matplotlib for visualisation/plotting
import matplotlib.pyplot as plt
%matplotlib inline

In [116]:
# check current directory
%ls

 Volume in drive C is OS_Install
 Volume Serial Number is F05F-E1DD

 Directory of C:\Users\Elayne Ruane\Documents\CSI MA\Research Practicum

22/07/2016  14:19    <DIR>          .
22/07/2016  14:19    <DIR>          ..
21/07/2016  17:54    <DIR>          .ipynb_checkpoints
06/07/2016  00:28            86,091 Data_Understanding.ipynb
04/07/2016  15:00           135,768 pauline_linear_regression_wifi_experiment.ipynb
05/07/2016  11:39           139,687 pauline_linear_regression_wifi_max.ipynb
05/07/2016  14:24           136,426 pauline_linear_regression_wifi_mean.ipynb
05/07/2016  14:57           119,047 pauline_logistic_reg_1.ipynb
05/07/2016  15:43           141,350 pauline_mean_updated.ipynb
21/07/2016  17:26           127,767 wifi_lin_reg_median_nointercept.ipynb
21/07/2016  17:11           139,193 wifi_lin_reg_median_nointercept_3cat.ipynb
22/07/2016  14:19           149,222 wifi_lin_reg_median_nointercept_binary_test&training.ipynb
05/07/2016  20:39           170,179 wifi_log_model

In [170]:
# read data from csv file into a data frame
# this code is OS agnostic

import os

a = '..' # removed slash
b = 'cleaned_data' # removed slash
c = 'full.csv'

path = os.path.join(a, b, c)

print(path)
wifi_df = pd.read_csv(path, names=['room', 'event_time', 'ass', 'auth'])

..\cleaned_data\full.csv


In [118]:
# May have to use this method to read csv into dataframe
# it uses double backlash to prevent unicode error from '\U' characters
wifi_df = pd.read_csv("D:\\Users\\Elayne Ruane\\Documents\\CSI MA\\research_practicum\\cleaned_data\\full.csv", names=['room', 'event_time', 'ass', 'auth'])

In [119]:
# check data loaded into data frame correctly
wifi_df.head()

Unnamed: 0,room,event_time,ass,auth
0,Belfield > Computer Science > B-002,Mon Nov 02 20:32:06 GMT+00:00 2015,0,0
1,Belfield > Computer Science > B-002,Mon Nov 02 20:37:10 GMT+00:00 2015,0,0
2,Belfield > Computer Science > B-002,Mon Nov 02 20:42:12 GMT+00:00 2015,0,0
3,Belfield > Computer Science > B-002,Mon Nov 02 20:47:14 GMT+00:00 2015,0,0
4,Belfield > Computer Science > B-002,Mon Nov 02 20:52:11 GMT+00:00 2015,0,0


<b> CLEAN DATA </b>

Need to clean the data for use in the model:
- convert values in 'event_time' column from timestamp to epoch time
- add a 'building' column (in this case all values will be 'school of computer science'
- convert values in 'room' column from the format 'campus > building > room'to just the integer value
- create two new columns that contain the 'day' and the 'hour' with values derived from the 'event_time'

In [121]:
import time
from dateutil.parser import parse

def convert_to_epoch(df, column):
    '''function that reads in a dataframe with a column containing values in timestamp format and converts those values to epoch forma
   
    requires module time and parse function from dateutil.parser
    
    paramaters
    ----------
    df is a dataframe
    column is a string that denotes the name of the column containing value in timestamp format
    '''
    
    #for loop that iterates through each row in the dataframe
    for i in range(df.shape[0]):
        # variable 'x' is assigned the value from the column and row 'i'
        x = df[column][i]
        # variable 'y' is assigned the result of variable 'x' passed through the parse method 
        y = parse(x)
        # variable 'epoch' is assigned 'y' value converted to epoch time
        epoch = int(time.mktime(y.timetuple()))
        # set column value to value of variable 'epoch'
        df.set_value(i, column, epoch)
    return df

In [122]:
convert_to_epoch(wifi_df, 'event_time')

Unnamed: 0,room,event_time,ass,auth
0,Belfield > Computer Science > B-002,1446496326,0,0
1,Belfield > Computer Science > B-002,1446496630,0,0
2,Belfield > Computer Science > B-002,1446496932,0,0
3,Belfield > Computer Science > B-002,1446497234,0,0
4,Belfield > Computer Science > B-002,1446497531,0,0
5,Belfield > Computer Science > B-002,1446497831,0,0
6,Belfield > Computer Science > B-002,1446498031,0,0
7,Belfield > Computer Science > B-002,1446498439,0,0
8,Belfield > Computer Science > B-002,1446498740,0,0
9,Belfield > Computer Science > B-002,1446499040,0,0


Clean Room Identifiers

In [124]:
def room_number(df, room_column):
    '''function that reads in a dataframe with a column containing room information in the format 'campus > building > roomcode-xxx' 
    and replaces the values in the column with just the room ID which is the last character of the string in that column.    
    '''
    # for loop that iterates through each row in the df
    for i in range(df.shape[0]):
        # selects last character of the string in the room_column which is the room ID
        df.set_value(i, room_column, df[room_column][i][-1:])
    return df

In [125]:
room_number(wifi_df, 'room')

Unnamed: 0,room,event_time,ass,auth
0,2,1446496326,0,0
1,2,1446496630,0,0
2,2,1446496932,0,0
3,2,1446497234,0,0
4,2,1446497531,0,0
5,2,1446497831,0,0
6,2,1446498031,0,0
7,2,1446498439,0,0
8,2,1446498740,0,0
9,2,1446499040,0,0


In [126]:
wifi_df.head()

Unnamed: 0,room,event_time,ass,auth
0,2,1446496326,0,0
1,2,1446496630,0,0
2,2,1446496932,0,0
3,2,1446497234,0,0
4,2,1446497531,0,0


Add building.

In [127]:
wifi_df['building'] = 'school of computer science'

In [128]:
wifi_df.head()

Unnamed: 0,room,event_time,ass,auth,building
0,2,1446496326,0,0,school of computer science
1,2,1446496630,0,0,school of computer science
2,2,1446496932,0,0,school of computer science
3,2,1446497234,0,0,school of computer science
4,2,1446497531,0,0,school of computer science


Clean Occupancy Data

In [129]:
# put survey data in a dataframe

a = '..' # removed slash
b = 'cleaned_data' # removed slash
c = 'survey_data.csv'

path = os.path.join(a, b, c)
print(path)

occupancy_df = pd.read_csv(path)

..\cleaned_data\survey_data.csv


OSError: File b'..\\cleaned_data\\survey_data.csv' does not exist

In [None]:
# Have to use this method with double backlash to prevent unicode error from '\U' characters
occupancy_df = pd.read_csv("D:\\Users\\Elayne Ruane\\Documents\\CSI MA\\research_practicum\\cleaned_data\\survey_data.csv")

In [None]:
occupancy_df.head()

In [None]:
# delete column 'Unnamed: 0'
del occupancy_df['Unnamed: 0']

In [131]:
occupancy_df.head()

Unnamed: 0_level_0,room,occupancy,building,event_hour,event_day
event_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-11-02 09:00:00,4,0.25,school of computer science,9,2
2015-11-02 09:00:00,2,0.25,school of computer science,9,2
2015-11-02 09:00:00,3,0.25,school of computer science,9,2
2015-11-02 10:00:00,4,0.5,school of computer science,10,2
2015-11-02 10:00:00,2,0.5,school of computer science,10,2


Convert EPCOH time into human-readable format.

In [132]:
# convert 'event_time' values from EPOCH to DATETIME
wifi_df['event_time'] = pd.to_datetime(wifi_df.event_time, unit='s')
# use event_time as dataframe index 
wifi_df.set_index('event_time', inplace=True)

In [133]:
wifi_df.head()

Unnamed: 0_level_0,room,ass,auth,building
event_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-11-02 20:32:06,2,0,0,school of computer science
2015-11-02 20:37:10,2,0,0,school of computer science
2015-11-02 20:42:12,2,0,0,school of computer science
2015-11-02 20:47:14,2,0,0,school of computer science
2015-11-02 20:52:11,2,0,0,school of computer science


In [134]:
# create two new columns, event_hour and event_day
wifi_df['event_hour'] = wifi_df.index.hour
wifi_df['event_day'] = wifi_df.index.day

In [135]:
wifi_df.head()

Unnamed: 0_level_0,room,ass,auth,building,event_hour,event_day
event_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-11-02 20:32:06,2,0,0,school of computer science,20,2
2015-11-02 20:37:10,2,0,0,school of computer science,20,2
2015-11-02 20:42:12,2,0,0,school of computer science,20,2
2015-11-02 20:47:14,2,0,0,school of computer science,20,2
2015-11-02 20:52:11,2,0,0,school of computer science,20,2


In [136]:
# convert 'event_time' values from EPOCH to DATETIME
occupancy_df['event_time'] = pd.to_datetime(occupancy_df.event_time, unit='s')
# use event_time as dataframe index 
occupancy_df.set_index('event_time', inplace=True)

AttributeError: 'DataFrame' object has no attribute 'event_time'

In [137]:
occupancy_df.head()

Unnamed: 0_level_0,room,occupancy,building,event_hour,event_day
event_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-11-02 09:00:00,4,0.25,school of computer science,9,2
2015-11-02 09:00:00,2,0.25,school of computer science,9,2
2015-11-02 09:00:00,3,0.25,school of computer science,9,2
2015-11-02 10:00:00,4,0.5,school of computer science,10,2
2015-11-02 10:00:00,2,0.5,school of computer science,10,2


In [138]:
# create two new columns, event_hour and event_day
occupancy_df['event_hour'] = occupancy_df.index.hour
occupancy_df['event_day'] = occupancy_df.index.day

In [139]:
occupancy_df.head()

Unnamed: 0_level_0,room,occupancy,building,event_hour,event_day
event_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-11-02 09:00:00,4,0.25,school of computer science,9,2
2015-11-02 09:00:00,2,0.25,school of computer science,9,2
2015-11-02 09:00:00,3,0.25,school of computer science,9,2
2015-11-02 10:00:00,4,0.5,school of computer science,10,2
2015-11-02 10:00:00,2,0.5,school of computer science,10,2


<b> DATA ANALYSIS </b>

Survey data contains one recorded value per room, per day, per hour. As such, we take the median reading from the wifi logs per hour, per day, per room.

In [140]:
df_med_conn = wifi_df.groupby(['room', 'event_day', 'event_hour'], as_index=False).median()

In [141]:
df_med_conn.tail()

Unnamed: 0,room,event_day,event_hour,ass,auth
1051,4,17,7,0,0
1052,4,17,8,0,0
1053,4,17,9,72,72
1054,4,17,10,85,85
1055,4,17,11,39,39


In [142]:
# merge data into single dataframe
df_med_conn['room'] = df_med_conn['room'].astype(int)
df = pd.merge(df_med_conn, occupancy_df, on=['room', 'event_day', 'event_hour'], how='inner')

df.head(15)

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building
0,2,3,9,2.0,2.0,0.0,school of computer science
1,2,3,10,29.0,29.0,0.5,school of computer science
2,2,3,11,27.0,27.0,0.5,school of computer science
3,2,3,12,16.0,16.0,0.5,school of computer science
4,2,3,13,13.0,13.0,0.0,school of computer science
5,2,3,14,47.0,47.0,0.75,school of computer science
6,2,3,15,35.0,35.0,0.25,school of computer science
7,2,3,16,36.5,36.5,0.25,school of computer science
8,2,4,9,14.0,14.0,0.25,school of computer science
9,2,4,10,15.0,15.0,0.25,school of computer science


We create a new column for the estimated occupants based on the survey data. This is calculated as room capacity * occupancy rate

In [171]:

def estimate_occ(df,room, occupancy_rate):
    '''function that caluclates the estimated number of room occupants
    
    parameters
    ----------
    df is a dataframe with columns room and occupancy_rate
    room is a string denoting a column in df that contains INT values representing room IDs
    occupancy_rate is a string denoting a column in df that contains DECIMAL values that represent the estimated room occupancy rate
    
    '''
    #for loop that iterates through each row of the df
    for i in range(df.shape[0]):
        
        #room two and three have capacity of 90
        if df[room][i] == 2 or df[room][i] == 3:
            # calculate estimated occupants for row, assign to variable 'est'
            est = df[occupancy_rate][i] * 90
            #set value in new column
            df.set_value(i, 'est_occupants', est)
        
        #room four has a capcity of 220
        elif df[room][i] == 4:
            est = df[occupancy_rate][i] * 220
            df.set_value(i, 'est_occupants', est)
        
        else:
            raise ValueError('Incorrect room number:', df[room][i])
            

In [144]:
estimate_occ(df, 'room', 'occupancy')

<b>CREATE TEST AND TRAINING SET</b>

We separate the data into a training set (70%) and a test set (30%)

In [145]:
df.head()

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants
0,2,3,9,2,2,0.0,school of computer science,0
1,2,3,10,29,29,0.5,school of computer science,45
2,2,3,11,27,27,0.5,school of computer science,45
3,2,3,12,16,16,0.5,school of computer science,45
4,2,3,13,13,13,0.0,school of computer science,0


In [146]:
df_train = df[:int(0.7 * df.shape[0])]
df_test = df[int(0.7 * df.shape[0]):]

df_train.head()

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants
0,2,3,9,2,2,0.0,school of computer science,0
1,2,3,10,29,29,0.5,school of computer science,45
2,2,3,11,27,27,0.5,school of computer science,45
3,2,3,12,16,16,0.5,school of computer science,45
4,2,3,13,13,13,0.0,school of computer science,0


<b>TRAIN THE TRAINING SET</b>

In [147]:
import statsmodels.formula.api as sm

In [148]:
# can also use associated but higher correlation with authenticated 
lm = sm.ols(formula='est_occupants ~ auth', data=df_train).fit()

In [149]:
print(lm.params)

Intercept    2.375927
auth         0.923753
dtype: float64


In [150]:
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:          est_occupants   R-squared:                       0.609
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     231.6
Date:                Fri, 22 Jul 2016   Prob (F-statistic):           3.75e-32
Time:                        14:19:46   Log-Likelihood:                -638.23
No. Observations:                 151   AIC:                             1280.
Df Residuals:                     149   BIC:                             1287.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      2.3759      2.101      1.131      0.2

In [172]:
df_train['prediction_med'] = None

for i in range(df_train.shape[0]):
    df_train.set_value(i, 'prediction_med', df_train['auth'][i] * lm.params['auth'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [173]:
# add column to dataframe for prediction category
df_train['cat_predict'] = None

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [174]:
# check column added
df_train.head()

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants,prediction_max,cat_predict,accurate,prediction_med
0,2,3,9,2,2,0.0,school of computer science,0,1.84751,,1,1.84751
1,2,3,10,29,29,0.5,school of computer science,45,26.7888,,1,26.7888
2,2,3,11,27,27,0.5,school of computer science,45,24.9413,,1,24.9413
3,2,3,12,16,16,0.5,school of computer science,45,14.78,,1,14.78
4,2,3,13,13,13,0.0,school of computer science,0,12.0088,,0,12.0088


The model predicts the number of people in a room based on the number of connected devices. We want to categorise these predictions into Empty (0) and Full (1)

In [175]:
def set_occupancy_category(df, room, linear_predict, cat_predict):
    '''function that converts linear predictions to a defined category and updates the dataframe passed through
    
    Parameters
    ----------
    df: a dataframe
    room: a string that is the column in df containing room id values of type INT
    linear_predict: a string that is the column in df containing linear predictions
    cat_predict: a string that is the column in df that will containing category predictions
    
    Assumptions
    -----------
    <=5 devices is considered an empty room
    
    '''
    
    for i in range(df.shape[0]):
        
        # assign room capacity
        if df[room][i] == 2 or df[room][i] == 3:
            cap = 90
        elif df[room][i] == 4:
            cap = 200
            
        # calculate the occupancy rate and assign to variable 'ratio'
        ratio = df[linear_predict][i]/ cap
        
        # assign category based on ratio
        if ratio <= 0.05:
            cat = 0 # Empty
        else:
            cat =  1 # Not Empty
        
        # set category value in df
        df.set_value(i, cat_predict, cat)

In [155]:
set_occupancy_category(df_train, 'room', 'prediction_max', 'cat_predict')

In [156]:
df_train.head()

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants,prediction_max,cat_predict
0,2,3,9,2,2,0.0,school of computer science,0,1.84751,0
1,2,3,10,29,29,0.5,school of computer science,45,26.7888,1
2,2,3,11,27,27,0.5,school of computer science,45,24.9413,1
3,2,3,12,16,16,0.5,school of computer science,45,14.78,1
4,2,3,13,13,13,0.0,school of computer science,0,12.0088,1


Check accuracy of model according to survey data

In [157]:
df_train['accurate'] = None

for i in range(df_train.shape[0]):
    occ = df_train['occupancy'][i]
    cat = df_train['cat_predict'][i]
    df_train.set_value(i, 'accurate', 1 if (occ == cat) or (occ != 0 and cat == 1) else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [158]:
df_train.head(15)

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants,prediction_max,cat_predict,accurate
0,2,3,9,2.0,2.0,0.0,school of computer science,0.0,1.84751,0,1
1,2,3,10,29.0,29.0,0.5,school of computer science,45.0,26.7888,1,1
2,2,3,11,27.0,27.0,0.5,school of computer science,45.0,24.9413,1,1
3,2,3,12,16.0,16.0,0.5,school of computer science,45.0,14.78,1,1
4,2,3,13,13.0,13.0,0.0,school of computer science,0.0,12.0088,1,0
5,2,3,14,47.0,47.0,0.75,school of computer science,67.5,43.4164,1,1
6,2,3,15,35.0,35.0,0.25,school of computer science,22.5,32.3313,1,1
7,2,3,16,36.5,36.5,0.25,school of computer science,22.5,33.717,1,1
8,2,4,9,14.0,14.0,0.25,school of computer science,22.5,12.9325,1,1
9,2,4,10,15.0,15.0,0.25,school of computer science,22.5,13.8563,1,1


In [159]:
accuracy = df_train['accurate'].sum()/df_train.shape[0]
accuracy

0.7880794701986755

<b>TEST ON THE TEST SET</b>

In [161]:
df_test.head()

Unnamed: 0,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants
151,4,3,16,6,6.0,0.0,school of computer science,0
152,4,4,9,32,32.0,0.25,school of computer science,55
153,4,4,10,5,4.5,0.0,school of computer science,0
154,4,4,11,61,61.0,0.25,school of computer science,55
155,4,4,12,70,70.0,0.5,school of computer science,110


In [162]:
# add prediction column to the dataframe and set defualt value to 'None'
df_test['prediction_med'] = None

df_test = df_test.reset_index()

for i in range(df_test.shape[0]):
    df_test.set_value(i, 'prediction_med', df_test['auth'][i] * lm.params['auth'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [163]:
df_test.head()

Unnamed: 0,index,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants,prediction_med
0,151,4,3,16,6,6.0,0.0,school of computer science,0,5.54252
1,152,4,4,9,32,32.0,0.25,school of computer science,55,29.5601
2,153,4,4,10,5,4.5,0.0,school of computer science,0,4.15689
3,154,4,4,11,61,61.0,0.25,school of computer science,55,56.3489
4,155,4,4,12,70,70.0,0.5,school of computer science,110,64.6627


In [165]:
# add column to dataframe for prediction category
df_test['cat_predict'] = None

set_occupancy_category(df_test, 'room', 'prediction_med', 'cat_predict')

In [167]:
df_test['accurate'] = None

for i in range(df_test.shape[0]):
    occ = df_test['occupancy'][i]
    cat = df_test['cat_predict'][i]
    df_test.set_value(i, 'accurate', 1 if (occ == cat) or (occ != 0 and cat == 1) else 0)

In [168]:
df_test.head()

Unnamed: 0,index,room,event_day,event_hour,ass,auth,occupancy,building,est_occupants,prediction_med,cat_predict,accurate
0,151,4,3,16,6,6.0,0.0,school of computer science,0,5.54252,0,1
1,152,4,4,9,32,32.0,0.25,school of computer science,55,29.5601,1,1
2,153,4,4,10,5,4.5,0.0,school of computer science,0,4.15689,0,1
3,154,4,4,11,61,61.0,0.25,school of computer science,55,56.3489,1,1
4,155,4,4,12,70,70.0,0.5,school of computer science,110,64.6627,1,1


In [169]:
accuracy = df_train['accurate'].sum()/df_train.shape[0]
accuracy

0.7880794701986755