Contributing Members: Shane Davey 0885534

Data Curation and Conditioning

In [3]:
#connect to my google drive to access csv file
import pandas as pd 
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
room_df = pd.read_csv('/content/drive/MyDrive/ESOF - 4011/Project/Occupancy_Estimation.csv') #read in data set

In [5]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from pandas.core.internals.array_manager import NullArrayProxy
from scipy.stats import zscore
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor 
from sklearn.neighbors import KNeighborsRegressor
from matplotlib.colors import ListedColormap
import time
import seaborn as sns 

Beginning of Data Analysis:

First thing we want to check is if the data set is formatted correctly and can be opened.

In [6]:
room_df #display data set

Unnamed: 0,Date,Time,S1_Temp,S2_Temp,S3_Temp,S4_Temp,S1_Light,S2_Light,S3_Light,S4_Light,S1_Sound,S2_Sound,S3_Sound,S4_Sound,S5_CO2,S5_CO2_Slope,S6_PIR,S7_PIR,Room_Occupancy_Count
0,2017/12/22,10:49:41,24.94,24.75,24.56,25.38,121,34,53,40,0.08,0.19,0.06,0.06,390,0.769231,0,0,1
1,2017/12/22,10:50:12,24.94,24.75,24.56,25.44,121,33,53,40,0.93,0.05,0.06,0.06,390,0.646154,0,0,1
2,2017/12/22,10:50:42,25.00,24.75,24.50,25.44,121,34,53,40,0.43,0.11,0.08,0.06,390,0.519231,0,0,1
3,2017/12/22,10:51:13,25.00,24.75,24.56,25.44,121,34,53,40,0.41,0.10,0.10,0.09,390,0.388462,0,0,1
4,2017/12/22,10:51:44,25.00,24.75,24.56,25.44,121,34,54,40,0.18,0.06,0.06,0.06,390,0.253846,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10124,2018/01/11,08:58:07,25.06,25.13,24.69,25.31,6,7,33,22,0.09,0.04,0.06,0.08,345,0.000000,0,0,0
10125,2018/01/11,08:58:37,25.06,25.06,24.69,25.25,6,7,34,22,0.07,0.05,0.05,0.08,345,0.000000,0,0,0
10126,2018/01/11,08:59:08,25.13,25.06,24.69,25.25,6,7,34,22,0.11,0.05,0.06,0.08,345,0.000000,0,0,0
10127,2018/01/11,08:59:39,25.13,25.06,24.69,25.25,6,7,34,22,0.08,0.08,0.10,0.08,345,0.000000,0,0,0


In [7]:
room_df.shape #check shape of database

(10129, 19)

Gettting the shape of the dataframe we can see that it matches what is described in the repository, 10129 entries with the 19 necissary columns

Making a dataframe for the csv file and running these tests we can see that the data is presented and there we no initial problems with it.

Then we want to check and see that the data provided is complete and there are no values that are missing.

In [None]:
if np.where(pd.isnull(room_df)) == NullArrayProxy: #check to see if there are any empty entries in the datafram
  print ("Some data may be missing") #if there is data missing notify user
else: #if all the data is there
  print ("All data is complete") #notify user


After excecution we can see that the data set is complete and we can move on to the next step. It is important to start by making sure the data provided functions correctly before preprocessing as it would greatly affect the final product. 

Beginning of Data Preprocessing:

We can start by removing the date column as it is not very really useful in terms of predicting room occupancy however time will be important as the time of day between days may give a good indication of how many people may be home.

In [None]:
corr = room_df.corr() #let corr be the pairwaise correlation of all columns in the dataframe df
mask = np.zeros_like(corr, dtype = np.bool) #create an array mask of booleans the same size as the correlation 
mask[np.triu_indices_from(mask)] = True #get the indicies of the upper triangle and set them all to true
cmap = sns.diverging_palette(220, 10, as_cmap = True) #create the color map for the correlation, define hues for negative and positive points
plt.figure(figsize = (12,12)) #set the size of the graph to 14 x 14
sns.heatmap(corr, mask = mask, cmap = cmap, center = 0, square = True, linewidths = 0.5) #generate the final heatmap with the correlation the mask and the color map

In [None]:
room_df = room_df.drop(columns="Date") #remove the date column from the dataframe

Now we still have one important column that is not numeric and thats the time of day so that will need to be converted next. Converting all of the variables to numeric is very important as it makes working with all of the data far more simple and a single action can easily be performed over the entire data frame such as manipulation since each catagory is of the same data type.

In [None]:
room_df['Time'] = pd.to_timedelta(room_df['Time']).dt.total_seconds() #convert the time string to seconds

In [None]:
room_df #check change in dataframe

Now we have all of the important data in the right format so we can now remove the outliers to clean up the data a little bit. This step is important because if we leave the outliers in the data then it can be skewed and reduce the accuracy and the consistancy of the final system.

In [None]:
zScores = np.abs(zscore(room_df)) #get the z scores off all the data in the dataframe
Keep = (zScores < 3).all(axis=1) #filter out all of the outliers
refined_df = room_df[Keep].reset_index() #remove all of the outliers from the dataframe
refined_df #view refined data frame

Now we can see that and of the entries that were more than 3 standard deviations away has been removed and we are down to 8516 rows with more accurate data.

Now that the outliers are gone we can normalize the data, the purpose of this is to consolidate the ranges of all the columns to that they all follow a common scale but the differences in values are still maintained, the only one that will not be changed will be the room ocupancy as that will be how we catagorize each output.

In [None]:
#here we will also need to find the min beacuase of negative values
Mins = refined_df.min() #get the mins for all the columns
Maxes = refined_df.max() #get the maxs for all the columns
min = Mins.min() #get the min
max = Maxes.max() #get the max
normalized_df = (refined_df - min) / (max - min) #normalize the dataframe
normalized_df['Room_Occupancy_Count'] = refined_df['Room_Occupancy_Count'] #restore the room occupancy count for readability
normalized_df #view the new normalized dataframe

The dataframe in its current state is now ready to be used in our machine learning model and based on the changes made it should perform much better than it would have before as well as be much easier to work with.

Link to Data Repository: http://archive.ics.uci.edu/ml/datasets/Room+Occupancy+Estimation


In [None]:
#Linear regression
X = normalized_df.drop('Room_Occupancy_Count', axis=1) #create independant variables
Y = normalized_df['Room_Occupancy_Count'] #create dependant variable

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 0) #split up the training and testing data

LRM = linear_model.LinearRegression().fit(X_train, Y_train) #create and fit linear regression model

prediction = LRM.predict(X_test) #make a prediction with LMR and testing data

test_MAE = mean_absolute_error(Y_test, prediction) #calculate the MAE
test_MSE = mean_squared_error(Y_test, prediction) #calculate the MSE
test_r2 = r2_score(Y_test, prediction) #calculate goodness of fit

print(test_MAE) #return MAE
print(test_MSE) #return MSE
print(test_r2) #return goodness of fit

In [None]:
start = time.process_time() 
result = LRM.predict(X_test)
end = time.process_time()

total = end - start
per_prediction = total/2129
print(per_prediction)

In [None]:
results = "\n".join("{} {}".format(x, y) for x, y in zip(Y_test, prediction)) #compare the predicted values with the actual values
print(results) #print results

In [None]:
#Decision tree regression
X = normalized_df.drop('Room_Occupancy_Count', axis=1) #create independant variables
Y = normalized_df['Room_Occupancy_Count'] #create dependant variable

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 0) #split up the training and testing data

DTM = DecisionTreeRegressor().fit(X_train, Y_train) #create and fit the decision tree regressor model

prediction = DTM.predict(X_test) #make a prediction with the regressor and the testing data

test_MAE = mean_absolute_error(Y_test, prediction) #calculate the MAE
test_MSE = mean_squared_error(Y_test, prediction) #calculate the MSE
test_r2 = r2_score(Y_test, prediction) #calculate goodness of fit

print(test_MAE) #return MAE
print(test_MSE) #return MSE
print(test_r2) #return goodness of fit

In [None]:
results = "\n".join("{} {}".format(x, y) for x, y in zip(Y_test, prediction)) #compare the predicted values with the actual values
print(results) #print results

In [None]:
#Decision Treehyper perameter tuning
from sklearn.model_selection import GridSearchCV
#create all the hyperparameters that will be tested
parameters={"splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,20,30,40,50] }

tuning_model=GridSearchCV(DTM, param_grid=parameters,cv=10,verbose=3) #create the tuning model

tuning_model.fit(X_train,Y_train)

In [None]:
print(tuning_model.best_params_) #return the parameters for the best tuning

In [None]:
tuned_DTM = DecisionTreeRegressor(max_depth=3,max_features='auto',max_leaf_nodes=None,min_samples_leaf=1,min_weight_fraction_leaf=0.1,splitter='best', random_state=0) #create the new tuned model
tuned_DTM.fit(X_train, Y_train) #fit the model with the training data

tuned_prediction = tuned_DTM.predict(X_test) #test the new tuned model

tuned_MAE = mean_absolute_error(Y_test, tuned_prediction) #calculate the MAE
tuned_MSE = mean_squared_error(Y_test, tuned_prediction) #calculate the MSE
tuned_r2 = r2_score(Y_test, tuned_prediction) #calculate goodness of fit

print(tuned_MAE) #return MAE
print(tuned_MSE) #return MSE
print(tuned_r2) #return goodness of fit

In [None]:
results = "\n".join("{} {}".format(x, y) for x, y in zip(Y_test, tuned_prediction)) #compare the predicted values with the actual values
print(results) #print results

In [None]:
start = time.process_time() 
result = tuned_DTM.predict(X_test)
end = time.process_time()

total = end - start
per_prediction = total/2129
print(per_prediction)

In [None]:
#K nearest neighbor regression

X = normalized_df.loc[:, normalized_df.columns != 'Room_Occupancy_Count'] #create independant variables
Y = normalized_df['Room_Occupancy_Count'] #create dependant variable

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 0) #split up the training and testing data

KNN = KNeighborsRegressor().fit(X_train, Y_train) #create and fit the decision tree regressor model

prediction = KNN.predict(X_test) #make a prediction with the regressor and the testing data

test_MAE = mean_absolute_error(Y_test, prediction) #calculate the MAE
test_MSE = mean_squared_error(Y_test, prediction) #calculate the MSE
test_r2 = r2_score(Y_test, prediction) #calculate goodness of fit

print(test_MAE) #return MAE
print(test_MSE) #return MSE
print(test_r2) #return goodness of fit

In [None]:
results = "\n".join("{} {}".format(x, y) for x, y in zip(Y_test, prediction)) #compare the predicted values with the actual values
print(results) #print results

In [None]:
#hyper perameter tuning
from sklearn.model_selection import GridSearchCV
#create all the hyperparameters that will be tested
parameters={"leaf_size" : [10,20,30,40,50],
           "n_neighbors":[5,10,15,20,25,30],
           "p":[1,2] }

tuning_model=GridSearchCV(KNN, param_grid=parameters, cv=10, verbose=3) #create the tuning model

tuning_model.fit(X_train,Y_train) #fit the new tuning model

In [None]:
print(tuning_model.best_params_) #get the best parameters for the KNN

In [None]:
tuned_KNN = KNeighborsRegressor(leaf_size=10, n_neighbors=5, p=1) #create the new tuned model
tuned_KNN.fit(X_train, Y_train) #fit the model with the training data

tuned_prediction = tuned_KNN.predict(X_test) #test the new tuned model

tuned_MAE = mean_absolute_error(Y_test, tuned_prediction) #calculate the MAE
tuned_MSE = mean_squared_error(Y_test, tuned_prediction) #calculate the MSE
tuned_r2 = r2_score(Y_test, tuned_prediction) #calculate goodness of fit

print(tuned_MAE) #return MAE
print(tuned_MSE) #return MSE
print(tuned_r2) #return goodness of fit

In [None]:
results = "\n".join("{} {}".format(x, y) for x, y in zip(Y_test, tuned_prediction)) #compare the predicted values with the actual values
print(results) #print results

In [None]:
start = time.process_time() 
result = tuned_KNN.predict(X_test)
end = time.process_time()

total = end - start
per_prediction = total/2129
print(per_prediction)