#Introduction + Problem Definition
The data that is used for this coursework is a dataset about the bike rentals for a bike sharing system between 2011 and 2012 which allows users to rent a bike and return them to particular station. Whilst there are multiple systems out there they have many real world benefits and applications, the data that can be (and has been) collected can be used analyse the conditions for which people will rent these systems bikes.

In this instance of the dataset, one group of columns/attributes that has been collected is the number of bike rentals as well as the total number of rentals split by the type of renter (a casual user or a member for the system).

This information alongside the day and hour columns allows the analyst to identify the peak rental times of the bikes and allows for the analyst to forecast the number of bikes in use at a given time.

Beyond the season, date, year, month and hours, the dataset also contains the information on whether it is a working day, weekday or a holiday or not. This allows for an extra level of exploration for the analyst as they can focus on a particular kind of day as well as having another dimension to analyse the peak times.

The dataset also contains information on the type of weather for the particular day and hour. Furthermore it contains information about the normalized temperature (as well as the temperature it feels like both in celsius), humidity and windspeed. These columns allow for the analyst to explore the affects of the weather on the number of bike rentals.

The prediction of the number of bike rentals is a regression problem because the models the make the predictions are identifying the relationships between the input variables and the target variable (bike rentals). This relationship is used to predict the number of bike rentals using the given linear equation (Linear Regression) or conditions (Decision Tree) of the model.

The question that is being asked is how the 2 temperatures, whether it is a holiday or working day, whether or not it is a peak rental time or whether it is the night time. 

#Data Ingestion + Importing Libraries
Import the necessary libraries and mount the notebook on the google drive containing the data.

In [None]:
import numpy as np
import math
import pandas as pd
import sklearn.preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error as mse
from google.colab import drive
drive.mount('/content/gdrive/')
path = "/content/gdrive/My Drive/DW_data/"

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


After the notebook has been mounted on the google drive, we will need to load the dataset into a dataframe.

In [None]:
#Read the bike data from the csv file.
df = pd.read_csv(path+"bike-dataset hour.csv")

#Data Preparation
Before implementing the preparation of the data, there will be a description of the statistical data types for each of the columns of the dataset. Firstly, the temp, atemp, hum(idity), and windspeed columns are all continuous values as they are decimal values between 0 and 1.

Next, the dteday, season, year, month, hr and instant columns are all Ordinal data types as they are discretized (distinct labels/buckets) that have some notion of order. For example, the hour columns order is from the first hour (0) to the last hour (23).

The weathersit, holdiay, weekday and workingday columns are all Nominal data types as each value/label represents a category. One example of this is the workingday column's values of No and Yes. These values put labels to whether or not it is a daythat people work or not.

The casual, registered and cnt (total number of rentals) are Numerical data types. These columns describe how many rentals their has been as well as how many rentals there has been by casual and registered users. These values are continuous, but cannot be decimal values due to the fact that a user cannot have a fraction of a rental.

The first step is to find the appropriate replacement values for the nan values in temp and atemp. To do this we will find the average value for each hour that we have a nan value for and then replace it with that average.

The use of the invalid row's respective hour is because the peak rental times will be accounted for in these averages.

In [None]:
#Iterate over the dataset and add the values together (categorized by the hour they are in).
hs_temp, hs_count = [0.0 for i in range(24)], [0 for i in range(24)] 
for h, t in zip(df['hr'], df['temp']):
  if math.isnan(t) == False:
    hs_temp[h] += t
    hs_count[h] += 1

#Divide hs_temp by hs_count.
hs_temp, hs_count = np.array(hs_temp), np.array(hs_count)
hs_temp_mean = hs_temp / hs_count

#Do the same as the above but for the atemp attribute of the dataset.
hs_atemp, hs_count = [0.0 for i in range(24)], [0 for i in range(24)] 
for h, t in zip(df['hr'], df['atemp']):
  if math.isnan(t) == False:
    hs_atemp[h] += t
    hs_count[h] += 1

#Divide hs_atemp by hs_count.
hs_atemp, hs_count = np.array(hs_atemp), np.array(hs_count)
hs_atemp_mean = hs_atemp / hs_count

Using the mean values of temp and atemp (split across each hour), we will now replace the nan values with the appropriate hour.

In [None]:
#This replaces the nan values with the average.
for i in range(len(df['temp'])):
  #Handle the temp attribute if the value is nan.
  if math.isnan(df['temp'][i]):
    df['temp'][i] = hs_temp_mean[df['hr'][i]]

  #Handle the atemp attribute if the value is nan.
  if math.isnan(df['atemp'][i]):
    df['atemp'][i] = hs_atemp_mean[df['hr'][i]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['temp'][i] = hs_temp_mean[df['hr'][i]]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['atemp'][i] = hs_atemp_mean[df['hr'][i]]


The next step is to remove the redundant attributes like dedate, yr and day. 

In [None]:
df = df[['hr', 'holiday', 'workingday', 'cnt', 'weathersit', 'temp', 'atemp']]

In this section, we will find the periods that are the peak times and when it is night time.

In [None]:
#Create these columns within the dataframe so the peaktimes and nighttimes can be identified and stored.
df['peaktimes'] = [0 for _ in range(len(df['hr']))]
df['nighttime'] = [0 for _ in range(len(df['hr']))]

#Find the appropriate labels for the new columns for each entry.
for i in range(len(df['workingday'])):
  #This checks for the peak times for the working days.
  if df['workingday'][i] == "Yes":
    if (df['hr'][i] >= 7 and df['hr'][i] <= 9) or (df['hr'][i] >= 16 and df['hr'][i] <= 19):
      df['peaktimes'][i] = 1
    else:
      df['peaktimes'][i] = 0
  
  #This checks for the peak times when they are not working.
  if df['workingday'][i] == "No":
    if df['hr'][i] >= 10 and df['hr'][i] <= 16:
      df['peaktimes'][i] = 1
    else:
      df['peaktimes'][i] = 0

  #This checks the periods that are considered to be the night.
  if df['hr'][i] >= 22 or df['hr'][i] <= 4:
    df['nighttime'][i] = 1
  else:
    df['nighttime'][i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['peaktimes'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['nighttime'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['nighttime'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['peaktimes'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the c

Data Binning is the process of discretizing a variable in to individual 'buckets'. An example of this would be taking a continuous value such as age and placing them into buckets/brackets such as 18-24, 25-34 all the way to 60+.

Given the above paragraph about Data Binning, it would be appropriate to apply this concept to the temp and atemp columns of the dataset. This is because it would allow for the dataset to be split by a Decision Tree whilst also retaining enough of it's linear origin that it can be used by the Linear Regression and other equation based regression models.

In the following section the data will be encoded to their appropriate types. The temp and atemp columns will be discretized (Data Binning). Each 'bucket' or discretized value for these columns will receive it's own column to reduce these variables into a binary label which would allow for the problem to be more accurate.

In [None]:
#Discretize the temp column for the RandomForest method.
df.loc[(0.0 <= df['temp']) & (df['temp'] <= 0.1), 'temp'] = 0.0
for i in range(1, 10):
  df.loc[(i/10 < df['temp']) & (df['temp'] <= (i+1)/10), 'temp'] = float(i)

#Discretize the atemp column for the RandomForest method.
df.loc[(0.0 <= df['atemp']) & (df['atemp'] <= 0.1), 'atemp'] = 0.0
for i in range(1, 10):
  df.loc[(i/10 < df['atemp']) & (df['atemp'] <= (i+1)/10), 'atemp'] = float(i)

#Convert the above columns back to integers.
df['temp'] = [int(t) for t in df['temp'].to_numpy()]
df['atemp'] = [int(t) for t in df['atemp'].to_numpy()]

#Convert the workingday column to an integer-based binary label
df['workingday'] = sklearn.preprocessing.LabelEncoder().fit_transform(df['workingday'])

#Given each index (between 0 and 9), split each label into it's own binary classification.
n=10
for i in range(n):
  df['temp_{idx}'.format(idx=i)] = [1 if ((i/n) > df.iloc[j]['temp']) and (df.iloc[j]['temp'] <= (i+1)/n) else 0 for j in range(len(df['temp']))]
  df['atemp_{idx}'.format(idx=i)] = [1 if ((i/n) > df.iloc[j]['atemp']) and (df.iloc[j]['atemp'] <= (i+1)/n) else 0 for j in range(len(df['atemp']))]

#Now we just need to drop the hr column from the DataFrame.
names = [t for t in df if t != 'hr']
df = df[names]

#Data Segregation
The dataset will have an (80/0/20)% (training/validation/testing) split. This split is used as the choosen methods will benefit from generating a regression model with a larger amount of data from the dataset as it will have a more accurate model.

There is also no validation set due to the same reason as well.

However, the reason why 100% dataset is not being used for training the model is because it needs to be test to ensure it is up to a good enough standard in terms of the given metric (which is specified during the Model Evaluation).

In [None]:
#Shuffle the dataset and then Split the dataset into the training and the testing sets.
df = df.sample(frac=1).reset_index(drop=True)
train_df, test_df = df[:int(0.8*df.shape[0])], df[int(0.8*df.shape[0]):]

#Select the target variable for the regression model.
names = [t for t in df if t != 'cnt' and t != 'temp' and t != 'atemp']
test_input, test_target = test_df[names], test_df['cnt']
train_input, train_target = train_df[names], train_df['cnt']

#Model Training and Evaluation
In the next Code Block, a Decision Tree model as well as a Lasso (Linear Regression) model will be implemented. These are used as a baseline to compare to a version of each model with tuned parameters.

The Decision Tree model has been chosen because the model .

The Lasso model has been chosen because due to the target variable (cnt) containing a continuous data type and having a linear trend. The Lasso model also allows for the identification of sparse coefficients for each column/attribute in the dataset. However, it needs to be used 'on top of' a regression model (in this case a linear regression model).

The Root Mean Square Error has been chosen due to averaging the error of the model and then scaling it to the range of the target variable. For the purpose of having the error as a percentage between 0 and 1, the error is normalized using the test_target's minimum and maximum.

In [None]:
base_error = []
for m in [Lasso, DecisionTreeClassifier]:
  #Build and train the model.
  model = m()
  model.fit(train_input, train_target)

  #Calculate the root mean square error (RMSE) using the mean squared error metric from sklearn.
  results = model.predict(test_input)
  results = mse(test_target, results)
  results = pow(results, 1/2)

  #Store the normalized error (using the minimum and maximum values of the test targets).
  base_error.append((results - test_target.min())/ (test_target.max() - test_target.min()))
  print(base_error[len(base_error)-1])

0.13298189905639743
0.16414371209892342


In the next Code Block, the above models will be implemented with automatically tuned parameters to be compared against their baseline models using a grid search method. This method of optimizing the hyperparameters has been chosen as it takes in the specified parameters such as the max depth of the Decision Tree and tries each permutation of their given values. It also trains the model using these parameters which another method such as RandomizedSearchCV does not do.

In [None]:
#Run the parameter tuning on the DecisionTreeClassifier object.
tuned_parameters = [{"criterion": ["gini", "entropy"], "max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10]},]
model = DecisionTreeClassifier()

grid_search = GridSearchCV(model, tuned_parameters, scoring="neg_root_mean_squared_error", cv=5)
grid_search.fit(train_input, train_target)

print("Best parameters (Decision Tree):", grid_search.best_params_)

#Run the parameter tuning on the LogisticRegression object.
tuned_parameters = [{"alpha": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},]
model = Lasso()

grid_search = GridSearchCV(model, tuned_parameters, scoring="neg_root_mean_squared_error", cv=5)
grid_search.fit(train_input, train_target)

print("Best parameters (Lasso):", grid_search.best_params_)



Best parameters (Decision Tree): {'criterion': 'gini', 'max_depth': 4}
Best parameters (Lasso): {'alpha': 1}


Using the best parameters for each regression technique, the following code block will construct and train the models and then calculate the root mean squared error (RMSE) to compare them to the baselines of these models.

In [None]:
#This stores the average errors of the "optimized" model's errors.
optimized_error = []

#Generate, train and test the Lasso method.
model = Lasso(alpha=1)
model.fit(train_input, train_target)

#Calculate the root mean square error (RMSE) using the mean squared error metric from sklearn.
results = model.predict(test_input)
results = mse(test_target, results)
results = pow(results, 1/2)

#Store the normalized error (using the minimum and maximum values of the test targets).
optimized_error.append((results - test_target.min())/ (test_target.max() - test_target.min()))
print(optimized_error[len(optimized_error)-1])

#Generate, train and test the Decision Tree method.
model = DecisionTreeClassifier(criterion="gini", max_depth=5)
model.fit(train_input, train_target)

#Calculate the root mean square error (RMSE) using the mean squared error metric from sklearn.
results = model.predict(test_input)
results = mse(test_target, results)
results = pow(results, 1/2)

#Store the normalized error (using the minimum and maximum values of the test targets).
optimized_error.append((results - test_target.min())/ (test_target.max() - test_target.min()))
print(optimized_error[len(optimized_error)-1])

0.13298189905639743
0.15150251384750202


In the following Code Block, the baseline models will be compared to their optimized counterparts.

In [None]:
if optimized_error[0] < base_error[0]:
  print("Baseline Lasso is better by {diff}.".format(diff=abs(optimized_error[0] - base_error[0])))
else:
  print("Optimized Lasso is better by {diff}.".format(diff=abs(optimized_error[0] - base_error[0])))

if optimized_error[1] < base_error[1]:
  print("Baseline Decision Tree is better by {diff}.".format(diff=abs(optimized_error[1] - base_error[1])))
else:
  print("Optimized Decision Tree is better by {diff}.".format(diff=abs(optimized_error[1] - base_error[1])))

Optimized Lasso is better by 0.0.
Baseline Decision Tree is better by 0.012641198251421404.


#Conclusion
Looking at the results of the models, it can be seen that the baseline Lasso (Linear) regression model has the same accuracy as it's optimized counterpart. This shows that in this particular instance that there is no need to optimize the hyperparameters for this model as the baseline model is as good at predicting the number of bike rentals.

As for the Decision Tree, it can be seen that the optimized decision tree is marginally better (by a relatively small margin) than the baseline model. This suggests that there is no reason to spend more computational resources on optimizing this model for such a small improvement.

One potential improvement would be a wider comparison of models. This would allow the relationships between the inputs and the target(s) to be viewed from a variety of angles.

Another potential improvemnt would be to vary the target variables being used. In particular, the variables the discuss whether the renters are members of not of the service.