# INFO 251: Final project

#### Members: Marius Brogaard Lerstein and Tuva Cornelia Oppenhagen

## Introduction

Bike sharing is a growing transportation method in many cities around the world, with over 500 rental programs with a total of over 500 thousand bikes, an important insight to the business is how many bikes are rented per day. With factors such as weather, weekend, temperature, wind information etc. affecting the number of rentals, this gives a good foundation for a machine learning classification problem. In our research project, we want to try out different classification algorithms to predict the amount of bikes that were rented in a given day. The models we want to try out include; logistic regression, Naive Bayes, K-Nearest Neighbors, Decision Trees, Random Forests.

The dataset we are going to use contains the count of rental bikes between 2011 and 2012 in Washington, DC including information about the weather conditions and season. We are going to use the dataset containing hourly information in order to get more data to use in order to train and test our model.

<b>Link to dataset:</b> https://www.kaggle.com/datasets/marklvl/bike-sharing-dataset

As the different classification models require different data preprocessing, we want to split our work into finishing one model before starting on the next one. Therefore, we suggest these milestones for our deliverables:

<b>Before 04/15/2023</b>: 
Implement model for logistic regression, naive bayes, K-Nearest Neighbors, with the required preprocessing.

<b>Before 03/05/2023</b>: 
Implement the decision tree and random forest algorithms, with their required preprocessing. To conclude the results from all of our algorithms, our comparison between the accuracy values should also be described in this milestone.


## EDA

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import random

seed=99
random.seed(seed)
np.random.seed(seed)

In [None]:
data = pd.read_csv('data/hour.csv')

In [None]:
data.head()

In [None]:
data.dtypes

In [None]:
print('Contains missing values:', data.isnull().values.any())

In [None]:
print('Data shape:', data.shape)
data.describe()

In the table above, we can see descriptive statistics of the 16 numerical features (of 17 total features).

### Distribution of features
Vizualise the data to improve insight. Try to identify patterns, trends, outliers etc.

In [None]:
import seaborn as sns

fig, axes = plt.subplots(nrows=4, ncols=5, figsize=(15, 10))
axes = axes.flatten()

# loop through each column and plot the distribution
for i, col in enumerate(data.drop(['dteday'], axis=1).columns):
    sns.histplot(data[col], ax=axes[i])
    axes[i].set_title(col)

plt.tight_layout()
plt.show()

From the descriptive statistics we see that the distributions of the features are varies. To avoid scaling issues that can impact the performance of some models it might be a good idea to perform normalization and standardization.

## Data cleaning

In [None]:
data = data.drop(columns=['dteday', 'casual', 'registered'])

In [None]:
data['weathersit'].value_counts()

## Feature Engineering

#### Encoding categorical variables to numerical values

In [None]:
data.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
trans_season = ohe.fit_transform(data[['season']])
data[ohe.categories_[0]] = trans_season.toarray()
data = data.drop(columns=['season'])
data = data.rename(columns={1:'season_1', 2:'season_2', 3:'season_3', 4:'season_4'})

In [None]:
ohe = OneHotEncoder()
trans_yr = ohe.fit_transform(data[['yr']])
data[ohe.categories_[0]] = trans_yr.toarray()
data = data.drop(columns=['yr'])
data = data.rename(columns={0:'year_0', 1:'year_1'})

In [None]:
ohe = OneHotEncoder()
trans_mnth = ohe.fit_transform(data[['mnth']])
data[ohe.categories_[0]] = trans_mnth.toarray()
data = data.drop(columns=['mnth'])
data = data.rename(columns={1:'mnth_1', 2:'mnth_2', 3:'mnth_3', 4:'mnth_4', 5:'mnth_5', 6:'mnth_6', 7:'mnth_7', 8:'mnth_8', 9:'mnth_9', 10:'mnth_10', 11:'mnth_11', 12:'mnth_12'})

In [None]:
ohe = OneHotEncoder()
trans_hr = ohe.fit_transform(data[['hr']])
data[ohe.categories_[0]] = trans_hr.toarray()
data = data.drop(columns=['hr'])
data = data.rename(columns={0:'hr_0', 1:'hr_1', 2:'hr_2', 3:'hr_3', 4:'hr_4', 5:'hr_5', 6:'hr_6', 7:'hr_7', 8:'hr_8', 9:'hr_9', 10:'hr_10', 11:'hr_11', 12:'hr_12', 13:'hr_13', 14:'hr_14', 15:'hr_15', 16:'hr_16', 17:'hr_17', 18:'hr_18', 19:'hr_19', 20:'hr_20', 21:'hr_21', 22:'hr_22', 23:'hr_23'})

In [None]:
ohe = OneHotEncoder()
trans_holiday = ohe.fit_transform(data[['holiday']])
data[ohe.categories_[0]] = trans_holiday.toarray()
data = data.drop(columns=['holiday'])
data = data.rename(columns={0:'holiday_0', 1:'holiday_1'})

In [None]:
ohe = OneHotEncoder()
trans_weekday = ohe.fit_transform(data[['weekday']])
data[ohe.categories_[0]] = trans_weekday.toarray()
data = data.drop(columns=['weekday'])
data = data.rename(columns={0:'weekday_0', 1:'weekday_1', 2:'weekday_2', 3:'weekday_3', 4:'weekday_4', 5:'weekday_5', 6:'weekday_6'})

In [None]:
ohe = OneHotEncoder()
trans_workingday = ohe.fit_transform(data[['workingday']])
data[ohe.categories_[0]] = trans_workingday.toarray()
data = data.drop(columns=['workingday'])
data = data.rename(columns={0:'workingday_0', 1:'workingday_1'})

In [None]:
ohe = OneHotEncoder()
trans_weathersit = ohe.fit_transform(data[['weathersit']])
data[ohe.categories_[0]] = trans_weathersit.toarray()
data = data.drop(columns=['weathersit'])
data = data.rename(columns={1:'weathersit_1', 2:'weathersit_2', 3:'weathersit_3', 4:'weathersit_4'})

In [None]:
print(data.columns)
print(len(data.columns))

### Store the dataframe as CSV to save computing.

In [None]:
data.to_csv('data/rental.csv', index=False)