# Bike Sharing Demand
<hr style="border:2px solid black">

This notebook contains a semi-guided open-ended mini project, where the reader will recollect some of the already-introduced concepts. 

<img src="../data/capital_bikeshare.png" width="800"/>

## 1. Introduction

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

### 1.1 Load Packages

In [None]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

In [None]:
# machine-learning stack
from sklearn.preprocessing import (
    MinMaxScaler,
    PolynomialFeatures
)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
# cross_val_score
from sklearn.linear_model import LinearRegression
# from sklearn.neighbors import KNeighborsRegressor

In [None]:
# math and stat
import scipy.stats as ss

In [None]:
# miscellaneous
import time
import joblib
import warnings
warnings.filterwarnings("ignore")

### 1.2 Data Description

We are provided hourly rental data spanning two years. The dataset covers the first 19 days of each month.

#### Data Fields

|column|description|
|:--------:|:-------------------------:|
|`datetime`| hourly date + timestamp|
|`season`|  1 = spring, 2 = summer, 3 = fall, 4 = winter| 
|`holiday`| whether the day is considered a holiday|
|`workingday`| whether the day is neither a weekend nor holiday|
|`weather`| 1: Clear, Few clouds, Partly cloudy|
||2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist|
||3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds|
||4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog|
|`temp`|temperature in Celsius|
|`atemp`| "feels like" temperature in Celsius|
|`humidity`| relative humidity|
|`windspeed`| wind speed|
|`casual`| number of non-registered user rentals initiated|
|`registered`|number of registered user rentals initiated|
|`count`|number of total rentals|

### 1.3 User Defined Functions

In [None]:
def cramers_corrected_stat(df,cat_col1,cat_col2):
    """
    This function spits out corrected Cramer's correlation statistic
    between two categorical columns of a dataframe 
    """
    crosstab = pd.crosstab(df[cat_col1],df[cat_col2])
    chi_sqr = ss.chi2_contingency(crosstab)[0]
    n = crosstab.sum().sum()
    r,k = crosstab.shape
    phi_sqr_corr = max(0, chi_sqr/n - ((k-1)*(r-1))/(n-1))    
    r_corr = r - ((r-1)**2)/(n-1)
    k_corr = k - ((k-1)**2)/(n-1)
    
    result = np.sqrt(phi_sqr_corr / min( (k_corr-1), (r_corr-1)))
    return round(result,3)

In [None]:
def anova_pvalue(df,cat_col,num_col):
    """
    This function spits out the anova p-value (probability of no correlation) 
    between a categorical column and a numerical column of a dataframe
    """
    CategoryGroupLists = df.groupby(cat_col)[num_col].apply(list)
    AnovaResults = ss.f_oneway(*CategoryGroupLists)
    p_value = round(AnovaResults[1],3)
    return p_value

## 2. Get Data

In [None]:
# read training data from file
df = pd.read_csv('../data/bike_sharing_data.csv', parse_dates=[0])

### Data Quick Check

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.isna().sum()

Observation: Apparently, missing values appear only in temperature column. But some missing values could be in disguise of zero.

### Train-Test Split

In [None]:
train,test = train_test_split(df,test_size=0.2,random_state=42)
train.shape, test.shape

## 3. Exploratory Data Analysis

### 3.1 Target Properties

In [None]:
# target
target = 'count'

In [None]:
target_dist= train[target].value_counts()
sns.lineplot(
    x=target_dist.index,
    y=target_dist.values
);

In [None]:
sns.histplot(train[target], bins=30);

- Observation: Count data is skewed

### 3.2 Categorical Features

#### Create new relevant temporal fetures

In [None]:
# create 'year', 'month', 'weekday','hour' columns from 'datetime'
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['weekday'] = train['datetime'].dt.weekday
train['hour'] = train['datetime'].dt.hour

#### count vs year

In [None]:
df_year = train.groupby('year')['count'].mean().reset_index()
sns.barplot(data=df_year,x='year',y='count');
anova_pvalue(train,'year','count')

#### count vs season

In [None]:
df_season = train.groupby(['year','season'])['count'].mean().reset_index()
sns.barplot(data=df_season,x='season',y='count');
anova_pvalue(train,'season','count')

#### count vs month

In [None]:
df_month = train.groupby(['year','month'])['count'].mean().reset_index()
sns.barplot(data=df_month,x='month',y='count');
anova_pvalue(train,'month','count')

#### count vs hour

In [None]:
df_hour = train.groupby(['year','hour'])['count'].mean().reset_index()
sns.barplot(data=df_hour,x='hour',y='count');
anova_pvalue(train,'hour','count')

#### count vs weekday

In [None]:
df_weekday = train.groupby(['year','weekday'])['count'].mean().reset_index()
sns.barplot(data=df_weekday,x='weekday',y='count');
anova_pvalue(train,'weekday','count')

#### count vs workingday

In [None]:
df_work = train.groupby(['year','workingday'])['count'].mean().reset_index()
sns.barplot(data=df_work,x='workingday',y='count');
anova_pvalue(train,'workingday','count')

#### count vs holiday

In [None]:
df_holiday = train.groupby(['year','holiday'])['count'].mean().reset_index()
sns.barplot(data=df_holiday,x='holiday',y='count');
anova_pvalue(train,'holiday','count')

#### count vs weather

In [None]:
df_weather = train.groupby(['year','weather'])['count'].mean().reset_index()
sns.barplot(data=df_weather,x='weather',y='count');
anova_pvalue(train,'weather','count')

#### Correlation among features (Cramer's V)

In [None]:
cat_feat = ['year','season','month','hour','weekday','workingday','holiday','weather']
cramer_v_corr = {}
for feat1 in cat_feat:
    cramer_v_corr[feat1] = [cramers_corrected_stat(train,feat1,feat2) for feat2 in cat_feat]
cat_corr = pd.DataFrame(index=cramer_v_corr.keys(),data=cramer_v_corr)

plt.figure(figsize=(6,6),dpi=100)
sns.heatmap(data=cat_corr, cmap='coolwarm', linecolor='white', linewidth=1, annot=True);

### 3.3 Numerical Features

#### temperature vs count

In [None]:
sns.displot(train['temp'], kde=True, color='blue');
corr_ = round(train['temp'].corr(train['count']),3)
print(f'correlation value: {corr_}')

#### feels-like temperature vs count

In [None]:
sns.displot(train['atemp'], kde=True, color='green');
corr_ = round(train['atemp'].corr(train['count']),3)
print(f'correlation value: {corr_}')

#### count vs humidity

In [None]:
sns.displot(train['humidity'], kde=True, color='blue');
corr_ = round(train['humidity'].corr(train['count']),3)
print(f'correlation value: {corr_}')

In [None]:
sns.displot(train['windspeed'], kde=True, color='green');
corr_ = round(train['windspeed'].corr(train['count']),3)
print(f'correlation value: {corr_}')

Observation: Windspeed not well distributed; Missing values reported as 0; imputation needed

#### cross correlation of features

In [None]:
num_feat = ['temp', 'atemp','humidity','windspeed']
plt.figure(figsize=(4,4),dpi=100)
sns.heatmap(data=train[num_feat].corr(), cmap='coolwarm', linecolor='white', linewidth=1, annot=True);

## 4. Feature Engineering

#### Target variables

Because the target variable is skewed, one can make a logarithmic transformation to render it more amenable to ML models that generically perform better for unskewed target variable distribution.

In [None]:
train['log_count'] = np.log1p(train['count'])

In [None]:
target_dist= train['log_count'].value_counts()
sns.lineplot(
    x=target_dist.index,
    y=target_dist.values
);

In [None]:
sns.histplot(train['log_count'], bins=30);

#### New column: 'day_type'

In [None]:
# weekends are non-working days that are not holidays either
# workingdays and holidays do not overlap
train.groupby(['workingday','holiday','weekday'])[['count']].count()

Comment: create new categorical column with three categories: workingday, holiday and weekend 

In [None]:
# new column 'day_type'
train['day_type'] = train.apply(
    lambda x: 0 if x['workingday']==1 else 1 if x['holiday']==0 else 2,
    axis=1
)

In [None]:
# count vs day type
df_day = train.groupby(['year','day_type'])['count'].mean().reset_index()
sns.barplot(data=df_day,x='day_type',y='count');
anova_pvalue(train,'day_type','count')

#### Imputation: temperature

In [None]:
temp_imputer = SimpleImputer(strategy='mean')

In [None]:
temp_imputer.fit(train[['temp']])

In [None]:
train['temp'] = temp_imputer.transform(train[['temp']])

#### Imputation: windspeed

In [None]:
# impute zero values of windspeed as missing values

## 5. Model Building

### 5.1 Feature Selection

### 5.2 Train Model

### 5.3 Model Validation

## 6. Model Evaluation