## Bike Sharing Regression Assignment
Given the dataset on bike sharing, we will try and create a regression model to predict the variable cnt - the total number of rented bikes on a given day

#### Notebook sections

    1. Exploratory Data Analysis
    2. Data Preprocessing
    3. Model implementation
    4. Model assessment
    5. Final outcomes

In [None]:
#Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

In [None]:
# A function to retrieve the data types, null counts and number of unique values for each column in a Pandas DataFrame

def get_metadata(df):
    columns = []
    dtypes = []
    nulls = []
    unique_count = []
    for (col, dtype, null_count) in zip(df.columns, df.dtypes, df.isnull().sum()):
        columns.append(col)
        dtypes.append(dtype)
        nulls.append(null_count)
        unique_count.append(df[col].nunique())
    
    data = {"column_name":columns,"data_type":dtypes, "null_count":nulls, "unique_count":unique_count}
    df_metadata = pd.DataFrame(data)
    return df_metadata

In [None]:
#Reading the data
df = pd.read_csv("data/day.csv")

### Exploratory Data Analysis
1. Changing data types where required
2. Dealing with null/missing values
3. Univariate analysis of numerical columns
4. Bivariate analysis of numerical columns
5. Univariate and bivariate analysis of categorical columns

In [None]:
df.head()

In [None]:
df_meta = get_metadata(df)
df_meta

Due to the nature of the columns instant and dteday, they will be dropped. instant is an identifier simply used for uniquely identifying rows and will, therefore have no bearing on the analysis. dteday has already been split into its components (yr, mnth and weekday).
As far as the columns casual and registered are concerned, they relate to our target variable by the following equation: casual+registered = cnt. Since these columns will never be available to us on the same day when we need to make the prediction, we cannot use them to build our model.

In [None]:
cols_to_drop = ['instant','dteday','casual','registered']
df = df.drop(cols_to_drop, axis = 1)

For the columns weathersit, weekday, season and mnth, we will change them back to the string values to which they were originally mapped because when we use them to create dummy variables later, it will make it easier to identify the column values. Yr has not been mapped again due to its binary nature. As and when more years data is added, it will be mapped apropriately.

In [None]:
weather_mapping = {1:'Clear', 2:'Mist', 3:'Light_Snow'}
df['weathersit'] = df['weathersit'].map(weather_mapping)

In [None]:
weekday_mapping = {0:"Sunday", 1:"Monday", 2:"Tuesday", 3:"Wednesday", 4:"Thursday", 5:"Friday", 6:"Saturday"}
df['weekday'] = df['weekday'].map(weekday_mapping)

In [None]:
season_mapping = {1:"spring", 2:"summer", 3:"fall", 4:"winter"}
df['season'] = df['season'].map(season_mapping)

In [None]:
month_mapping = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun", 7:"Jul", 8:"Aug", 9:"Sept", 10:"Oct", 11:"Nov", 12:"Dec"}
df['mnth'] = df['mnth'].map(month_mapping)

In [None]:
df.head()

In [None]:
df_meta = get_metadata(df)
df_meta

In [None]:
categorical = df_meta.loc[df_meta['unique_count'] <= 12, 'column_name'].to_list()
df_categorical = df[categorical]
df_categorical.head()

In [None]:
numerical = df_meta.loc[df_meta['unique_count'] > 12, 'column_name'].to_list()
df_numerical = df[numerical]
df_numerical.head()

In [None]:
sns.pairplot(df_numerical)
plt.show()

In [None]:
sns.heatmap(df_numerical.corr(), cmap = 'Greens', annot = True)
plt.show()

In [None]:
ohe = OneHotEncoder()

In [None]:
df_cat_meta = get_metadata(df_categorical)
df_cat_meta

In [None]:
columns_to_encode = df_cat_meta.loc[df_cat_meta['unique_count'] > 2, 'column_name'].to_list()
columns_to_encode.remove('cnt')

In [None]:
dummy_cols = pd.get_dummies(df_categorical, drop_first=True, columns=columns_to_encode)

In [None]:
dummy_cols.head()

In [None]:
dummy_cols.drop('cnt', axis=1,inplace=True)

In [None]:
df_encoded = pd.concat([df_numerical,dummy_cols],axis=1)

In [None]:
df_encoded.head()

In [None]:
df_meta = get_metadata(df_encoded)
df_meta

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df_encoded.corr(), cmap='Blues',annot=True)
plt.show()

In [None]:
df_train, df_test = train_test_split(df_encoded,test_size=0.3)

In [None]:
scaler = MinMaxScaler()

In [None]:
columns_to_scale = df_meta.loc[df_meta['unique_count'] > 2, 'column_name'].to_list()
columns_to_scale

In [None]:
df_train[columns_to_scale] = scaler.fit_transform(df_train[columns_to_scale])

In [None]:
df_train.describe()

In [None]:
y_train = df_train['cnt']
cols = df_train.columns.to_list()
cols.remove('cnt')
X_train = df_train[cols]

In [None]:
X_train.head()

In [None]:
X_train_sm = sm.add_constant(X_train)

In [None]:
estimator = LinearRegression()

In [None]:
rfe = RFE(estimator, n_features_to_select=15)

In [None]:
rfe.fit(X_train, y_train)

In [None]:
selected_features = rfe.support_
selected_feature_names = [feature for feature, selected in zip(range(X_train.shape[1]), selected_features) if selected]

print("Selected Features:")
print(selected_feature_names)

In [None]:
selected_cols = [X_train.columns[x] for x in selected_feature_names]
print(selected_cols)

In [None]:
X_train_sel = X_train[selected_cols]
X_train_sm = sm.add_constant(X_train_sel)

In [None]:
lr = sm.OLS(y_train,X_train_sm)
lr = lr.fit()

In [None]:
lr.summary()

In [None]:
vif = pd.DataFrame()
vif["Variable"] = X_train_sm.columns
vif["VIF"] = [variance_inflation_factor(X_train.values, i) for i in range(X_train_sm.shape[1])]

In [None]:
vif

In [None]:
X_train_sm.drop('windspeed', axis = 1, inplace=True)