## Bike Sharing Regression Assignment
Given the dataset on bike sharing, we will try and create a regression model to predict the variable cnt - the total number of rented bikes on a given day

#### Notebook sections

    1. Exploratory Data Analysis
    2. Feature selection
    3. Model implementation
    4. Model assessment
    5. Final outcomes

In [1]:
#Importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

In [2]:
# A function to retrieve the data types, null counts and number of unique values for each column in a Pandas DataFrame

def get_metadata(df):
    metadata = {}
    unique_count = []
    for (col, dtype, null_count) in zip(df.columns, df.dtypes, df.isnull().sum()):
        metadata[col] = (dtype, null_count)
        unique_count.append(df[col].nunique())
    
    columns = []
    dtypes = []
    nulls = []
    for key,value in metadata.items():
        columns.append(key)
        dtypes.append(value[0])
        nulls.append(value[1])
    
    data = {"column_name":columns,"data_type":dtypes, "null_count":nulls, "unique_count":unique_count}
    df_metadata = pd.DataFrame(data)
    return df_metadata

In [3]:
#Reading the data
df = pd.read_csv("data/day.csv")

### Exploratory Data Analysis
1. Changing data types where required
2. Dealing with null/missing values
3. Univariate analysis of numerical columns
4. Bivariate analysis of numerical columns
5. Univariate and bivariate analysis of categorical columns

In [4]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600


In [5]:
df['dteday'] = pd.to_datetime(df['dteday'], format='%d-%m-%Y')

In [6]:
df_meta = get_metadata(df)
df_meta

Unnamed: 0,column_name,data_type,null_count,unique_count
0,instant,int64,0,730
1,dteday,datetime64[ns],0,730
2,season,int64,0,4
3,yr,int64,0,2
4,mnth,int64,0,12
5,holiday,int64,0,2
6,weekday,int64,0,7
7,workingday,int64,0,2
8,weathersit,int64,0,3
9,temp,float64,0,498


In [7]:
categorical = df_meta[df_meta['unique_count'] <= 12]
categorical = categorical['column_name'].to_list()

In [8]:
numerical = [x for x in df.columns if x not in categorical]
numerical

['instant',
 'dteday',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [9]:
numerical.remove('instant')
numerical.remove('dteday')
numerical

['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']

In [None]:
df_categorical = df[categorical]
df_numerical = df[numerical]

In [None]:
df_numerical.head()

In [None]:
for col in categorical:
    df[col] = df[col].astype(str)

In [None]:
df_meta = get_metadata(df)
df_meta

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df_numerical.corr(), cmap='YlGnBu', annot = True)
plt.show()

In [None]:
df_pivot = df.pivot_table(values = 'cnt', index = 'season', columns = 'yr', aggfunc = 'mean')

In [None]:
df_pivot['perc_change'] = df_pivot['1'].divide(df_pivot['0']).multiply(100)

In [None]:
df_pivot

In [None]:
#plt.figure(figsize=(20,20))
#plt.plot(df['dteday'], df['temp'], label = 'temp')
#plt.plot(df['dteday'], df['atemp'], label = 'atemp')
#plt.plot(df['dteday'], df['cnt'], label = 'cnt')
#plt.legend()
#plt.show()