<a href="https://www.kaggle.com/code/fakihakhan999/bike-sharing-eda-linear-regression?scriptVersionId=257490397" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Multiple Linear Regression
## Bike Sharing Assignment

#### Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.


A US bike-sharing provider BikeIndia has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 


In such an attempt, **BikeIndia** aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.


They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.
How well those variables describe the bike demands
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors. 

#### Business Goal:

We are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

##  Reading and Understanding the Data


In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/bike-sharing/day.csv")

In [None]:
df.head()

In [None]:
# Check for null values & data type
df.info()

In [None]:
num_col_Numbers=list(df.select_dtypes('number').columns)
print(f'The number type features in data are : \t {num_col_Numbers}')
num_col_object=list(df.select_dtypes('object').columns)
print(f'The object type features in data are : \t {num_col_object}')

In [None]:
for col in num_col_Numbers:
    print(f'\n{col} has {df[col].nunique()} Unique Values')
    if df[col].nunique()< 20:
        print(f'{df[col].unique()}')

Instant is a Index value and is not required for data analysis

In [None]:
df.drop(['instant','yr'],axis=1,inplace = True)
num_col_Numbers=list(df.select_dtypes('number').columns)

# DATA QUALITY CHECK

In [None]:
# percentage of missing values in each column
missing_val=df.isnull().sum()
missing_prct=(missing_val/len(df))*100
missing_prct = missing_prct[missing_prct > 0]
print(f'\n Percentage of Missing values is : \n{missing_prct}')

In [None]:
df.drop_duplicates(subset=None, inplace=True)
print(f'Shape of Data Frame is {df.shape}')

In [None]:
color_list = ['#A4BD84', '#D3DC92', '#FDF1A8', '#FCAB92', '#B17C82']
plt.figure(figsize=(20,15))
plt.title('Distribution of Numerical Features')
inx=1
for col in num_col_Numbers:
  colori = color_list[(inx - 1) % len(color_list)]
  plt.subplot(4,4,inx)
  plt.hist(df[col],bins=20,color=colori,edgecolor='black')
  plt.xlabel(f'Data Distribution of {col}')
  plt.xticks(rotation=60)
  inx=inx+1

plt.tight_layout()
plt.show()

In [None]:
df.dteday = pd.to_datetime(df.dteday,format="%d-%m-%Y")
df['year'] = df.dteday.dt.year.astype(int)

In [None]:
df.head()

# Understanding Relationship between Features

In [None]:
# scale numerical columns into (0-1) range 
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
df[['temp', 'atemp','hum','windspeed','casual','registered','cnt']] = min_max_scaler.fit_transform(df[['temp', 'atemp','hum','windspeed','casual','registered','cnt']])

In [None]:
# scale days and month into cose and sine as they are categorical with order
df['day_sin'] = np.sin(2 * np.pi * df['weekday'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['weekday'] / 7)
df['month_sin'] = np.sin(2 * np.pi * df['mnth'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['mnth'] / 12)
df['season_sin'] = np.sin(2 * np.pi * df['season'] / 4)
df['season_cos'] = np.cos(2 * np.pi * df['season'] / 4)

In [None]:
df.drop(['weekday','mnth','season','dteday'],axis=1,inplace=True)

In [None]:
df.info()

In [None]:
corrmat=df.corr()
plt.figure(figsize=(14,10))
sns.heatmap(
    corrmat,
    cmap=color_list,       # your custom color list or colormap
    annot=True,            # show correlation values
    fmt=".2f",             # 2 decimal places
    linewidths=0.5,        # line between cells
    cbar_kws={"shrink": 0.8}  # colorbar size
)

In [None]:
df.drop(['atemp'],axis=1,inplace=True)
df.columns

In [None]:
num_col_Numbers=list(df.select_dtypes('number').columns)
print(f'There are {len(num_col_Numbers)} features : \n {num_col_Numbers}')

In [None]:
color_list = ['#A4BD84', '#D3DC92', '#FDF1A8', '#FCAB92', '#B17C82']
plt.figure(figsize=(16,12))
plt.title('Distribution of Features')
inx=1
for col in num_col_Numbers:
  colori = color_list[(inx - 1) % len(color_list)]
  plt.subplot(4,4,inx)
  plt.hist(df[col],bins=20,color=colori,edgecolor='black')
  plt.xlabel(f'Data Distribution of {col}')
  plt.xticks(rotation=60)
  inx=inx+1

plt.tight_layout()
plt.show()

Only year and weather sit are not scaled. Encode the categorical features such as year and weather sit.

In [None]:
df[[ 'year', 'day_sin', 'day_cos', 'month_sin',
       'month_cos', 'season_sin', 'season_cos']] = min_max_scaler.fit_transform(df[[ 'year', 'day_sin', 'day_cos', 'month_sin',
       'month_cos', 'season_sin', 'season_cos']])

In [None]:
df = pd.get_dummies(df, columns=['weathersit'], prefix='weather')
print(df['year'].value_counts())
print(df.columns)

In [None]:
print(f'{df.weather_1.value_counts()}\n{df.weather_2.value_counts()}\n{df.weather_3.value_counts()}')

In [None]:
num_col_Numbers=list(df.select_dtypes('number').columns)

In [None]:
summary = pd.DataFrame({
    "Min": df.min(),
    "Max": df.max()
})

# Display only numeric columns
summary = summary.select_dtypes(include='number')

print(summary)

## Correlation Matrix

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated. Note:
# here we are considering only those variables (dataframe: bike_new) that were chosen for analysis

plt.figure(figsize = (25,20))
sns.heatmap(df.corr(), annot = True, cmap="RdBu")
plt.show()

In [None]:
color_list = ['#A4BD84', '#D3DC92', '#FDF1A8', '#FCAB92', '#B17C82']
plt.figure(figsize=(16,12))
plt.title('Relationship of features with target feature')
inx=1
for col in list(df.columns):
  colori = color_list[(inx - 1) % len(color_list)]
  plt.subplot(4,5,inx)
  sns.scatterplot(x=col,y='cnt',color=colori,data=df)
  inx=inx+1

plt.tight_layout()
plt.show()

In [None]:
# Suppose X contains only independent variables
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

X = df.drop(columns=['cnt'])  # drop target
X = X.astype(float)  # ensure all numeric

# Add constant for intercept
X_const = add_constant(X)

# Compute VIF
vif = pd.DataFrame()
vif["Feature"] = X_const.columns
vif["VIF"] = [variance_inflation_factor(X_const.values, i)
              for i in range(X_const.shape[1])]

print(vif)

# BUILDING A LINEAR MODEL

In [None]:
x=df[['holiday','workingday','hum','windspeed','casual','year','day_sin','day_cos']]

In [None]:
y=df['cnt']
from sklearn.model_selection import train_test_split
x_train,x_test, y_train, y_test  = train_test_split(x,y, test_size = 0.2, random_state = 42)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def error_metrics(y_test,y_pred):
    print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
    print("R² test Score:", r2_score(y_test, y_pred))

In [None]:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
ytrain_calc = regressor.predict(x_train)
trainr2=r2_score(y_train, ytrain_calc)
print(f'Training R2 score is : {trainr2}')
error_metrics(y_test,y_pred)

The regressor is performing good as the training r2 score and test r2 score are almost equal.(0.81~ 0.799)
It implies that the model is not overfitting.