<a href="https://colab.research.google.com/github/MominAhmedShaikh/Ride-Sharing-Demand/blob/main/Ride_Sharing_Demand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing dependencies and importing libraries

### Importing Libraries

In [1]:
!pip install -U feature-engine -q
!pip install mlxtend -U -q

[K     |████████████████████████████████| 276 kB 29.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 16.7 MB/s 
[?25h

In [2]:
from sklearn.preprocessing import MinMaxScaler,StandardScaler,RobustScaler,Normalizer
from sklearn.metrics import mean_squared_error,mean_absolute_error
from feature_engine.creation import MathematicalCombination,CombineWithReferenceFeature
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import train_test_split,GridSearchCV
from zipfile import ZipFile
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings
import plotly.express as px
from sklearn.decomposition import PCA
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")

### Downloading dataset from kaggle

In [3]:
! pip install kaggle -q
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download -c bike-sharing-demand

Downloading bike-sharing-demand.zip to /content
  0% 0.00/189k [00:00<?, ?B/s]
100% 189k/189k [00:00<00:00, 61.3MB/s]


# Function Definations

In [4]:
def bar_uni_plot(*args,col,df,rot):
  '''
  Description : This Function creates a bar plot with specific colors and highlight the bar with most values

  parameters : 
  * args : Takes labels for bars on xticks
    col  : Column from dataframe
    df   : Dataframe
    rot. : Rotation for xticks labels

  returns a fig in sns 
  '''
  fig = sns.countplot(df[col],
                      saturation = 1,  
                      palette=sns.color_palette(['#cfe2f3' if i != df[col].value_counts().max() else '#6fa8dc' for i in df[col].value_counts().sort_index() ]),
                      )
  fig.set_xticklabels(
    labels=[i for i in args], rotation=rot)
  return fig

In [5]:
def bar_plot(df,col,xticks):
  '''
  Description : This Function creates a bar plot with specific colors and highlight the bar with most values

  parameters : 
    col  : Column from dataframe
    df   : Dataframe
    xticks : xticks labels in a list

  returns a fig in plotly 

  '''
  fig = px.bar(df,
              x = [i for i in df[col].value_counts().index] ,
              y = [i for i in df[col].value_counts().values] ,
              width=500,
              height=500,
              color_discrete_sequence =[['blue']],
              labels= {'x':f'{col}', 'y': 'Count'},
              text_auto = True,
              opacity = 0.8,
              template='plotly_white'
       )
  fig.update_traces(textfont_size=12, textangle=0, textposition="auto")
  fig.update_layout(
      xaxis = dict(
          tickmode = 'array',
          tickvals = [i for i in df[col].value_counts().index],
          ticktext = xticks
      )
  )
  return fig

In [6]:
def actual_vs_predicted_plot(y_true,y_pred,display_upto = 50):

  '''
  Description : This Function creates line plot with marker 'o' for actual (y_true) vs predicted values (y_pred)

  parameters : 
    y_true : y_true from test dataframe
    y_pred  : Predicted value with dataframe
    display_upto  : Takes number of rows we want to predict upto. Default = 50 (int)

  returns a line chart in plotly 
  '''
  dataframe = pd.DataFrame(y_true).reset_index()
  dataframe.drop(['index'],axis = 'columns',inplace = True)
  dataframe['Predicted'] = y_pred
  dataframe.columns = ['Actual', 'Predicted']
  fig = px.line(data_frame=dataframe.iloc[:display_upto,:],y=[i for i in dataframe.columns],markers='o',height=400,width = 700)
  return fig

In [7]:
def FI_by_RF(x,feature_imp):
  '''
  Description : This Function creates a bar plot for feature importance produced by RandomForestAlgorithm

  parameters : 
  * args : Takes labels for bars on xticks
    col  : Column from dataframe
    x   : features in x when we splitted x
    feature_imp : Takes feature_importnace_ as an argument from RandomForestAlgorithm

  returns a fig in plotly
  '''
  importances = feature_imp
  fig = px.bar(x = importances , y = x.columns ,labels = {'y':'Features','x':'Importance'},height=500,width = 700)
  return fig

# Data Exploration

## Data Extraction and Reading

In [8]:
file_name = "/content/bike-sharing-demand.zip"
  
with ZipFile(file_name, 'r') as zips:
    zips.extractall()

## Reading

In [9]:
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

## Data Dictionary
- `datetime` - hourly date + timestamp  
- `season` -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
- `holiday` - whether the day is considered a holiday
- `workingday` - whether the day is neither a weekend nor holiday
- `weather`

  1. Clear, Few clouds, Partly cloudy, Partly cloudy
  2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- `temp` - temperature in Celsius
- `atemp` - "feels like" temperature in Celsius
- `humidity` - relative humidity
- `windspeed` - wind speed
- `casual` - number of non-registered user rentals initiated
- `registered` - number of registered user rentals initiated
- `count` - number of total rentals

## .head()

1. `Train` dataset has 12 columns.
 * `Assumption 1` - Datetime column looks different in both the datasets.
 In train dataset datetime starts with `2011-01-01 00:00:00`, meaning 1st january 2011

2. `Test` dataset has 9 columns.
 * casual, registered, count these columns are not present.
 * In test dataset datetime starts with `2011-01-20 00:00:00`, meaning 20th january 2011

In [10]:
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [11]:
test.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0
3,2011-01-20 03:00:00,1,0,1,1,10.66,12.88,56,11.0014
4,2011-01-20 04:00:00,1,0,1,1,10.66,12.88,56,11.0014


## .shape()
- `Train` dataset has 10886 rows and 12 columns
- `Test` dataset has 6493 rows and 9 columns


In [12]:
train.shape , test.shape

((10886, 12), (6493, 9))

## .info()

- Count is our target column
- There are 4 discrete variable and 8 continuous variable
- We can reduce dtype of some column for fewer memory space but this is not needed here, as we have less amount of data.

***


> **`Types of variable`**

| Predictor | Target |
| :---:     | :---: | 
| datetime  | casual |
| season    | registered | 
| holiday   |  <mark> count </mark> | 
| workingday|  | 
| weather   |  |
| temp      |  |  
| atemp     |  |
| humidity  |  | 
| windspeed |  |     



***

> **`Data types`**

| variable | dtype | needs_conversion | convert into     |
| :---:    | :---: |  :---:           | :---:            |
| datetime | object|   1              | datetime         |
| season   | int64 | 1                | category/object  |
| holiday  | int64 | 1                | bool             |
|workingday|int64  | 1                | bool             |
| weather  | int64 | 1                |category/object   |
| temp     |float64| 0                |float32 or lesser |
| atemp    |float64| 0                |float32 or lesser |
| humidity | int64 | 0                |int32 or lesser   |
| windpeed | int64 | 0                |float32 or lesser |
| casual   | int64 | 0                |int32 or lesser   |
|registered| int64 | 0                |int32 or lesser   | 
| count    | int64 | 0                |int32 or lesser   |

***

> **`Variable Category`**

| Discrete | Continuous |
| :---:     | :---:     | 
| season    | registered | 
| holiday   |   count    | 
| workingday|  casual    | 
| weather   | datetime   |
|           | temp       |
|           | atemp      |
|           | windspeed  |
|           | humidity   |


In [13]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB


<!-- ### .info()
**Type of Variable**
1. Predictor (datetime, season,holiday, workingday, weather , temp , atemp , humidity , windspeed)
[count (8)]
2. Target (casual , registered , count)
[count (3)]
in this project we will use count.

**Data types**
1. datetime dtype(object) /// to_convert(datetime) ///Represents Dates

2. season dtype(int64) /// to_convert(category) /// seasons (nominal data)

3. holiday dtype(int64) /// to_convert(bool)

4. working day dtype(int64) /// to_convert(bool)

5. weather dtype(int64) /// to_convert(category) /// weather (nominal data)

6. temp dtype(float64) /// to_convert(float32 or lesser) /// temp (though it is float but it looks like nominal)

7. atemp dtype(float64) /// to_convert(float32 or lesser) /// atemp (though it is float but it looks like nominal)

8. humidity dtype(intt64) /// to_convert(int32 or lesser) /// humidity (though it is int but it looks like nominal)

9. windspeed dtype(float64) /// to_convert(float32 or lesser) /// windspeed (though it is float but it looks like nominal)

10. causual,registred,count /// to_convert(int32 or lesser) /// humidity (though it is int but it looks like nominal)

**Variable Category**

1. Categorical
- season , holiday , working day, weather,temp,atemp, humidity 
2. Continuous
- datetime ,windspeed ,casual ,registred ,count -->

In [14]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    6493 non-null   object 
 1   season      6493 non-null   int64  
 2   holiday     6493 non-null   int64  
 3   workingday  6493 non-null   int64  
 4   weather     6493 non-null   int64  
 5   temp        6493 non-null   float64
 6   atemp       6493 non-null   float64
 7   humidity    6493 non-null   int64  
 8   windspeed   6493 non-null   float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.7+ KB


## .describe()
- `Assumption 1 Proved:` - Top value in train's datetime column is `2011-01-01 00:00:00`.
- Top value in test's datetime column is `2011-01-20 00:00:00`.
- <mark>Count</mark>, `Registered,Casual` column's mean and standard devaition looks non-guassian we can prove this with z-score test for normality / Plot histogram or qqplot.
- Max values on <mark>Count</mark> looks an outlier as it may be impossible to have such large % of people booking for bike. Their may contain some marketing events hosted by companyor other event such as world evironment day because of that such % of people used bike sharing service.

In [15]:
train.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
datetime,10886.0,10886.0,2011-01-01 00:00:00,1.0,,,,,,,
season,10886.0,,,,2.506614,1.116174,1.0,2.0,3.0,4.0,4.0
holiday,10886.0,,,,0.028569,0.166599,0.0,0.0,0.0,0.0,1.0
workingday,10886.0,,,,0.680875,0.466159,0.0,0.0,1.0,1.0,1.0
weather,10886.0,,,,1.418427,0.633839,1.0,1.0,1.0,2.0,4.0
temp,10886.0,,,,20.23086,7.79159,0.82,13.94,20.5,26.24,41.0
atemp,10886.0,,,,23.655084,8.474601,0.76,16.665,24.24,31.06,45.455
humidity,10886.0,,,,61.88646,19.245033,0.0,47.0,62.0,77.0,100.0
windspeed,10886.0,,,,12.799395,8.164537,0.0,7.0015,12.998,16.9979,56.9969
casual,10886.0,,,,36.021955,49.960477,0.0,4.0,17.0,49.0,367.0


In [16]:
test.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
datetime,6493.0,6493.0,2011-01-20 00:00:00,1.0,,,,,,,
season,6493.0,,,,2.4933,1.091258,1.0,2.0,3.0,3.0,4.0
holiday,6493.0,,,,0.029108,0.168123,0.0,0.0,0.0,0.0,1.0
workingday,6493.0,,,,0.685815,0.464226,0.0,0.0,1.0,1.0,1.0
weather,6493.0,,,,1.436778,0.64839,1.0,1.0,1.0,2.0,4.0
temp,6493.0,,,,20.620607,8.059583,0.82,13.94,21.32,27.06,40.18
atemp,6493.0,,,,24.012865,8.782741,0.0,16.665,25.0,31.06,50.0
humidity,6493.0,,,,64.125212,19.293391,16.0,49.0,65.0,81.0,100.0
windspeed,6493.0,,,,12.631157,8.250151,0.0,7.0015,11.0014,16.9979,55.9986


## .isnull()
- There are `no Null values` in both train and test datasets

In [17]:
train.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64

In [18]:
test.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
dtype: int64

## .unique() & .nunique()
- Datetime column from both the datasets have almost equal no. of unique values as rows in these columns so we will have to remove this column after `feature extraction`.

In [19]:
train_nunique = {i : train[i].nunique() for i in train.columns}
train_nunique

{'datetime': 10886,
 'season': 4,
 'holiday': 2,
 'workingday': 2,
 'weather': 4,
 'temp': 49,
 'atemp': 60,
 'humidity': 89,
 'windspeed': 28,
 'casual': 309,
 'registered': 731,
 'count': 822}

In [20]:
test_nunique = {i : test[i].nunique() for i in test.columns}
test_nunique

{'datetime': 6493,
 'season': 4,
 'holiday': 2,
 'workingday': 2,
 'weather': 4,
 'temp': 49,
 'atemp': 65,
 'humidity': 79,
 'windspeed': 27}

In [21]:
for i in train.columns:
  print(i , '---',train[i].unique())

datetime --- ['2011-01-01 00:00:00' '2011-01-01 01:00:00' '2011-01-01 02:00:00' ...
 '2012-12-19 21:00:00' '2012-12-19 22:00:00' '2012-12-19 23:00:00']
season --- [1 2 3 4]
holiday --- [0 1]
workingday --- [0 1]
weather --- [1 2 3 4]
temp --- [ 9.84  9.02  8.2  13.12 15.58 14.76 17.22 18.86 18.04 16.4  13.94 12.3
 10.66  6.56  5.74  7.38  4.92 11.48  4.1   3.28  2.46 21.32 22.96 23.78
 24.6  19.68 22.14 20.5  27.06 26.24 25.42 27.88 28.7  30.34 31.16 29.52
 33.62 35.26 36.9  32.8  31.98 34.44 36.08 37.72 38.54  1.64  0.82 39.36
 41.  ]
atemp --- [14.395 13.635 12.88  17.425 19.695 16.665 21.21  22.725 21.97  20.455
 11.365 10.605  9.85   8.335  6.82   5.305  6.06   9.09  12.12   7.575
 15.91   3.03   3.79   4.545 15.15  18.18  25.    26.515 27.275 29.545
 23.485 25.76  31.06  30.305 24.24  18.94  31.82  32.575 33.335 28.79
 34.85  35.605 37.12  40.15  41.665 40.91  39.395 34.09  28.03  36.365
 37.88  42.425 43.94  38.635  1.515  0.76   2.275 43.18  44.695 45.455]
humidity --- [ 81  80 

In [22]:
for i in test.columns:
  print(i , '---',test[i].unique())

datetime --- ['2011-01-20 00:00:00' '2011-01-20 01:00:00' '2011-01-20 02:00:00' ...
 '2012-12-31 21:00:00' '2012-12-31 22:00:00' '2012-12-31 23:00:00']
season --- [1 2 3 4]
holiday --- [0 1]
workingday --- [1 0]
weather --- [1 2 3 4]
temp --- [10.66  9.84  9.02 11.48 12.3  13.12  8.2   6.56  5.74  4.92  4.1   3.28
  2.46  1.64  0.82  7.38 13.94 14.76 17.22 15.58 16.4  21.32 22.14 22.96
 18.86 18.04 19.68 20.5  23.78 25.42 27.06 28.7  30.34 31.16 27.88 24.6
 26.24 29.52 31.98 33.62 32.8  35.26 36.08 36.9  34.44 37.72 38.54 39.36
 40.18]
atemp --- [11.365 13.635 12.88  10.605 16.665 14.395 15.15  15.91  12.12   9.85
  9.09   8.335  7.575  6.06   6.82   5.305  3.79   1.515  2.275  0.
  0.76   3.03   4.545 17.425 18.18  21.21  19.695 20.455 25.    25.76
 26.515 22.725 21.97  23.485 24.24  27.275 30.305 31.06  32.575 33.335
 31.82  29.545 28.79  28.03  34.09  34.85  37.12  38.635 37.88  36.365
 35.605 40.15  39.395 41.665 40.91  42.425 43.18  44.695 46.21  45.455
 47.725 49.24  50.    43.94

## .duplicated()
- There are `no duplicate values` in both train and test datasets

In [23]:
train.duplicated().sum()

0

In [24]:
test.duplicated().sum()

0

# Data Preprocessing

### Datatype Conversion
- We have reduced the size of int64 and float64 to `int16 and float16` resspectively.
- We have changed types of object variable to `category` datatype.
- We have chnaged dtypes of some column contains 0/1 as `bool` resspectively for EDA.


In [25]:
train['datetime'] = pd.to_datetime(train['datetime'])
train['season'] = train['season'].astype('category')
train['holiday'] = train['holiday'].astype('bool')
train['workingday'] = train['workingday'].astype('bool')
train['weather'] = train['weather'].astype('category')
train['temp'] = train['temp'].astype('float16')
train['atemp'] = train['atemp'].astype('float16')
train['windspeed'] = train['windspeed'].astype('float16')
train['humidity'] = train['humidity'].astype('int16')
train['casual'] = train['casual'].astype('int16')
train['registered'] = train['registered'].astype('int16')
train['count'] = train['count'].astype('int16')

### Datetime column feature extraction
- We have extracted `Year, Quarters, Month, WeekType, Day, Hours` from `datetime` column.

In [26]:
train['Year'] = train['datetime'].dt.year
train['Month'] = train['datetime'].dt.month
train['Date'] = train['datetime'].dt.day
train['Hour'] = train['datetime'].dt.hour
train['Month_Name'] = train['datetime'].dt.month_name()
train['Day_Name'] = train['datetime'].dt.day_name()
train['Quarter_Of_Year'] = train['datetime'].dt.quarter
train['Week_Type'] = np.where(train['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')

In [27]:
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,Year,Month,Date,Hour,Month_Name,Day_Name,Quarter_Of_Year,Week_Type
0,2011-01-01 00:00:00,1,False,False,1,9.84375,14.398438,81,0.0,3,13,16,2011,1,1,0,January,Saturday,1,Weekend
1,2011-01-01 01:00:00,1,False,False,1,9.023438,13.632812,80,0.0,8,32,40,2011,1,1,1,January,Saturday,1,Weekend
2,2011-01-01 02:00:00,1,False,False,1,9.023438,13.632812,80,0.0,5,27,32,2011,1,1,2,January,Saturday,1,Weekend
3,2011-01-01 03:00:00,1,False,False,1,9.84375,14.398438,75,0.0,3,10,13,2011,1,1,3,January,Saturday,1,Weekend
4,2011-01-01 04:00:00,1,False,False,1,9.84375,14.398438,75,0.0,0,1,1,2011,1,1,4,January,Saturday,1,Weekend


### Converting data types of Extracted Column

In [28]:
train['Year'] = train['Year'].astype('int16')
train['Month'] = train['Month'].astype('int16')
train['Hour'] = train['Hour'].astype('int16')
train['Month_Name'] = train['Month_Name'].astype('category')
train['Day_Name'] = train['Day_Name'].astype('category')
train['Date'] = train['Date'].astype('int16')
train['Quarter_Of_Year'] = train['Quarter_Of_Year'].astype('int16')
train['Week_Type'] = train['Week_Type'].astype('category')

In [29]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   datetime         10886 non-null  datetime64[ns]
 1   season           10886 non-null  category      
 2   holiday          10886 non-null  bool          
 3   workingday       10886 non-null  bool          
 4   weather          10886 non-null  category      
 5   temp             10886 non-null  float16       
 6   atemp            10886 non-null  float16       
 7   humidity         10886 non-null  int16         
 8   windspeed        10886 non-null  float16       
 9   casual           10886 non-null  int16         
 10  registered       10886 non-null  int16         
 11  count            10886 non-null  int16         
 12  Year             10886 non-null  int16         
 13  Month            10886 non-null  int16         
 14  Date             10886 non-null  int16

# Exploratory Data Analysis
- Helps understand Data well. And let's us analyze some data before making any conclusion.

In [30]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   datetime         10886 non-null  datetime64[ns]
 1   season           10886 non-null  category      
 2   holiday          10886 non-null  bool          
 3   workingday       10886 non-null  bool          
 4   weather          10886 non-null  category      
 5   temp             10886 non-null  float16       
 6   atemp            10886 non-null  float16       
 7   humidity         10886 non-null  int16         
 8   windspeed        10886 non-null  float16       
 9   casual           10886 non-null  int16         
 10  registered       10886 non-null  int16         
 11  count            10886 non-null  int16         
 12  Year             10886 non-null  int16         
 13  Month            10886 non-null  int16         
 14  Date             10886 non-null  int16

In [31]:
train.drop(columns = ['datetime'],inplace = True)

### Univariate Analysis
- We would use univariate data analysis for a `descriptive study` on how one characteristic or attribute varies or to examine how each characteristic or attribute varies before including that variable in a study with two or more variables.

#### Categorical Column
- We have taken columns with `category / bool` data type

##### Count Plot
- Count Plot show the counts of observations in each categorical bin using bars
- <b>Insights</b>
- least favored season is spring or season 1.
- People tend to book more bikes during clear weather conditions.
- Less prefered month was January and most favored months were May, June, July, August, December.
- Most of the bikes were booked on weekends as compare to weekday.
- About 50% of bikes were booked on non-Working day as compare to Working day. 

In [32]:
from plotly.subplots import make_subplots
fig1 = bar_plot(train,'season',['Winter','Fall','Summer','Spring'])
fig2 = bar_plot(train,'weather',['1','2','3','4'])
fig3 = bar_plot(train,'Month_Name',['August','December','July','June','May','November','October','April','September','February','March','January'])
fig4 = bar_plot(train,'Day_Name',['Saturday','Sunday','Thursday','Monday','Wednesday','Tuesday','Friday'])
fig5 = bar_plot(train,'Week_Type',['Weekday','Weekend'])
fig6 = bar_plot(train,'holiday',['No','Yes'])
fig7 = bar_plot(train,'workingday',['Yes','No'])

figure1_traces = [fig1["data"][trace] for trace in range(len(fig1["data"]))]
figure2_traces = [fig2["data"][trace] for trace in range(len(fig2["data"]))]
figure3_traces = [fig3["data"][trace] for trace in range(len(fig3["data"]))]
figure4_traces = [fig4["data"][trace] for trace in range(len(fig4["data"]))]
figure5_traces = [fig5["data"][trace] for trace in range(len(fig5["data"]))]
figure6_traces = [fig6["data"][trace] for trace in range(len(fig6["data"]))]
figure7_traces = [fig7["data"][trace] for trace in range(len(fig7["data"]))]

this_figure = make_subplots(rows = 4, cols = 2,row_titles=['Count']*4)
this_figure.update_layout(height = 1500, width = 1200, title_text = 'Univariate Categorical Columns Analysis', title_font_size = 25)

this_figure.update_xaxes(title_text="Season", row=1, col=1)
this_figure.update_xaxes(title_text="Weather", row=1, col=2)
this_figure.update_xaxes(title_text="Month Name", row=2, col=1)
this_figure.update_xaxes(title_text="Day Name", row=2, col=2)
this_figure.update_xaxes(title_text="Week type (Weekend/Weekday)", row=3, col=1)
this_figure.update_xaxes(title_text="Holiday (Yes/No)", row=3, col=2)
this_figure.update_xaxes(title_text="Working day (Yes/No)", row=4, col=1)



for traces in figure1_traces:
  this_figure.append_trace(traces, row = 1, col = 1)

for traces in figure2_traces:
  this_figure.append_trace(traces, row = 1, col = 2)

for traces in figure3_traces:
  this_figure.append_trace(traces, row = 2, col = 1)

for traces in figure4_traces:
    this_figure.append_trace(traces, row = 2, col = 2)

for traces in figure5_traces:
    this_figure.append_trace(traces, row = 3, col = 1)

for traces in figure6_traces:
    this_figure.append_trace(traces, row = 3, col = 2)

for traces in figure7_traces:
    this_figure.append_trace(traces, row = 4, col = 1)


this_figure.show()

##### Box plot
- Box Plots Enable us to quickly visualize descriptive stats and display outliers.
- <b>insights</b>
- Almost all the Box plots shows presence of outliers above upper fence.
- Season 1 or Spring has lower median relatively.
- Box plot of weather 4 is not displayed properly as it contains only one rare value.
- Median levels of Quarter 2 and 3 proves high usage of bikes.
- Generally Weekday contains more outliers.

In [33]:
cat_col = ['season','holiday','workingday','weather','Month_Name','Day_Name','Week_Type']

In [34]:
from plotly.subplots import make_subplots
fig1 = px.box(train, x='season',y='count')
fig2 = px.box(train, x='holiday',y='count')
fig3 = px.box(train, x='workingday',y='count')
fig4 = px.box(train, x='weather',y='count')
fig5 = px.box(train, x='Month_Name',y='count')
fig6 = px.box(train, x='Day_Name',y='count')
fig7 = px.box(train, x='Week_Type',y='count')

figure1_traces = [fig1["data"][trace] for trace in range(len(fig1["data"]))]
figure2_traces = [fig2["data"][trace] for trace in range(len(fig2["data"]))]
figure3_traces = [fig3["data"][trace] for trace in range(len(fig3["data"]))]
figure4_traces = [fig4["data"][trace] for trace in range(len(fig4["data"]))]
figure5_traces = [fig5["data"][trace] for trace in range(len(fig5["data"]))]
figure6_traces = [fig6["data"][trace] for trace in range(len(fig6["data"]))]
figure7_traces = [fig7["data"][trace] for trace in range(len(fig7["data"]))]

this_figure = make_subplots(rows = 4, cols = 2,row_titles=['Count']*4)
this_figure.update_layout(height = 1500, width = 1000, title_text = 'Univariate Categorical Columns Analysis (Box Plot)', title_font_size = 25)

this_figure.update_xaxes(title_text="Season", row=1, col=1)
this_figure.update_xaxes(title_text="holiday", row=1, col=2)
this_figure.update_xaxes(title_text="workingday", row=2, col=1)
this_figure.update_xaxes(title_text="weather", row=2, col=2)
this_figure.update_xaxes(title_text="Month_Name", row=3, col=1)
this_figure.update_xaxes(title_text="Day_Name", row=3, col=2)
this_figure.update_xaxes(title_text="Week_Type", row=4, col=1)



for traces in figure1_traces:
  this_figure.append_trace(traces, row = 1, col = 1)

for traces in figure2_traces:
  this_figure.append_trace(traces, row = 1, col = 2)

for traces in figure3_traces:
  this_figure.append_trace(traces, row = 2, col = 1)

for traces in figure4_traces:
    this_figure.append_trace(traces, row = 2, col = 2)

for traces in figure5_traces:
    this_figure.append_trace(traces, row = 3, col = 1)

for traces in figure6_traces:
    this_figure.append_trace(traces, row = 3, col = 2)

for traces in figure7_traces:
    this_figure.append_trace(traces, row = 4, col = 1)


this_figure.show()

#### Continuous Column
- We have taken columns with `int / float` data type

##### Scatter Matrix
- Scatter plot shows distribution of points and how dependent variables are correlated with independent variable.
- **insights**
- Point Clusters on temp and atemp shows they are positively correlated.
- Count with registered shows strong postive correlation.
- Count with casual shows weak postive correlation.

In [35]:
num_col = ['temp','atemp','humidity','windspeed','casual','registered','count']

In [36]:
fig = px.scatter_matrix(train,dimensions=[i for i in num_col])
fig.update_layout(
    title='Univariate Numerical Column Analysis (Scatter Matrix) ',
    width=1500,
    height=1500
)
fig.update_traces(diagonal_visible=False)
fig.show()

##### Density Plot
- Density plots are used to observe the distribution of a variable in a dataset. Density plots are a variation of Histograms.
- **insights**
- Count , Registered and Casual columns have right skewed distribution.
- It is required to perform feature transformation on these columns as most of the machine learning model require normally distributed dependent variable.
- Other Columns are slightly normally distributed.

In [37]:
import plotly.figure_factory as ff

def Density_Plot(df,col):
  x = df[col]
  hist_data = [x]
  group_labels = [f'{col}']
  fig = ff.create_distplot(hist_data, group_labels,show_rug=False,show_hist=True)
  fig.update_layout(width=500,height = 500,showlegend = False)
  fig.update_xaxes(title_text=f"{col}")
  fig.update_yaxes(title_text="Density")
  return fig

In [38]:
fig1 = Density_Plot(train,'temp')
fig2 = Density_Plot(train,'atemp')
fig3 = Density_Plot(train,'humidity')
fig4 = Density_Plot(train,'windspeed')
fig5 = Density_Plot(train,'casual')
fig6 = Density_Plot(train,'registered')
fig7 = Density_Plot(train,'count')

figure1_traces = [fig1["data"][trace] for trace in range(len(fig1["data"]))]
figure2_traces = [fig2["data"][trace] for trace in range(len(fig2["data"]))]
figure3_traces = [fig3["data"][trace] for trace in range(len(fig3["data"]))]
figure4_traces = [fig4["data"][trace] for trace in range(len(fig4["data"]))]
figure5_traces = [fig5["data"][trace] for trace in range(len(fig5["data"]))]
figure6_traces = [fig6["data"][trace] for trace in range(len(fig6["data"]))]
figure7_traces = [fig7["data"][trace] for trace in range(len(fig7["data"]))]

this_figure = make_subplots(rows = 3, cols = 3,row_titles=['Density']*4)
this_figure.update_layout(height = 1500, width = 1000, title_text = 'Univariate Continuous Variable Analysis (Density Plot)', title_font_size = 25)

this_figure.update_xaxes(title_text="temp", row=1, col=1)
this_figure.update_xaxes(title_text="atemp", row=1, col=2)
this_figure.update_xaxes(title_text="humidity", row=1, col=3)
this_figure.update_xaxes(title_text="windspeed", row=2, col=1)
this_figure.update_xaxes(title_text="casual", row=2, col=2)
this_figure.update_xaxes(title_text="registered", row=2, col=3)
this_figure.update_xaxes(title_text="count", row=3, col=1)



for traces in figure1_traces:
  this_figure.append_trace(traces, row = 1, col = 1)
for traces in figure2_traces:
  this_figure.append_trace(traces, row = 1, col = 2)

for traces in figure3_traces:
  this_figure.append_trace(traces, row = 1, col = 3)

for traces in figure4_traces:
    this_figure.append_trace(traces, row = 2, col = 1)

for traces in figure5_traces:
    this_figure.append_trace(traces, row = 2, col = 2)

for traces in figure6_traces:
    this_figure.append_trace(traces, row = 2, col = 3)

for traces in figure7_traces:
    this_figure.append_trace(traces, row = 3, col = 1)

this_figure.update_layout(showlegend=False)
this_figure.show()

##### Box Plot
- **Insights**
- Most of the values on continuous columns are clustered around IQR.

In [39]:
fig1 = px.box(train, y="temp",points="all",width=500,height=400)
fig2 = px.box(train, y="atemp",points="all",width=500,height=400)
fig3 = px.box(train, y="humidity",points="all",width=500,height=400)
fig4 = px.box(train, y="windspeed",points="all",width=500,height=400)
fig5 = px.box(train, y="casual",points="all",width=500,height=400)
fig6 = px.box(train, y="registered",points="all",width=500,height=400)
fig7 = px.box(train, y="count",points="all",width=500,height=400)

figure1_traces = [fig1["data"][trace] for trace in range(len(fig1["data"]))]
figure2_traces = [fig2["data"][trace] for trace in range(len(fig2["data"]))]
figure3_traces = [fig3["data"][trace] for trace in range(len(fig3["data"]))]
figure4_traces = [fig4["data"][trace] for trace in range(len(fig4["data"]))]
figure5_traces = [fig5["data"][trace] for trace in range(len(fig5["data"]))]
figure6_traces = [fig6["data"][trace] for trace in range(len(fig6["data"]))]
figure7_traces = [fig7["data"][trace] for trace in range(len(fig7["data"]))]

this_figure = make_subplots(rows = 3, cols = 3,row_titles=['Density']*4)
this_figure.update_layout(height = 1500, width = 1000, title_text = 'Univariate Continuous Variable Analysis (Box Plot)', title_font_size = 25)

this_figure.update_xaxes(title_text="temp", row=1, col=1)
this_figure.update_xaxes(title_text="atemp", row=1, col=2)
this_figure.update_xaxes(title_text="humidity", row=1, col=3)
this_figure.update_xaxes(title_text="windspeed", row=2, col=1)
this_figure.update_xaxes(title_text="casual", row=2, col=2)
this_figure.update_xaxes(title_text="registered", row=2, col=3)
this_figure.update_xaxes(title_text="count", row=3, col=1)



for traces in figure1_traces:
  this_figure.append_trace(traces, row = 1, col = 1)
for traces in figure2_traces:
  this_figure.append_trace(traces, row = 1, col = 2)

for traces in figure3_traces:
  this_figure.append_trace(traces, row = 1, col = 3)

for traces in figure4_traces:
    this_figure.append_trace(traces, row = 2, col = 1)

for traces in figure5_traces:
    this_figure.append_trace(traces, row = 2, col = 2)

for traces in figure6_traces:
    this_figure.append_trace(traces, row = 2, col = 3)

for traces in figure7_traces:
    this_figure.append_trace(traces, row = 3, col = 1)


this_figure.show()

### Bivariate Analysis
- We perform bivariate analysis to compare two features wrt each other.

#### Group By Analysis

##### Month vs Count, Registered, Casual
- June shows highest spike in count of rentals while january shows least value.
- June, July, August, September and October's average count is greater whole year.
- Company should increase workforce during these months.

In [40]:
monthly_avg = pd.DataFrame(train.groupby("Month")['count','registered','casual'].mean()).reset_index()
monthly_avg_fig = px.line(monthly_avg,x='Month',y=['count','registered','casual'],markers=True)
monthly_avg_fig.show()

##### Hour vs Count, Registered, Casual
- Morning 6 to 9 and evening 16 to 18 is most prefered time.

In [41]:
hourly_avg = pd.DataFrame(train.groupby('Hour')['count','registered','casual'].mean()).reset_index()
hourly_avg_fig = px.line(hourly_avg,x='Hour',y=['count','registered','casual'],markers=True)
hourly_avg_fig.show()

##### Day vs Count, Registered, Casual
- Saturday and Thursday shows highest mean count of rentals.

In [42]:
Weekly_avg = pd.DataFrame(train.groupby('Day_Name')['count','registered','casual'].mean()).reset_index()
Weekly_avg_fig = px.line(Weekly_avg,x='Day_Name',y=['count','registered','casual'],markers=True)
Weekly_avg_fig.show()

##### Quarter vs Count, Registered, Casual
- As expected Quarters 2 and 3 are outperforming Q1 and Q4

In [43]:
Quarterly_avg = pd.DataFrame(train.groupby('Quarter_Of_Year')['count','registered','casual'].mean()).reset_index()
Quarterly_avg_fig = px.line(Quarterly_avg,x='Quarter_Of_Year',y=['count','registered','casual'],markers=True)
Quarterly_avg_fig.show()

##### Day vs Count, Registered, Casual
- Normally distributed over 19 days of month.

In [44]:
Daily_avg = pd.DataFrame(train.groupby('Date')['count','registered','casual'].mean()).reset_index()
Daily_avg_fig = px.line(Daily_avg,x='Date',y=['count','registered','casual'],markers=True)
Daily_avg_fig.show()

##### Year vs Count, Registered, Casual
- Year 2012 shows around 65% spike in registered users and 48.27% increase in casual users as compare to that of year 2011


In [45]:
Yearly_avg = pd.DataFrame(train.groupby("Year")['count','registered','casual'].mean()).reset_index()
Yearly_avg_fig = px.bar(Yearly_avg,x='Year',y=['count','registered','casual'],barmode='group')
Yearly_avg_fig.show()

### Multivariate Analysis
- Multivariate analysis shows, how one or more varaiables are dependent to each other.

#### Correlation Matrix
- Correlation plot shows how variables are correlated with each other.
- **insights**
- We will only see which variables are correlated with our target variable (Count).
- temp and atemp shows signs of Multicollinearity.(We will only use one during model building).
- Holiday, Working day and day shows very less collinearity.(we will avoid this during model building after finalizing with feature selection technique)
- Humidity shows weak negative corr with count.
- Casual and registered shows strong positive correlation as Count column itself is combination of these features.
- Year, Month ,Hour, Quarter of Year shows weak positive correlation with Count.

In [46]:
x = list(train.corr().columns)
y = list(train.corr().index)
z = np.array(train.corr())

fig = ff.create_annotated_heatmap(
    z,
    x = x,
    y = y ,
    annotation_text = np.around(z, decimals=2),
    hoverinfo='z',
    colorscale='Blues',showscale = True
    )
fig.show()

#### Distribution Plot
- We will plot distribution of values with Rug plot below.
- Insights
- Count features contains most outliers followed by registered and casual.
- We will be required to remove these outliers before feature transformations.

In [47]:
hist_data = [train[i].values for i in num_col]
group_labels = [i for i in num_col]
fig1 = ff.create_distplot(hist_data, group_labels, curve_type='normal',show_rug=True,show_hist=False)
fig1.show()

# Model Building
- We will select <mark>Mean absolute error (MAE)</mark> as error metric. Which will give us absolute errors first and as a final metric mean of absolute error. We have taken this metric which makes it easy to understand the use of error becuase mean absolute error will give actual error.

- <mark>R2 score</mark> to check `Overfitting` and `Underfitting`. R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable


### Model Selection
- We will use six algorhithms viz. `RandomForestRegressor`, `DecisionTreeRegressor`, `LinearRegression`, `SGDRegressor`,`KNeighborsRegressor`, `SupportVectorRegressor (SVR)`.
- The algorithm which outperformed everyother algo in the list is `RandomForestRegressor` because random forest is ensemble technique which runs multiple decision tree and provide the most voted answer as an output.
- eg. Suppose you want to watch a web series on Netflix. Will you just log in to your account and watch the first webisode that pops up or will you browse a few web pages, compare the ratings and then make a decision. Yes. It’s highly likely that you will go for the second option and instead of making a direct conclusion you will consider other options as well.
- The <mark>MAE with random forest was ~25</mark> which means we have +25 or -25 error while predicting on train dataset

In [48]:
LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['casual','registered','count','Month_Name','atemp']

LE = LabelEncoder()
ONE = OneHotEncoder()

for cols in LE_features:
  train[cols] = LE.fit_transform(train[cols])

new_df = pd.get_dummies(train,columns=OHE_features)

#seperating x and y
x = new_df.drop(columns = drop_features)
y = new_df[['count']]
x.shape,y.shape

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

# Scaling
scaler = MinMaxScaler()
x_train,x_test = scaler.fit_transform(x_train),scaler.fit_transform(x_test)

In [49]:
models=[RandomForestRegressor(),DecisionTreeRegressor(),LinearRegression(),SGDRegressor(),SVR(),KNeighborsRegressor()]
model_names=['RandomForestRegressor','DecisionTreeRegressor','LinearRegression','SGDRegressor','SVR','KNeighborsRegressor']
mae=[]
d={}
for model in range (len(models)):
    reg=models[model]
    reg.fit(x_train,y_train)
    y_pred=reg.predict(x_test)
    mae.append(mean_absolute_error(y_test,y_pred))
d={'Algorithm':model_names,'MAE':mae}   
mae_frame=pd.DataFrame(d)
mae_frame.sort_values(by = 'MAE',ascending = True)

Unnamed: 0,Algorithm,MAE
0,RandomForestRegressor,24.698688
1,DecisionTreeRegressor,35.792432
5,KNeighborsRegressor,86.332403
2,LinearRegression,105.701874
3,SGDRegressor,105.770668
4,SVR,115.228988


In [50]:
Random_Forest_Regressor = RandomForestRegressor(random_state = 1)
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = Random_Forest_Regressor.predict(x_test)


no_of_estimator=[10,50,100,500]
params_dict={'n_estimators':no_of_estimator,'n_jobs':[-1],'max_features':["auto",'sqrt','log2']}
reg_rf=GridSearchCV(estimator=RandomForestRegressor(random_state=1),param_grid=params_dict,scoring='neg_mean_absolute_error')
reg_rf.fit(x_train,y_train)
pred=reg_rf.predict(x_test)
print(mean_absolute_error(y_test,y_pred))

24.853640705363702


### Model without Datetime  <mark>MAE (108.84)</mark>
- We will make this model as a baseline for our project, this model is just without datetime columns to see the imapact of this column alone.

In [51]:
train_df_copy = pd.read_csv('/content/train.csv')

In [52]:
train_df_copy.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [53]:
train_df_copy.drop(columns = ['datetime'],inplace = True)

In [54]:
#seperating x and y
x = train_df_copy.drop(columns = ['count','casual','registered'])
y = train_df_copy[['count']]
x.shape,y.shape

((10886, 8), (10886, 1))

In [55]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((8164, 8), (2722, 8), (8164, 1), (2722, 1))

In [56]:
Random_Forest_Regressor = RandomForestRegressor(random_state=1)
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = np.round(Random_Forest_Regressor.predict(x_test))
print('MAE:  ',mean_absolute_error(y_true = y_test , y_pred = y_pred))
print('R2_Score:  ',r2_score(y_true = y_test , y_pred = y_pred))

MAE:   108.84202792064659
R2_Score:   0.305321403330012


In [57]:
Random_Forest_Regressor.feature_importances_

array([0.07027309, 0.00621917, 0.04231251, 0.05319853, 0.14065389,
       0.23546542, 0.26098355, 0.19089384])

In [58]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

In [59]:
actual_vs_predicted_plot(y_test,y_pred)

### Model with datetime extracted features without scaling <mark>MAE (24.36)</mark>
- Through this model we can conclude that datetime column have sugnificant impact on Count of users booking for ride sharing.

In [60]:
train_df_copy = pd.read_csv('/content/train.csv')

In [61]:
train_df_copy.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [62]:
train_df_copy.shape

(10886, 12)

In [63]:
train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')

In [64]:
train_df_copy.shape

(10886, 20)

In [65]:
train_df_copy.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,Year,Month,Date,Hour,Month_Name,Day_Name,Quarter_Of_Year,Week_Type
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,2011,1,1,0,January,Saturday,1,Weekend
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,2011,1,1,1,January,Saturday,1,Weekend


In [66]:
LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['datetime','casual','registered','count','Month_Name','atemp']

In [67]:
LE = LabelEncoder()
ONE = OneHotEncoder()

In [68]:
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])

In [69]:
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

In [70]:
new_df.head(3)

Unnamed: 0,datetime,holiday,workingday,temp,atemp,humidity,windspeed,casual,registered,count,...,Day_Name_Monday,Day_Name_Saturday,Day_Name_Sunday,Day_Name_Thursday,Day_Name_Tuesday,Day_Name_Wednesday,Quarter_Of_Year_1,Quarter_Of_Year_2,Quarter_Of_Year_3,Quarter_Of_Year_4
0,2011-01-01 00:00:00,0,0,9.84,14.395,81,0.0,3,13,16,...,0,1,0,0,0,0,1,0,0,0
1,2011-01-01 01:00:00,0,0,9.02,13.635,80,0.0,8,32,40,...,0,1,0,0,0,0,1,0,0,0
2,2011-01-01 02:00:00,0,0,9.02,13.635,80,0.0,5,27,32,...,0,1,0,0,0,0,1,0,0,0


In [71]:
#seperating x and y
x = new_df.drop(columns = drop_features)
y = new_df[['count']]
x.shape,y.shape

((10886, 29), (10886, 1))

In [72]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((8164, 29), (2722, 29), (8164, 1), (2722, 1))

In [73]:
Random_Forest_Regressor = RandomForestRegressor(random_state = 1)
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = Random_Forest_Regressor.predict(x_test)
print('MAE:  ',mean_absolute_error(y_true = y_test , y_pred = y_pred))
print('R2_Score:  ',r2_score(y_true = y_test , y_pred = y_pred))

MAE:   24.364393828067595
R2_Score:   0.9537931478236782


In [74]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

In [75]:
actual_vs_predicted_plot(y_test,y_pred)

In [76]:
sfs1 = SFS(RandomForestRegressor(), 
           k_features=8, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='neg_mean_absolute_error',
           cv=5)

sfs1 = sfs1.fit(x_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:   23.3s finished

[2022-10-18 09:41:39] Features: 1/8 -- score: -86.60661437971514[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:   39.8s finished

[2022-10-18 09:42:19] Features: 2/8 -- score: -70.84683482369822[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   43.1s finished

[2022-10-18 09:43:02] Features: 3/8 -- score: -59.36022961680993[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 

### Model with datetime extracted features with scaling <mark>MAE (Min Max - 24.85 , Std - 24.45)</mark>
- Using `Standard scaler` instead of `Min Max Scaler` has little to no impact on our model.

In [77]:
train_df_copy = pd.read_csv('/content/train.csv')

In [78]:
train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')


train_df_copy['Year'] = train_df_copy['Year'].astype('int16')
train_df_copy['Month'] = train_df_copy['Month'].astype('int16')
train_df_copy['Hour'] = train_df_copy['Hour'].astype('int16')
train_df_copy['Month_Name'] = train_df_copy['Month_Name'].astype('category')
train_df_copy['Day_Name'] = train_df_copy['Day_Name'].astype('category')
train_df_copy['Date'] = train_df_copy['Date'].astype('int16')
train_df_copy['Quarter_Of_Year'] = train_df_copy['Quarter_Of_Year'].astype('int16')
train_df_copy['Week_Type'] = train_df_copy['Week_Type'].astype('category')

In [79]:
LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['datetime','casual','registered','count','Month_Name','atemp']

In [80]:
LE = LabelEncoder()
ONE = OneHotEncoder()
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

In [81]:
scalerrrr = [MinMaxScaler(),StandardScaler(),RobustScaler(),Normalizer()]
scaler_score_mae = {}
scaler_score_r2 = {}

for i in range (len(scalerrrr)):
  #seperating x and y
  x = new_df.drop(columns = drop_features)
  y = new_df[['count']]
  x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
  scaler = scalerrrr[i]
  x_train,x_test = scaler.fit_transform(x_train),scaler.fit_transform(x_test)
  Random_Forest_Regressor = RandomForestRegressor(random_state = 1)
  Random_Forest_Regressor.fit(x_train,y_train)
  y_pred = Random_Forest_Regressor.predict(x_test)
  scaler_score_mae[i] = mean_absolute_error(y_true = y_test , y_pred = y_pred)
  scaler_score_r2[i] = r2_score(y_true = y_test , y_pred = y_pred)

In [82]:
scaler_score_mae

{0: 24.853640705363702,
 1: 24.444819985304925,
 2: 58.795723732549604,
 3: 53.989614254224826}

### Model with datetime extracted features with scaling and <mark>PCA (n_components = 16)</mark> <mark>MAE (53.26)</mark>
- We have performed PCA with components from 1-30 and least MAE was seen at n_components --> 16

In [83]:
train_df_copy = pd.read_csv('/content/train.csv')

train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')


train_df_copy['Year'] = train_df_copy['Year'].astype('int16')
train_df_copy['Month'] = train_df_copy['Month'].astype('int16')
train_df_copy['Hour'] = train_df_copy['Hour'].astype('int16')
train_df_copy['Month_Name'] = train_df_copy['Month_Name'].astype('category')
train_df_copy['Day_Name'] = train_df_copy['Day_Name'].astype('category')
train_df_copy['Date'] = train_df_copy['Date'].astype('int16')
train_df_copy['Quarter_Of_Year'] = train_df_copy['Quarter_Of_Year'].astype('int16')
train_df_copy['Week_Type'] = train_df_copy['Week_Type'].astype('category')

LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['datetime','casual','registered','count','Month_Name','atemp']

LE = LabelEncoder()
ONE = OneHotEncoder()
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

In [84]:
pca_score_mae = {}
for i in range(1,30):
  x = new_df.drop(columns = drop_features)
  y = new_df[['count']]


  x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
  x_train.shape,x_test.shape,y_train.shape,y_test.shape

  scaler = MinMaxScaler()
  x_train,x_test = scaler.fit_transform(x_train),scaler.fit_transform(x_test)
  pca = PCA(n_components = i)
  pca = pca.fit(x_train,y_train)
  x_train_pca = pca.transform(x_train)
  x_test_pca = pca.transform(x_test)
  Random_Forest_Regressor = RandomForestRegressor()
  Random_Forest_Regressor.fit(x_train_pca,y_train)
  y_pred = Random_Forest_Regressor.predict(x_test_pca)
  pca_score_mae[i] = mean_absolute_error(y_true = y_test , y_pred = y_pred)

pca_score_mae

{1: 157.89283486407052,
 2: 127.22905584129316,
 3: 110.19389419544451,
 4: 101.7303159441587,
 5: 97.22423585598824,
 6: 93.06526818515799,
 7: 92.64345701689933,
 8: 90.39643277002205,
 9: 89.97662747979426,
 10: 89.21001836884643,
 11: 89.99135194709773,
 12: 89.64585598824394,
 13: 58.111818515797204,
 14: 57.25141072740631,
 15: 55.408324761205,
 16: 53.24391991182954,
 17: 54.28026818515798,
 18: 53.99081925055107,
 19: 54.36081190301249,
 20: 56.06856355620867,
 21: 55.87058780308596,
 22: 55.91572740631888,
 23: 55.93234753857458,
 24: 55.59859294636297,
 25: 55.55919544452608,
 26: 56.0228912564291,
 27: 55.90592946362968,
 28: 55.533868479059514,
 29: 55.93282880235121}

In [85]:
np.argmin(list(pca_score_mae.values())) #this value is 16

15

In [86]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

### Model with datetime extracted features with Scaling + k-clusters of numerical variables <mark>MAE (26.03)</mark>
- In this model we have used k-Nearest Bins for our numerical column as a part of feature engineering. Those columns include `temp`,`humidity` and `windspeed`.

In [87]:
train_df_copy = pd.read_csv('/content/train.csv')
train_df_copy.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [88]:
train_df_copy = pd.read_csv('/content/train.csv')

train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')


train_df_copy['Year'] = train_df_copy['Year'].astype('int16')
train_df_copy['Month'] = train_df_copy['Month'].astype('int16')
train_df_copy['Hour'] = train_df_copy['Hour'].astype('int16')
train_df_copy['Month_Name'] = train_df_copy['Month_Name'].astype('category')
train_df_copy['Day_Name'] = train_df_copy['Day_Name'].astype('category')
train_df_copy['Date'] = train_df_copy['Date'].astype('int16')
train_df_copy['Quarter_Of_Year'] = train_df_copy['Quarter_Of_Year'].astype('int16')
train_df_copy['Week_Type'] = train_df_copy['Week_Type'].astype('category')

LE_features =  ['Year','Week_Type','holiday','workingday']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['datetime','casual','registered','count','Month_Name','atemp','temp','humidity','windspeed']

LE = LabelEncoder()
ONE = OneHotEncoder()
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

from sklearn.cluster import KMeans
cluster_cols = ['temp','humidity','windspeed']
for i in cluster_cols:
  cluster_df = new_df[[i]]
  clusters_plus = KMeans(n_clusters=10).fit(cluster_df)
  new_df[i] = pd.DataFrame(clusters_plus.labels_)

x = new_df.drop(columns = drop_features)
y = new_df[['count']]


x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape


scaler = MinMaxScaler()
x_train,x_test = scaler.fit_transform(x_train),scaler.fit_transform(x_test)

Random_Forest_Regressor = RandomForestRegressor(random_state = 1)
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = Random_Forest_Regressor.predict(x_test)
print('MAE:  ',mean_absolute_error(y_true = y_test , y_pred = y_pred))
print('R2_Score:  ',r2_score(y_true = y_test , y_pred = y_pred))

MAE:   26.036914033798677
R2_Score:   0.9444715675632406


In [89]:
x.columns

Index(['holiday', 'workingday', 'Year', 'Month', 'Date', 'Hour', 'Week_Type',
       'season_1', 'season_2', 'season_3', 'season_4', 'weather_1',
       'weather_2', 'weather_3', 'weather_4', 'Day_Name_Friday',
       'Day_Name_Monday', 'Day_Name_Saturday', 'Day_Name_Sunday',
       'Day_Name_Thursday', 'Day_Name_Tuesday', 'Day_Name_Wednesday',
       'Quarter_Of_Year_1', 'Quarter_Of_Year_2', 'Quarter_Of_Year_3',
       'Quarter_Of_Year_4'],
      dtype='object')

In [90]:
actual_vs_predicted_plot(y_test,y_pred)

In [91]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

In [92]:
sfs1 = SFS(RandomForestRegressor(), 
           k_features=8, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='neg_mean_absolute_error',
           cv=5)

sfs1 = sfs1.fit(x_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:   20.0s finished

[2022-10-18 09:59:58] Features: 1/8 -- score: -86.6166669535489[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:   33.1s finished

[2022-10-18 10:00:31] Features: 2/8 -- score: -70.84335209490015[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:   35.3s finished

[2022-10-18 10:01:07] Features: 3/8 -- score: -59.66488430341212[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 o

### Feature Engine MathematicalCombination <mark>MAE (28)</mark>
- We have used Open source library for Feature Generation (Feature Engine)
- We used our numerical variables `temp`,`humidity`,`windspeed` and we have used permutation and combination with there `sum`,`min`,`max`,`std`,`prod`,`mean`, and generated new columns.

In [93]:
train_df_copy = pd.read_csv('/content/train.csv')

train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')


train_df_copy['Year'] = train_df_copy['Year'].astype('int16')
train_df_copy['Month'] = train_df_copy['Month'].astype('int16')
train_df_copy['Hour'] = train_df_copy['Hour'].astype('int16')
train_df_copy['Month_Name'] = train_df_copy['Month_Name'].astype('category')
train_df_copy['Day_Name'] = train_df_copy['Day_Name'].astype('category')
train_df_copy['Date'] = train_df_copy['Date'].astype('int16')
train_df_copy['Quarter_Of_Year'] = train_df_copy['Quarter_Of_Year'].astype('int16')
train_df_copy['Week_Type'] = train_df_copy['Week_Type'].astype('category')

MF = MathematicalCombination(variables_to_combine=["temp","humidity",'windspeed'],math_operations = ["sum", "min", "max", "std","prod","mean"])

MF.fit(train_df_copy)

# Transform the data
train_df_copy = MF.transform(train_df_copy)

LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['datetime','casual','registered','count','Month_Name','atemp',"temp","humidity",'windspeed']

LE = LabelEncoder()
ONE = OneHotEncoder()
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

#seperating x and y
x = new_df.drop(columns = drop_features)
y = new_df[['count']]


x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape


scaler = MinMaxScaler()
x_train,x_test = scaler.fit_transform(x_train),scaler.fit_transform(x_test)

Random_Forest_Regressor = RandomForestRegressor()
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = Random_Forest_Regressor.predict(x_test)
print('MSE:  ',mean_absolute_error(y_true = y_test , y_pred = y_pred))
print('R2_Score:  ',r2_score(y_true = y_test , y_pred = y_pred))

MSE:   27.85634092578986
R2_Score:   0.9421906868449988


In [94]:
train_df_copy.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,...,Month_Name,Day_Name,Quarter_Of_Year,Week_Type,sum(temp-humidity-windspeed),min(temp-humidity-windspeed),max(temp-humidity-windspeed),std(temp-humidity-windspeed),prod(temp-humidity-windspeed),mean(temp-humidity-windspeed)
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,...,January,Saturday,1,1,90.84,0.0,81.0,44.199493,0.0,30.28
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,...,January,Saturday,1,1,89.02,0.0,80.0,43.816893,0.0,29.673333
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,...,January,Saturday,1,1,89.02,0.0,80.0,43.816893,0.0,29.673333
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,...,January,Saturday,1,1,84.84,0.0,75.0,40.758744,0.0,28.28
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,...,January,Saturday,1,1,84.84,0.0,75.0,40.758744,0.0,28.28


In [95]:
x.head()

Unnamed: 0,holiday,workingday,Year,Month,Date,Hour,Week_Type,sum(temp-humidity-windspeed),min(temp-humidity-windspeed),max(temp-humidity-windspeed),...,Day_Name_Monday,Day_Name_Saturday,Day_Name_Sunday,Day_Name_Thursday,Day_Name_Tuesday,Day_Name_Wednesday,Quarter_Of_Year_1,Quarter_Of_Year_2,Quarter_Of_Year_3,Quarter_Of_Year_4
0,0,0,0,1,1,0,1,90.84,0.0,81.0,...,0,1,0,0,0,0,1,0,0,0
1,0,0,0,1,1,1,1,89.02,0.0,80.0,...,0,1,0,0,0,0,1,0,0,0
2,0,0,0,1,1,2,1,89.02,0.0,80.0,...,0,1,0,0,0,0,1,0,0,0
3,0,0,0,1,1,3,1,84.84,0.0,75.0,...,0,1,0,0,0,0,1,0,0,0
4,0,0,0,1,1,4,1,84.84,0.0,75.0,...,0,1,0,0,0,0,1,0,0,0


In [96]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

In [97]:
actual_vs_predicted_plot(y_test,y_pred)

In [98]:
train_df_copy = pd.read_csv('/content/train.csv')

train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')


train_df_copy['Year'] = train_df_copy['Year'].astype('int16')
train_df_copy['Month'] = train_df_copy['Month'].astype('int16')
train_df_copy['Hour'] = train_df_copy['Hour'].astype('int16')
train_df_copy['Month_Name'] = train_df_copy['Month_Name'].astype('category')
train_df_copy['Day_Name'] = train_df_copy['Day_Name'].astype('category')
train_df_copy['Date'] = train_df_copy['Date'].astype('int16')
train_df_copy['Quarter_Of_Year'] = train_df_copy['Quarter_Of_Year'].astype('int16')
train_df_copy['Week_Type'] = train_df_copy['Week_Type'].astype('category')

RF = CombineWithReferenceFeature(variables_to_combine=["temp", "atemp","humidity",'windspeed'],reference_variables=["temp", "atemp","humidity",'windspeed'],operations = ["sub","add","mul"])

RF.fit(train_df_copy)

# Fit and transform 
train_df_copy = RF.transform(train_df_copy)

LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']
drop_features = ['datetime','casual','registered','count','Month_Name','atemp',"temp","humidity",'windspeed']

LE = LabelEncoder()
ONE = OneHotEncoder()
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

#seperating x and y
x = new_df.drop(columns = drop_features)
y = new_df[['count']]


x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape


scaler = MinMaxScaler()
x_train,x_test = scaler.fit_transform(x_train),scaler.fit_transform(x_test)

Random_Forest_Regressor = RandomForestRegressor(random_state = 1)
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = Random_Forest_Regressor.predict(x_test)
print('MSE:  ',mean_absolute_error(y_true = y_test , y_pred = y_pred))
print('R2_Score:  ',r2_score(y_true = y_test , y_pred = y_pred))

MSE:   27.989044819985303
R2_Score:   0.9427182715439031


In [99]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

In [100]:
actual_vs_predicted_plot(y_test,y_pred)

### Model with selected features <mark>MAE (~30)</mark>
- We have selected features based on `feature_importance_` method on RandomForestRegressor.
- Most significant features were `Hour` followed by `temp` > `Year` > `Month` > `working day` > `humidity`

In [101]:
train_df_copy = pd.read_csv('/content/train.csv')

In [102]:
train_df_copy['datetime'] = pd.to_datetime(train_df_copy['datetime'])
train_df_copy['Year'] = train_df_copy['datetime'].dt.year
train_df_copy['Month'] = train_df_copy['datetime'].dt.month
train_df_copy['Date'] = train_df_copy['datetime'].dt.day
train_df_copy['Hour'] = train_df_copy['datetime'].dt.hour
train_df_copy['Month_Name'] = train_df_copy['datetime'].dt.month_name()
train_df_copy['Day_Name'] = train_df_copy['datetime'].dt.day_name()
train_df_copy['Quarter_Of_Year'] = train_df_copy['datetime'].dt.quarter
train_df_copy['Week_Type'] = np.where(train_df_copy['Day_Name'].isin(['Friday','Saturday','Sunday']),'Weekend','Weekday')


In [103]:
train_df_copy['Year'] = train_df_copy['Year'].astype('int16')
train_df_copy['Month'] = train_df_copy['Month'].astype('int16')
train_df_copy['Hour'] = train_df_copy['Hour'].astype('int16')
train_df_copy['Month_Name'] = train_df_copy['Month_Name'].astype('category')
train_df_copy['Day_Name'] = train_df_copy['Day_Name'].astype('category')
train_df_copy['Date'] = train_df_copy['Date'].astype('int16')
train_df_copy['Quarter_Of_Year'] = train_df_copy['Quarter_Of_Year'].astype('int16')
train_df_copy['Week_Type'] = train_df_copy['Week_Type'].astype('category')


In [104]:
LE_features =  ['holiday','workingday','Year','Week_Type']
OHE_features =  ['season','weather','Day_Name','Quarter_Of_Year']

In [105]:
LE = LabelEncoder()
ONE = OneHotEncoder()
for cols in LE_features:
  train_df_copy[cols] = LE.fit_transform(train_df_copy[cols])
new_df = pd.get_dummies(train_df_copy,columns=OHE_features)

In [106]:
drop_features = [i for i in new_df if i not in ['Year','Month','Hour','workingday','temp','humidity']]

In [107]:
new_df.columns

Index(['datetime', 'holiday', 'workingday', 'temp', 'atemp', 'humidity',
       'windspeed', 'casual', 'registered', 'count', 'Year', 'Month', 'Date',
       'Hour', 'Month_Name', 'Week_Type', 'season_1', 'season_2', 'season_3',
       'season_4', 'weather_1', 'weather_2', 'weather_3', 'weather_4',
       'Day_Name_Friday', 'Day_Name_Monday', 'Day_Name_Saturday',
       'Day_Name_Sunday', 'Day_Name_Thursday', 'Day_Name_Tuesday',
       'Day_Name_Wednesday', 'Quarter_Of_Year_1', 'Quarter_Of_Year_2',
       'Quarter_Of_Year_3', 'Quarter_Of_Year_4'],
      dtype='object')

In [108]:
#seperating x and y
x = new_df.drop(columns = drop_features)
y = train_df_copy[['count']]

In [109]:
x.head()

Unnamed: 0,workingday,temp,humidity,Year,Month,Hour
0,0,9.84,81,0,1,0
1,0,9.02,80,0,1,1
2,0,9.02,80,0,1,2
3,0,9.84,75,0,1,3
4,0,9.84,75,0,1,4


In [110]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1)
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((8164, 6), (2722, 6), (8164, 1), (2722, 1))

In [111]:
x.head()

Unnamed: 0,workingday,temp,humidity,Year,Month,Hour
0,0,9.84,81,0,1,0
1,0,9.02,80,0,1,1
2,0,9.02,80,0,1,2
3,0,9.84,75,0,1,3
4,0,9.84,75,0,1,4


In [112]:
Random_Forest_Regressor = RandomForestRegressor(random_state=1)
Random_Forest_Regressor.fit(x_train,y_train)
y_pred = Random_Forest_Regressor.predict(x_test)
print('MSE:  ',mean_absolute_error(y_true = y_test , y_pred = y_pred))
print('R2_Score:  ',r2_score(y_true = y_test , y_pred = y_pred))

MSE:   29.288627011826037
R2_Score:   0.9314376140007726


In [113]:
FI_by_RF(x,Random_Forest_Regressor.feature_importances_)

In [114]:
actual_vs_predicted_plot(y_test,y_pred)

In [115]:
x_train.shape

(8164, 6)

In [116]:

sfs1 = SFS(RandomForestRegressor(), 
           k_features=6, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='neg_mean_absolute_error',
           cv=5)

sfs1 = sfs1.fit(x_train, y_train)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    6.0s finished

[2022-10-18 10:08:46] Features: 1/6 -- score: -86.61581788332498[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    9.6s finished

[2022-10-18 10:08:56] Features: 2/6 -- score: -70.84742538867981[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   10.2s finished

[2022-10-18 10:09:06] Features: 3/6 -- score: -59.36799079493812[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 

In [117]:
pd.DataFrame.from_dict(sfs1.get_metric_dict()).T

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names,ci_bound,std_dev,std_err
1,"(5,)","[-90.18837057801568, -83.69087268932456, -87.0...",-86.615818,"(Hour,)",2.702342,2.102514,1.051257
2,"(0, 5)","[-73.31740552203459, -71.12764467190738, -70.9...",-70.847425,"(workingday, Hour)",2.287629,1.779853,0.889926
3,"(0, 1, 5)","[-58.86108371236819, -59.69706718976923, -61.5...",-59.367991,"(workingday, temp, Hour)",2.181293,1.69712,0.84856
4,"(0, 1, 3, 5)","[-42.880644545613734, -42.26957910512715, -44....",-42.457517,"(workingday, temp, Year, Hour)",1.945017,1.513289,0.756644
5,"(0, 1, 3, 4, 5)","[-36.714336994484725, -36.08126834635392, -35....",-35.530853,"(workingday, temp, Year, Month, Hour)",1.292707,1.00577,0.502885
6,"(0, 1, 2, 3, 4, 5)","[-31.691141180415826, -30.252302935973326, -30...",-30.294521,"(workingday, temp, humidity, Year, Month, Hour)",1.346855,1.047899,0.52395


# Summary
- We first used a model with and without datetime feature. The one with datetime feature was found to be more effective then without datetime feature.
- We can say that Min Max and Standard Scalers had almost same effect we can use either of the two for our model development.
- We found around 16 components or feature from our dataset to be effective for model. We have again cut down the features into 6 features which are the most important features.
- K-Nearest Neighbours for binning for numerical variables was not the best approach.
- On later stage, we used open source library (Feature Engine) for Feature Generation. It was found to be effective few feature like temp add atemp was one of the contributing feature for our model.
- Lastly, We made model with selected features with turned out to give around 30 MAE. Which can be useful in this case as we are not developing model for competition.
- Most significant features were `Hour` followed by `temp` > `Year` > `Month` > `working day` > `humidity`