### Goal

EASY: ~1.1 MEDIUM: <0.9 MEDIUM-HARD: <0.7 HARD: <0.5
VERY HARD: <0.45

### Data Fields

- datetime - hourly date + timestamp  
- season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals

## Importing packages

In [1]:
# data analysis stack
import numpy as np
import pandas as pd
import datetime as dt
from datetime import datetime
from datetime import timedelta


# Import train-test-split
from sklearn.model_selection import train_test_split
from matplotlib import pyplot

# data visualization stack
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

# machine learning stack
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    MinMaxScaler
)
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

In [2]:
train= pd.read_csv('train.csv')
kaggle_test = pd.read_csv('test.csv')

In [3]:
train = pd.read_csv('train.csv', parse_dates=True)

In [4]:
train.shape

(10886, 12)

In [5]:
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [6]:
train.tail()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.91,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129
10885,2012-12-19 23:00:00,4,0,1,1,13.12,16.665,66,8.9981,4,84,88


In [7]:
train.describe()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
count,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0,10886.0
mean,2.506614,0.028569,0.680875,1.418427,20.23086,23.655084,61.88646,12.799395,36.021955,155.552177,191.574132
std,1.116174,0.166599,0.466159,0.633839,7.79159,8.474601,19.245033,8.164537,49.960477,151.039033,181.144454
min,1.0,0.0,0.0,1.0,0.82,0.76,0.0,0.0,0.0,0.0,1.0
25%,2.0,0.0,0.0,1.0,13.94,16.665,47.0,7.0015,4.0,36.0,42.0
50%,3.0,0.0,1.0,1.0,20.5,24.24,62.0,12.998,17.0,118.0,145.0
75%,4.0,0.0,1.0,2.0,26.24,31.06,77.0,16.9979,49.0,222.0,284.0
max,4.0,1.0,1.0,4.0,41.0,45.455,100.0,56.9969,367.0,886.0,977.0


### EDA

Define cat and num features 

In [8]:
num_features = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count']
cat_features = ['season', 'holiday', 'workingday', 'weather']

## Project Milestone: 
Create time-related features


In [9]:
#convert a column to timestamps
ts = pd.to_datetime(train['datetime']) 

Extract features like hour, month etc. from the datetime column.

In [10]:
ts

0       2011-01-01 00:00:00
1       2011-01-01 01:00:00
2       2011-01-01 02:00:00
3       2011-01-01 03:00:00
4       2011-01-01 04:00:00
                ...        
10881   2012-12-19 19:00:00
10882   2012-12-19 20:00:00
10883   2012-12-19 21:00:00
10884   2012-12-19 22:00:00
10885   2012-12-19 23:00:00
Name: datetime, Length: 10886, dtype: datetime64[ns]

In [11]:
ts.dt.year

0        2011
1        2011
2        2011
3        2011
4        2011
         ... 
10881    2012
10882    2012
10883    2012
10884    2012
10885    2012
Name: datetime, Length: 10886, dtype: int64

In [12]:
ts.dt.month

0         1
1         1
2         1
3         1
4         1
         ..
10881    12
10882    12
10883    12
10884    12
10885    12
Name: datetime, Length: 10886, dtype: int64

In [13]:
ts.dt.month_name()


0         January
1         January
2         January
3         January
4         January
           ...   
10881    December
10882    December
10883    December
10884    December
10885    December
Name: datetime, Length: 10886, dtype: object

In [14]:
train['month'] = ts.dt.month_name()

In [15]:
train

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,January
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,January
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,January
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,January
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,January
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,December
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,December
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,December
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,December


In [16]:
ts.dt.day

0         1
1         1
2         1
3         1
4         1
         ..
10881    19
10882    19
10883    19
10884    19
10885    19
Name: datetime, Length: 10886, dtype: int64

In [17]:
train['weekday']=ts.dt.day_name()
train

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month,weekday
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,January,Saturday
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,January,Saturday
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,January,Saturday
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,January,Saturday
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,January,Saturday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,December,Wednesday
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,December,Wednesday
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,December,Wednesday
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,December,Wednesday


Plot small sections of the data (1 day, 1 week etc.)

In [18]:
train.set_index('datetime')

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month,weekday
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,January,Saturday
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,January,Saturday
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,January,Saturday
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,January,Saturday
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,January,Saturday
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,December,Wednesday
2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,December,Wednesday
2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,December,Wednesday
2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,December,Wednesday


In [19]:
train['ts'] = pd.to_datetime(train['datetime']) 

In [20]:
train

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month,weekday,ts
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,January,Saturday,2011-01-01 00:00:00
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,January,Saturday,2011-01-01 01:00:00
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,January,Saturday,2011-01-01 02:00:00
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,January,Saturday,2011-01-01 03:00:00
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,January,Saturday,2011-01-01 04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,December,Wednesday,2012-12-19 19:00:00
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,December,Wednesday,2012-12-19 20:00:00
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,December,Wednesday,2012-12-19 21:00:00
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,December,Wednesday,2012-12-19 22:00:00


In [25]:
train.set_index('datetime')

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month,weekday,ts
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,January,Saturday,2011-01-01 00:00:00
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,January,Saturday,2011-01-01 01:00:00
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,January,Saturday,2011-01-01 02:00:00
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,January,Saturday,2011-01-01 03:00:00
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,January,Saturday,2011-01-01 04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,December,Wednesday,2012-12-19 19:00:00
2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,December,Wednesday,2012-12-19 20:00:00
2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,December,Wednesday,2012-12-19 21:00:00
2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,December,Wednesday,2012-12-19 22:00:00


In [32]:
sample_week = train.loc['2011-01-10':]

In [33]:
sample_week

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,month,weekday,ts
2012,2011-05-10 18:00:00,2,0,1,1,26.24,31.060,29,8.9981,70,480,550,May,Tuesday,2011-05-10 18:00:00
2013,2011-05-10 19:00:00,2,0,1,1,24.60,31.060,43,15.0013,69,365,434,May,Tuesday,2011-05-10 19:00:00
2014,2011-05-10 20:00:00,2,0,1,1,22.14,25.760,60,8.9981,50,241,291,May,Tuesday,2011-05-10 20:00:00
2015,2011-05-10 21:00:00,2,0,1,1,22.14,25.760,52,0.0000,30,173,203,May,Tuesday,2011-05-10 21:00:00
2016,2011-05-10 22:00:00,2,0,1,1,21.32,25.000,55,7.0015,29,121,150,May,Tuesday,2011-05-10 22:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,December,Wednesday,2012-12-19 19:00:00
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,December,Wednesday,2012-12-19 20:00:00
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,December,Wednesday,2012-12-19 21:00:00
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,December,Wednesday,2012-12-19 22:00:00
