## Error Metrics on a naive forecast

With regard to the bikes dataset a ’naive’ forecast would be to use the count from the year before to predict the next year. Use the count from May 2011 as a forecast for May 2012 and check how far off the predictions are:

1. Filter the dataset for May 2011 and the count column
2. Filter the dataset for May 2012 and the count column
3. Use the above results as the input for rmse
4. How far off on average was this naive prediction?

### Bonus:

How could the results be improved?

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.tools.eval_measures as sm

In [3]:
df = pd.read_csv('../data/bike_dataset_new.csv',parse_dates=['datetime'])

In [4]:
df

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour,month,weekday,day,year,part_of_day
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,16,0,1,5,1,2011,morning
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,40,1,1,5,1,2011,morning
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,32,2,1,5,1,2011,morning
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,13,3,1,5,1,2011,morning
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,1,4,1,5,1,2011,morning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,156,19,12,0,31,2012,evening
17375,2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,104,20,12,0,31,2012,evening
17376,2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,67,21,12,0,31,2012,night
17377,2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,43,22,12,0,31,2012,night


In [5]:
df_may_2011 = df[(df['year'] == 2011) & (df['month'] == 5)]
df_may_2011

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour,month,weekday,day,year,part_of_day
2786,2011-05-01 00:00:00,2,0,0,1,17.22,21.210,67,6.0032,19.0,77.0,96,0,5,6,1,2011,morning
2787,2011-05-01 01:00:00,2,0,0,1,17.22,21.210,69,7.0015,9.0,50.0,59,1,5,6,1,2011,morning
2788,2011-05-01 02:00:00,2,0,0,1,17.22,21.210,77,7.0015,7.0,43.0,50,2,5,6,1,2011,morning
2789,2011-05-01 03:00:00,2,0,0,1,16.40,20.455,82,7.0015,8.0,15.0,23,3,5,6,1,2011,morning
2790,2011-05-01 04:00:00,2,0,0,1,16.40,20.455,76,7.0015,6.0,11.0,17,4,5,6,1,2011,morning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3525,2011-05-31 19:00:00,2,0,1,1,31.98,37.120,62,7.0015,,,407,19,5,1,31,2011,evening
3526,2011-05-31 20:00:00,2,0,1,1,31.98,37.120,62,7.0015,,,310,20,5,1,31,2011,evening
3527,2011-05-31 21:00:00,2,0,1,1,31.16,36.365,66,7.0015,,,224,21,5,1,31,2011,night
3528,2011-05-31 22:00:00,2,0,1,1,30.34,34.850,70,11.0014,,,160,22,5,1,31,2011,night


In [6]:
count_2011 = df_may_2011['count']
count_2011

2786     96
2787     59
2788     50
2789     23
2790     17
       ... 
3525    407
3526    310
3527    224
3528    160
3529     98
Name: count, Length: 744, dtype: int64

In [7]:
df_may_2012 = df[(df['year'] == 2012) & (df['month'] == 5)]
df_may_2012

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour,month,weekday,day,year,part_of_day
11539,2012-05-01 00:00:00,2,0,1,2,20.50,24.240,59,12.9980,7.0,28.0,35,0,5,1,1,2012,morning
11540,2012-05-01 01:00:00,2,0,1,2,20.50,24.240,63,8.9981,0.0,21.0,21,1,5,1,1,2012,morning
11541,2012-05-01 02:00:00,2,0,1,2,20.50,24.240,72,6.0032,1.0,7.0,8,2,5,1,1,2012,morning
11542,2012-05-01 03:00:00,2,0,1,2,20.50,24.240,77,0.0000,1.0,2.0,3,3,5,1,1,2012,morning
11543,2012-05-01 04:00:00,2,0,1,2,21.32,25.000,72,6.0032,1.0,7.0,8,4,5,1,1,2012,morning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12278,2012-05-31 19:00:00,2,0,1,1,29.52,32.575,34,11.0014,,,420,19,5,3,31,2012,evening
12279,2012-05-31 20:00:00,2,0,1,1,27.88,31.820,44,11.0014,,,336,20,5,3,31,2012,evening
12280,2012-05-31 21:00:00,2,0,1,1,27.88,31.820,44,11.0014,,,249,21,5,3,31,2012,night
12281,2012-05-31 22:00:00,2,0,1,1,27.06,31.060,50,11.0014,,,188,22,5,3,31,2012,night


In [8]:
count_2012 = df_may_2012['count']
count_2012

11539     35
11540     21
11541      8
11542      3
11543      8
        ... 
12278    420
12279    336
12280    249
12281    188
12282    135
Name: count, Length: 744, dtype: int64

In [9]:
# sns.heatmap(data=df_may_2011,
#                 x='month',y='count')

# sns.barplot(data=df_may_2012,
#                 x='month',y='count');

In [10]:
count_2011.mean() , count_2012.mean()

(202.3252688172043, 252.5766129032258)

In [11]:
rmse = sm.rmse(count_2011, count_2012)
rmse

168.69935321213757

rmse value shows how far off, on average, the naive prediction was from the actual values. In this case, the average difference between prediction and the actual values is about 168.7

To improve the prediction accuracy, we can consider other factors that affect bike rentals, such as seasonality, trends, holidays, weather conditions etc