# Kaggle Project - Bike Sharing Demand

Source: https://www.kaggle.com/c/bike-sharing-demand
        
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Dataset

datetime - hourly date + timestamp

season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

holiday - whether the day is considered a holiday

workingday - whether the day is neither a weekend nor holiday

weather

    1: Clear, Few clouds, Partly cloudy, Partly cloudy
        
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
        
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
        
temp - temperature in Celsius

atemp - "feels like" temperature in Celsius

humidity - relative humidity

windspeed - wind speed

casual - number of non-registered user rentals initiated

registered - number of registered user rentals initiated

count - number of total rentals

Evaluation metrics
$$
\sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }
$$

n is the number of hours in the test set

pi is your predicted count

ai is the actual count

log(x) is the natural logarithm


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
#parse_dates : string데이터를 datetime데이터로 변경
test_df=pd.read_csv("./bike-sharing-demand/test.csv", parse_dates=["datetime"])
train_df=pd.read_csv("./bike-sharing-demand/train.csv", parse_dates=["datetime"])

In [5]:
all_df=pd.concat((train_df, test_df), axis=0).reset_index() #인덱스 중복 제거
all_df.head()

Unnamed: 0,index,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3.0,13.0,16.0
1,1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8.0,32.0,40.0
2,2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5.0,27.0,32.0
3,3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3.0,10.0,13.0
4,4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0.0,1.0,1.0


In [6]:
train_index=list(range(len(train_df)))
train_index[-1]

10885

In [10]:
test_index=list(range(len(train_df), len(all_df)))
test_index[0], test_index[-1]

(10886, 17378)

In [13]:
all_df.isnull().sum() #6493개의 test데이터에 포함하지 않는 값... => 널값없음

index            0
datetime         0
season           0
holiday          0
workingday       0
weather          0
temp             0
atemp            0
humidity         0
windspeed        0
casual        6493
registered    6493
count         6493
dtype: int64

In [14]:
def rmsle(y,y_):
    log1=np.nan_to_num(np.log(y+1))
    log2=np.nan_to_num(np.log(y_+1))
    calc=(log1-log2)**2
    return np.sqrt(np.mean(calc))

In [16]:
submission_df=pd.read_csv("./bike-sharing-demand/sampleSubmission.csv")
submission_df.head()

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0


In [18]:
rmsle(submission_df["count"].values,
     np.random.randint(0,100,size=len(submission_df)))

3.7733143985464284

In [19]:
del all_df["casual"]
del all_df["registered"]
del all_df["index"]

In [20]:
pre_df=all_df.merge(pd.get_dummies(
    all_df["season"], prefix="season"),
                    left_index=True, right_index=True)
pre_df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,season_1,season_2,season_3,season_4
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16.0,1,0,0,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40.0,1,0,0,0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32.0,1,0,0,0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13.0,1,0,0,0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1.0,1,0,0,0


In [21]:
pre_df=pre_df.merge(pd.get_dummies(
    all_df["weather"], prefix="weather"),
                    left_index=True, right_index=True)
pre_df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,season_1,season_2,season_3,season_4,weather_1,weather_2,weather_3,weather_4
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16.0,1,0,0,0,1,0,0,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40.0,1,0,0,0,1,0,0,0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32.0,1,0,0,0,1,0,0,0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13.0,1,0,0,0,1,0,0,0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1.0,1,0,0,0,1,0,0,0


In [22]:
pre_df["datetime"].unique()

array(['2011-01-01T00:00:00.000000000', '2011-01-01T01:00:00.000000000',
       '2011-01-01T02:00:00.000000000', ...,
       '2012-12-31T21:00:00.000000000', '2012-12-31T22:00:00.000000000',
       '2012-12-31T23:00:00.000000000'], dtype='datetime64[ns]')