<a href="https://colab.research.google.com/github/JPrier/TorontoBikeShare/blob/master/BikeShare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notes

#### TODO


*   Add station ids so that we can map stations later on with realtime data
*  Create a new Dataframe that is able to hold all the needed data from both bikes and weather
  * should be a easy method to add more data on top of
* Run report on new dataframe
* Begin feature engineering 
* Build Benchmark model
* Build RNN-LTSM model
* Look at realtime data and format training data to be similar



## NN Notes



*   Need more features, 4 is not enough
*   Use XgBoost to improve result
*  Get a baseline performance with a simple NN and try to improve from that 



## RNN Notes



*  For the RNN use 24 nodes, one for each hour
* Use Bag of Words on the stations since the problem is with frequency of station usage



# Google Drive Setup

In [0]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive

from google.colab import auth 
from oauth2client.client import GoogleCredentials

#Authenticate and create the PyDrive client

auth.authenticate_user()

gauth = GoogleAuth()

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

In [0]:
folders = ["1FywO6-NIKvfXZ3LJ08fdU_S4F6QdBOOr",
           "17524iO2kiuU4_MBalRoBbYe8kpA6oVtn"]

for folderID in folders:
  file_list = drive.ListFile({'q': "'{}' in parents and trashed=false".format(
      folderID)}).GetList()
  i = 0      
  for file1 in sorted(file_list, key = lambda x: x['title']):
      if file1['title'].endswith(".csv"):
        i+=1
        file = drive.CreateFile({'id':file1['id']})
        print("Downloading " + str(file1['title']) + " " + str(i) + "/" + str(len(file_list)) + " in folder " + folderID)
        file.GetContentFile(file1['title'])


# Setup

In [0]:
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('ggplot')

# Variable Identification

### Predictor Variables:
* trip_start_time
* from_station_name
* user_type
*  date_time_local / unixtime
* pressure_station
* pressure_sea
* wind_dir_10s
* wind_speed
* relative_humidity
* dew_point
* temperature
* windchill
* visibility
* health_index



### Target Variable:
  - bikes_used -- Vector for number of bikes that have left (-) or arrived (+) at each station

<br />

### Data_Types:
#### Bike Share
| Numerical             	| Character         	| DateTime        	|
|-----------------------	|-------------------	|-----------------	|
| trip_id               	| from_station_name 	| trip_start_time 	|
| trip_duration_seconds 	| to_station_name   	| trip_stop_time  	|
| bikes_used            	| user_type         	|                 	|

#### Weather
| Numerical         	| Character 	| DateTime        	|
|-------------------	|-----------	|-----------------	|
| unixtime          	| wind_dir  	| date_time_local 	|
| pressure_station  	|           	|                 	|
| pressure_sea      	|           	|                 	|
| wind_dir_10s      	|           	|                 	|
| wind_speed        	|           	|                 	|
| relative_humidity 	|           	|                 	|
| dew_point         	|           	|                 	|
| temperature       	|           	|                 	|
| windchill         	|           	|                 	|
| visibility        	|           	|                 	|
| health_index      	|           	|                 	|

<br /><br />

### Variable Category
#### Bike Share

|    Categorical    | Continuous            |
|:-----------------:|-----------------------|
| trip_id           | trip_start_time       |
| from_station_name | trip_stop_time        |
| to_station_name   | trip_duration_seconds |
| user_type         | bikes_used            |

#### Weather

| Categorical 	| Continuous        	|
|-------------	|-------------------	|
| wind_dir    	| date_time_local   	|
|             	| unixtime          	|
|             	| pressure_station  	|
|             	| pressure_sea      	|
|             	| wind_dir_10s      	|
|             	| wind_speed        	|
|             	| relative_humidity 	|
|             	| dew_point         	|
|             	| temperature       	|
|             	| windchill         	|
|             	| visibility        	|
|             	| health_index      	|


# Helper Functions

In [0]:
'''
Issues with the BikeShare Dataset:
  - Q1 and Q2 are different from Q3 and Q4 in these ways:
      - station ids are non existent in only Q3 and Q4
      - date format switches from d-m-y to m-d-y for Q3 to a different one for Q4
      - Q4 uses a datetime var? excel shows a different format than what is actually there
'''
bs_files = ["Bikeshare Ridership (2017 Q1).csv", 
      "Bikeshare Ridership (2017 Q2).csv", 
      "Bikeshare Ridership (2017 Q3).csv",
      "Bikeshare Ridership (2017 Q4).csv"]

QX = {0:bs_files[0], 1:bs_files[:2], 2:bs_files[:3], 3:bs_files}

def read_bikeshare_data(quarters):
  files = QX[quarters]
  df = pd.read_csv(files[0]).dropna()
  
  df = format_time(0, df)
  
  for i in range(len(files)-1):
    temp_df = pd.read_csv(files[i+1]).dropna()
    temp_df = format_time(i+1, temp_df)
    
    df = df.append(temp_df, sort=False)
  
  # Remove station ids as Q3 and Q4 do not have them (still have station names)
  if 'from_station_id' in df.columns:
      df.drop(['from_station_id', 'to_station_id'], axis=1, inplace=True)
  
  df["from_station_name"] = df["from_station_name"].str.replace(".", "")
  df["from_station_name"] = df["from_station_name"].str.replace("'", "")
  df["to_station_name"] = df["to_station_name"].str.replace(".", "")
  df["to_station_name"] = df["to_station_name"].str.replace("'", "")
  
  return df

def format_time(quarter, df):
  if quarter == 0 or quarter == 1:
    df['trip_start_time'] = pd.to_datetime(df['trip_start_time'], 
                                                    format='%d/%m/%Y %H:%M')
    df['trip_stop_time'] = pd.to_datetime(df['trip_stop_time'], 
                                                   format='%d/%m/%Y %H:%M')
  elif quarter == 3:
    df['trip_start_time'] = pd.to_datetime(df['trip_start_time'], 
                                                    format='%m/%d/%y %H:%M:%S')
    df['trip_stop_time'] = pd.to_datetime(df['trip_stop_time'], 
                                                   format='%m/%d/%y %H:%M:%S')
  else:
    df['trip_start_time'] = pd.to_datetime(df['trip_start_time'], 
                                                    format='%m/%d/%Y %H:%M')
    df['trip_stop_time'] = pd.to_datetime(df['trip_stop_time'], 
                                                    format='%m/%d/%Y %H:%M')
  return df

def read_weather_data(quarters):
  df = pd.read_csv('weatherstats_toronto_hourly.csv')
  df = df[df['unixtime'].between(1485918000, 1514779200, inclusive=True)]
  return df

def read_data(quarters):
  df_bikes = read_bikeshare_data(quarters)
  df_weather = read_weather_data(quarters)
  return df_bikes, df_weather

# Run EDA

In [0]:
'''
-----TODO-----
- Read in and manipulate/clean data
- visualize data 
- perform more EDA
- predict rider usage on a day of the year?? (would need to add in another dataset -- possibly weather, holidays and events)
'''
# quarters to read in (just for speed of debugging)
quarters = 3
df_bikes, df_weather = read_data(quarters)

In [0]:
report = pandas_profiling.ProfileReport(df_bikes)
report.to_file(outputfile='bikesReport.html')

In [0]:
report = pandas_profiling.ProfileReport(df_weather)
report.to_file(outputfile='weatherReport.html')

# Benchmark Model

A basic NN built as a benchmark to compare the later models to, to see if the more complex models are making a difference compared to a simplier model.

In [0]:
# TODO: go from pandas df to sklearn training/Testing set