##### Ross Brown The Data Incubator Challenge Question 3 Notebook

Rideshare platforms have negatively affected public transit systems and cities by drawing passengers away from public transit and increasing auto traffic and gridlock. The aim of this project was to use rideshare trip data from the city of Chicago to explore whether cities and riders could know in advance if rides were likely to be lengthy and costly. Cities could use this data to make public transit more appealing during periods of costly and/or lengthy rideshare trips. Passengers could use this information to explore alternatives to rideshare. 

I investigated whether hourly precipitation data could be predictive of rideshare trip duration or cost. The City of Chicago data website promises rideshare trip data starting from November 1, 2018. Regrettably the website does not make it apparent that the data is only available through December 31, 2018 (only by noticing the rideshare company data submission deadlines are far after the trips are taken did I realize the data was not complete). The precipitation data for November and December 2018 was scant, because water mostly falls as snow during those months in Chicago, and snow is not provided by weather reporting agencies as precipation amounts. For these reasons, the hypothesized relationship was not born out by scatterplots.
Nonetheless, my data wrangling work, including working with 17 million rideshare trips, is shown in this notebook.


Step 1: Select daytime rides and calculate mean cost and duration by hour

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.max_rows = 999
import warnings
warnings.filterwarnings("ignore")


In [2]:
df = pd.read_csv('C:/Users/rmbrm/Documents/TDI_challenge/trips.csv', low_memory=False)

In [3]:
trips = df[['Trip Start Timestamp', 'Trip Seconds', 'Trip Total']].copy()

In [4]:
working = trips.rename(index=str, columns={'Trip Start Timestamp': 'start', 'Trip Seconds': 'duration', 'Trip Total': 'cost'})

In [5]:
df = working.sort_values(by=['start'])

In [6]:
from datetime import datetime

In [7]:
df['start_dt'] = df['start'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'))

In [8]:
df.to_pickle('C:/Users/rmbrm/Documents/TDI_challenge/trips_with_DT', compression='infer', protocol=4)

In [11]:
df.head()

Unnamed: 0,start,duration,cost,start_dt
5138900,11/01/2018 01:00:00 AM,1092.0,12.5,2018-11-01 01:00:00
2158163,11/01/2018 01:00:00 AM,357.0,7.5,2018-11-01 01:00:00
9224043,11/01/2018 01:00:00 AM,1624.0,7.5,2018-11-01 01:00:00
14290593,11/01/2018 01:00:00 AM,953.0,5.0,2018-11-01 01:00:00
15539658,11/01/2018 01:00:00 AM,792.0,7.5,2018-11-01 01:00:00


In [12]:
df['start_dt'] = pd.to_datetime(df['start_dt'])   
df.set_index('start_dt',inplace=True)

In [13]:
daytime = df.between_time('6:00', '18:59')

In [20]:
daytime.index = daytime.index.floor('H')

In [21]:
means = daytime.resample('H').mean()

In [22]:
means = means.dropna()

In [24]:
means.head()

Unnamed: 0_level_0,duration,cost
start_dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-11-01 06:00:00,1204.709212,16.499463
2018-11-01 07:00:00,1262.57694,14.657461
2018-11-01 08:00:00,1276.139724,14.110814
2018-11-01 09:00:00,1152.094102,14.92212
2018-11-01 10:00:00,1103.610976,15.655609


In [25]:
means.to_pickle('C:/Users/rmbrm/Documents/TDI_challenge/trips_final', compression='infer', protocol=4)

Step 2: Prepare hourly precipitation data

In [27]:
df = pd.read_csv('C:/Users/rmbrm/Documents/TDI_challenge/1724467.csv')

In [28]:
weather = df[['DATE', 'HourlyPrecipitation']].copy()

In [29]:
import datetime as dt

In [30]:
weather['DATE'] = pd.to_datetime(df['DATE'])   
weather.set_index('DATE',inplace=True)

In [31]:
daytime = weather.between_time('6:00', '19:00')

In [None]:
df = daytime.fillna(0)

In [32]:
df.rename(index=str, columns={"HourlyPrecipitation": "precip"}, inplace=True)

In [33]:
df.precip = df.precip.replace({"T": .01})

In [34]:
cols1 = ['precip']
df[cols1] = df[cols1].replace({'s': ''}, regex=True)

In [35]:
df.to_excel('C:/Users/rmbrm/Documents/TDI_challenge/data.xlsx')

In [36]:
df = pd.read_csv('C:/Users/rmbrm/Documents/TDI_challenge/data4.csv')

In [37]:
df.head()

Unnamed: 0,hour,ID,precip
0,2018-11-01 06,2018110106,0.0
1,2018-11-01 07,2018110107,0.0
2,2018-11-01 08,2018110108,0.0
3,2018-11-01 09,2018110109,0.0
4,2018-11-01 10,2018110110,0.0


In [38]:
dfw = df[['hour', 'ID']].copy()

In [39]:
dfw.drop_duplicates(subset ="hour", keep = 'first', inplace = True) 

In [40]:
dfb = dfw.drop(dfw.index[2351])

In [42]:
dfb = dfb.reset_index(drop=True)

In [43]:
df = pd.read_csv('C:/Users/rmbrm/Documents/TDI_challenge/data4.csv')

In [45]:
hourly = df[['ID', 'precip']].copy()

In [46]:
grouped = hourly['precip'].groupby(df['ID'])

In [47]:
final = grouped.sum()

In [48]:
df = final.to_frame().reset_index()

In [49]:
dfa = df.drop(df.index[2351])

In [51]:
last = pd.merge(dfa, dfb, on='ID', how='outer')

In [56]:
rain = last.iloc[0:794]

In [54]:
trips = means

In [59]:
rain_fixed=rain.drop([195], axis=0)

In [60]:
rain=rain_fixed.reset_index

In [61]:
trips = trips.reset_index()

In [None]:
final = rain.join(trips)

In [65]:
final = pd.read_pickle('C:/Users/rmbrm/Documents/TDI_challenge/final')

In [66]:
import plotly
plotly.tools.set_credentials_file(username="Ross.Brown.Ph.D.", api_key='yPNZCAkYOyi7wAKtZrSM')
import plotly.plotly as py
import plotly.offline as pyo
# from plotly.graph_objs import *
import plotly.plotly as py
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from plotly import tools
from plotly.graph_objs import Bar, Data, Figure, Layout, Marker, Scatter
init_notebook_mode(connected=True)

In [67]:
trace = go.Scatter(
    x = final['precip'],
    y = final['duration'],
    mode = 'markers'
)
layout= go.Layout(
    title= 'Hourly Precipitation and Average Duration of Rideshare Ride',
    hovermode= 'closest',
    xaxis= dict(
        title= 'Precipitation',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Ride Duration',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
#data = [trace]
fig= go.Figure(data=[trace], layout=layout)

# Plot and embed in ipython notebook!
pyo.iplot(fig)

In [69]:
trace = go.Scatter(
    x = final['precip'],
    y = final['cost'],
    mode = 'markers'
)
layout= go.Layout(
    title= 'Hourly Precipitation and Average Cost of Rideshare Ride',
    hovermode= 'closest',
    xaxis= dict(
        title= 'Precipitation',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Ride Cost',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
#data = [trace]
fig= go.Figure(data=[trace], layout=layout)

# Plot and embed in ipython notebook!
pyo.iplot(fig)