# Chicago Rideshare Services Open Data Source Project

## Description

#### Rideshare services have become increasingly popular. Nearly 36% of Americans reported using a ridesharing service in 2018, an increase of 15% from 2015, amounting to over 13 billion dollars in revenue for major players in the market. Rideshares offer an affordability and convenience not typically met by traditional taxi services. Nevertheless, rideshare services inhabit a very competitive market space and must seek to address challenges facing both of its user bases - the driver and the rider. 

#### A great deal of thought and resources have gone into optimizing the rider experience and trip routes, but comparatively little has been done to improve driver experience. Rideshare drivers turn over almost completely every two years. The retention of trained, well-qualified, experienced drivers goes along way toward a satisfied rider base. While some companies have launched programs to provide more robust support to its drivers (e.g. college tuition, insurance, vehicle maintenance, phone services), rideshare drivers are first and foremost interested in maximizing their pay while minimizing their time on the road. Generally, drivers rely on instinct and surge notifications to dictate their trip habits. Moreover, they are often penalized for skipping prospective rides - but what if they could plan ahead, incorporating their own needs, preferences, and plans using predictive models?

### Imports

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
from bokeh.plotting import figure, show, output_notebook
import matplotlib.pyplot as plt
from datetime import datetime
import numpy as np
import pandas as pd
from sodapy import Socrata

output_notebook()

Image(url= "https://upload.wikimedia.org/wikipedia/commons/2/24/Map_of_the_Community_Areas_and_%27Sides%27_of_the_City_of_Chicago.svg")





### Access API to explore data 

In [2]:
client = Socrata("data.cityofchicago.org", None)

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("m6dm-c72p", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



In [3]:
results_df.head()

Unnamed: 0,additional_charges,dropoff_census_tract,dropoff_centroid_latitude,dropoff_centroid_location,dropoff_centroid_longitude,dropoff_community_area,fare,pickup_census_tract,pickup_centroid_latitude,pickup_centroid_location,...,pickup_community_area,shared_trip_authorized,tip,trip_end_timestamp,trip_id,trip_miles,trip_seconds,trip_start_timestamp,trip_total,trips_pooled
0,0.72,17031061901.0,41.9432371225,"{'type': 'Point', 'coordinates': [-87.64347095...",-87.6434709559,6,0,17031243500.0,41.8926581076,"{'type': 'Point', 'coordinates': [-87.65253448...",...,24,True,0,2019-03-20T18:00:00.000,741c619085e3ac071b903633a589ac8808edc613,5.657066252,1560,2019-03-20T17:30:00.000,0.72,4
1,2.55,17031071000.0,41.9217014922,"{'type': 'Point', 'coordinates': [-87.65591184...",-87.6559118484,7,5,17031842200.0,41.9049353016,"{'type': 'Point', 'coordinates': [-87.64990722...",...,8,False,3,2019-02-03T14:30:00.000,741c62a99f2eaeea5f1ddfe8f5eec47d7699f402,0.96383631696,363,2019-02-03T14:30:00.000,10.55,1
2,2.55,,41.8390869059,"{'type': 'Point', 'coordinates': [-87.71400380...",-87.714003807,30,5,,41.8390869059,"{'type': 'Point', 'coordinates': [-87.71400380...",...,30,False,0,2019-02-24T13:00:00.000,741c62b7feccf70a5e5ce59a4b8faa41f38a5342,1.85831278388658,459,2019-02-24T13:00:00.000,7.55,1
3,2.55,17031062300.0,41.9416281,"{'type': 'Point', 'coordinates': [-87.66144336...",-87.6614433685,6,5,17031070300.0,41.9290469366,"{'type': 'Point', 'coordinates': [-87.65131087...",...,7,False,0,2019-02-24T03:30:00.000,741c633a886b101acf0edb42a39bb8d98dbc0eb5,1.86034464177145,545,2019-02-24T03:30:00.000,7.55,1
4,2.55,,41.899602111,"{'type': 'Point', 'coordinates': [-87.63330803...",-87.6333080367,8,10,,41.9012069941,"{'type': 'Point', 'coordinates': [-87.67635598...",...,24,False,0,2019-01-03T16:00:00.000,741c63492d401d5bc7bc3ffa439e55bcc26b9ee5,3.90897899237622,1001,2019-01-03T15:45:00.000,12.55,1


### Downloaded csv file to access larger dataset

In [4]:
filename = 'TDI/Transportation_Network_Providers_Trips.csv'

raw_df = pd.read_csv(filename, nrows =5000000)

### Code to determine the most common pair of pickup and drop off combinations

In [5]:
# Group by pick up
pickup_df = raw_df[['Trip ID', 'Pickup Community Area','Dropoff Community Area']]\
.groupby(['Pickup Community Area', 'Dropoff Community Area']).count()

pickup_df.reset_index(inplace=True)
pickup_df.rename(columns={'Trip ID':'counts'}, inplace=True)
   

In [6]:
pickup_df.head()

Unnamed: 0,Pickup Community Area,Dropoff Community Area,counts
0,1.0,1.0,10324
1,1.0,2.0,4660
2,1.0,3.0,3197
3,1.0,4.0,1581
4,1.0,5.0,876


In [7]:
npickup_df = pickup_df.sort_values(by='counts', ascending=False)
npickup_df.head()

Unnamed: 0,Pickup Community Area,Dropoff Community Area,counts
537,8.0,8.0,208973
561,8.0,32.0,117199
2234,32.0,8.0,96982
557,8.0,28.0,90423
1928,28.0,8.0,85851


In [8]:
x_list = []
for i in range(0,len(pickup_df)):
    x ='('+str(npickup_df.iloc[i,0])+','+str(npickup_df.iloc[i,1])+')'
    x_list.append(x)

In [9]:
x_values = x_list[:11]
y_values = list(npickup_df.counts)[:11]

p3 = figure(x_range=x_values, plot_height=400, plot_width =1000, title="Most Common Trips by Region", tools="")

p3.vbar(x=x_values, top=y_values, width=0.9)

p3.xgrid.grid_line_color = None
p3.y_range.start = 0

p3.yaxis.axis_label = 'Number of Trips'
p3.xaxis.axis_label = '(Pickup Region, Dropoff Region)'

show(p3)

#### This figure shows the most common trips (combination of pickup and dropoff locations) by city region. Notably region 8, the Near North Side is one of three regions that constitutes Central Chicago and is home to the Magnificent MIle. Another popular region, region 32 - the Loop, also a part of Central Chicago, contains the city's commercial core. Region 28, the Near West Side, is adjacent to the Loop and boasts the United Center arena and University of Illinois at Chicago. Region 6, Lake View, the largest region by population also makes the list. It contains Wrigley Field, home to the Chicago Cubs.

### Code to determine most popular pickup spots by time of day

In [10]:
#Convert timestamp to hour of day 0100 to 2400
time_df = raw_df[['Trip ID', 'Pickup Community Area','Trip Start Timestamp']].copy()
time_df['new_timestamp'] = time_df['Trip Start Timestamp'].apply(lambda x : datetime.strptime(x,"%m/%d/%Y %I:%M:%S %p"))
time_df.reset_index(inplace=True, drop=True)                                                              
len(time_df)
    

5000000

In [11]:
time_df['hour'] = [x.hour for x in time_df.new_timestamp]
len(time_df)

5000000

In [12]:
# Group by pick up

new_time_df = time_df[['Trip ID','Pickup Community Area','hour']].groupby(['Pickup Community Area','hour']).count()
new_time_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Trip ID
Pickup Community Area,hour,Unnamed: 2_level_1
1.0,0,1808
1.0,1,1298
1.0,2,1167
1.0,3,915
1.0,4,1066


In [13]:
new_time_df.reset_index(inplace=True)

In [14]:
def get_time_series(df, comm_area):
    ts = df[df['Pickup Community Area'] == comm_area]
    return ts

In [15]:
sum_df = time_df[['Trip ID','Pickup Community Area']].groupby(['Pickup Community Area']).count()
sum_df.reset_index(inplace=True)
sum_df.sort_values(by='Trip ID', ascending=False, inplace=True)
top_comms = list(sum_df['Pickup Community Area'][:10])
sum_df[:10]

Unnamed: 0,Pickup Community Area,Trip ID
7,8.0,791327
27,28.0,434843
31,32.0,420547
5,6.0,338549
23,24.0,319906
6,7.0,272805
21,22.0,194594
75,76.0,154989
2,3.0,106992
32,33.0,88572


In [16]:
from bokeh.palettes import Category20c as palette
from bokeh.models import HoverTool
import itertools

colors = itertools.cycle(palette[20])
TOOLTIPS = [
    ("index", "$index"),
    ("(x,y)", "($x, $y)"),
    ("desc", "@desc"),
]

p = figure(width=1000, height=300, x_axis_type="linear") 

for comm in top_comms: #new_time_df['Pickup Community Area']:
    ts = get_time_series(new_time_df, comm)
    p.line(ts.hour, ts['Trip ID'], color=next(colors), legend=str(comm))

p.xaxis.axis_label = 'Hour of Day (Military Time)'
p.yaxis.axis_label = 'Number of Trips Starting in Region'


p.add_tools(HoverTool(tooltips=[
        ('Community area', str(comm)),
        ('Hour', "@x"),
        ('Number of trips', "@y")
    ]))

show(p)

#### This figure shows the number of total rides originating in particular regions by hour of day. This could alert drivers to popular regions to pickup riders across the course of the day. Combined with data predicting dropoff location, as proposed in the larger project, could help drivers plan rides in advance or suggest immediate direction based on popular regions or predicted destination.

## Conclusion and Next Steps

#### As described on Data is Plural, earlier this year Chicago became the first city to publish detailed data from rideshare services, termed Transportation Network Providers, such as Uber and Lyft. Though this trip dataset covers only November and December 2018; it includes more than 17 million rides across Chicago's 77 regions. Exploratory analysis was conducted to determine most frequent region-to-region trips. Further analysis evinced the most active pickup regions across the course of a day. This information could be used by drivers to maximize their time on the road through regional selectivity and chart a course with a higher likelihood of reaching a desired destination region.

#### In the proposed project, more specific longitudinal and latitudinal data for pickup and drop-off points will be utilized for an in-depth look at travel patterns. Additionally, the development of a predictive model will provide rideshare drivers with greater control over their time on the road through the implementation of a regression algorithm, which would yield a more accurate prediction of trip trends.