<a name="top"></a>
# Indego City Bike Trip Data Analysis with Google Maps APIs

Indego publishes quarterly data on city bike usage. The following analysis looks at data for Q3 2021

Some questions this analysis will aim to answer
- Which neighborhoods have the highest number of Indego stations?
- Which stations are most / least utilized?
- Are trip distances longer with electric bikes than standard bikes?
- How much revenue was generated from electric bike fees?

Google Maps APIs will be used to enhance the data set with the following
- Neighborhood information to each bike station
- Distance between stations, following bike routes


**Jump to** *(if viewing thru GitHub, jump to links will not work)*
- [Load data and initialize API](#dataload)
- [Add Neighborhood from Google Maps](#addn)
- [Analyze station data](#stationdata)
- [Join trip data with station data](#jointripdata)
- [Clean trip data](#cleantripdata)
- [Analyze trip data](#analyzetripdata)
- [Add trip distance using Google Maps APIs](#adddistancedata)

[View source data from Indego](https://www.rideindego.com/about/data/)

## <a name="dataload"></a> Load data and initialize API
[Return to Top](#top)

Import the necessary libraries:
- dotenv to store API key
- googlemaps library to access Maps APIs (originally, intended to use the APIs directly - but saw that Google has developed a set of python libraries that make it easier to work with their APIs)
- pandas and numpy for data analysis

In [6]:
# Install if not already present
# pip install -U googlemaps
# pip install python-dotenv 

import googlemaps
from datetime import datetime

import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np

import json

Load data stored env file - this will primarily be used to store the Google API key, so the API key itself is not present in this file

Load data stored in csv files, that include trip data and basic station data

In [7]:
# Load env 
load_dotenv("indego.env")

# Load data from trip and station csv files
trip_data = pd.read_csv("indego-trips-2021-q3.csv", low_memory=False)
#trip_data.index.name = None
station_data = pd.read_csv("indego-stations-2021-10-01.csv")

Initialize Google Maps

In [8]:
# Get the API key from the env file
api_key = os.getenv('API_KEY')

# Initialize google maps API
gmaps = googlemaps.Client(key=api_key)

<a name="#addn"></a> 

## <a name="#addn"></a> Add Neighborhood to station data, using Geocode API
[Return to Top](#top)

- Get an overview of station data
- Add a Neighborhood attribute to station data, based on Google Maps geocode data

Print some basic info about station data

In [9]:
print("Rows and colums:")
print(station_data.shape)
print()
station_data.head(5)

Rows and colums:
(179, 4)



Unnamed: 0,Station_ID,Station_Name,Day of Go_live_date,Status
0,3000,Virtual Station,4/23/2015,Active
1,3004,Municipal Services Building Plaza,4/23/2015,Active
2,3005,"Welcome Park, NPS",4/23/2015,Active
3,3006,40th & Spruce,4/23/2015,Active
4,3007,"11th & Pine, Kahn Park",4/23/2015,Active


<br>
Add a Full_Name column, that includes the Station_Name + 'Philadelphia, PA' - this will be sent as part of the geocode request

In [10]:
station_data["Full_Name"] = station_data["Station_Name"] + (", Philadelphia, PA")
station_data.head(2)

Unnamed: 0,Station_ID,Station_Name,Day of Go_live_date,Status,Full_Name
0,3000,Virtual Station,4/23/2015,Active,"Virtual Station, Philadelphia, PA"
1,3004,Municipal Services Building Plaza,4/23/2015,Active,"Municipal Services Building Plaza, Philadelphi..."


<br>
Use the Google Maps geocode API to assign a Neighborhood to each station

- For each station, call the API and pass the full station name

- Parse the response to find the neighborhood value and assign it back to the station_data df


In [18]:
# Iterate over each station
for index, row in station_data.iterrows():
    # Call the geocode API, for the current station
    # geocode returns a dict with results
    geocode = gmaps.geocode(station_data.loc[index,"Full_Name"])
    
    # Get address_components list in  result
    gr = geocode[0]["address_components"]
    found_neighborhood = False
    
    # Iterate through address_components, looking for "neighborhood"
    # If found, assign the value into the df
    for r in gr:
        if "neighborhood" in r["types"]:
            station_data.loc[index,"Neighborhood"] = r["long_name"]
            found_neighborhood = True
    # If no neighborhood is, assign NaN in the df
    if found_neighborhood != True:
        print("Neighborhood not found")
        station_data.loc[index,"Neighborhood"] = np.NaN

Neighborhood not found


<br>
Review Neighborhood data for any issues

In [19]:
# Find rows where Neighborhood == NaN
missing_neighorhood = station_data.loc[station_data["Neighborhood"].isna()]
print(missing_neighorhood.to_string())

     Station_ID  Station_Name Day of Go_live_date  Status                       Full_Name Neighborhood
145        3204  17th & Green          11/14/2019  Active  17th & Green, Philadelphia, PA          NaN


<br>
Assign a value manually for the row with a missing neighborhood

In [20]:
station_data.loc[145, "Neighborhood"] = "North Philadelphia"

## <a name="stationdata"></a> Analyze station data
[Return to Top](#top)

- View station data stats by neighborhood
- View active / inactive station stats
- View top neighborhoods with active Indego stations

<br>

Print a sorted list of neighborhoods by count of Indego stations

In [21]:
grouped = station_data.groupby("Neighborhood")
grouped.size().sort_values(ascending=False)

Neighborhood
North Philadelphia         35
University City            25
Center City                20
Center City East           12
Center City West           11
West Philadelphia           8
Rittenhouse Square          6
Graduate Hospital           6
Point Breeze                6
Washington Square West      4
South Philadelphia East     4
South Philadelphia          4
Bella Vista                 3
Queen Village               3
West Poplar                 3
Grays Ferry                 3
Society Hill                2
Old City                    2
Olde Kensington             2
Chinatown                   2
Mantua                      2
Gayborhood                  2
West Parkside               2
East Passyunk Crossing      2
Devil's Pocket              2
South Philadelphia West     2
Dickinson Narrows           1
North Philadelphia West     1
East Parkside               1
Melrose                     1
Pennsport                   1
Northern Liberties          1
dtype: int64

*It would be more helpful here to have a visualization of distribution across the city - to be added*

<br>
Are any stations inactive?

In [22]:
grouped = station_data.groupby("Status")
print(grouped.size().sort_values(ascending=False))

Status
Active      166
Inactive     13
dtype: int64


<br>
Find the top ten neighborhoods with active stations

In [24]:
grouped = grouped.get_group("Active").groupby("Neighborhood")
grouped.size().sort_values(ascending=False).head(10)

Neighborhood
North Philadelphia        30
University City           22
Center City               19
Center City East          12
Center City West          11
West Philadelphia          8
Point Breeze               6
Graduate Hospital          5
Rittenhouse Square         5
Washington Square West     4
dtype: int64

## <a name="jointripdata"></a> Join trip data with station data
[Return to Top](#top)

In [25]:
# Run as needed to re-init trip_data from the csv file
#trip_data = pd.read_csv("indego-trips-2021-q3.csv", low_memory=False)

Print some basic information on trip data

In [26]:
print("Rows and colums:")
print(trip_data.shape)
print()
trip_data.info()
trip_data.head(5)

Rows and colums:
(300432, 15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300432 entries, 0 to 300431
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   trip_id              300432 non-null  int64  
 1   duration             300432 non-null  int64  
 2   start_time           300432 non-null  object 
 3   end_time             300432 non-null  object 
 4   start_station        300432 non-null  int64  
 5   start_lat            300412 non-null  float64
 6   start_lon            300412 non-null  float64
 7   end_station          300432 non-null  int64  
 8   end_lat              296273 non-null  float64
 9   end_lon              296273 non-null  float64
 10  bike_id              300432 non-null  object 
 11  plan_duration        300432 non-null  int64  
 12  trip_route_category  300432 non-null  object 
 13  passholder_type      300432 non-null  object 
 14  bike_type            300432 non-null 

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type
0,398698761,11,7/1/2021 0:00,7/1/2021 0:11,3045,39.947922,-75.162369,3030,39.93935,-75.157158,3360,30,One Way,Indego30,standard
1,398698759,4,7/1/2021 0:02,7/1/2021 0:06,3052,39.947319,-75.156952,3238,39.946281,-75.151382,5420,30,One Way,Indego30,standard
2,398698757,56,7/1/2021 0:03,7/1/2021 0:59,3192,39.96207,-75.141113,3161,39.954861,-75.180908,18450,30,One Way,Indego30,electric
3,398698755,55,7/1/2021 0:04,7/1/2021 0:59,3192,39.96207,-75.141113,3161,39.954861,-75.180908,16508,30,One Way,Indego30,electric
4,398698753,5,7/1/2021 0:08,7/1/2021 0:13,3052,39.947319,-75.156952,3046,39.950119,-75.144722,3475,365,One Way,Indego365,standard


Join station data with trip data

In [27]:
# Join start station data
ts_data = pd.merge(left=trip_data, right=station_data[["Station_ID", "Full_Name", "Neighborhood"]], how="left", left_on="start_station", right_on="Station_ID").drop(columns=["Station_ID"])

# Join end station data
ts_data = pd.merge(left=ts_data, right=station_data[["Station_ID", "Full_Name", "Neighborhood"]], how="left", left_on="end_station", right_on="Station_ID").drop(columns=["Station_ID"])

# Rename merged columns 
ts_data = ts_data.rename(columns={"Full_Name_x":"start_name", "Neighborhood_x":"start_neighborhood"})
ts_data = ts_data.rename(columns={"Full_Name_y":"end_name", "Neighborhood_y":"end_neighborhood"})

ts_data.head(2)

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood
0,398698761,11,7/1/2021 0:00,7/1/2021 0:11,3045,39.947922,-75.162369,3030,39.93935,-75.157158,3360,30,One Way,Indego30,standard,"13th & Locust, Philadelphia, PA",Washington Square West,"Darien & Catharine, Philadelphia, PA",Bella Vista
1,398698759,4,7/1/2021 0:02,7/1/2021 0:06,3052,39.947319,-75.156952,3238,39.946281,-75.151382,5420,30,One Way,Indego30,standard,"9th & Locust, Philadelphia, PA",Washington Square West,"6th & S Washington Square, Philadelphia, PA",Society Hill


## <a name="cleantripdata"></a> Clean trip data
[Return to Top](#top)

Check how many trips have a start or end location at a Virtual station - these will be removed these from the dataset 

In [28]:
ts_data.loc[(ts_data['start_station'] == 3000) | (ts_data['end_station'] == 3000)]

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood
74,398758095,15,7/1/2021 5:29,7/1/2021 5:44,3125,39.943909,-75.167351,3000,,,3638,30,One Way,Indego30,standard,"15th & South, Philadelphia, PA",Rittenhouse Square,"Virtual Station, Philadelphia, PA",Center City
315,398768712,15,7/1/2021 8:09,7/1/2021 8:24,3049,39.945091,-75.142502,3000,,,19813,365,One Way,Indego365,electric,"Foglietta Plaza, Philadelphia, PA",Center City,"Virtual Station, Philadelphia, PA",Center City
333,399033738,10,7/1/2021 8:20,7/1/2021 8:30,3170,39.944260,-75.181343,3000,,,18160,30,One Way,Indego30,electric,"Grays Ferry & Pemberton, Philadelphia, PA",Devil's Pocket,"Virtual Station, Philadelphia, PA",Center City
359,398777836,6,7/1/2021 8:34,7/1/2021 8:40,3032,39.945271,-75.179710,3000,,,17200,30,One Way,Indego30,electric,"23rd & South, Philadelphia, PA",South Philadelphia,"Virtual Station, Philadelphia, PA",Center City
539,398786572,14,7/1/2021 9:44,7/1/2021 9:58,3008,39.979439,-75.151138,3000,,,19793,30,One Way,Indego30,electric,"Temple University Station, Philadelphia, PA",North Philadelphia,"Virtual Station, Philadelphia, PA",Center City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299977,428447431,16,9/30/2021 20:11,9/30/2021 20:27,3052,39.947319,-75.156952,3000,,,11757,365,One Way,Indego365,standard,"9th & Locust, Philadelphia, PA",Washington Square West,"Virtual Station, Philadelphia, PA",Center City
300012,428340686,1,9/30/2021 20:23,9/30/2021 20:24,3185,39.951691,-75.158882,3000,,,17727,30,One Way,Indego30,electric,"11th & Market, Philadelphia, PA",Center City East,"Virtual Station, Philadelphia, PA",Center City
300052,428477694,15,9/30/2021 20:38,9/30/2021 20:53,3114,39.937752,-75.180122,3000,,,18797,30,One Way,Indego30,electric,"22nd & Federal, Philadelphia, PA",Point Breeze,"Virtual Station, Philadelphia, PA",Center City
300211,428462247,10,9/30/2021 21:41,9/30/2021 21:51,3046,39.950119,-75.144722,3000,,,19845,30,One Way,Indego30,electric,"2nd & Market, Philadelphia, PA",Center City East,"Virtual Station, Philadelphia, PA",Center City


<br><br>
Drop trips that include a Virtual location

In [29]:
ts_data = ts_data.loc[(ts_data['start_station'] != 3000) & (ts_data['end_station'] != 3000)]
print("Update count rows and colums:")
print(ts_data.shape)

Update count rows and colums:
(296268, 19)


<br><br>
Check that original row count - virtual row count = current row count

In [30]:
300432-4164 == len(ts_data)

True

<br><br>
Drop trips longer than 3 hours - for this analysis, assume these are a result of user error

In [31]:
# Only keep trips where duration <= 180 minutes
ts_data = ts_data.loc[ts_data["duration"]<= 180]

# Output number of trips
len(ts_data)

294912

<br><br>
Drop round trips that are under 5 minutes - for this analysis, assume these are a result of user error

In [32]:
# Keep round trips where duration > 5 minutes, as well as all one way trips 
ts_data = ts_data.loc[((ts_data["duration"] > 5) & (ts_data["trip_route_category"] == "Round Trip")) | ((ts_data["trip_route_category"] == "One Way"))]

# Output number of trips 
len(ts_data)

284458

## <a name="analyzetripdata"></a> Analyze trip data
[Return to Top](#top)

<br>

**Which neighborhoods are most trips between?**

In [33]:
# Group by start neighborhood and end neighborhood
grouped = ts_data.groupby(['start_neighborhood','end_neighborhood']).size().sort_values(ascending=False)

# Output the top 15 routes
grouped.head(15)

start_neighborhood  end_neighborhood  
University City     University City       15109
North Philadelphia  North Philadelphia    11359
Center City         Center City            9417
North Philadelphia  Center City            5895
Center City East    Center City East       5822
Center City         North Philadelphia     5753
University City     Center City West       5307
Center City West    University City        5036
                    Center City West       4490
                    Center City            3762
Center City         Center City West       3676
Center City East    Center City            3640
Center City         Center City East       3584
University City     Center City            3245
Center City         University City        3193
dtype: int64

<br>

**Which station has the fewest trips (start or end at that station)?**

In [50]:
# Create a df with count of trips started for each station
grouped = ts_data.groupby('start_name').size().reset_index(name='sum')

# Create a df with count of trips ended for each station
add_to_grouped = ts_data.groupby('end_name').size().reset_index(name='sum')

# Rename the end_name column to match start_name, so dfs can be concatenated
add_to_grouped = add_to_grouped.rename(columns={"end_name":"start_name"})

# Concatenate the dfs
grouped = grouped.append(add_to_grouped).groupby('start_name').sum()

# Output the station with the lowest number of trips
grouped.nsmallest(1, "sum")

Unnamed: 0_level_0,sum
start_name,Unnamed: 1_level_1
"4th & Wood, Philadelphia, PA",20


<br>

**Which stations have the most trips?**

In [51]:
grouped.nlargest(5, "sum")

Unnamed: 0_level_0,sum
start_name,Unnamed: 1_level_1
"16th & Chestnut, Philadelphia, PA",10453
"15th & Spruce, Philadelphia, PA",9871
"34th & Spruce, Philadelphia, PA",9774
"17th & Locust, Philadelphia, PA",9581
"Schuylkill Banks Pergola, Philadelphia, PA",9571


**How much revenue was generated in electric bike fees?**
- Trips using an electric bike charge an extra $.15 per minute

In [69]:
# Set all charges to NaN, then populate charges as applicable
ts_data["electric_charges"] = np.NaN

# Where (bike_type == electric), set overage to duration * .15
ts_data.loc[ts_data["bike_type"] == "electric", "electric_charges"] = ts_data["duration"] *.15

total = ts_data["electric_charges"].sum()
print("Charges for electric bikes totaled ${:,.2f} in Q3 2021".format(total))

Charges for electric bikes totaled $249,957.75 in Q3 2021


**How much revenue was generated in overage charges in Q321?** 
- Guest passes and walk up passes charge $.15 per minute after the first 30 minutes

- 30 day passes and annual passes charge $.15 per minute after the first hour

See how many of each passholder type is in the dataset

In [67]:
grouped = ts_data.groupby('passholder_type').size().sort_values(ascending=False)
grouped

passholder_type
Indego30     206220
Indego365     44077
Day Pass      34159
Walk-up           2
dtype: int64

Calculate the overages based on trip duration and passholder type

In [70]:
# Set all overages to NaN, then populate overages as applicable
ts_data["overage_charges"] = np.NaN

# Where (passholder_type == Indego30 | Indego365) & duration > 60, set overage to (duration - 60) * .15
ts_data.loc[((ts_data["passholder_type"] == "Indego30") | ((ts_data["passholder_type"] == "Indego365"))) & (ts_data["duration"] > 60), "overage_charges"] = (ts_data["duration"] - 60) *.15

# Where (passholder_type == Day Pass | Walk-up) & duration > 30, set overage to (duration - 30) * .15
ts_data.loc[((ts_data["passholder_type"] == "Day Pass") | ((ts_data["passholder_type"] == "Walk-up"))) & (ts_data["duration"] > 30), "overage_charges"] = (ts_data["duration"] - 30) *.15

total = ts_data["overage_charges"].sum()
print("Overage charges totaled ${:,.2f} in Q3 2021".format(total))

Overage charges totaled $59,008.80 in Q3 2021


<br><br>
Looking at some sample data below, verify the calculation worked properly
- Row 12 was an 88 minute trip for a Day Pass user: (88-30)x(.15) = 8.70
- Row 34 was a 62 minute trip for a monthly subscriber: (62-600x(.15) = .30

In [62]:
ts_data.loc[ts_data["overage_charges"] > 0].head()

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood,overage_charges
12,398698734,88,7/1/2021 0:20,7/1/2021 1:48,3049,39.945091,-75.142502,3049,39.945091,-75.142502,16719,1,Round Trip,Day Pass,electric,"Foglietta Plaza, Philadelphia, PA",Center City,"Foglietta Plaza, Philadelphia, PA",Center City,8.7
20,398698719,73,7/1/2021 0:35,7/1/2021 1:48,3026,39.941818,-75.1455,3049,39.945091,-75.142502,2688,1,One Way,Day Pass,standard,"2nd & Lombard, Philadelphia, PA",Society Hill,"Foglietta Plaza, Philadelphia, PA",Center City,6.45
34,398707712,62,7/1/2021 1:07,7/1/2021 2:09,3237,39.917171,-75.170959,3049,39.945091,-75.142502,19813,30,One Way,Indego30,electric,"Broad & Oregon, Philadelphia, PA",South Philadelphia West,"Foglietta Plaza, Philadelphia, PA",Center City,0.3
35,398708115,77,7/1/2021 1:10,7/1/2021 2:26,3047,39.950729,-75.149467,3049,39.945091,-75.142502,11727,30,One Way,Indego30,standard,"Independence Mall, NPS, Philadelphia, PA",Center City East,"Foglietta Plaza, Philadelphia, PA",Center City,2.55
36,398708114,76,7/1/2021 1:10,7/1/2021 2:26,3047,39.950729,-75.149467,3049,39.945091,-75.142502,14669,30,One Way,Indego30,standard,"Independence Mall, NPS, Philadelphia, PA",Center City East,"Foglietta Plaza, Philadelphia, PA",Center City,2.4


**Analysis below to be added**

What percentage of trips use standard bikes vs electric bikes? Do certain stations favor electric bikes?

What is the average trip duration for standard bikes vs electric bikes?

Do riders with day passes more often use standard or electric bikes?

## <a name="adddistancedata"></a> To be added - Add trip distance using Google Maps APIs
[Return to Top](#top)

**For each start station, build a list of end stations that have an existing trip**

**Then, using Google Distance Matrix API to calculate the distance between these stations**

<br><br>
Select all station (start, end) combinations that include a trip in the data set

**Out of 31,506 possible combinations, 19,072 have a trip in the data set**


In [607]:
ts_data.groupby(["start_station", "end_station"]).size().reset_index(name='count')

Unnamed: 0,start_station,end_station,count
0,3004,3004,27
1,3004,3005,5
2,3004,3006,3
3,3004,3007,4
4,3004,3008,9
...,...,...,...
19067,3256,3212,1
19068,3256,3248,1
19069,3256,3249,2
19070,3256,3255,1
