<a name="top"></a>
# Part 1: Enhance Indego Data with Google Maps APIs

Indego publishes quarterly data on city bike usage. The following will clean the data, enhance it with additional information, and store it locally so it can be used for future analysis.

Google Maps APIs will be used to enhance the data set with the following
- Neighborhood information for each bike station
- Distance between stations, following bike routes

The following will *(if viewing thru GitHub, jump to links below will not work)*<br>
- [Load station data](#loadstation)
- [Add Neighborhood to station data, using Geocode API](#addneighborhood)
- [Write enhanced station data to a csv](#writestationdata)
- [Load trip data](#loadtripdata)
- [Clean trip data](#cleantripdata)
- [Join trip data with station data](#jointripdata)
- [Calculate distance for each trip, using Distance Matrix API](#distancedata)
- [Write enhanced trip data to a csv](#writetripdata)
- [Store data as a SQL database](#Store-data-as-a-SQL-database)

<br><br>
## <a name="loadstation"></a> Load station data

[Return to Top](#top)

Data on Indego stations is stored in a single csv file

In [213]:
import os
import pandas as pd
import numpy as np

# Load from csv in the current working director
station_data = pd.read_csv(os.getcwd() + "/source_data/indego-stations-2021-10-01.csv")

In [214]:
# Print the first few rows of data
station_data.head()

Unnamed: 0,Station_ID,Station_Name,Day of Go_live_date,Status
0,3000,Virtual Station,4/23/2015,Active
1,3004,Municipal Services Building Plaza,4/23/2015,Active
2,3005,"Welcome Park, NPS",4/23/2015,Active
3,3006,40th & Spruce,4/23/2015,Active
4,3007,"11th & Pine, Kahn Park",4/23/2015,Active


<br>
Clean up the header info

In [215]:
# Get current column names
header = station_data.columns
new_header = []

# For each column name, convert to lowercase and replace any spaces with underscores
for h in header:
    clean = h.lower()
    clean = clean.replace(" ", "_")
    new_header.append(clean)
    
station_data.columns = new_header
print(station_data.columns)

Index(['station_id', 'station_name', 'day_of_go_live_date', 'status'], dtype='object')


<br>
Some stations have the same name - these will be assumed to be at the same location

In [273]:
station_data.loc[station_data["station_name"].duplicated(keep=False)]

Unnamed: 0,station_id,station_name,day_of_go_live_date,status,neighborhood
154,3213,Broad & Carpenter,9/16/2020,Inactive,Graduate Hospital
155,3214,Broad & Cecil B Moore,9/22/2020,Active,North Philadelphia
166,3244,Broad & Carpenter,7/15/2021,Active,Graduate Hospital
167,3245,Broad & Cecil B Moore,9/21/2021,Active,North Philadelphia


<br><br>
## <a name="addneighborhood"></a> Add Neighborhood to station data, using Geocode API
[Return to Top](#top)

Add a Neighborhood attribute to station data, based on Google Maps geocode data

<br>Initialize Googe Maps API

In [216]:
import googlemaps
from dotenv import load_dotenv

# Load env file
load_dotenv("indego.env")

# Get the Google Maps API key from the env file
api_key = os.getenv('API_KEY')

# Initialize Google Maps API
gmaps = googlemaps.Client(key=api_key)

<br>
Use the Google Maps geocode API to assign a Neighborhood to each station

- For each station, call the API and pass the full station name
- Parse the response to find the neighborhood value and assign it back to the station_data df


Define a get_neighborhood function (in the Trip data section, there is some missing data that requires use of this function)

In [217]:
# Add a column 'neighborhood' to station data
station_data["neighborhood"] = ''

In [218]:
# get_neighborhood receives an address as a string
# returns neighborhood as a string or NaN if not found
def get_neighborhood(addr):
    # Call the geocode API, for the current station
    # geocode returns a dict with results
    geocode = gmaps.geocode(addr)

    # Get address_components list in  result
    gr = geocode[0]["address_components"]
    found_neighborhood = False

    # Iterate through address_components, looking for "neighborhood"
    # If found, assign the value into the df
    for r in gr:
        if "neighborhood" in r["types"]:
            return r["long_name"]
    
    # If no neighborhood found, return NaN        
    return np.NaN

In [219]:
# Iterate over each station
for index, row in station_data.iterrows():
    # In case API calls run into an issue and this block of code has to be run multiple times, 
    # the check below avoids re-populating rows that have already been returned
    if station_data.loc[index,"neighborhood"] == "":
        # Get neighboorhod for the current station
        neighborhood = get_neighborhood(station_data.loc[index,"station_name"] + ", Philadelphia, PA")
        
        # Assign the result into the dataset
        station_data.loc[index,"neighborhood"] = neighborhood

<br>
Find any rows that did not return a neighborhood result

In [220]:
# Find rows where Neighborhood == NaN
missing_neighorhood = station_data.loc[station_data["neighborhood"].isna()]
print(missing_neighorhood.to_string())

     station_id  station_name day_of_go_live_date  status neighborhood
145        3204  17th & Green          11/14/2019  Active          NaN


<br>
Manually populate the data for this item

In [221]:
station_data.loc[145, "neighborhood"] = "North Philadelphia"

<br><br>
## <a name="writestationdata"></a> Write enhanced station data to a csv

[Return to Top](#top)

Store this data in a new csv file, so it can be used later

In [222]:
def station_to_csv():
    station_data.to_csv(os.getcwd() + "/staged_data/indego-stations-2021-10-01-enhanced.csv", index=False)

station_to_csv()

<br><br>
## <a name="loadtripdata"></a> Load trip data
[Return to Top](#top)

Trip data is stored in a series of csv files that need to be loaded

In [363]:
# List includes end of filename for each csv file
csv_list = ['2019-q1', '2019-q2', '2019-q3', '2019-q4', 
            '2020-q1', '2020-q2', '2020-q3', '2020-q4', 
            '2021-q1', '2021-q2', '2021-q3']

trip_data = pd.DataFrame()
for row in csv_list:
    fpath = os.getcwd() + "/source_data/indego-trips-" + row + ".csv"
    trip_data = trip_data.append(pd.read_csv(fpath, dtype={'bike_id':'object'}))

In [364]:
trip_data.columns

Index(['trip_id', 'duration', 'start_time', 'end_time', 'start_station',
       'start_lat', 'start_lon', 'end_station', 'end_lat', 'end_lon',
       'bike_id', 'plan_duration', 'trip_route_category', 'passholder_type',
       'bike_type'],
      dtype='object')

In [365]:
print("Number of trips: {:,}".format(len(trip_data)))

Number of trips: 2,108,330


In [366]:
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2108330 entries, 0 to 300431
Data columns (total 15 columns):
 #   Column               Dtype  
---  ------               -----  
 0   trip_id              int64  
 1   duration             int64  
 2   start_time           object 
 3   end_time             object 
 4   start_station        int64  
 5   start_lat            float64
 6   start_lon            float64
 7   end_station          int64  
 8   end_lat              float64
 9   end_lon              float64
 10  bike_id              object 
 11  plan_duration        float64
 12  trip_route_category  object 
 13  passholder_type      object 
 14  bike_type            object 
dtypes: float64(5), int64(4), object(6)
memory usage: 257.4+ MB


<br><br>
## <a name="cleantripdata"></a> Clean trip data
[Return to Top](#top)

Remove data that is likely incorrect and would skew analysis if not removed

<br><br>
Drop trips that start or end at a virtual station (station id = 3000)

In [369]:
# Save the original number of trips in dataset
orig_num_trips = len(trip_data)

# Keep only trips where start or end station != 3000
trip_data = trip_data.loc[(trip_data['start_station'] != 3000) & (trip_data['end_station'] != 3000)]

# Output number of trips
print("Original number of trips: {:,}".format(orig_num_trips))
print("Current number of trips: {:,}".format(len(trip_data)))

Original number of trips: 2,108,330
Current number of trips: 2,072,116


<br>Drop trips longer than 3 hours - for this analysis, assume these are a result of user error

In [370]:
# Keep only trips where duration <= 180 minutes
trip_data = trip_data.loc[trip_data["duration"]<= 180]

print("Current number of trips: {:,}".format(len(trip_data)))

Current number of trips: 2,052,670


<br>
Drop round trips that are under 5 minutes - for this analysis, assume these are a result of user error

In [371]:
# Keep round trips where duration > 5 minutes, as well as all one way trips 
trip_data = trip_data.loc[((trip_data["duration"] > 5) & (trip_data["trip_route_category"] == "Round Trip")) | ((trip_data["trip_route_category"] == "One Way"))]

print("Current number of trips: {:,}".format(len(trip_data)))

Current number of trips: 1,989,934


<br>Calculate how many trips were removed due from original data

In [372]:
print("Removed {:,} of {:,} trips from original dataset. \n{:,} trips remain".format(orig_num_trips - len(trip_data), orig_num_trips, len(trip_data)))

Removed 118,396 of 2,108,330 trips from original dataset. 
1,989,934 trips remain


<br><br>
## <a name="jointripdata"></a> Join trip data with station data
[Return to Top](#top)

Join trip data with station data

In [373]:
# Join start station data
ts_data = pd.merge(left=trip_data, right=station_data[["station_id", "station_name", "neighborhood"]], how="left", left_on="start_station", right_on="station_id").drop(columns=["station_id"])

# Join end station data
ts_data = pd.merge(left=ts_data, right=station_data[["station_id", "station_name", "neighborhood"]], how="left", left_on="end_station", right_on="station_id").drop(columns=["station_id"])

# Rename merged columns 
ts_data = ts_data.rename(columns={"station_name_x":"start_name", "neighborhood_x":"start_neighborhood"})
ts_data = ts_data.rename(columns={"station_name_y":"end_name", "neighborhood_y":"end_neighborhood"})

ts_data.head(2)

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood
0,306773863,8,2019-01-01 00:19:00,2019-01-01 00:27:00,3049,39.945091,-75.142502,3007,39.945171,-75.159927,14495,30.0,One Way,Indego30,standard,Foglietta Plaza,Center City,"11th & Pine, Kahn Park",Washington Square West
1,306773862,7,2019-01-01 00:30:00,2019-01-01 00:37:00,3005,39.94733,-75.144028,3007,39.945171,-75.159927,5332,1.0,One Way,Day Pass,standard,"Welcome Park, NPS",Center City East,"11th & Pine, Kahn Park",Washington Square West


<br>Check if any trip data is missing start or end information

In [374]:
print(len(ts_data.loc[ts_data["start_name"].isnull()]))
print(len(ts_data.loc[ts_data["start_neighborhood"].isnull()]))
print(len(ts_data.loc[ts_data["end_name"].isnull()]))
print(len(ts_data.loc[ts_data["end_neighborhood"].isnull()]))

19
19
18
18


<br>There is an additional station 3042 that is not included in the station data:

In [375]:
ts_data.loc[ts_data["start_name"].isnull()].head(2)

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood
550228,326839302,8,2019-10-01 17:38:10,2019-10-01 17:46:38,3042,39.949421,-75.16613,3073,39.96143,-75.15242,16389,365.0,One Way,Indego365,electric,,,9th & Spring Garden,Center City
553388,326970681,3,2019-10-02 17:08:20,2019-10-02 17:11:49,3042,39.949421,-75.16613,3088,39.969841,-75.1418,16682,365.0,One Way,Indego365,electric,,,3rd & Girard,Olde Kensington


In [376]:
ts_data.loc[ts_data["end_name"].isnull()].head(2)

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood
210216,317631215,56,2019-05-31 10:02:00,2019-05-31 10:58:00,3157,39.925449,-75.159538,3042,39.949421,-75.16613,14476,365.0,One Way,Indego365,standard,"8th & Mifflin, Bok Building",East Passyunk Crossing,,
220098,317939917,38,2019-06-04 07:50:00,2019-06-04 08:28:00,3024,39.948219,-75.209084,3042,39.949421,-75.16613,3723,365.0,One Way,Indego365,standard,"43rd & Chester, Clark Park",University City,,


<br>There used to be a station at 15th and Walnut that was not included in the station data. Add this station to the station dataset

In [None]:
station_id = 3042
station_name = "15th & Walnut"
# Use the get_neighborhood function to call the geocode API
neighborhood = get_neighborhood(station_name + ", Philadelphia, PA")

# Create a new df and append it to station_data
new_row = {'station_id':station_id, 'station_name': station_name, 'status':'Inactive', 'neighborhood': neighborhood}
station_data = station_data.append(new_row, ignore_index=True)

In [672]:
# Print to confirm the new data was added
station_data.tail(2)

Unnamed: 0,station_id,station_name,day_of_go_live_date,status,neighborhood
178,3256,23rd & Chestnut,9/21/2021,Active,Center City West
179,3042,15th & Walnut,,Inactive,Center City West


<br>Update the station data csv, so this station is included moving forward

In [253]:
station_to_csv()

<br>Update ts_data and check there are no more missing values for start or end stations or neighborhoods

In [377]:
ts_data.loc[ts_data["start_station"] == 3042, "start_name"] = station_name
ts_data.loc[ts_data["start_station"] == 3042, "start_neighborhood"] = neighborhood
ts_data.loc[ts_data["end_station"] == 3042, "end_name"] = station_name
ts_data.loc[ts_data["end_station"] == 3042, "end_neighborhood"] = neighborhood

print(len(ts_data.loc[ts_data["start_name"].isnull()]))
print(len(ts_data.loc[ts_data["start_neighborhood"].isnull()]))
print(len(ts_data.loc[ts_data["end_name"].isnull()]))
print(len(ts_data.loc[ts_data["end_neighborhood"].isnull()]))

0
0
0
0


<br><br>
## <a name="distancedata"></a> Calculate distance for each trip, using Distance Matrix API
[Return to Top](#top)

Since the Distance Matrix API charges $5 per 1000 combinations of (origin, destination) requsted, find the smallest number of  requests needed to calculate distance for all trips that exist in the dataset

<br>Check how many combinations are possible for all start and end stations

In [378]:
# Get the number of stations, subtracting 1 to account for the Virtual Station that will not need to be looked up
len_station_data = len(station_data) - 1

# Mutiply the length of station_data * station_data - 1 to get the number of possible combination
possible_combos = len_station_data*(len_station_data-1)

print(possible_combos)

31862


<br>Check how many combinations actually exist in the trip data based on **station id**, excluding round trips 

In [379]:
station_id_combos = ts_data.loc[ts_data["start_station"] != ts_data["end_station"]].groupby(["start_station", "end_station"]).size().reset_index(name='count')

print(len(station_id_combos))

24176


<br>Check how many combinations actually exist in the trip data based on **station name**, excluding round trips <br>
*Earlier, it was noted that some stations have the same name and are the same location*

In [380]:
station_name_combos = ts_data.loc[ts_data["start_name"] != ts_data["end_name"]].groupby(["start_name", "end_name"]).size().reset_index(name='count')

print(len(station_name_combos))

23760


<br>Include long and lat in the table - these are needed for the Distance Matrix APIs

In [419]:
station_name_combos = ts_data.loc[ts_data["start_name"] != ts_data["end_name"]].groupby(["start_name", "start_lat", "start_lon", "end_name", "end_lat", "end_lon"]).size().reset_index(name='count')

print(len(station_name_combos))

26783


<br>The number of rows increased... some latlong keys include multiple sets of coordinates:

In [402]:
temp = station_name_combos.groupby(["start_name", "end_name"]).size().reset_index(name='count')

temp.loc[temp["count"] > 3]

Unnamed: 0,start_name,end_name,count
19481,Frankford & Belgrade,Race Street Pier,4
20526,"Independence Mall, NPS",Race Street Pier,4
21606,Philadelphia Museum of Art,Race Street Pier,4
22151,Race Street Pier,Frankford & Belgrade,4
22157,Race Street Pier,"Independence Mall, NPS",4
22164,Race Street Pier,Philadelphia Museum of Art,4
22171,Race Street Pier,"Spring Garden Station, BSL",4
22778,"Spring Garden Station, BSL",Race Street Pier,4


<br>Average latlong coordinates for each start, end combination, so there is a single latlong key for each

In [421]:
station_name_combos = station_name_combos.groupby(["start_name", "end_name"]).mean().reset_index()

print(len(station_name_combos))

23760


<br>Calculate how much it will cost to look up each pair of stations with a trip in the dataset

In [338]:
print("{:,} calls to the Distance Matrix API will cost ${:.2f} (of the $200 Google APIs monthly credit)".format(len(station_name_combos), len(station_name_combos)*.005))

23,760 calls to the Distance Matrix API will cost $118.80 (of the $200 Google APIs monthly credit)


<br>Test a Distance Matrix API call on the first row of data

In [443]:
ind_olat = 2
ind_olong = 3
ind_dlat = 4
ind_dlong = 5

origin = str(station_name_combos.iloc[0, ind_olat]) + ", " + str(station_name_combos.iloc[0,ind_olong])
destination = str(station_name_combos.iloc[0, ind_dlat]) + ", " + str(station_name_combos.iloc[0,ind_dlong])

print(origin)
print(destination)

result = gmaps.distance_matrix(origin, destination, units='imperial', mode='bicycling')
print("Result:")
print(result)
print("Distance:")
print(result['rows'][0]['elements'][0]['distance']['text'])

39.95005, -75.156723
39.934311, -75.160423
Result:
{'destination_addresses': ['10th & Federal, Philadelphia, PA 19147, USA'], 'origin_addresses': ['10th & Chestnut, Philadelphia, PA 19107, USA'], 'rows': [{'elements': [{'distance': {'text': '1.1 mi', 'value': 1797}, 'duration': {'text': '8 mins', 'value': 493}, 'status': 'OK'}]}], 'status': 'OK'}
Distance:
1.1 mi


In [444]:
# Creat a new column 'distance'
station_name_combos["distance"] = ''

In [455]:
# Iterate over each combination of stations
for index, row in station_name_combos.iterrows():
    # In case API calls run into an issue and this block of code has to be run multiple times, 
    # the check below avoids re-populating rows that have already been returned
    if station_name_combos.loc[index,"distance"] == "":
        # Get distance for the current pair of stations
        origin = str(station_name_combos.iloc[index, ind_olat]) + ", " + str(station_name_combos.iloc[index,ind_olong])
        destination = str(station_name_combos.iloc[index, ind_dlat]) + ", " + str(station_name_combos.iloc[index,ind_dlong])
        result = gmaps.distance_matrix(origin, destination, units='imperial', mode='bicycling')
        distance = result['rows'][0]['elements'][0]['distance']['text']
                
        # Assign the result into the dataset
        station_name_combos.loc[index,"distance"] = distance

In [457]:
station_name_combos.head(2)

Unnamed: 0,start_name,end_name,start_lat,start_lon,end_lat,end_lon,count,distance
0,10th & Chestnut,10th & Federal,39.95005,-75.156723,39.934311,-75.160423,216.0,1.1 mi
1,10th & Chestnut,11th & Market,39.95005,-75.156723,39.951691,-75.158882,156.0,0.4 mi


<br>Add an additional station combination for each station, where start_name == end_name and distance = 0.0

In [528]:
# For each station
for index, row in station_data.iterrows():
    # Add a round trip entry with distance 0 to station_name_combos
    new_row = {'start_name':station_data.loc[index, "station_name"], 'end_name': station_data.loc[index, "station_name"], 'distance': "0.0 mi"}
    station_name_combos = station_name_combos.append(new_row, ignore_index=True)

In [530]:
station_name_combos.tail()

Unnamed: 0,start_name,end_name,start_lat,start_lon,end_lat,end_lon,count,distance
23935,27th & Morris - Vare Recreation Center,27th & Morris - Vare Recreation Center,,,,,,0.0 mi
23936,16th & Montgomery,16th & Montgomery,,,,,,0.0 mi
23937,Broad & Chestnut,Broad & Chestnut,,,,,,0.0 mi
23938,23rd & Chestnut,23rd & Chestnut,,,,,,0.0 mi
23939,15th & Walnut,15th & Walnut,,,,,,0.0 mi


<br>Drop the 'mi' suffix from distance values and store only the float 

In [596]:
station_name_combos["distance"] = station_name_combos["distance"].str.replace(" mi", "")

<br>Distance Matrix API returns any trips that are less .1 mi in feet instead of miles
<br>For any rows with a 'ft' suffix, convert to miles

In [598]:
traverse = station_name_combos.loc[station_name_combos["distance"].str.contains("ft")]
for index, row in traverse.iterrows():
    rounded = round(int(station_name_combos.loc[index, "distance"].replace(" ft", ""))*1.0/5280, 2)
    station_name_combos.loc[index, "distance"] = str(rounded)

<br> Convert the cleaned column from a str (object) to a float

In [602]:
station_name_combos = station_name_combos.astype({"distance": float})

<br>
Check that trip distances that were measured in ft converted correctly

In [605]:
station_name_combos.loc[(station_name_combos["distance"]<.1) & (station_name_combos["distance"]>0)].head()

Unnamed: 0,start_name,end_name,start_lat,start_lon,end_lat,end_lon,count,distance
3760,17th & JFK,18th & JFK,39.954048,-75.167831,39.953899,-75.169022,105.0,0.06
5014,18th & JFK,17th & JFK,39.953899,-75.169022,39.954048,-75.167831,87.0,0.06
5022,18th & JFK,18th & JFK Curbside,39.953899,-75.169022,39.954102,-75.169647,70.0,0.03
5180,18th & JFK Curbside,18th & JFK,39.954102,-75.169647,39.953899,-75.169022,51.0,0.03


<br>Join distance data to trip data

In [606]:
tsd_data = pd.merge(left=ts_data, right=station_name_combos[["start_name", "end_name", "distance"]], how="left", left_on=["start_name", "end_name"], right_on=["start_name", "end_name"])

tsd_data.head(2)

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,start_lat,start_lon,end_station,end_lat,end_lon,bike_id,plan_duration,trip_route_category,passholder_type,bike_type,start_name,start_neighborhood,end_name,end_neighborhood,distance
0,306773863,8,2019-01-01 00:19:00,2019-01-01 00:27:00,3049,39.945091,-75.142502,3007,39.945171,-75.159927,14495,30.0,One Way,Indego30,standard,Foglietta Plaza,Center City,"11th & Pine, Kahn Park",Washington Square West,1.3
1,306773862,7,2019-01-01 00:30:00,2019-01-01 00:37:00,3005,39.94733,-75.144028,3007,39.945171,-75.159927,5332,1.0,One Way,Day Pass,standard,"Welcome Park, NPS",Center City East,"11th & Pine, Kahn Park",Washington Square West,1.2


<br>Confirm distance is not null for any trips

In [608]:
len(tsd_data.loc[tsd_data["distance"].isnull()])

0

<br><br>
## <a name="writetripdata"></a> Write enhanced trip data to a csv
[Return to Top](#top)

In [593]:
# Write the full data to a csv
tsd_data.to_csv(os.getcwd() + "/staged_data/indego-trips-enhanced.csv", index=False)

# Write a truncated version of the file including only 100 rows of data, since the full file is too large to upload to GitHub
tsd_data[:100].to_csv(os.getcwd() + "/staged_data/indego-trips-enhanced-truncated.csv", index=False)

In [571]:
# Write the station_names_combos data to a csv
station_name_combos.to_csv(os.getcwd() + "/staged_data/stations-with-distance.csv", index=False)

<br><br>
## <a name="storeassqldb"></a>Store data as a SQL database
[Return to Top](#top)

<br>By storing the data this way, it can be queried in the future using SQL, which will not require the full dataset to be loaded into memory and allows for faster query execution

<br> Prepare the data to be stored in a relational database, starting with station data

In [680]:
station_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180 entries, 0 to 179
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   station_id           180 non-null    int64 
 1   station_name         180 non-null    object
 2   day_of_go_live_date  179 non-null    object
 3   status               180 non-null    object
 4   neighborhood         180 non-null    object
dtypes: int64(1), object(4)
memory usage: 12.5+ KB


<br>Add latitude and longitude coordinates to station data, where available. This information is currently stored for each individual trip, rather than once per station.

In [681]:
# Get the list of all stations for start or end of trip, with latlong data. Avg latlong if multiple entries exist
start_coords = trip_data[["start_station", "start_lat", "start_lon"]].groupby(["start_station"]).mean().reset_index()
end_coords = trip_data[["end_station", "end_lat", "end_lon"]].groupby(["end_station"]).mean().reset_index()

# Rename columns to match
start_coords = start_coords.rename(columns={"start_station":"station_id", "start_lat":"station_lat", "start_lon":"station_lon"})
end_coords = end_coords.rename(columns={"end_station":"station_id", "end_lat":"station_lat", "end_lon":"station_lon"})

# Merge the two lists. Avg latlong if multple entries exist
coords = start_coords.append(end_coords).groupby("station_id").mean().reset_index()

station_data_sql = pd.merge(left=station_data, right=coords[["station_id", "station_lat", "station_lon"]], how="left", left_on=["station_id"], right_on=["station_id"])

station_data_sql.head(5)

Unnamed: 0,station_id,station_name,day_of_go_live_date,status,neighborhood,station_lat,station_lon
0,3000,Virtual Station,4/23/2015,Active,Center City,,
1,3004,Municipal Services Building Plaza,4/23/2015,Active,Center City,39.953781,-75.163742
2,3005,"Welcome Park, NPS",4/23/2015,Active,Center City East,39.94733,-75.144028
3,3006,40th & Spruce,4/23/2015,Active,University City,39.952202,-75.20311
4,3007,"11th & Pine, Kahn Park",4/23/2015,Active,Washington Square West,39.945171,-75.159927


<br>Check if any station IDs are missing longlat data. Only stations that are inactive and the virtual station do not have data

In [682]:
station_data_sql.loc[station_data_sql["station_lat"].isnull() == True]

Unnamed: 0,station_id,station_name,day_of_go_live_date,status,neighborhood,station_lat,station_lon
0,3000,Virtual Station,4/23/2015,Active,Center City,,
20,3023,Rittenhouse Square,4/23/2015,Inactive,Rittenhouse Square,,
43,3048,Broad & Fitzwater,4/23/2015,Inactive,South Philadelphia,,
90,3109,Parkside & Girard,5/6/2016,Inactive,East Parkside,,
103,3122,"24th & Cecil B. Moore, Cecil B. Moore Library",4/27/2016,Inactive,North Philadelphia,,


<br>Convert the go live date to a datetime before storing in the SQL db

In [725]:
import datetime as dt

station_data_sql["day_of_go_live_date"] = pd.to_datetime(station_data_sql["day_of_go_live_date"])

station_data_sql.head(2)

Unnamed: 0,station_id,station_name,day_of_go_live_date,status,neighborhood,station_lat,station_lon
0,3000,Virtual Station,2015-04-23,Active,Center City,,
1,3004,Municipal Services Building Plaza,2015-04-23,Active,Center City,39.953781,-75.163742


<br>Store the combination of start & end stations with distance in between as its own table

In [611]:
station_name_combos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23940 entries, 0 to 23939
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   start_name  23940 non-null  object 
 1   end_name    23940 non-null  object 
 2   start_lat   23760 non-null  float64
 3   start_lon   23760 non-null  float64
 4   end_lat     23760 non-null  float64
 5   end_lon     23760 non-null  float64
 6   count       23760 non-null  float64
 7   distance    23940 non-null  float64
dtypes: float64(6), object(2)
memory usage: 1.5+ MB


<br>Drop the latlong data from this table - it is now stored in the station data table

In [688]:
station_combos_sql = station_name_combos.copy()

station_combos_sql = station_combos_sql.drop(columns=["start_lat", "start_lon", "end_lat", "end_lon", "count"])

station_combos_sql.head()

Unnamed: 0,start_name,end_name,distance
0,10th & Chestnut,10th & Federal,1.1
1,10th & Chestnut,11th & Market,0.4
2,10th & Chestnut,"11th & Pine, Kahn Park",0.6
3,10th & Chestnut,"11th & Poplar, John F. Street Community Center",1.6
4,10th & Chestnut,11th & Reed,1.3


In [612]:
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1989934 entries, 0 to 300431
Data columns (total 15 columns):
 #   Column               Dtype  
---  ------               -----  
 0   trip_id              int64  
 1   duration             int64  
 2   start_time           object 
 3   end_time             object 
 4   start_station        int64  
 5   start_lat            float64
 6   start_lon            float64
 7   end_station          int64  
 8   end_lat              float64
 9   end_lon              float64
 10  bike_id              object 
 11  plan_duration        float64
 12  trip_route_category  object 
 13  passholder_type      object 
 14  bike_type            object 
dtypes: float64(5), int64(4), object(6)
memory usage: 242.9+ MB


<br>Drop the latlong data from this table - it is now stored in the station data table

In [720]:
trip_data_sql = trip_data.copy()

trip_data_sql = trip_data_sql.drop(columns=["start_lat", "start_lon", "end_lat", "end_lon"])

trip_data_sql.tail()

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,end_station,bike_id,plan_duration,trip_route_category,passholder_type,bike_type
300427,428365176,10,9/30/2021 23:57,10/1/2021 0:07,3009,3035,18791,30.0,One Way,Indego30,electric
300428,428365174,7,9/30/2021 23:57,10/1/2021 0:04,3047,3028,5263,30.0,One Way,Indego30,standard
300429,428365172,7,9/30/2021 23:58,10/1/2021 0:05,3046,3050,18675,30.0,One Way,Indego30,electric
300430,428365170,3,9/30/2021 23:58,10/1/2021 0:01,3115,3075,21618,30.0,One Way,Indego30,electric
300431,428365168,12,9/30/2021 23:59,10/1/2021 0:11,3012,3112,18807,30.0,One Way,Indego30,electric


<br>Convert the date columns to datetimes before storing in the SQL db

In [721]:
trip_data_sql["start_time"] = pd.to_datetime(trip_data_sql["start_time"])
trip_data_sql["end_time"] = pd.to_datetime(trip_data_sql["end_time"])

trip_data_sql.tail()

Unnamed: 0,trip_id,duration,start_time,end_time,start_station,end_station,bike_id,plan_duration,trip_route_category,passholder_type,bike_type
300427,428365176,10,2021-09-30 23:57:00,2021-10-01 00:07:00,3009,3035,18791,30.0,One Way,Indego30,electric
300428,428365174,7,2021-09-30 23:57:00,2021-10-01 00:04:00,3047,3028,5263,30.0,One Way,Indego30,standard
300429,428365172,7,2021-09-30 23:58:00,2021-10-01 00:05:00,3046,3050,18675,30.0,One Way,Indego30,electric
300430,428365170,3,2021-09-30 23:58:00,2021-10-01 00:01:00,3115,3075,21618,30.0,One Way,Indego30,electric
300431,428365168,12,2021-09-30 23:59:00,2021-10-01 00:11:00,3012,3112,18807,30.0,One Way,Indego30,electric


<br>Create a new SQLite database

In [726]:
import sqlite3

conn = sqlite3.connect('bike_trip_db')
c = conn.cursor()

In [749]:
station_data_sql.to_sql('stations', conn, if_exists='replace', index = False)
station_combos_sql.to_sql('station_combos', conn, if_exists='replace', index = False)
trip_data_sql.to_sql('trips', conn, if_exists='replace', index = False)

<br>Check header info

In [750]:
c.execute('''  
            PRAGMA table_info(stations);
            ''')

for row in c.fetchall():
    print (row)

(0, 'station_id', 'INTEGER', 0, None, 0)
(1, 'station_name', 'TEXT', 0, None, 0)
(2, 'day_of_go_live_date', 'TIMESTAMP', 0, None, 0)
(3, 'status', 'TEXT', 0, None, 0)
(4, 'neighborhood', 'TEXT', 0, None, 0)
(5, 'station_lat', 'REAL', 0, None, 0)
(6, 'station_lon', 'REAL', 0, None, 0)


<br>Check data

In [751]:
c.execute('''  
            SELECT * 
            FROM stations
            LIMIT 10;
            ''')

for row in c.fetchall():
    print (row)

(3000, 'Virtual Station', '2015-04-23 00:00:00', 'Active', 'Center City', None, None)
(3004, 'Municipal Services Building Plaza', '2015-04-23 00:00:00', 'Active', 'Center City', 39.953781, -75.163742)
(3005, 'Welcome Park, NPS', '2015-04-23 00:00:00', 'Active', 'Center City East', 39.94733, -75.144028)
(3006, '40th & Spruce', '2015-04-23 00:00:00', 'Active', 'University City', 39.952202, -75.20311)
(3007, '11th & Pine, Kahn Park', '2015-04-23 00:00:00', 'Active', 'Washington Square West', 39.945171, -75.159927)
(3008, 'Temple University Station', '2015-04-23 00:00:00', 'Active', 'North Philadelphia', 39.98044755123083, -75.1506971812565)
(3009, '33rd & Market', '2015-04-23 00:00:00', 'Active', 'University City', 39.955761, -75.189819)
(3010, '15th & Spruce', '2015-04-23 00:00:00', 'Active', 'Rittenhouse Square', 39.947109, -75.166183)
(3011, '38th & Powelton', '2015-04-23 00:00:00', 'Active', 'University City', 39.95960020818546, -75.19664993981894)
(3012, '21st & Catharine', '2015-04-

<br>Check the number of rows

In [752]:
c.execute('''  
            SELECT COUNT(*) 
            FROM stations
            ''')

for row in c.fetchall():
    print (row)

(180,)


<br>The data has been stored in the 'bike_trip-db' database