<a href="https://colab.research.google.com/github/DonRomaniello/CitibikeDocks/blob/master/TripData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Note: While CitiBike has stations on both sides of the Hudson, few (if any) rides originate in one state and end in another. There would be very little incentive to attempt this feat beyond bragging rights, and based on the two sets of trip data published depending on jurisdiction, it does not seem like anyone is doing it. As I live and work in New York City, I will only be focusing on New York.

Before training our model, it would be useful to learn how bikes flow between stations.

CitiBike publishes trip reports every month to an AWS S3 bucket. These reports contain data of all the trips taken 
by CitiBike users, with information like the start times and locations, end times and locations, etc.

In [1]:
import requests
import pandas as pd

Unfortunately, some of the data are published as zip files that also contain MacOS special files, which means PANDAS can't simply ingest the zip file as published.

We will use Requests to grab the file from the S3 bucket, BytesIO to keep the zip directory in memory, and ZipFile to work with the zip directory to extract the CSV only.

In [2]:
from io import BytesIO
from zipfile import ZipFile

In [3]:
dirtyZipUrl = 'https://s3.amazonaws.com/tripdata/202108-citibike-tripdata.csv.zip'
dirtyZipFilename = requests.get(dirtyZipUrl).content
dirtyZipFile = ZipFile( BytesIO(dirtyZipFilename), 'r')

for item in dirtyZipFile.namelist():
  print("File in zip:" + item)

File in zip:202108-citibike-tripdata.csv
File in zip:__MACOSX/._202108-citibike-tripdata.csv


There it is, the stuff that PANDAS doesn't like. The files in the "__MACOSX" directory will cause the PANDAS read_csv() function to throw an exception.

Not all of the published zip directories have this problem, but we shoud get rid of it if it is in there.


In [4]:
justCSV = [cleanFilename for cleanFilename in dirtyZipFile.namelist() if "._" not in cleanFilename and ".csv" in cleanFilename][0]

And now we can load the data and make sure it is as expected.

In [5]:
tripData = pd.read_csv(dirtyZipFile.open(justCSV), low_memory=False)
tripData.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,FB6B89D05B67EBED,classic_bike,2021-08-24 15:59:57,2021-08-24 16:42:07,Broadway & E 21 St,6098.1,Central Park North & Adam Clayton Powell Blvd,7617.07,40.739888,-73.989586,40.799484,-73.955613,member
1,E13DA3E30CEF8DFC,classic_bike,2021-08-18 13:12:01,2021-08-18 13:21:26,E 13 St & 2 Ave,5820.08,Henry St & Grand St,5294.04,40.731539,-73.985302,40.714211,-73.981095,member
2,56617490AB8AE69C,classic_bike,2021-08-17 14:31:23,2021-08-17 14:35:34,E 95 St & 3 Ave,7365.13,E 84 St & Park Ave,7243.04,40.784903,-73.950503,40.778627,-73.957721,member
3,CA908B271C7D6663,classic_bike,2021-08-11 10:00:12,2021-08-11 10:31:01,Madison Ave & E 82 St,7188.13,E 84 St & Park Ave,7243.04,40.778131,-73.960694,40.778627,-73.957721,casual
4,3E170CE1F4FE179D,classic_bike,2021-08-12 19:28:38,2021-08-12 19:48:50,E 74 St & 1 Ave,6953.08,E 84 St & Park Ave,7243.04,40.768974,-73.954823,40.778627,-73.957721,casual


Great.

We should turn this process into a function that takes the URL of the S3 item as input and returns a pandas DataFrame, because we will be doing this many times.

In [20]:
def readDirtyZip(dirtyZipUrl):
  dirtyZipFilename = requests.get(dirtyZipUrl).content
  dirtyZipFile = ZipFile( BytesIO(dirtyZipFilename), 'r')
  tripData = pd.read_csv(dirtyZipFile.open([cleanFilename for cleanFilename in dirtyZipFile.namelist() if "._" not in cleanFilename and ".csv" in cleanFilename][0]), low_memory=False)
  
  return tripData

# Legacy Data

Before going any further in creating our trip dataset, there is a slight wrinkle. At some point CitiBike changed the IDs for the all the stations. 

Graciously, they saw fit to include the old names *and* new names in the JSON feed that provides live information about the system.

This will allow us to construct a dictionary that which we can use to rename the old trip data to reflect the current naming paradigm.

Notes:  

*   Stations that begin with letters include stations in New Jersey, so we will remove them when we make the dictionary.
* The legacy system used int64 as the datatype for station IDs. The new system uses strings. When constructing the dictionary, the legacy IDs need to be type cast.


In [97]:
stationLocationsRequest = requests.get('https://gbfs.citibikenyc.com/gbfs/en/station_information.json')
stationLocationData = stationLocationsRequest.json()
stationLocations = pd.DataFrame(stationLocationData['data']['stations'])
stationNameDictionary = dict(zip(stationLocations[stationLocations['short_name'].str.contains('[a-zA-Z]+', regex=True)==False].legacy_id.astype('int64'), stationLocations[stationLocations['short_name'].str.contains('[a-zA-Z]+', regex=True)==False].short_name))
print(stationNameDictionary)

{72: '6926.01', 79: '5430.08', 82: '5167.06', 83: '4354.07', 116: '6148.02', 119: '4700.06', 120: '4452.03', 127: '5805.05', 128: '5687.04', 143: '4605.04', 144: '4812.02', 146: '5359.10', 150: '5476.03', 151: '5492.05', 152: '5288.09', 153: '6474.11', 157: '4531.05', 161: '5721.14', 164: '6498.10', 168: '6064.08', 174: '6004.07', 212: '6233.05', 216: '4829.01', 217: '4903.08', 223: '6030.04', 224: '5137.10', 228: '6541.03', 229: '5636.11', 232: '4677.01', 236: '5669.10', 238: '5964.01', 239: '4628.05', 241: '4546.04', 242: '4732.08', 244: '4611.03', 245: '4659.02', 247: '5922.07', 248: '5539.06', 249: '5400.05', 250: '5561.06', 251: '5561.04', 252: '5797.01', 254: '5914.03', 257: '5391.06', 258: '4461.04', 259: '4846.01', 260: '4962.08', 261: '4668.08', 262: '4546.05', 264: '5065.10', 265: '5523.02', 266: '5506.10', 267: '6441.01', 268: '5422.04', 270: '4620.02', 274: '4395.04', 275: '4419.03', 276: '5400.08', 278: '4781.03', 281: '6839.10', 282: '5062.01', 284: '6072.06', 285: '5905.

We don't need anything except the dictionary, so we will delete everything else that went into creating the dictionary.

In [19]:
del stationLocationsRequest, stationLocationData, stationLocations

Since we are trying to predict the availability of bikes and open docks in the current CitiBike system, the new names will be used to rename old trip station IDs.

The last month that used the legacy IDs appears to be January, 2021. We should test our renaming dictionary on this before proceeding.

In [116]:
legacyTrips = readDirtyZip('https://s3.amazonaws.com/tripdata/202101-citibike-tripdata.csv.zip')

legacyTrips['start station id'] = legacyTrips['start station id'].map(stationNameDictionary)
legacyTrips['end station id'] = legacyTrips['end station id'].map(stationNameDictionary)

legacyTrips.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,2513,2021-01-01 00:00:11.9020,2021-01-01 00:42:05.2260,4042.08,Underhill Ave & Lincoln Pl,40.674012,-73.967146,4042.08,Underhill Ave & Lincoln Pl,40.674012,-73.967146,47812,Customer,1969,0
1,2519,2021-01-01 00:00:15.0960,2021-01-01 00:42:14.9780,4042.08,Underhill Ave & Lincoln Pl,40.674012,-73.967146,4042.08,Underhill Ave & Lincoln Pl,40.674012,-73.967146,47571,Customer,1969,0
2,1207,2021-01-01 00:00:28.9300,2021-01-01 00:20:36.6510,7188.1,E 81 St & Park Ave,40.776777,-73.95901,6912.01,7 Ave & Central Park South,40.766741,-73.979069,37451,Subscriber,2002,1
3,2506,2021-01-01 00:00:32.7130,2021-01-01 00:42:19.3980,4042.08,Underhill Ave & Lincoln Pl,40.674012,-73.967146,4042.08,Underhill Ave & Lincoln Pl,40.674012,-73.967146,48884,Customer,2002,1
4,959,2021-01-01 00:00:35.3650,2021-01-01 00:16:34.6010,,Water - Whitehall Plaza,40.702551,-74.012723,5181.04,Cherry St,40.712199,-73.979481,26837,Customer,2002,1


Looks good. In fact, looks great, because station IDs that are not in the dictionary of current stations are replaced with NaN. We can use the PANDAS dropna fuction to remove them... later. First we will do a little more processing and cleaning.

Some column names have changed in the new era. Spaces have been replaced with underscores in the new data, and the time stamp column names are prepositional phrases.

We are only going to be using trip start times, end times, and the station IDs for the starting stations and end stations, so these are the only ones we will bother to rename.

In [117]:
legacyColumnRename = dict({'starttime': 'started_at', 'stoptime': 'ended_at', 'start station id': 'start_station_id', 'end station id': 'end_station_id'})
legacyTrips.rename(columns=legacyColumnRename, inplace=True)

Then we can use the column renaming dictionary to cull the unwanted columns from our DataFrame.

In [118]:
legacyTrips = legacyTrips[legacyColumnRename.values()]
legacyTrips.head()

And finally drop NaNs. Had we done this earlier we might have lost data if there was mssing information in columns that we aren't even going to be using.

In [122]:
legacyTrips.dropna(inplace=True)
legacyTrips.isna().sum()

started_at          0
ended_at            0
start_station_id    0
end_station_id      0
dtype: int64

We don't really want to do this manually, so maybe a list of the contents of the bucket is in order. We will use Boto3 to do this, connecting to S3 without a signature to avoid having to configure anything.

In [None]:
!pip install boto3
import boto3
from botocore import UNSIGNED
from botocore.client import Config

Collecting boto3
  Downloading boto3-1.18.45-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 5.1 MB/s 
[?25hCollecting botocore<1.22.0,>=1.21.45
  Downloading botocore-1.21.45-py3-none-any.whl (7.9 MB)
[K     |████████████████████████████████| 7.9 MB 52.1 MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 7.8 MB/s 
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.6-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 36.9 MB/s 
Installing collected packages: urllib3, jmespath, botocore, s3transfer, boto3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
[31mERROR: pip's dependency resolver does not currently take into a

In [None]:
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
s3.list_objects(Bucket='tripdata')['Contents']