# Indexing by Station

While I am storing data on the level of *trips*, I need to access it on the level of *stations*. Since the search strategy for determining which bikes correspond with which stations is complex, and would require a full scan of the data, I need to speed this up by precomputing an index.

I already did that work in `08` and `09` while generating counts of things, so I just reuse that code here.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests
import io
import zipfile
from tqdm import tqdm

In [2]:
stations = pd.read_csv("../data/final/june_22_station_metadata.csv", index_col=0)

In [3]:
june_22 = pd.read_csv("../data/final/all_june_22_citibike_trips.csv", index_col=0)

## Outbound bike trips

The following cell generates a dict of stations by bikes which began the day there.

In [5]:
outbound_station_bikesets = {station_id: [] for station_id in stations.index}

for bike_id in tqdm(np.unique(june_22['bikeid'])):
    first_trip_station = june_22[june_22['bikeid'] == bike_id].sort_values(by='starttime').iloc[0]['start station id']
    outbound_station_bikesets[first_trip_station].append(bike_id)

100%|████████████████████████████████████| 6497/6497 [00:05<00:00, 1112.82it/s]


The index is structured like so:

    {<station id>: [<bike id>, <bike id>, ...]}

Ok great, but now we need to convert that set of bike ids to sets of trip ids.

In [32]:
import functools
outbound_bike_trip_index = dict()

for station_id, bike_id_list in tqdm(outbound_station_bikesets.items()):
    trip_set = []
    for bike_id in bike_id_list:
        trip_set.append(list(june_22[june_22['bikeid'] == bike_id].index.values))
    if len(trip_set) > 0:
        trip_set = functools.reduce(lambda x,y: x+y, trip_set)
    outbound_bike_trip_index[station_id] = trip_set

100%|███████████████████████████████████████| 474/474 [00:02<00:00, 164.29it/s]


And there you have it.

In [33]:
outbound_bike_trip_index[72]

[1005037,
 1005697,
 1034787,
 1035581,
 1005015,
 1006619,
 1007494,
 1010797,
 1012207,
 1013222,
 1013964,
 1015458,
 1023771,
 1024868,
 1028162,
 1030868,
 1032378,
 1047538,
 1049546,
 1051784,
 178,
 1016277,
 1027213,
 1032934,
 1036287]

## Inbound bike trips

Same process, just with slightly different inputs...

In [30]:
inbound_station_bikesets = {station_id: [] for station_id in stations.index}

for bike_id in tqdm(np.unique(june_22['bikeid'])):
    first_trip_station = june_22[june_22['bikeid'] == bike_id].sort_values(by='starttime', ascending=False).iloc[0]['start station id']
    inbound_station_bikesets[first_trip_station].append(bike_id)

100%|████████████████████████████████████| 6497/6497 [00:05<00:00, 1095.37it/s]


In [34]:
import functools
inbound_bike_trip_index = dict()

for station_id, bike_id_list in tqdm(inbound_station_bikesets.items()):
    trip_set = []
    for bike_id in bike_id_list:
        trip_set.append(list(june_22[june_22['bikeid'] == bike_id].index.values))
    if len(trip_set) > 0:
        trip_set = functools.reduce(lambda x,y: x+y, trip_set)
    inbound_bike_trip_index[station_id] = trip_set

100%|███████████████████████████████████████| 474/474 [00:02<00:00, 164.86it/s]


And there you have it.

In [35]:
inbound_bike_trip_index[72]

[1016255,
 1038188,
 1038854,
 1043492,
 1013541,
 1015353,
 1017461,
 1018528,
 1024885,
 1028258,
 1029102,
 1030061,
 1031714,
 1033776,
 1047879,
 1049421,
 1057621,
 1058834,
 1060118,
 1011219,
 1014152,
 1021996,
 1022568,
 1038434,
 1040910,
 1049130,
 1018994,
 1019842,
 1034390,
 1042124,
 1047090,
 1050550,
 1052587,
 1060234,
 1006463,
 1012314,
 1013020,
 1040282,
 1043457,
 1048515,
 1015346,
 1045405,
 1049009,
 1051135,
 1011334,
 1012160,
 1026052,
 1034899,
 1043827,
 1048595,
 1050838,
 1017156,
 1019042,
 1019892,
 1020737,
 1021258,
 1025341,
 1026295,
 1026743,
 1028075,
 1029722,
 1038587,
 1043004,
 632,
 1015940,
 1056313,
 1058086,
 1006717,
 1018260,
 1022519,
 1023644,
 1029823,
 1035821,
 1037143,
 1038424,
 1010022,
 1011908,
 1015132,
 1018522,
 1019648,
 1021225,
 1021613,
 1023084,
 1024161,
 1024974,
 1027652,
 1028521,
 1029330,
 1030048,
 1032184,
 1033673,
 1035669,
 1037259,
 1038567,
 1039852,
 1045137,
 900,
 1008543,
 1010409,
 1021198,
 1027133

## Trips from station

In [36]:
june_22.head(1)

Unnamed: 0,bikeid,birth year,end station id,end station latitude,end station longitude,end station name,gender,start station id,start station latitude,start station longitude,start station name,starttime,stoptime,tripduration,usertype
1004842,25585,1978.0,450,40.762272,-73.987882,W 49 St & 8 Ave,1,334,40.742388,-73.997262,W 20 St & 7 Ave,6/22/2016 00:00:11,6/22/2016 00:13:17,785,Subscriber


In [55]:
outbound_trip_index = dict()

for start_station_id in tqdm(np.unique(june_22['start station id'])):
    trip_ids = list(june_22[june_22['start station id'] == start_station_id].index.values)
    outbound_trip_index[start_station_id] = list(june_22[june_22['bikeid'] == bike_id].index.values)

100%|███████████████████████████████████████| 465/465 [00:00<00:00, 966.68it/s]


In [43]:
outbound_trip_index[72]

[1004900,
 1006445,
 1006968,
 1007743,
 1012627,
 1027981,
 1032330,
 1033799,
 1046768,
 1047218,
 1052136,
 1053102,
 1053676,
 1055161,
 1056179,
 1060215]

## Trips to station

In [41]:
inbound_trip_index = dict()

for end_station_id in tqdm(np.unique(june_22['end station id'])):
    trip_ids = list(june_22[june_22['end station id'] == end_station_id].index.values)
    inbound_trip_index[end_station_id] = list(june_22[june_22['bikeid'] == bike_id].index.values)

100%|██████████████████████████████████████| 471/471 [00:00<00:00, 1017.22it/s]


In [44]:
inbound_trip_index[72]

[1004900,
 1006445,
 1006968,
 1007743,
 1012627,
 1027981,
 1032330,
 1033799,
 1046768,
 1047218,
 1052136,
 1053102,
 1053676,
 1055161,
 1056179,
 1060215]

## Loading it into the database

Get that crap into the database!

In [45]:
from pymongo import MongoClient

In [46]:
client = MongoClient("mongodb://localhost:27017")

...right, once again there is weirdness with numpy stuff needing to be cast into Python-native int.

In [63]:
stations.sample(1)

Unnamed: 0_level_0,latitude,longitude,station name,incoming trips,outgoing trips,all trips,kind,bikes outbound,outbound trips,bikes inbound,inbound trips,delta bikes,delta trips
station id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
264,40.707065,-74.007319,Maiden Ln & Pearl St,100,98,198,active,3,46,14,103,11,57


In [67]:
# client['citibike']['station-indices'].insert_one({str(key): inbound_bike_trip_index[key] for key in inbound_bike_trip_index.keys()})
# for key in inbound_bike_trip_index.keys():
for station_id in tqdm(stations.index):
    client['citibike']['station-indices'].insert({
            'station id': str(station_id),
            'tripsets': {
                'inbound bike trip indices': [int(n) for n in inbound_bike_trip_index[station_id]]  if station_id in inbound_bike_trip_index else [],
                'outbound bike trip indices': [int(n) for n in outbound_bike_trip_index[station_id]] if station_id in outbound_bike_trip_index else [],
                'incoming trip indices': [int(n) for n in inbound_trip_index[station_id]] if station_id in inbound_trip_index else [],
                'outgoing trip indices': [int(n) for n in outbound_trip_index[station_id]]  if station_id in outbound_trip_index else []
            }
        })

100%|███████████████████████████████████████| 474/474 [00:00<00:00, 890.93it/s]


And...done!