A CSV file containing cleaned Uber data can be found here: 

https://drive.google.com/file/d/1pWGv84eaEZF495H7HVwDOFHnLj66-Nbh/view?usp=sharing

In [1]:
import pandas as pd

**Load the local csv files into dataframes.**

**"bus.csv" and "times.csv" should be in the GitHub repo, but uber.csv is not. Find that here: https://drive.google.com/file/d/1QaT4NrrkulKCc1OKrVxXukOJdUp8XiC8/view?usp=sharing**

In [232]:
uber = pd.read_csv('uber.csv', index_col = 0)
busDf = pd.read_csv('bus.csv', index_col = 0)
timesDf = pd.read_csv("times.csv", names=["start", "stop", "time"])

  mask |= (ar1 == a)


**Create a list of unique routes from the busDf**

In [78]:
lines = busDf.ROUTE_OR_LINE.unique()
lines.tolist()

**Sort the Uber dataframe by source and then destination**

In [233]:
uber = uber.sort_values(['sourceid', 'dstid'])

**Function to find the days where every line was slower than 80% of the average.**

If we just look at days where the lines were slower than 100% of the average, we get way too many days to look at. We need to be more strict with out analysis.**

In [215]:
def lateDays(lines, busDf):
    
    linesSlowerThanAvg = {}
    
    for line in lines:
        #get the piece of the dataframe that has data for each line
        thisLine = busDf.loc[busDf['ROUTE_OR_LINE'] == line]
        
        averageOnTime = ((thisLine['PERCENT_ONTIME']).mean())*0.8
        
        slowerThanAverage = thisLine.loc[(thisLine['PERCENT_ONTIME'] < averageOnTime)]
        
        datesSlowerThanAverage = slowerThanAverage['SERVICE_DATE']
        
        percentOnTime = slowerThanAverage['PERCENT_ONTIME']
        
        listDates = datesSlowerThanAverage.tolist()
        
        listPercents = percentOnTime.tolist()
        
        concatenated = [] 
        
        for ii in range(len(listDates)):
            concatenated.append((listDates[ii], listPercents[ii]))
        
        linesSlowerThanAvg[line] = concatenated
        
    return linesSlowerThanAvg

Get a dictionary of days that each line was pretty late, including what percent of buses were on-time.

In [216]:
daysSlowerThanAvg = lateDays(lines, busDf)

In [217]:
#checking how many values each key has
for key, value in daysSlowerThanAvg.items():
    #print value
    print(key, len([item for item in value if item]))

100 229
171 233
225 64
201 199
230 234
60 156
62 93
65 86
78 168
104 174
215 302
26 261
411 279
442 96
449 392
88 80
SL5 0
4 306
132 109
170 352
424 306
426 163
434 172
7 23
106 194
35 99
43 194
44 99
51 152
57A 81
76 124
CT3 268
SL1 0
111 0
114 314
9 33
116 11
222 144
236 276
326 231
39 0
435 252
119 214
21 63
216 39
428 321
451 169
456 126
459 236
503 172
80 188
87 156
SL2 5
8 247
137 329
14 206
220 116
24 187
351 260
554 197
91 325
99 285
501 66
221 277
431 71
448 304
553 202
69 62
92 151
455 204
68 159
746 24
75 124
5 296
11 93
112 288
16 206
22 0
34 63
430 205
502 82
89 136
90 229
109 170
192 194
194 115
214 96
240 242
245 195
441 109
55 156
77 0
117 9
131 221
17 201
211 228
350 133
36 160
436 109
10 98
105 343
108 215
23 0
29 289
31 50
32 0
352 225
45 111
47 159
93 108
210 287
465 144
505 123
71 1
79 174
59 228
72 81
136 253
504 210
558 295
70 116
86 93
202 294
439 163
50 208
57 0
84 266
19 182
30 224
85 162
CT1 270
101 51
217 240
33 183
450 190
212 309
354 162
70A 206
73 2
94 16

In [223]:
daysSlowerThanAvg['1']

[('11/24/16 0:00', 0.545144804088586),
 ('12/25/16 0:00', 0.5582978723404255),
 ('4/28/17 0:00', 0.5570175438596491),
 ('5/25/17 0:00', 0.5602836879432624),
 ('10/8/17 0:00', 0.5149384885764499),
 ('12/25/17 0:00', 0.5618789521228545)]

**Function to determine average travel times to and from each census tract in the Uber dataset.**

Output of this function has been saved in "times.csv" and the timesDf.

In [28]:
def averageTravelTimes(sources, dests, dataset):
    fromtotimes = {}
    counter = 0
    for source in sources:
        sliced = dataset.loc[(dataset['sourceid'] == source)]
        for dest in dests:
            furthersliced = sliced.loc[(sliced['dstid'] == dest)]
            if len(furthersliced.index != 0):
                mean = furthersliced['geometric_mean_travel_time'].mean()
                fromtotimes[(source, dest)] = mean
                counter+=1
                if (counter % 5000) == 0:
                    print(source, dest, mean)
    return fromtotimes

In [236]:
#times = averageTravelTimes(sources, dests, uber)

In [103]:
#timesSeries.index.name = 'To From Pair'

In [104]:
#timesSeries.reset_index()
#timesSeries.to_csv('times.csv')

In [240]:
#timesDf.head()

**To compare Uber and MBTA bus lines, we need to know what census tracts each line goes through. Here's the process:**

1. Find a map with census tracts. I use https://worldmap.harvard.edu/maps/3948. Uncheck everything on the left, and then under "Boundaries" check "Boston's Census Tracts".

2. Use the MBTA's website to get a map of the line route. https://www.mbta.com/schedules/bus

3. Click on areas of the Harvard Map that the line you're working on goes through

In [None]:
zones_100 = ["837", "588", "838", "836", "203", "587"]
zones_7 = ["885", "883", "530", "425", "501"]
zones_1 = ["806", "804", "711", "709", "708", "10402", "10401", "10101", ]