A CSV file containing cleaned Uber data can be found here: 

https://drive.google.com/file/d/1pWGv84eaEZF495H7HVwDOFHnLj66-Nbh/view?usp=sharing

In [1]:
import pandas as pd

**Load the local csv files into dataframes.**

**"bus.csv" and "times.csv" should be in the GitHub repo, but uber.csv is not. Find that here: https://drive.google.com/file/d/1QaT4NrrkulKCc1OKrVxXukOJdUp8XiC8/view?usp=sharing**

In [232]:
uber = pd.read_csv('uber.csv', index_col = 0)
busDf = pd.read_csv('bus.csv', index_col = 0)
timesDf = pd.read_csv("times.csv", names=["start", "stop", "time"])

  mask |= (ar1 == a)


**Create a list of unique routes from the busDf**

In [78]:
lines = busDf.ROUTE_OR_LINE.unique()
lines.tolist()

**Sort the Uber dataframe by source and then destination**

In [233]:
uber = uber.sort_values(['sourceid', 'dstid'])

**Function to find the days where every line was slower than 80% of the average.**

If we just look at days where the lines were slower than 100% of the average, we get way too many days to look at. We need to be more strict with out analysis.**

In [275]:
def lateDays(lines, busDf):
    
    linesSlowerThanAvg = {}
    
    for line in lines:
        #get the piece of the dataframe that has data for each line
        thisLine = busDf.loc[busDf['ROUTE_OR_LINE'] == line]
        
        averageOnTime = ((thisLine['PERCENT_ONTIME']).mean())*0.3
        
        slowerThanAverage = thisLine.loc[(thisLine['PERCENT_ONTIME'] < averageOnTime)]
        
        datesSlowerThanAverage = slowerThanAverage['SERVICE_DATE']
        
        percentOnTime = slowerThanAverage['PERCENT_ONTIME']
        
        listDates = datesSlowerThanAverage.tolist()
        
        listPercents = percentOnTime.tolist()
        
        concatenated = [] 
        
        for ii in range(len(listDates)):
            concatenated.append((listDates[ii], listPercents[ii]))
        
        linesSlowerThanAvg[line] = concatenated
        
    return linesSlowerThanAvg

Get a dictionary of days that each line was pretty late, including what percent of buses were on-time.

In [276]:
daysSlowerThanAvg = lateDays(lines, busDf)

In [277]:
uber.loc[(uber['sourceid'] >= 2000)]

Unnamed: 0,sourceid,dstid,hod,geometric_mean_travel_time,geometric_standard_deviation_travel_time


In [278]:
#checking how many values each key has
for key, value in daysSlowerThanAvg.items():
    #print value
    print(key, len([item for item in value if item]))

100 7
171 107
225 0
201 22
230 0
60 0
62 0
65 0
78 0
104 4
215 0
26 6
411 4
442 1
449 25
88 0
SL5 0
4 40
132 0
170 63
424 9
426 0
434 30
7 0
106 0
35 0
43 3
44 0
51 0
57A 2
76 0
CT3 6
SL1 0
111 0
114 64
9 0
116 0
222 0
236 2
326 9
39 0
435 1
119 3
21 0
216 0
428 154
451 6
456 0
459 1
503 31
80 0
87 1
SL2 0
8 1
137 13
14 1
220 0
24 1
351 12
554 1
91 8
99 2
501 0
221 110
431 3
448 190
553 1
69 0
92 0
455 2
68 1
746 0
75 0
5 72
11 0
112 9
16 0
22 0
34 0
430 7
502 0
89 0
90 8
109 2
192 112
194 37
214 0
240 1
245 0
441 1
55 0
77 0
117 0
131 10
17 3
211 0
350 1
36 0
436 1
10 0
105 12
108 5
23 0
29 1
31 0
32 0
352 1
45 0
47 0
93 0
210 71
465 0
505 0
71 0
79 0
59 0
72 0
136 10
504 1
558 31
70 0
86 0
202 21
439 1
50 1
57 0
84 77
19 3
30 0
85 0
CT1 4
101 0
217 18
33 5
450 0
212 15
354 1
70A 0
73 0
94 0
95 0
74 1
429 0
110 0
40 1
556 6
97 16
CT2 4
1 0
27 17
37 0
64 0
18 7
67 0
SL4 0
9703 84
120 12
134 0
15 0
34E 0
608 0
83 1
193 32
38 3
191 46
325 44
41 0
42 0
96 0
28 0
66 0
238 0
9701 55
121 2
9

In [279]:
daysSlowerThanAvg['72/75']

[('3/26/16 0:00', 0.18446601941747573)]

**Function to determine average travel times to and from each census tract in the Uber dataset.**

Output of this function has been saved in "times.csv" and the timesDf.

In [28]:
def averageTravelTimes(sources, dests, dataset):
    fromtotimes = {}
    counter = 0
    for source in sources:
        sliced = dataset.loc[(dataset['sourceid'] == source)]
        for dest in dests:
            furthersliced = sliced.loc[(sliced['dstid'] == dest)]
            if len(furthersliced.index != 0):
                mean = furthersliced['geometric_mean_travel_time'].mean()
                fromtotimes[(source, dest)] = mean
                counter+=1
                if (counter % 5000) == 0:
                    print(source, dest, mean)
    return fromtotimes

In [236]:
#times = averageTravelTimes(sources, dests, uber)

In [103]:
#timesSeries.index.name = 'To From Pair'

In [104]:
#timesSeries.reset_index()
#timesSeries.to_csv('times.csv')

In [240]:
#timesDf.head()

**To compare Uber and MBTA bus lines, we need to know what census tracts each line goes through. Here's the process:**

1. Find a map with census tracts. I use https://worldmap.harvard.edu/maps/3948. Uncheck everything on the left, and then under "Boundaries" check "Boston's Census Tracts".

2. Use the MBTA's website to get a map of the line route. https://www.mbta.com/schedules/bus

3. Click on areas of the Harvard Map that the line you're working on goes through. Write down the census tracts in a list. The Harvard website shows the tract under "Feature Details" and then "TRACT". 

4. Repeat for each line. There's 180 of them, so maybe start with the most popular lines first: https://en.wikipedia.org/wiki/MBTA_key_bus_routes

In [None]:
zones_100 = ["837", "588", "838", "836", "203", "587"]
zones_7 = ["885", "883", "530", "425", "501"]
zones_1 = ["806", "804", "711", "709", "708", "105", "107", "108", "3531", ]