# Analysis of Metro Boston Transit Authority routes and Uber trips 2016-2018

### A CSV file containing cleaned Uber data can be found here: 

https://drive.google.com/file/d/1pWGv84eaEZF495H7HVwDOFHnLj66-Nbh/view?usp=sharing

You must download the dataset to continue.

In [2]:
import pandas as pd

#### 1. Load the local csv files into dataframes.


In [3]:
#Uber Data
uber = pd.read_csv('uber.csv', index_col = 0)

#MBTA Bus data
busDf = pd.read_csv('bus.csv', index_col = 0)

#Average travel times from Uber dataset, calculated with averageTravelTimes() function at bottom of notebook
timesDf = pd.read_csv("times.csv", names=["start", "stop", "time"])

  mask |= (ar1 == a)


** 2. Create a list of unique routes from the busDf**

In [6]:
lines = busDf.ROUTE_OR_LINE.unique()
lines = lines.tolist()

**3. Sort the Uber dataframe by source and then destination**

In [7]:
uber = uber.sort_values(['sourceid', 'dstid'])

### Below, we find the five days where each line was most delayed.

If we just look at days where the lines were slower than average, we get way too many days to look at. We need to be more strict with out analysis. We slowly decrease our on-time threshold from the average down by a factor of 0.999. 

In [15]:
def lateDays(lines, busDf):
    
    linesSlowerThanAvg = {} #dictionary of lines as keys, and days as values
    count = 0
    for line in lines:
        
        #get the piece of the dataframe that has data for this line
        thisLine = busDf.loc[busDf['ROUTE_OR_LINE'] == line]
        
        #find the average percent on time for this line
        averageOnTime = ((thisLine['PERCENT_ONTIME']).mean())
        
        #we want just a handful of delays that are significant relative to each line
        #looking at the averages, we get between 1-100 delays
        #let's make it a max of 25 for each line
        
        slowerThanAverage = thisLine.loc[(thisLine['PERCENT_ONTIME'] < averageOnTime)]
        
        #while we have more than 25 days of delays
        while (len(slowerThanAverage) > 25): 
            
            #increase our threshold for on-time percent
            averageOnTime *= 0.999 
            
            #pull days with on-time percents less than our new threshold
            slowerThanAverage = thisLine.loc[(thisLine['PERCENT_ONTIME'] < averageOnTime)]
            
            #check that we don't have an average of 0
            if (slowerThanAverage['PERCENT_ONTIME'].mean() == 0):
                break
        
        #pull dates from the list of delayed days
        datesSlowerThanAverage = slowerThanAverage['SERVICE_DATE']
        
        #pull percent on time from list of delayed days
        percentOnTime = slowerThanAverage['PERCENT_ONTIME']
        
        #get a list of the days
        listDates = datesSlowerThanAverage.tolist()
        
        #get a list of the percents on time
        listPercents = percentOnTime.tolist()
        
        concatenated = []
        
        #for each delayed day
        for ii in range(len(listDates)):
            #add the line, date, and percent on time
            concatenated.append((line, listDates[ii], listPercents[ii]))
        
        #enter line, data, and percent on-time into the dictionary
        linesSlowerThanAvg[line] = concatenated
        
    return linesSlowerThanAvg

#### 4. Get a dictionary of days that each line was pretty late, including what percent of buses were on-time.

In [11]:
daysSlowerThanAvg = lateDays(lines, busDf)

### Below, we find average travel times to and from each census tract in the Uber Boston dataset. 

#### The function _averageTravelTimes_ returns a dictionary of sources, destinations, and average travel times for each zone pair in the city. 

Output of this function has been saved in "times.csv" and the timesDf. The function is provided for reference. 

In [17]:
def averageTravelTimes(sources, dests, dataset):
    
    #dictionary of sources, destinations, and mean travel times
    fromtotimes = {}
    
    #counter for intermittent output (to ensure the code is working because it takes a long time to run)
    counter = 0
    
    for source in sources:
        
        #get the slice of the dataset from source
        sliced = dataset.loc[(dataset['sourceid'] == source)]
        
        for dest in dests:
            #further slice into source->dest data
            furthersliced = sliced.loc[(sliced['dstid'] == dest)]
            
            #calculate the mean travel time from source to dest
            if len(furthersliced.index != 0):
                mean = furthersliced['geometric_mean_travel_time'].mean()
                fromtotimes[(source, dest)] = mean
                counter+=1
                
                #intermittent output to ensure the function is working
                if (counter % 5000) == 0:
                    print(source, dest, mean)
                    
    return fromtotimes

In [18]:
#The code below is used to build the average travel times csv
#times = averageTravelTimes(sources, dests, uber)
#times.to_csv('times.csv')

**To compare Uber and MBTA bus lines, we need to know what census tracts each line goes through. Here's the process:**

1. Find a map with census tracts. I use https://worldmap.harvard.edu/maps/3948. Uncheck everything on the left, and then under "Boundaries" check "Boston's Census Tracts".

2. Use the MBTA's website to get a map of the line route. https://www.mbta.com/schedules/bus

3. Click on areas of the Harvard Map that the line you're working on goes through. Write down the census tracts in a list. The Harvard website shows the tract under "Feature Details" and then "TRACT". 

4. Repeat for each line. There's 180 of them, so maybe start with the most popular lines first: https://en.wikipedia.org/wiki/MBTA_key_bus_routes