# Keolis delays
*Russell Goldenberg*

On the format: each question is followed by the code that generates the answer. This is also known as [reproducible research](https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research), a practice that’s slowly being adopted by newspapers (e.g. 538, The Upshot).


In [27]:
import re
import os
import agate

In [28]:
inputFile = os.getcwd() + '/delays.txt';
lines = open(inputFile).readlines()

In [29]:
delays = []
for line in lines:
    trim = line.strip()
    has_date = re.search(r'(\d+/\d+/\d+)', trim)
    has_time = re.search(r'(\d+:\d+)', trim)
    split = re.split(' {1,}', trim)

    if has_date and has_time and has_date.start() == 0 and len(split) > 2:
        date = split[0]
        time = split[1] + ' ' + split[2]
        datetime = date + ' ' + time
        id = split[5]
        minutes = split[6]
        delays.append((date, time, datetime, id, minutes))        

data = agate.Table(delays, column_names=['date', 'time', 'datetime', 'id', 'minutes'], column_types=[agate.Text(), agate.Text(), agate.DateTime(), agate.Text(), agate.Number()])

In [30]:
print('Here is a sample of what the clean data looks like:')
data.print_table(max_rows=5)

Here is a sample of what the clean data looks like:
|---------+----------+---------------------+------+----------|
|  date   | time     |            datetime | id   | minutes  |
|---------+----------+---------------------+------+----------|
|  6/1/15 | 6:00 am  | 2015-06-01 06:00:00 | 2034 |      18  |
|  6/1/15 | 7:20 am  | 2015-06-01 07:20:00 | 1059 |      34  |
|  6/1/15 | 7:20 am  | 2015-06-01 07:20:00 | 1059 |      34  |
|  6/1/15 | 8:40 am  | 2015-06-01 08:40:00 | 010  |      15  |
|  6/1/15 | 10:39 am | 2015-06-01 10:39:00 | 2012 |      20  |
|  ...    | ...      |                 ... | ...  |     ...  |
|---------+----------+---------------------+------+----------|


In [41]:
total_delays = len(data.rows)
print('There were %d total delays' % total_delays)

There were 861 total delays


### How many delays involved new trains?

In [42]:
# starts with 2 and >= 4 digits
with_new_trains = data.where(lambda row: row['id'].startswith('2') and len(row['id']) > 3)

In [43]:
total_delays_new_trains = len(with_new_trains.rows)
percent_new_trains = total_delays_new_trains / total_delays * 100
print('There were %d delays with new trains, or %.2f%% of all delays in this data set' % (total_delays_new_trains, percent_new_trains))

There were 276 delays with new trains, or 32.06% of all delays in this data set


### What is the average delay?

In [44]:
median_delay = data.aggregate(agate.Median('minutes'))
print('The median delay was %d minutes' % median_delay)

The median delay was 9 minutes


### How many delays per day on average?

In [45]:
counts_date = data.counts('date')
delays_per_day = counts_date.aggregate(agate.Mean('count'))
print('On average there were %d delays per day' % delays_per_day)

On average there were 4 delays per day


### How many days had a delay with a new train?

In [46]:
days_with_new_train_delay = len(with_new_trains.distinct('date').rows)
total_days = len(data.distinct('date').rows)
percent_days = days_with_new_train_delay / total_days * 100
print('%d days had a delay with a new train, or %.2f%% of all days with a delay in this data set' % (days_with_new_train_delay, percent_days))

113 days had a delay with a new train, or 64.94% of all days with a delay in this data set
