# Keolis delays
*Russell Goldenberg*

On the format: each question is followed by the code that generates the answer. This is also known as [reproducible research](https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research), a practice that’s slowly being adopted by newspapers (e.g. 538, The Upshot).


In [1]:
import re
import os
import agate

In [2]:
inputFile = os.getcwd() + '/delays.txt';
lines = open(inputFile).readlines()

In [9]:
delays = []
def line_to_list(arr): 
    date = arr[0]
    time = arr[1] + ' ' + arr[2]
    id = arr[5]
    minutes = arr[6]
    return (date, time, date + ' ' + time, id, minutes)
    
for line in lines:
    trim = line.strip()
    has_date = re.search(r'(\d+/\d+/\d+)', trim)
    has_time = re.search(r'(\d+:\d+)', trim)
    split = re.split(' {1,}', trim)

    if has_date and has_time and has_date.start() == 0 and len(split) > 2:
        new_data = line_to_list(split)
        delays.append(new_data)

col_types = [agate.Text(), agate.Text(), agate.DateTime(), agate.Text(), agate.Number()]
col_names = ['date', 'time', 'datetime', 'id', 'minutes']
data = agate.Table(delays, column_names=col_names, column_types=col_types)

In [10]:
print('Here is a sample of what the clean data looks like:')
data.print_table(max_rows=5)

Here is a sample of what the clean data looks like:
|---------+----------+---------------------+------+----------|
|  date   | time     |            datetime | id   | minutes  |
|---------+----------+---------------------+------+----------|
|  6/1/15 | 6:00 am  | 2015-06-01 06:00:00 | 2034 |      18  |
|  6/1/15 | 7:20 am  | 2015-06-01 07:20:00 | 1059 |      34  |
|  6/1/15 | 7:20 am  | 2015-06-01 07:20:00 | 1059 |      34  |
|  6/1/15 | 8:40 am  | 2015-06-01 08:40:00 | 010  |      15  |
|  6/1/15 | 10:39 am | 2015-06-01 10:39:00 | 2012 |      20  |
|  ...    | ...      |                 ... | ...  |     ...  |
|---------+----------+---------------------+------+----------|


In [15]:
# helper functions
def to_percent(n, d):
    return str(round(n/d * 100, 1)) + '%'

### How many delays involved new trains?

In [16]:
# starts with 2 and >= 4 digits
with_new_trains = data.where(lambda row: row['id'].startswith('2') and len(row['id']) > 3)

In [44]:
total = len(data.rows)
total_new_trains = len(with_new_trains.rows)
percent_new_trains = to_percent(total_new_trains, total)

In [45]:
min_date = data.aggregate(agate.Min('datetime'))
max_date = data.aggregate(agate.Max('datetime'))
min_date_display= min_date.date()
max_date_display= max_date.date()

In [48]:
print('This data looks at delays from %s to %s' % (min_date_display, max_date_display))
print('There were %d total delays' % total_delays)
print('%d delays were with new trains' % total_new_trains)
print('%s of all delays involved new trains' % percent_new_trains)

This data looks at delays from 2015-06-01 to 2015-12-01
There were 861 total delays
276 delays were with new trains
32.1% of all delays involved new trains


### How many days had a delay with a new train?

In [50]:
total_days = len(data.distinct('date').rows)
days_with_new_train = len(with_new_trains.distinct('date').rows)
percent_days_new_train = to_percent(days_with_new_train, total_days)

In [53]:
print('There were delays on %d days' % total_days)
print('On %d of those days there was at least one delay involving a new train' % days_with_new_train)
print('%s of all days with a delay involved a new train' % percent_days_new_train)

There were delays on 174 days
On 113 of those days there was at least one delay involving a new train
64.9% of all days with a delay involved a new train


### How many new unique new trains had issues?

In [54]:
unique_new_trains = len(with_new_trains.distinct('id').rows)

In [55]:
print('%d different new trains were involved in one or more delays' % unique_new_trains)

38 different new trains were involved in one or more delays


### What is the average delay?

In [56]:
median_delay = data.aggregate(agate.Median('minutes'))
print('The median delay was %d minutes' % median_delay)

The median delay was 9 minutes


### How many delays per day on average?

In [58]:
counts_date = data.counts('date')
per_day = counts_date.aggregate(agate.Mean('count'))
print('On average there were %d delays per day' % per_day)

On average there were 4 delays per day
