## TFL Oyster Card Journeys Analysis
London is a major world capital in many sectors, though perhaps most notably in finance. With a booming economy, people are flocking to the city from all around the world, attracting by the thriving job market and high standards of living.
This has caused its population to increase steadily and consistently since the 1970s, and most notably since the 1990s.
In 2011, the Current London Plan [[1]](https://www.london.gov.uk/what-we-do/planning/london-plan/current-london-plan/london-plan-chapter-one-context-and-strategy-0) predicted:
> ... London’s population rising from 8.2 million in 2011, to:
>
>9.20 million in 2021;
>9.54 million in 2026;
>9.84 million in 2031; and
>10.11 million in 2036.

So far their estimations have proved accurate; current estimates put the population of London at around 9.3 million. [[2]](https://worldpopulationreview.com/world-cities/london-population/)

With a growing population, London's roads are becoming more congested and the TFL network is struggling to cope. Never was this more obvious than at the height of the COVID-19 crisis in early March this year.

In this notebook I will analyse a dataset containing information on TFL Oyster card journeys across a 7-day period in November 2009 to draw some insights on the TFL network to highlight the key problem areas that need to be addressed.

The dataset was collected by a classmate of mine in the UCL Department of Physics and Astronomy, please find it here: [TFL Oyster Journeys '09](https://www.kaggle.com/astronasko/transport-for-london-journey-information)

<figure>
   <a href="https://www.london.gov.uk/sites/default/files/styles/gla_large_unconstrained/public/figure_1.1_annual_pop_change_1971-2011.png?itok=tmztWl0e">
   <img src=https://www.london.gov.uk/sites/default/files/styles/gla_large_unconstrained/public/figure_1.1_annual_pop_change_1971-2011.png?itok=tmztWl0e width="500" align="center"/></a>
   <figcaption> London population change between 1971-2011. Source: london.gov.uk
   </figcaption>
</figure>

In [None]:
# import modules
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read in dataset
df = pd.read_csv('../input/transport-for-london-journey-information/Nov09JnyExport.csv')

In [None]:
# Check the head of the dataset
df.head()

In [None]:
df.info()

# Step 1: Data Wrangling
* First off, I am going to clean up and reformat the data to make it suit my purposes.
* I'll start by renaming some columns to make them more descriptive and consistent.

In [None]:
df = df.rename(columns={'downo': 'DOWno', 'daytype': 'DOW', 'EndStation': 'EndStn',
                        'EXTimeHHMM': 'ExTimeHHMM', 'ZVPPT': 'Zones', 'JNYTYP': 'JourneyType',
                        'FFare': 'FullFare', 'DFare': 'DiscountFare', 'RouteID': 'BusRoute'})

In [None]:
df['EntTimeHHMM'].value_counts()

In [None]:
df['ExTimeHHMM'].value_counts()

* By far the most frequent value for both 'EntTimeHHMM' and 'ExTimeHHMM' is 00:00
* This appears erroneous, as one would expect that most passengers would travel at rush hour (before 9am, after 5pm)
* I will inspect this further and possibly drop these two columns, as the 'EntTime' and 'ExTime' columns supply sufficient information (time in minutes after midnight)

In [None]:
df[df['EntTimeHHMM'] != '00:00']['EntTimeHHMM'].value_counts().head(30)

In [None]:
df[df['EntTimeHHMM'] != '00:00']['EntTimeHHMM'].value_counts().tail(30)

In [None]:
df[df['EntTimeHHMM'] == '27:25']

In [None]:
df[(df['EntTimeHHMM'] == '26:36')]

### Some observations on 'EntTimeHHMM' and 'ExTimeHHMM'
* As expected, the vast majority of journeys (excluding at default time of 00:00) happen between 08:00-09:00 or 17:00-18:00, which is rush hour in London
* All erroneous entries where HH > 23 are bus journeys on night bus routes (including 24-hour routes such as route 83), as evidenced in the tables above (e.g. N295 at index 947857)
* These two categories do not provide me with any unique insight that I cannot derive from the categories, such as 'EntTime' and 'ExTime', and they are altogether too convoluted. I will therefore drop both categories.

In [None]:
df = df.drop(['EntTimeHHMM', 'ExTimeHHMM'], axis=1)

* Now I'll check the entries for the remaining categories

In [None]:
# Bus by far most popular mode of transport (for oyster card journeys at least)
df['SubSystem'].value_counts()

In [None]:
print(sorted(df['StartStn'].unique()))

In [None]:
print(sorted(df['EndStn'].unique()))

* Looking at the unique station names in 'StartStn', we can see that some passengers did not touch in, resulting in default station name 'Unstarted'
* Doing the same for 'EndStn', we find that journeys where the passenger did not touch out are given the default station name 'Unfinished'

## Completed Journeys
* I am interested in exploring journey length and how it interacts with other variables.
* In order to do this, I will now create a new dataframe called 'complete', which is a subset of 'df', but only containing entries where the passenger touched both in and out (i.e. we know both their start and end station).

In [None]:
complete = df[(df['StartStn'] != 'Unstarted') & (df['EndStn'] != 'Unfinished')]

In [None]:
complete.head()

In [None]:
complete.sample()

* Looking at subsets of 'complete', it is apparent that all bus journeys have a default Exit Time of midnight. This makes sense, as passengers are not required to touch out on bus journeys, so there will be no information on the end time of their journey. For our purposes, this is not helpful, as we are trying to look at journey times.
* A further observation is that for some bus journeys, the Start station and End station are both 'Bus' by default.
* For this reason, I will drop all bus journeys from the 'complete' subset.

In [None]:
# drop all bus journeys from complete
complete = complete[complete['SubSystem'] != 'LTB']

* As we have dropped all bus journeys, 'BusRoute' column is now redundant, as all entries will be 'XX' by default. I will now drop this column.

In [None]:
complete = complete.drop('BusRoute', axis=1)

### Feature engineering
Next I create a new category called "JourneyTime", which is the total journey time in minutes, as the difference between "ExTime" and "EntTime"

In [None]:
complete['JourneyLength'] = complete['ExTime'] - complete['EntTime']

In [None]:
complete[complete['JourneyLength'] <= 0]

* Looking at 'JourneyTime', we find that all Tram journeys follow the same pattern as Bus journeys, defaulting to an Exit Time of midnight. This is not helpful for our purposes, so I will now drop all Tram journeys.

In [None]:
complete = complete[complete['SubSystem'] != 'TRAM']

In [None]:
complete[complete['JourneyLength'] <= 0]

In [None]:
complete[complete['JourneyLength'] <= 0].sample()

* Lastly, we find that there are 724 remaining entries where 'JourneyTime' is less than or equal to zero. These all appear to be where a passenger has touched in and out of the same station, so I now drop all entries where the Start Station and End Station are the same

In [None]:
complete = complete[complete['StartStn'] != complete['EndStn']]

In [None]:
complete[complete['JourneyLength'] <= 0]

In [None]:
print(f"Remaining invalid journey times: {len(complete[complete['JourneyLength'] <= 0])}")

* Lastly, we have 46 remaining entries where 'JourneyTime' is less than or equal to zero. These are all cases where the start and end station are the same, so I will drop these.

In [None]:
complete = complete[complete['JourneyLength'] > 0]

In [None]:
complete.describe()

# Step 2: EDA
Now let's explore the data with some visualisations.

In [None]:
complete['SubSystem'].value_counts()

In [None]:
fig = plt.figure(figsize=[12,8])
plt.title('Number of Journeys by Mode of Transport')
complete['SubSystem'].value_counts().plot(kind='bar')

## Important Observation on 'complete' Dataframe
* LUL (Underground) is by far the most prominent mode of transport. This tells us that the 'complete' Dataframe is not indicative of all Oyster journeys, as it would appear that complete journey information is only available for LUL. We dropped all entries from other modes of transport, most notably Bus - which was by far the most popular mode of transport.
* It is therefore important to note that any predictions and conclusions we make from the 'complete' dataframe are only applicable to LUL journeys, and are not at all indicative of all Oyster journeys on the TFL network.
* For this reason, it is implied that any comments made henceforth on the 'complete' dataframe apply only to LUL journeys.

In [None]:
fig = plt.figure(figsize=[14,8])
plt.title('Most Popular Underground Start Stations')
complete['StartStn'].value_counts()[:20].plot(kind='bar')

In [None]:
fig = plt.figure(figsize=[14,8])
plt.title('Most Popular Underground End Stations')
complete['EndStn'].value_counts()[:20].plot(kind='bar')

In [None]:
complete[['StartStn','EndStn']].sample(10)

## Comparing Start and End Stations
* Now I create a frequency plot for all stations with a hue of Start vs End Station
* This is my favourite visualisation as it took quite a bit of tinkering to find a good solution to create the hue 

In [None]:
from collections import OrderedDict

startstn = list(complete['StartStn'])
endstn = list(complete['EndStn'])
df = pd.DataFrame(data={'StartStn':startstn,'EndStn':endstn})
df.head()

In [None]:
# Create 'hue' column to use for plot

df['hue'] = 'Start' # set 'hue' to 'Start' for all Start stations
df['Stations'] = df['StartStn']
df_start = df[['Stations','hue']]

df['hue'] = 'End'   # set 'hue' to 'End' for all End stations
df['Stations'] = df['EndStn']
df_end = df[['Stations','hue']]

In [None]:
# Create order by value count

orderstart = df['StartStn'].value_counts()
startstnlist = orderstart.index.tolist()

orderend = df['EndStn'].value_counts()
endstnlist = orderend.index.tolist()

order = startstnlist+endstnlist
order = list(OrderedDict.fromkeys(order))

df_concat = pd.concat([df_start,df_end],ignore_index=True)

In [None]:
plt.figure(figsize=[20,8])
fig = sns.countplot(data=df_concat,x='Stations',order=order[:20],hue='hue')
plt.title('Most Popular Underground Stations, compared as Start or End')
fig.set_xticklabels(fig.get_xticklabels(), rotation=90);

### Most popular Underground stations (Start or End)
* Now let's look at the most popular underground stations, whether they are the start or end points of a journey

In [None]:
# Get order of all stations from df_concat
allstnlist = df_concat['Stations'].value_counts().index.tolist()
orderall = list(OrderedDict.fromkeys(allstnlist))

plt.figure(figsize=[20,8])
plt.grid()
plt.title('Most Popular Underground Stations (Start or End)')
fig = sns.countplot(x='Stations',data=df_concat,order=orderall[:20],palette='viridis')
fig.set_xticklabels(fig.get_xticklabels(), rotation=90);

## Journey Length Distribution
* Now let's investigate the distribution of journey lengths on the network

In [None]:
plt.figure(figsize=[14,6])
plt.grid()
ax = sns.distplot(complete['JourneyLength'],bins=40)
ax.set_xlabel('Journey Length / mins')
ax.set_ylabel('Percentage of total journeys')

* The distribution plot shows a positive skew
* We can see that the mean Journey Length is around 20 mins

## Compare days of week
* Now let's look at a frequency plot according to the day of the week

In [None]:
daysofweek = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

In [None]:
plt.figure(figsize=[14,8])
plt.grid()
sns.countplot(x='DOW',data=complete, order = daysofweek)
plt.title('Number of Underground Journeys by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Number of journeys')

As expected, we find most journeys happen on weekdays, with fewer on Saturdays and least on Sundays.

# Step 3: Preprocessing
* Now I will prepare data for modelling, creating logical variables and dropping redundant features

In [None]:
# Create logical variable of Daily Capping
complete['DailyCapping'] = pd.get_dummies(complete['DailyCapping'])['Y']

* Now I look at the different Final Product entries, to find outliers and erroneous entries

In [None]:
plt.figure(figsize=[14,8])
complete['FinalProduct'].value_counts().plot(kind='bar')

In [None]:
complete['FinalProduct'].value_counts()

In [None]:
complete[complete['FinalProduct'] == 'Tfl Travel - Free']

* As we can see, 'LUL Travelcard->Annual' is an inconsistent entry matching 'LUL Travelcard-Annual'. I now rename these entries to be consistent with the rest of the data.
* I drop the two outliers where the travelcard time was not captured ('LUL Travelcard-Time Not Captured').
* It is unclear what is meant by 'Tfl Travelcard - Free'. My intuition is that this represents free journeys where passengers used 5-10 Zip oyster photocards. The Oyster card wikipedia page confirms that these Oyster cards were available in 2009, when the data was collected:
> "On 7 January 2008, Transport for London unveiled the Zip card, an Oyster photocard to be used by young people aged 18 years or under who qualify for free bus and tram travel within the capital, with effect from 1 June 2008."
[ [3] ](https://en.wikipedia.org/wiki/Oyster_card#Oyster_photocards) 

In [None]:
# Rename typo entries
complete['FinalProduct'].replace('LUL Travelcard->Annual','LUL Travelcard-Annual', inplace=True)

In [None]:
# Drop outlier entries where travelcard time period was missing
complete = complete[complete['FinalProduct'] != 'LUL Travelcard-Time Not Captured']

In [None]:
complete['FinalProduct'].value_counts()

# Conclusion
* From my analysis I can conclude that certain stations are under higher stress to accomodate passengers during rush hour. One such example is Oxford Circus.
* Most stations that experience high congestion are in Zone 1.
* It is advisable to find a method of sharing the passenger load that these congested stations carry with other nearby stations. Perhaps providing incentives for passengers to exit at a less congested station and complete their journey on foot or by bus, for example.
* I have cleaned up the data, performed EDA and some basic preprocessing to prepare it for modelling.
* I could now perform dimensionality reduction on the 'complete' dataframe, perhaps using PCA, to draw some more interesting conclusions. I may explore this possibility in future.

# References
[ [1] ](https://www.london.gov.uk/what-we-do/planning/london-plan/current-london-plan/london-plan-chapter-one-context-and-strategy-0) The London Plan 2011: https://www.london.gov.uk/what-we-do/planning/london-plan/current-london-plan/london-plan-chapter-one-context-and-strategy-0

[ [2] ](https://worldpopulationreview.com/world-cities/london-population/) 2020 London Population estimate according to Worldpopulationreview: https://worldpopulationreview.com/world-cities/london-population/

[ [3] ](https://en.wikipedia.org/wiki/Oyster_card#Oyster_photocards) Oyster photocards Wikipedia link: https://en.wikipedia.org/wiki/Oyster_card#Oyster_photocards