# **London Underground ETL**

## Objectives

* Clean and merge Footfall and Station Coordinates datasets

## Inputs

* To run this notebook the StationFootfall and Stations datasets are required 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [39]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\Project-1'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [40]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [41]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects'

# Extract

* Retrieving datasets and checking values

Importing necessary libraries

In [42]:
import pandas as pd

Parsing CSV files into DataFrames

In [43]:
Footfall = pd.read_csv(r'c:\Users\jackr\OneDrive\Desktop\my_projects\Project-1\Dataset\Dirty\StationFootfall_2024_2025 .csv')
Station_Coords = pd.read_csv(r'c:\Users\jackr\OneDrive\Desktop\my_projects\Project-1\Dataset\Dirty\Stations_20180921.csv')

Checking variable types 

In [44]:
Footfall.info()
Station_Coords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249121 entries, 0 to 249120
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   TravelDate     249121 non-null  int64 
 1   DayOfWeek      249121 non-null  object
 2   Station        249121 non-null  object
 3   EntryTapCount  249121 non-null  int64 
 4   ExitTapCount   249121 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 9.5+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479 entries, 0 to 478
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   FID       479 non-null    int64  
 1   OBJECTID  479 non-null    int64  
 2   NAME      479 non-null    object 
 3   EASTING   479 non-null    int64  
 4   NORTHING  479 non-null    int64  
 5   LINES     271 non-null    object 
 6   NETWORK   479 non-null    object 
 7   Zone      479 non-null    int64  
 8   x         479 non-null    float6

^ The 'x' and 'y' values are floats, this is ideal for us as it's necessary in creating map scatter plots later on

Displaying head of each DataFrame

In [45]:
Footfall.head()

Unnamed: 0,TravelDate,DayOfWeek,Station,EntryTapCount,ExitTapCount
0,20240101,Monday,Abbey Road DLR,395,375
1,20240101,Monday,Abbey Wood,5898,5963
2,20240101,Monday,Acton Central,609,474
3,20240101,Monday,Acton Main Line,1717,1710
4,20240101,Monday,Acton Town,2928,3334


In [46]:
Station_Coords.head()

Unnamed: 0,FID,OBJECTID,NAME,EASTING,NORTHING,LINES,NETWORK,Zone,x,y
0,0,78,Temple,530959,180803,"District, Circle",London Underground,1,-0.112644,51.510474
1,1,79,Blackfriars,531694,180893,"District, Circle",London Underground,1,-0.10202,51.511114
2,2,80,Mansion House,532354,180932,"District, Circle",London Underground,1,-0.092495,51.511306
3,3,81,Cannon Street,532611,180900,"District, Circle",London Underground,1,-0.088801,51.510963
4,4,82,Monument,532912,180824,"District, Circle",London Underground,1,-0.084502,51.510209


---

# Section 2

* Merging and Cleaning

Filtering non-underground stations

In [47]:
Station_Coords = Station_Coords.query("NETWORK == 'London Underground'")

# Checking Networks have been filtered correctly          
Station_Coords.query("NETWORK == ['London Overground', 'Tramlink', 'DLR', 'TfL Rail']")
                            

Unnamed: 0,FID,OBJECTID,NAME,EASTING,NORTHING,LINES,NETWORK,Zone,x,y


^ No rows appear when querying our filtered stations, they are no longer in the DataFrame

Checking for duplicate values

In [53]:
Footfall.duplicated().sum()
Station_Coords.duplicated().sum()

0

Cleaning whitespace in columns and values

In [48]:
Footfall.columns = Footfall.columns.str.strip()
Station_Coords.columns = Station_Coords.columns.str.strip()
Footfall['Station'] = Footfall['Station'].str.strip()
Station_Coords['NAME'] = Station_Coords['NAME'].str.strip()


Merging Underground datasets

In [49]:
merged_ug = pd.merge(
    Footfall,
    Station_Coords[['NAME', 'LINES', 'NETWORK', 'Zone', 'x', 'y']],
    left_on='Station',
    right_on='NAME',
    how='left'
)

merged_ug.head(10)

Unnamed: 0,TravelDate,DayOfWeek,Station,EntryTapCount,ExitTapCount,NAME,LINES,NETWORK,Zone,x,y
0,20240101,Monday,Abbey Road DLR,395,375,,,,,,
1,20240101,Monday,Abbey Wood,5898,5963,,,,,,
2,20240101,Monday,Acton Central,609,474,,,,,,
3,20240101,Monday,Acton Main Line,1717,1710,,,,,,
4,20240101,Monday,Acton Town,2928,3334,Acton Town,"District, Piccadilly",London Underground,3.0,-0.278433,51.502137
5,20240101,Monday,Aldgate,7223,7382,Aldgate,"Metropolitan, Circle",London Underground,1.0,-0.074236,51.513982
6,20240101,Monday,Aldgate East,10657,11723,Aldgate East,"Hammersmith & City, District",London Underground,1.0,-0.06954,51.514917
7,20240101,Monday,All Saints,641,578,,,,,,
8,20240101,Monday,Alperton,2117,2235,Alperton,Piccadilly,London Underground,4.0,-0.298361,51.540227
9,20240101,Monday,Amersham,793,815,Amersham,Metropolitan,London Underground,9.0,-0.606147,51.673662


Dropping duplicate name column

In [57]:
merged_ug = merged_ug.drop(columns=['NAME'], errors='ignore')
merged_ug.head()

Unnamed: 0,TravelDate,DayOfWeek,Station,EntryTapCount,ExitTapCount,LINES,NETWORK,Zone,x,y
0,20240101,Monday,Abbey Road DLR,395,375,,,,,
1,20240101,Monday,Abbey Wood,5898,5963,,,,,
2,20240101,Monday,Acton Central,609,474,,,,,
3,20240101,Monday,Acton Main Line,1717,1710,,,,,
4,20240101,Monday,Acton Town,2928,3334,"District, Piccadilly",London Underground,3.0,-0.278433,51.502137


Removing all rows with NaN values, these are non-underground stations

In [58]:
merged_ug = merged_ug.dropna()
merged_ug.head()

Unnamed: 0,TravelDate,DayOfWeek,Station,EntryTapCount,ExitTapCount,LINES,NETWORK,Zone,x,y
4,20240101,Monday,Acton Town,2928,3334,"District, Piccadilly",London Underground,3.0,-0.278433,51.502137
5,20240101,Monday,Aldgate,7223,7382,"Metropolitan, Circle",London Underground,1.0,-0.074236,51.513982
6,20240101,Monday,Aldgate East,10657,11723,"Hammersmith & City, District",London Underground,1.0,-0.06954,51.514917
8,20240101,Monday,Alperton,2117,2235,Piccadilly,London Underground,4.0,-0.298361,51.540227
9,20240101,Monday,Amersham,793,815,Metropolitan,London Underground,9.0,-0.606147,51.673662


Establishing new index

In [59]:
merged_ug = merged_ug.set_index('Station')
merged_ug.head()

Unnamed: 0_level_0,TravelDate,DayOfWeek,EntryTapCount,ExitTapCount,LINES,NETWORK,Zone,x,y
Station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Acton Town,20240101,Monday,2928,3334,"District, Piccadilly",London Underground,3.0,-0.278433,51.502137
Aldgate,20240101,Monday,7223,7382,"Metropolitan, Circle",London Underground,1.0,-0.074236,51.513982
Aldgate East,20240101,Monday,10657,11723,"Hammersmith & City, District",London Underground,1.0,-0.06954,51.514917
Alperton,20240101,Monday,2117,2235,Piccadilly,London Underground,4.0,-0.298361,51.540227
Amersham,20240101,Monday,793,815,Metropolitan,London Underground,9.0,-0.606147,51.673662


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [50]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)