<a href="https://colab.research.google.com/github/GenevaKirwan/CFG-Airline-Project/blob/Test-1/Processing_Fleet_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieving and Cleansing Fleet Data

The aim of this code is to web-scrape fleet data tables from https://data.freshaviation.co.uk/airlines/ and cleanse the data to minimise the size of the resultant dataframe. The data source details every plane owned by each airline, however what I need is the number of each aircraft type so I can compare this 'fleet portfolio' with the emmisions data.

## Web Scraping Precautions

I checked the data source website for a robots.txt file to check for any bot limitations before starting but there was no file present. Regardless, when scraping the data I will still include a 10 secs waiting time before retrieving data to make sure I don't overload the site.

## Initial tests and familiarisation with webscraping and BeautifulSoup

I ran the quick check below with requests to see if I was successfully retrieving the page from the site.

In [1]:
import requests
url = 'https://data.freshaviation.co.uk/view-airline/American%20Airlines/'

r = requests.get(url)

print(r.content[:100])

b'<!DOCTYPE html>\r\n<html lang="en">\r\n\r\n<head>\r\n\t<meta charset="utf-8">\r\n\t<meta name="viewport" content'


I also practised with BeautifulSoup to retrieve information from the right part of the relevant table. This took quite a bit of tweaking as it was my first time using CSS selector and BeautifulSoup syntax. Within this code I am pulling the name and IDs for the airlines. this will be useful in stitching my data with the data produced by the rest of the team (if necessary)

In [2]:
from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(r.content, 'html.parser')

IDs = soup.select('table tr td')[3].text.split(' / ')
Name = soup.select('table tr td')[1].text
ICAO = IDs[0]
IATA = IDs [1]
print('ICAO: ' + ICAO + '\n' + 'IATA: ' + IATA)

ICAO: AAL
IATA: AA


## Final Code and Walkthrough

The page URLS are generated into the URLs we will need to call from in the code below. 

In [5]:
airlines = ['American Airlines', 'Southwest Airlines', 'Delta Air Lines', 'United Airlines', 'Spirit Airlines', 'Alaska Airlines', 'JetBlue Airways', 'SkyWest Airlines', 'Frontier Airlines', 'Hawaiian Airlines']
urls = []

for airline in airlines:
    urls.append('https://data.freshaviation.co.uk/view-airline/' + airline.replace(' ','%20') + '/')
    
urls

['https://data.freshaviation.co.uk/view-airline/American%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/Southwest%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/Delta%20Air%20Lines/',
 'https://data.freshaviation.co.uk/view-airline/United%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/Spirit%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/Alaska%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/JetBlue%20Airways/',
 'https://data.freshaviation.co.uk/view-airline/SkyWest%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/Frontier%20Airlines/',
 'https://data.freshaviation.co.uk/view-airline/Hawaiian%20Airlines/']

Next I am going to use a for loop to web-scrape the airline fleet data for each of the URLs. This for-loop will be quite long as by the end of it I will have placed all the data I need from all the site-pages into a few lists. Comments will be placed within the code to provide further insight into the steps carried out. WARNING:This code block may take up to 10 mins to run.

In [6]:
import requests
from time import sleep
from bs4 import BeautifulSoup

#Some lists I will need later to collate the info into
airline_name = []
ICAO_ID = []
IATA_ID = []
aircraft_type_col = []
no_aircrafts = []

for url in urls:
    #First we use requests to get the HTML from the inserted url.
    r = requests.get(url)
    
    #Next we use BeautifulSoup to locate the Name and IDs of the airline from the HTML.
    soup = BeautifulSoup(r.content, 'html.parser')
    IDs = soup.select('table tr td')[3].text.split(' / ')
    Name = soup.select('table tr td')[1].text
    ICAO = IDs[0]
    IATA = IDs [1]
    
    #Then we collect the fleet specific table from the HTML.
    sleep(10)
    rows = soup.select('#cfleet table tr')
    
    #As I only need the 'Aircraft Type' from this table I will collect this value in a list.
    all_aircraft_list = []
    for row in rows:
        r_data = row.select('td')
        Aircraft_Type = r_data[2].text
        all_aircraft_list.append(Aircraft_Type)
    
    #My list has many duplicate values. I want to know how many times each Aircraft type is listed.
    #To achieve this I will create a dictionary holding the count of each list item.
    all_aircraft_dict = dict.fromkeys(all_aircraft_list)
    for key in all_aircraft_dict:
        all_aircraft_dict[key] = all_aircraft_list.count(key)
        
    #Now I would like to append these values to lists ready to be converted later into a pandas dataframe.
    aircraft_type = list(all_aircraft_dict.keys())
    airline_name.extend([Name]*len(aircraft_type))
    ICAO_ID.extend([ICAO]*len(aircraft_type))
    IATA_ID.extend([IATA]*len(aircraft_type))
    aircraft_type_col.extend(aircraft_type)
    no_aircrafts.extend(list(all_aircraft_dict.values()))
    
    #Lastly I will have the code wait 10 secs before pulling from the next page
    sleep(10)
    
print('All done :-)')

all done :-)


All of the lists created now need to be moved into a pandas dataframe ready for further processing later. It's also a handy way for me to be able to visualise the data I have produced and 'sanity check' it. The dataframe is produced by using the pd zip command to combine all the lists, this way they are in the right format for the pd.DataFrame function.  

In [7]:
import pandas as pd
fleet_pd = pd.DataFrame(list(zip(airline_name, ICAO_ID, IATA_ID, aircraft_type_col, no_aircrafts)), 
                           columns =['Airline Name', 'ICAO ID', 'IATA ID', 'Aircraft Type', 'No Aircrafts'])
fleet_pd

Unnamed: 0,Airline Name,ICAO ID,IATA ID,Aircraft Type,No Aircrafts
0,American Airlines,AAL,AA,Airbus A321-200,218
1,American Airlines,AAL,AA,Airbus A320-200,48
2,American Airlines,AAL,AA,Airbus A319-100,133
3,American Airlines,AAL,AA,Boeing B757-200,30
4,American Airlines,AAL,AA,Boeing B737-800,304
...,...,...,...,...,...
89,Frontier Airlines,FFT,F9,Airbus A319-100,6
90,Hawaiian Airlines,HAL,HA,Airbus A321neo,18
91,Hawaiian Airlines,HAL,HA,Airbus A330-200,24
92,Hawaiian Airlines,HAL,HA,Boeing B717-200,19


I will then save the data as a csv to protect it from changes/ loss of this code or changes to the websites data is being scraped from.

In [10]:
from datetime import date
today = date.today()
file_name = 'fleetcsv_' + today.strftime("%d%m%y") + '.csv'
fleet_pd.to_csv(file_name)