## Milestone 3 - Cleaning/Formatting Website Data

>Flat File - Website:
The flat file has a column for diverted flights but does not have any information on the cause for diversion. I would like to look up the reason for a flight being diverted. The website and flat file can be matched on flight date, origin and destination to lookup diverted flight information.
Flat file has many to many relation with the Website. We will need to pass the flight date and the origin and destination city to the website to get flight diversion details for a particular date and route.

In [1]:
#Milestone 3 libraries
import pandas as pd
import numpy as np
from urllib.request import Request, urlopen 
from bs4 import BeautifulSoup

In [2]:
url = 'https://www.diverted.eu/' #Website with diverted flight information

In [3]:
# Parsing HTML using BeautifulSoup
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

In [4]:
#Parse HTML for the diverted data table
flight_diverted_table = soup.findAll("table", { 'id' : 'tablepress-current_month' })
flight_diverted_table

[<table class="tablepress tablepress-id-current_month tablepress-responsive" id="tablepress-current_month">
 <thead>
 <tr class="row-1 odd">
 <th class="column-1">Date</th><th class="column-2">Airlines/Operator</th><th class="column-3">Flight number</th><th class="column-4">Departure airport</th><th class="column-5">Destination airport</th><th class="column-6">Diverted to</th><th class="column-7">Emergency code</th><th class="column-8">Alleged reason</th><th class="column-9">Aircraft</th><th class="column-10">Registration</th>
 </tr>
 </thead>
 <tbody class="row-hover">
 <tr class="row-2 even">
 <td class="column-1">20.07.2022</td><td class="column-2">Go First</td><td class="column-3">G8151 / GOW151</td><td class="column-4">Delhi</td><td class="column-5">Guwahati</td><td class="column-6">Jaipur</td><td class="column-7"></td><td class="column-8">cracked windshield</td><td class="column-9">Airbus A320-271N</td><td class="column-10">VT-WGP</td>
 </tr>
 <tr class="row-3 odd">
 <td class="c

In [5]:
#Load the data table to a dataframe
flight_diverted_table = pd.read_html(str(flight_diverted_table))
flight_diverted_df = flight_diverted_table[0]
flight_diverted_df

Unnamed: 0,Date,Airlines/Operator,Flight number,Departure airport,Destination airport,Diverted to,Emergency code,Alleged reason,Aircraft,Registration
0,20.07.2022,Go First,G8151 / GOW151,Delhi,Guwahati,Jaipur,,cracked windshield,Airbus A320-271N,VT-WGP
1,20.07.2022,Wizz Air,W65058 / WZZ101S,Bari,Krakow,Budapest,,bomb threat,Airbus A321-271NX,HA-LGA
2,19.07.2022,Go First,G8386 / GOW386,Mumbai,Leh,Delhi,,engine issue,Airbus A320-271N,VT-WGA
3,19.07.2022,Go First,G86202 / GOW6202,Srinagar,Delhi,Srinagar,,engine issue,Airbus A320-271N,VT-WJG
4,19.07.2022,LOT,LO6297 / LOT6297,Prague,Zanzibar,Warsaw,,brakes issue,Boeing 787-9 Dreamliner,SP-LSB
...,...,...,...,...,...,...,...,...,...,...
679,04.11.2021,Azul Linhas Aéreas,AD4327 / AZU4327,Goiani,Campinas,Brasilia,,technical issue,Embraer E195-E2,PS-AEF
680,02.11.2021,United Airlines,UA818 / UAL818,Buenos Aires,Houston,Buenos Aires,,pressurisation issue,Boeing 787-9 Dreamliner,N36962
681,01.11.2021,Delta Air Lines,DL9962 / DAL9962,Atlanta,Key West,Atlanta,,airspeed issue,Airbus A319-114,N364NB
682,01.11.2021,Delta Air Lines,DL365 / DAL365,Atlanta,Los Angeles,Dallas,,disruptive passenger,Airbus A321-211,N390DN


### Data Transformation 

#### String to Date conversion

In [6]:
flight_diverted_df.Date.dtype

dtype('O')

>Flight date is formatted as a string (Pandas type 'O' is a string). 

In [7]:
#Format Flight date from string to Date 
flight_diverted_df.Date = pd.to_datetime(flight_diverted_df["Date"], format='%d.%m.%Y') 
flight_diverted_df.head(5)

Unnamed: 0,Date,Airlines/Operator,Flight number,Departure airport,Destination airport,Diverted to,Emergency code,Alleged reason,Aircraft,Registration
0,2022-07-20,Go First,G8151 / GOW151,Delhi,Guwahati,Jaipur,,cracked windshield,Airbus A320-271N,VT-WGP
1,2022-07-20,Wizz Air,W65058 / WZZ101S,Bari,Krakow,Budapest,,bomb threat,Airbus A321-271NX,HA-LGA
2,2022-07-19,Go First,G8386 / GOW386,Mumbai,Leh,Delhi,,engine issue,Airbus A320-271N,VT-WGA
3,2022-07-19,Go First,G86202 / GOW6202,Srinagar,Delhi,Srinagar,,engine issue,Airbus A320-271N,VT-WJG
4,2022-07-19,LOT,LO6297 / LOT6297,Prague,Zanzibar,Warsaw,,brakes issue,Boeing 787-9 Dreamliner,SP-LSB


#### Filter Flights by Date

>Only select data for May'22, since our excel data is for May 2022

In [8]:
diverted_df = flight_diverted_df[(flight_diverted_df.Date >= '2022-05-01') & (flight_diverted_df.Date < '2022-06-01')]
diverted_df.head(5)

Unnamed: 0,Date,Airlines/Operator,Flight number,Departure airport,Destination airport,Diverted to,Emergency code,Alleged reason,Aircraft,Registration
147,2022-05-31,Virgin Australia,VA9223 / VOZ9223,Perth,Boolgeeda,Perth,,hydraulic issue,Airbus A320-232,VH-VNB
148,2022-05-31,Aer Lingus,EI3326 / EAI26MH,Dublin,Manchester,Dublin,7700.0,technical issue,ATR 72-600,EI-HDH
149,2022-05-30,American Airlines,AA720 / AAL720,Charlotte,Rome,Charlotte,,maintenance issue,Boeing 777-223(ER),N793AN
150,2022-05-29,Swiss,LX340 / SWR340V,Zurich,London,Zurich,,odor in cockpit,Airbus A220-100,HB-JBI
151,2022-05-29,Qantas,QF2008 / QLK8D,Sydney,Tamworth,Sydney,,hydraulic issue,De Havilland Canada Dash 8-400,VH-QOF


#### Repalce Headers

In [9]:
#Columns before renaming
diverted_df.columns

Index(['Date', 'Airlines/Operator', 'Flight number', 'Departure airport',
       'Destination airport', 'Diverted to', 'Emergency code',
       'Alleged reason', 'Aircraft', 'Registration'],
      dtype='object')

In [10]:
#Renaming columns
diverted_df.columns = ['FL_DATE', 'OP_UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN', 'DEST', 'DIVERTED_TO', 'EMERGENCY_CODE', 'DIVERTED_REASON', 'AIRCRAFT', 'AIRCRAFT_REGISTRATION'] 

In [11]:
#Columns after renaming
diverted_df.columns

Index(['FL_DATE', 'OP_UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN', 'DEST',
       'DIVERTED_TO', 'EMERGENCY_CODE', 'DIVERTED_REASON', 'AIRCRAFT',
       'AIRCRAFT_REGISTRATION'],
      dtype='object')

#### Drop rows 

###### Drop null rows, if any

In [12]:
print('Data before dropping null rows : ',diverted_df.shape)
diverted_df.dropna()
print('Data after dropping null rows : ', diverted_df.shape)

Data before dropping null rows :  (87, 10)
Data after dropping null rows :  (87, 10)


>No null rows to drop

###### Drop duplicates, if any

In [13]:
print('Dataframe before dropping duplicates :', diverted_df.shape)
flight_data_df = diverted_df.drop_duplicates() 
print('Dataframe after dropping duplicates :',diverted_df.shape)
#No duplicates in the website data table

Dataframe before dropping duplicates : (87, 10)
Dataframe after dropping duplicates : (87, 10)


#### Update rows 

>Look for rows with inconsistent reason for diversion 

In [14]:
diverted_df.groupby(['DIVERTED_REASON'])['DIVERTED_REASON'].count() 

DIVERTED_REASON
air conditioning issue          1
bird strike                     8
bomb threat                     1
brakes issue                    1
cracked windshield              1
disruptive passenger            5
engine issue                    4
hydraulic issue                 5
landing gear issue              3
maintenance issue               1
medical emergency              14
odor in cockpit                 1
odor on board                   2
operational reasons             1
possible landing gear issue     1
possible medical emergency      1
possible technical issue        2
pressurisation issue            6
smell on board                  3
smoke indication                1
smoke on board                  1
technical issue                 8
weather radar issue             1
winglet issue                   1
“rostering error”               1
Name: DIVERTED_REASON, dtype: int64

>Rostering error has unwanted quotes. Removing them for consistency.

In [15]:
diverted_df.loc[diverted_df.DIVERTED_REASON == '“rostering error”', 'DIVERTED_REASON'] = 'rostering_error'

In [16]:
#Validate data after update
diverted_df.groupby(['DIVERTED_REASON'])['DIVERTED_REASON'].count() 

DIVERTED_REASON
air conditioning issue          1
bird strike                     8
bomb threat                     1
brakes issue                    1
cracked windshield              1
disruptive passenger            5
engine issue                    4
hydraulic issue                 5
landing gear issue              3
maintenance issue               1
medical emergency              14
odor in cockpit                 1
odor on board                   2
operational reasons             1
possible landing gear issue     1
possible medical emergency      1
possible technical issue        2
pressurisation issue            6
rostering_error                 1
smell on board                  3
smoke indication                1
smoke on board                  1
technical issue                 8
weather radar issue             1
winglet issue                   1
Name: DIVERTED_REASON, dtype: int64

#### Fill NA/NaN values 

In [17]:
print('EMERGENCY_CODE before updating NA/NAN : ',diverted_df.EMERGENCY_CODE.unique())

EMERGENCY_CODE before updating NA/NAN :  [  nan 7700.]


In [18]:
diverted_df = diverted_df.replace(np.nan,0)

In [19]:
print('EMERGENCY_CODE after updating NA/NAN : ',diverted_df.EMERGENCY_CODE.unique())

EMERGENCY_CODE after updating NA/NAN :  [   0. 7700.]


### Ethical implications:
>Website Data - 
The data source of the flat file is genuine and reliable (Bureau of Transportation). However, the website may not hold accurate information because it is not government or FAA authorized source. The webiste does not mention the source of data, making the accuracy and legality of data questionable. The website also states the same in the disclaimer. However, on running a high level search for a couple of diverted flight information, we are able to confirm the accuracy of the data.

### Conclusion:
>As a part of this milestone, the following Data Transformation steps have been performed.

>1. Data Type conversion
>2. Renamed columns 
>3. Replaced values in a dataframe column
>4. Filtered data 
>5. Filled NA/NAN values 
>6. Performed checks for duplicates and null rows

###### Following lines can be ignored. I will be reusing the same file for upcoming milestones