# Airlines On-Time Performance, Delays, Cancellations and Diversions

## Milestone 1 - 
### Introduction: 
>Airline cancellations or delays are one of the major causes of passenger inconvenience. With
the publicly available dataset, using data science, I am hoping to gain meaningful insights into the best-performing airlines and understand the causes of delays, diversions and cancellations across different airline carriers. 
For the final project, I would like to analyze airline data to identify different factors and their effects on a carrier's performance. As a performance measure, I would like to explore on-time arrivals, and the number of cancellations by
the carrier and explore different reasons for delays and diversions. Based on the outcome, carriers can take necessary actions to focus on the problem areas.


### Data Source: 
-  Flat File: Excel files from BTS. The Excel data has airline performance factors such as cancelled, diverted, delayed and on-time data. The downloaded raw data has up to 34 columns.
https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?20=E (Download Raw Data link for data).
- API: API provides historical weather information. https://visual-crossing-weather.p.rapidapi.com/history?startDateTime={}&aggregateHours=24&location={}&endDateTime={}&unitGroup=us 
-  Website: The website consists of a list of diverted flights. https://www.diverted.eu/ 
        
    
### Relationships: 
> The Flat file is the main data source with scheduled flight information.

>   Flat File - API:

>   Data from the flat file has cancellations and delays due to weather. I would like to look up the weather information for the flight date at the origin/destination of flights cancelled or delayed due to bad weather. The Bureau data has up to January 2023 data. To look up the weather for a past date, I would need historic weather data. The API gets the historic
weather data for a location (origin or destination city name). This will enable us to validate if there truly was a bad weather situation for a flight to be delayed or cancelled. With this, we can also identify the cause of bad weather like storms, snow, wind, etc.
   
>   - Flat file has many to many relation with the API. We will need to pass the flight date and the origin or destination city to the API to get weather information for a particular date and place.
   
>   Flat File - Website:

>   The flat file has a column for diverted flights but does not have any information on the cause for diversion. I would like to look up the reason for a flight being diverted. The website and flat file can be matched on flight date, origin and destination to lookup diverted flight information.

>  - Flat file has many to many relation with the Website. We will need to pass the flight date and the origin and destination city to the website to get flight diversion details for a particular date and route.


### Project Subject Area: 
>The project aims on identifying various performance measures in airline operations. Using the statistical analysis we can gain insights into the best and least performing airline carriers and the most common reasons for delays and cancellations.


### Challenges:
>The flight performance data size is huge (flat file). I would have to find ways to reduce data to
a reasonable size without losing meaningful information.

    
### Conclusion:
>For the first project milestone, I have identified data from different sources in different formats. I will be applying various data cleansing and visualization techniques on this dataset to gain meaningful insights in the upcoming project milestones.

## Milestone 2 - Cleaning/Formatting Flat File Source

>Flat File: Excel files from BTS. The Excel data has airline performance factors such as cancelled, diverted, delayed and on-time data. The downloaded raw data has up to 34 columns. https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?20=E (Download Raw Data link for data).
The Flat file is the main data source with scheduled flight information.

In [302]:
# Import necessary libraries

import pandas as pd
from datetime import datetime
import numpy as np

#Milestone 3 libraries
from urllib.request import Request, urlopen 
from bs4 import BeautifulSoup

#Milestone 4 libraries
#import requests
import urllib.request, urllib.parse, urllib.error
import json  
import requests
import re

In [2]:
#Read flight data from "https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?20=E" into a dataframe

flight_data_df = pd.read_csv('T_ONTIME_MARKETING_May.csv')
flight_data_df.head(5)

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,MKT_UNIQUE_CARRIER,OP_UNIQUE_CARRIER,ORIGIN_AIRPORT_ID,ORIGIN,...,DIVERTED,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,10140,ABQ,...,0.0,104.0,71.0,1.0,569.0,,,,,
1,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,10140,ABQ,...,0.0,97.0,72.0,1.0,569.0,,,,,
2,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,10140,ABQ,...,0.0,98.0,73.0,1.0,569.0,,,,,
3,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,10140,ABQ,...,0.0,110.0,73.0,1.0,569.0,,,,,
4,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,10140,ABQ,...,0.0,93.0,72.0,1.0,569.0,,,,,


### Data Transformation

#### i. Drop  Columns

>Drop unwanted columns to reduce the data size and improve data readability.
Columns that I will not be using for this project are  as follows:


>-    ORIGIN_AIRPORT_ID
>-    ACTUAL_ELAPSED_TIME
>-    AIR_TIME
>-    FLIGHTS
>-    ORIGIN_WAC
>-    DEST_AIRPORT_ID
>-    DEST_WAC
>-    AIR_TIME  

In [3]:
flight_data_df = flight_data_df.drop(columns=['ORIGIN_AIRPORT_ID','ACTUAL_ELAPSED_TIME','AIR_TIME','FLIGHTS',
                          'ORIGIN_WAC','DEST_AIRPORT_ID','DEST_WAC','AIR_TIME'])
flight_data_df.head(5)

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,MKT_UNIQUE_CARRIER,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_CITY_NAME,...,ARR_DELAY_NEW,CANCELLED,CANCELLATION_CODE,DIVERTED,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
0,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,0.0,,0.0,569.0,,,,,
1,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,0.0,,0.0,569.0,,,,,
2,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,0.0,,0.0,569.0,,,,,
3,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,0.0,,0.0,569.0,,,,,
4,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,0.0,,0.0,569.0,,,,,


#### ii. Look for Duplicates

>Duplicates cause inconsistent results when dealing with statistics. Hence dropping duplicate rows.

In [4]:
print('Dataframe before dropping duplicates :', flight_data_df.shape)
flight_data_df = flight_data_df.drop_duplicates() # 1,389 rows dropped
print('Dataframe after dropping duplicates :',flight_data_df.shape)

Dataframe before dropping duplicates : (602950, 32)
Dataframe after dropping duplicates : (601561, 32)


#### iii. Replace values in a column

> Cancellation code is represented as A, B, C and D, which is not very informative. 
The BTS website provided details on this code as follows:

    - A Carrier

    - B Weather

    - C National Air System

    - D Security

In [5]:
flight_data_df.CANCELLATION_CODE = np.where(flight_data_df.CANCELLATION_CODE=='A', 'Carrier',
                                 np.where(flight_data_df.CANCELLATION_CODE=='B', 'Weather',
                                          np.where(flight_data_df.CANCELLATION_CODE=='C', 'National Air System',
                                                   np.where(flight_data_df.CANCELLATION_CODE=='D', 'Security',''))))

flight_data_df.groupby(['CANCELLATION_CODE'])['CANCELLATION_CODE'].count().sort_index()

CANCELLATION_CODE
                       590957
Carrier                  4902
National Air System      1394
Security                    1
Weather                  4307
Name: CANCELLATION_CODE, dtype: int64

#### iv. Rename Column

>To make more sense of the information in cancellation_code, replacing the column to cancellation reason. 

In [6]:
flight_data_df = flight_data_df.rename(columns={"CANCELLATION_CODE": "CANCELLATION_REASON"}, errors="raise")
flight_data_df.columns

Index(['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE',
       'MKT_UNIQUE_CARRIER', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_CITY_NAME',
       'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM', 'DEST', 'DEST_CITY_NAME',
       'DEST_STATE_ABR', 'DEST_STATE_NM', 'DEP_DELAY', 'DEP_DELAY_NEW',
       'TAXI_OUT', 'TAXI_IN', 'ARR_TIME', 'ARR_DELAY', 'ARR_DELAY_NEW',
       'CANCELLED', 'CANCELLATION_REASON', 'DIVERTED', 'DISTANCE',
       'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY',
       'LATE_AIRCRAFT_DELAY'],
      dtype='object')

#### v. Add new columns

##### STATUS

In [7]:
#Adding a new column 'STATUS' that tells the status of a flight 
flight_data_df['STATUS'] = ''
 
flight_data_df.STATUS = np.where(flight_data_df.CANCELLED==1, 'Cancelled',
                                 np.where(flight_data_df.DIVERTED==1, 'Diverted',
                                          np.where(flight_data_df.ARR_DELAY<=15, 'On-Time',
                                                   np.where(flight_data_df.ARR_DELAY>15, 'Delayed',''))))
flight_data_df.groupby(['STATUS'])['STATUS'].count().sort_index()

STATUS
Cancelled     10604
Delayed      119624
Diverted       1581
On-Time      469752
Name: STATUS, dtype: int64

##### DELAYED

>As a step to data reduction, I will be considering flights arriving 15 minutes or later as delayed

In [8]:
#Creating a new column 'DELAYED'. A flag that represents if a flight was delayed. Similar to CANCELLED and DIVERTED 

flight_data_df.loc[(flight_data_df['ARR_DELAY']>15), 'DELAYED'] = True
flight_data_df.loc[(flight_data_df['ARR_DELAY']<=15), 'DELAYED'] = False

flight_data_df.groupby(['DELAYED'])['DELAYED'].count().sort_index()

DELAYED
False    469752
True     119624
Name: DELAYED, dtype: int64

##### DELAY REASON

In [9]:
#Adding a new column 'DELAY_REASON' that tells the reason for a flight getting delayed 
#Using the newly created DELAYED flag and the available columns for each type of delay to create one column with the delay reason.

flight_data_df['DELAY_REASON'] = np.where(((flight_data_df.DELAYED==True) & (flight_data_df.CARRIER_DELAY != 0)), 'Carrier',
                                          np.where(((flight_data_df.DELAYED==True) & (flight_data_df.LATE_AIRCRAFT_DELAY != 0)), 'LateAircraft',
                                                   np.where(((flight_data_df.DELAYED==True) & (flight_data_df.WEATHER_DELAY != 0)), 'Weather',
                                                            np.where(((flight_data_df.DELAYED==True) & (flight_data_df.NAS_DELAY != 0)), 'NAS',
                                                                     np.where(((flight_data_df.DELAYED==True) & (flight_data_df.SECURITY_DELAY != 0)), 'Security','')))))

flight_data_df.groupby(['DELAY_REASON'])['DELAY_REASON'].count().sort_index()

DELAY_REASON
                481937
Carrier          72453
LateAircraft     25504
NAS              17384
Security           131
Weather           4152
Name: DELAY_REASON, dtype: int64

#### vi. Implementing arithmetic functions for statistical analysis 

In [10]:
# Create a new dataframe with total number of flights per operating carrier to calculate the % 

flight_totals = flight_data_df.value_counts(subset=['OP_UNIQUE_CARRIER']).reset_index() #Get total flights per operating carrier
flight_totals_df = pd.DataFrame(flight_totals) # Convert to dataframe
flight_totals_df.columns = ['OP_UNIQUE_CARRIER','TOTAL'] # Assign Column names
flight_totals_df['PERCENTAGE'] = round(flight_totals_df.TOTAL/flight_totals_df.TOTAL.sum()*100,2) #Calculate the percentage

flight_totals_df = flight_totals_df.sort_values('PERCENTAGE',ascending=False) #Sort by percentage (descending)
flight_totals_df.head(5)

Unnamed: 0,OP_UNIQUE_CARRIER,TOTAL,PERCENTAGE
0,WN,107950,17.94
1,DL,76021,12.64
2,AA,71471,11.88
3,OO,66615,11.07
4,UA,53535,8.9


In [11]:
# Calculate percentage by carrier and flight status   
flight_status = flight_data_df.value_counts(subset=['OP_UNIQUE_CARRIER','STATUS']).reset_index() #Get total flights per operating carrier and status 
flight_status_df = pd.DataFrame(flight_status) #create a dataframe
flight_status_df.columns = ['OP_UNIQUE_CARRIER','STATUS', 'COUNT'] #Add column names
flight_status_df = flight_status_df.sort_values('OP_UNIQUE_CARRIER') #Sort by operating carrier

flight_status_df['PERCENTAGE'] = ''
            
for index, row in flight_status_df.iterrows():
    tot = flight_totals.loc[flight_totals.OP_UNIQUE_CARRIER==row.OP_UNIQUE_CARRIER].TOTAL.values #Calculate total per operating carrier to get the status percentage 
    val = (row.COUNT/tot * 100)   
    flight_status_df.at[index,'PERCENTAGE'] = round(val[0].astype(float),2) #Calculate the percentage

flight_status_df.head(10)

Unnamed: 0,OP_UNIQUE_CARRIER,STATUS,COUNT,PERCENTAGE
33,9E,Delayed,3113,15.33
48,9E,Cancelled,542,2.67
74,9E,Diverted,35,0.17
8,9E,On-Time,16613,81.83
41,AA,Cancelled,973,1.36
56,AA,Diverted,215,0.3
3,AA,On-Time,55403,77.52
11,AA,Delayed,14880,20.82
47,AS,Cancelled,608,3.12
10,AS,On-Time,15502,79.49


In [12]:
#Create a new dataframe with the percentage by origin airport and status
flight_origin_totals = flight_data_df.value_counts(subset=['ORIGIN']).reset_index() #get the counts by origin
flight_origin_totals_df = pd.DataFrame(flight_origin_totals) #create a dataframe
flight_origin_totals_df.columns = ['ORIGIN','TOTAL'] #Add column names
flight_origin_totals_df['PERCENTAGE'] = round(flight_origin_totals_df.TOTAL/flight_origin_totals_df.TOTAL.sum()*100,2) #Calculate the percentage by origin airport
 

origin_airport_delays = flight_data_df.value_counts(subset=['ORIGIN','STATUS']).reset_index() #get counts by origin and status
origin_airport_df = pd.DataFrame(origin_airport_delays) #create a dataframe
origin_airport_df.columns = ['ORIGIN','STATUS', 'COUNT'] #add column names
origin_airport_df = origin_airport_df.sort_values('ORIGIN') #sort by origin
origin_airport_df['PERCENTAGE'] = ''
            
for index, row in origin_airport_df.iterrows():
    tot = flight_origin_totals.loc[flight_origin_totals.ORIGIN==row.ORIGIN].TOTAL.values #get totals per origin & status
    val = (row.COUNT/tot * 100)   
    origin_airport_df.at[index,'PERCENTAGE'] = round(val[0].astype(float),2) #calulate the percentage

origin_airport_df = origin_airport_df.sort_values('PERCENTAGE',ascending=False) #sort by percentage descending
 
origin_airport_df.head(10)

Unnamed: 0,ORIGIN,STATUS,COUNT,PERCENTAGE
770,GST,On-Time,12,100.0
1208,STC,On-Time,1,100.0
385,LWS,On-Time,95,96.94
623,BGM,On-Time,30,96.77
470,DRT,On-Time,60,96.77
517,PLN,On-Time,51,96.23
488,MCW,On-Time,55,94.83
490,FOD,On-Time,55,94.83
515,TBN,On-Time,51,94.44
529,LAR,On-Time,50,94.34


#### vii. NULL check

In [13]:
#Looking for null values to further reduce the data size.
flight_data_df.isnull().sum()

YEAR                        0
QUARTER                     0
MONTH                       0
DAY_OF_MONTH                0
DAY_OF_WEEK                 0
FL_DATE                     0
MKT_UNIQUE_CARRIER          0
OP_UNIQUE_CARRIER           0
ORIGIN                      0
ORIGIN_CITY_NAME            0
ORIGIN_STATE_ABR            0
ORIGIN_STATE_NM             0
DEST                        0
DEST_CITY_NAME              0
DEST_STATE_ABR              0
DEST_STATE_NM               0
DEP_DELAY               10201
DEP_DELAY_NEW           10201
TAXI_OUT                10558
TAXI_IN                 10769
ARR_TIME                10769
ARR_DELAY               12185
ARR_DELAY_NEW           12185
CANCELLED                   0
CANCELLATION_REASON         0
DIVERTED                    0
DISTANCE                    0
CARRIER_DELAY          477611
WEATHER_DELAY          477611
NAS_DELAY              477611
SECURITY_DELAY         477611
LATE_AIRCRAFT_DELAY    477611
STATUS                      0
DELAYED   

>Based on the above, it doesn't appear there are any null rows that are irrelevant. 
Status is a significant column that tells if there are any flights with no relevant status. All flights are now categorized under On-Time, Delayed, Cancelled or Diverted.

>The final flat file dataset is as follows:

In [14]:
print(flight_data_df.columns)
flight_data_df.head(5)

Index(['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE',
       'MKT_UNIQUE_CARRIER', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_CITY_NAME',
       'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM', 'DEST', 'DEST_CITY_NAME',
       'DEST_STATE_ABR', 'DEST_STATE_NM', 'DEP_DELAY', 'DEP_DELAY_NEW',
       'TAXI_OUT', 'TAXI_IN', 'ARR_TIME', 'ARR_DELAY', 'ARR_DELAY_NEW',
       'CANCELLED', 'CANCELLATION_REASON', 'DIVERTED', 'DISTANCE',
       'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY',
       'LATE_AIRCRAFT_DELAY', 'STATUS', 'DELAYED', 'DELAY_REASON'],
      dtype='object')


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,MKT_UNIQUE_CARRIER,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_CITY_NAME,...,DIVERTED,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,STATUS,DELAYED,DELAY_REASON
0,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,569.0,,,,,,On-Time,False,
1,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,569.0,,,,,,On-Time,False,
2,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,569.0,,,,,,On-Time,False,
3,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,569.0,,,,,,On-Time,False,
4,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,ABQ,"Albuquerque, NM",...,0.0,569.0,,,,,,On-Time,False,


### Ethical implications: 
>BTS data - Flat File
I do not see any ethical implications for this dataset as it is from a federal government source and is made accessible to public. The only concern I have is that, the dataset I am referring to is old and it's possible the trend has changed over time. The reason for using old dataset is because I need the flight diversion information which I was only able to find for the year 2022. 

### Conclusion:
> As a part of this milestone, the following Data Transformation steps have been performed.
>1. Dropped columns
>2. Dropped duplicate rows
>3. Replaced values in a dataframe column
>4. Renamed a column
>5. Added new columns to the dataframe
>6. Implemented arithmetic functions for statistical analysis 
>7. Performed null check to drop rows with null values. 

## Milestone 3 - Cleaning/Formatting Website Data

>Flat File - Website:
The flat file has a column for diverted flights but does not have any information on the cause for diversion. I would like to look up the reason for a flight being diverted. The website and flat file can be matched on flight date, origin and destination to lookup diverted flight information.
Flat file has many to many relation with the Website. We will need to pass the flight date and the origin and destination city to the website to get flight diversion details for a particular date and route.

In [15]:
url = 'https://www.diverted.eu/' #Website with diverted flight information

In [16]:
# Parsing HTML using BeautifulSoup
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

In [17]:
#Parse HTML for the diverted data table
flight_diverted_table = soup.findAll("table", { 'id' : 'tablepress-current_month' })
flight_diverted_table

[<table class="tablepress tablepress-id-current_month tablepress-responsive" id="tablepress-current_month">
 <thead>
 <tr class="row-1 odd">
 <th class="column-1">Date</th><th class="column-2">Airlines/Operator</th><th class="column-3">Flight number</th><th class="column-4">Departure airport</th><th class="column-5">Destination airport</th><th class="column-6">Diverted to</th><th class="column-7">Emergency code</th><th class="column-8">Alleged reason</th><th class="column-9">Aircraft</th><th class="column-10">Registration</th>
 </tr>
 </thead>
 <tbody class="row-hover">
 <tr class="row-2 even">
 <td class="column-1">20.07.2022</td><td class="column-2">Go First</td><td class="column-3">G8151 / GOW151</td><td class="column-4">Delhi</td><td class="column-5">Guwahati</td><td class="column-6">Jaipur</td><td class="column-7"></td><td class="column-8">cracked windshield</td><td class="column-9">Airbus A320-271N</td><td class="column-10">VT-WGP</td>
 </tr>
 <tr class="row-3 odd">
 <td class="c

In [18]:
#Load the data table to a dataframe
flight_diverted_table = pd.read_html(str(flight_diverted_table))
flight_diverted_df = flight_diverted_table[0]
flight_diverted_df

Unnamed: 0,Date,Airlines/Operator,Flight number,Departure airport,Destination airport,Diverted to,Emergency code,Alleged reason,Aircraft,Registration
0,20.07.2022,Go First,G8151 / GOW151,Delhi,Guwahati,Jaipur,,cracked windshield,Airbus A320-271N,VT-WGP
1,20.07.2022,Wizz Air,W65058 / WZZ101S,Bari,Krakow,Budapest,,bomb threat,Airbus A321-271NX,HA-LGA
2,19.07.2022,Go First,G8386 / GOW386,Mumbai,Leh,Delhi,,engine issue,Airbus A320-271N,VT-WGA
3,19.07.2022,Go First,G86202 / GOW6202,Srinagar,Delhi,Srinagar,,engine issue,Airbus A320-271N,VT-WJG
4,19.07.2022,LOT,LO6297 / LOT6297,Prague,Zanzibar,Warsaw,,brakes issue,Boeing 787-9 Dreamliner,SP-LSB
...,...,...,...,...,...,...,...,...,...,...
679,04.11.2021,Azul Linhas Aéreas,AD4327 / AZU4327,Goiani,Campinas,Brasilia,,technical issue,Embraer E195-E2,PS-AEF
680,02.11.2021,United Airlines,UA818 / UAL818,Buenos Aires,Houston,Buenos Aires,,pressurisation issue,Boeing 787-9 Dreamliner,N36962
681,01.11.2021,Delta Air Lines,DL9962 / DAL9962,Atlanta,Key West,Atlanta,,airspeed issue,Airbus A319-114,N364NB
682,01.11.2021,Delta Air Lines,DL365 / DAL365,Atlanta,Los Angeles,Dallas,,disruptive passenger,Airbus A321-211,N390DN


### Data Transformation 

#### String to Date conversion

In [19]:
flight_diverted_df.Date.dtype

dtype('O')

>Flight date is formatted as a string (Pandas type 'O' is a string). 

In [20]:
#Format Flight date from string to Date 
flight_diverted_df.Date = pd.to_datetime(flight_diverted_df["Date"], format='%d.%m.%Y') 
flight_diverted_df.head(5)

Unnamed: 0,Date,Airlines/Operator,Flight number,Departure airport,Destination airport,Diverted to,Emergency code,Alleged reason,Aircraft,Registration
0,2022-07-20,Go First,G8151 / GOW151,Delhi,Guwahati,Jaipur,,cracked windshield,Airbus A320-271N,VT-WGP
1,2022-07-20,Wizz Air,W65058 / WZZ101S,Bari,Krakow,Budapest,,bomb threat,Airbus A321-271NX,HA-LGA
2,2022-07-19,Go First,G8386 / GOW386,Mumbai,Leh,Delhi,,engine issue,Airbus A320-271N,VT-WGA
3,2022-07-19,Go First,G86202 / GOW6202,Srinagar,Delhi,Srinagar,,engine issue,Airbus A320-271N,VT-WJG
4,2022-07-19,LOT,LO6297 / LOT6297,Prague,Zanzibar,Warsaw,,brakes issue,Boeing 787-9 Dreamliner,SP-LSB


#### Filter Flights by Date

>Only select data for May'22, since our excel data is for May 2022

In [21]:
diverted_df = flight_diverted_df[(flight_diverted_df.Date >= '2022-05-01') & (flight_diverted_df.Date < '2022-06-01')]
diverted_df.head(5)

Unnamed: 0,Date,Airlines/Operator,Flight number,Departure airport,Destination airport,Diverted to,Emergency code,Alleged reason,Aircraft,Registration
147,2022-05-31,Virgin Australia,VA9223 / VOZ9223,Perth,Boolgeeda,Perth,,hydraulic issue,Airbus A320-232,VH-VNB
148,2022-05-31,Aer Lingus,EI3326 / EAI26MH,Dublin,Manchester,Dublin,7700.0,technical issue,ATR 72-600,EI-HDH
149,2022-05-30,American Airlines,AA720 / AAL720,Charlotte,Rome,Charlotte,,maintenance issue,Boeing 777-223(ER),N793AN
150,2022-05-29,Swiss,LX340 / SWR340V,Zurich,London,Zurich,,odor in cockpit,Airbus A220-100,HB-JBI
151,2022-05-29,Qantas,QF2008 / QLK8D,Sydney,Tamworth,Sydney,,hydraulic issue,De Havilland Canada Dash 8-400,VH-QOF


#### Repalce Headers

In [22]:
#Columns before renaming
diverted_df.columns

Index(['Date', 'Airlines/Operator', 'Flight number', 'Departure airport',
       'Destination airport', 'Diverted to', 'Emergency code',
       'Alleged reason', 'Aircraft', 'Registration'],
      dtype='object')

In [23]:
#Renaming columns
diverted_df.columns = ['FL_DATE', 'OP_UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN', 'DEST', 'DIVERTED_TO', 'EMERGENCY_CODE', 'DIVERTED_REASON', 'AIRCRAFT', 'AIRCRAFT_REGISTRATION'] 

In [24]:
#Columns after renaming
diverted_df.columns

Index(['FL_DATE', 'OP_UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN', 'DEST',
       'DIVERTED_TO', 'EMERGENCY_CODE', 'DIVERTED_REASON', 'AIRCRAFT',
       'AIRCRAFT_REGISTRATION'],
      dtype='object')

#### Drop rows 

###### Drop null rows, if any

In [25]:
print('Data before dropping null rows : ',diverted_df.shape)
diverted_df.dropna()
print('Data after dropping null rows : ', diverted_df.shape)

Data before dropping null rows :  (87, 10)
Data after dropping null rows :  (87, 10)


>No null rows to drop

###### Drop duplicates, if any

In [26]:
print('Dataframe before dropping duplicates :', diverted_df.shape)
diverted_df = diverted_df.drop_duplicates() 
print('Dataframe after dropping duplicates :',diverted_df.shape)
#No duplicates in the website data table

Dataframe before dropping duplicates : (87, 10)
Dataframe after dropping duplicates : (87, 10)


#### Update rows 

>Look for rows with inconsistent reason for diversion 

In [27]:
diverted_df.groupby(['DIVERTED_REASON'])['DIVERTED_REASON'].count() 

DIVERTED_REASON
air conditioning issue          1
bird strike                     8
bomb threat                     1
brakes issue                    1
cracked windshield              1
disruptive passenger            5
engine issue                    4
hydraulic issue                 5
landing gear issue              3
maintenance issue               1
medical emergency              14
odor in cockpit                 1
odor on board                   2
operational reasons             1
possible landing gear issue     1
possible medical emergency      1
possible technical issue        2
pressurisation issue            6
smell on board                  3
smoke indication                1
smoke on board                  1
technical issue                 8
weather radar issue             1
winglet issue                   1
“rostering error”               1
Name: DIVERTED_REASON, dtype: int64

>Rostering error has unwanted quotes. Removing them for consistency.

In [28]:
diverted_df.loc[diverted_df.DIVERTED_REASON == '“rostering error”', 'DIVERTED_REASON'] = 'rostering_error'

In [29]:
#Validate data after update
diverted_df.groupby(['DIVERTED_REASON'])['DIVERTED_REASON'].count() 

DIVERTED_REASON
air conditioning issue          1
bird strike                     8
bomb threat                     1
brakes issue                    1
cracked windshield              1
disruptive passenger            5
engine issue                    4
hydraulic issue                 5
landing gear issue              3
maintenance issue               1
medical emergency              14
odor in cockpit                 1
odor on board                   2
operational reasons             1
possible landing gear issue     1
possible medical emergency      1
possible technical issue        2
pressurisation issue            6
rostering_error                 1
smell on board                  3
smoke indication                1
smoke on board                  1
technical issue                 8
weather radar issue             1
winglet issue                   1
Name: DIVERTED_REASON, dtype: int64

#### Fill NA/NaN values 

In [30]:
print('EMERGENCY_CODE before updating NA/NAN : ',diverted_df.EMERGENCY_CODE.unique())

EMERGENCY_CODE before updating NA/NAN :  [  nan 7700.]


In [31]:
diverted_df = diverted_df.replace(np.nan,0)

In [32]:
print('EMERGENCY_CODE after updating NA/NAN : ',diverted_df.EMERGENCY_CODE.unique())

EMERGENCY_CODE after updating NA/NAN :  [   0. 7700.]


### Ethical implications:
>Website Data - 
The data source of the flat file is genuine and reliable (Bureau of Transportation). However, the website may not hold accurate information because it is not government or FAA authorized source. The webiste does not mention the source of data, making the accuracy and legality of data questionable. The website also states the same in the disclaimer. However, on running a high level search for a couple of diverted flight information, we are able to confirm the accuracy of the data.

### Conclusion:
>As a part of this milestone, the following Data Transformation steps have been performed.

>1. Data Type conversion
>2. Renamed columns 
>3. Replaced values in a dataframe column
>4. Filtered data 
>5. Filled NA/NAN values 
>6. Performed checks for duplicates and null rows

## Milestone 4 - Connecting to an API/Pulling in the Data and Cleaning/Formatting

>Data from the flat file has cancellations and delays due to weather. 
The API gets the historic weather data for a location (origin or destination city name). This will enable us to validate if there truly was a bad weather situation for a flight to be delayed or cancelled. With this, we can also identify the cause of bad weather like storms, snow, wind, etc.

In [323]:
#Working with weather delays. Creating a dataframe with only weather delays. 
weather_delay_df = flight_data_df[flight_data_df.DELAY_REASON=='Weather']
print(weather_delay_df.shape)
weather_delay_df.head(5)

(4152, 35)


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,MKT_UNIQUE_CARRIER,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_CITY_NAME,...,DIVERTED,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,STATUS,DELAYED,DELAY_REASON
87,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,BHM,"Birmingham, AL",...,0.0,597.0,0.0,61.0,0.0,0.0,0.0,Delayed,True,Weather
191,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,CLE,"Cleveland, OH",...,0.0,1737.0,0.0,18.0,0.0,0.0,0.0,Delayed,True,Weather
227,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,CLT,"Charlotte, NC",...,0.0,361.0,0.0,19.0,6.0,0.0,0.0,Delayed,True,Weather
1962,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,PIT,"Pittsburgh, PA",...,0.0,1814.0,0.0,18.0,0.0,0.0,0.0,Delayed,True,Weather
2000,2022,2,5,1,7,5/1/2022 12:00:00 AM,AA,AA,RDU,"Raleigh/Durham, NC",...,0.0,587.0,0.0,47.0,12.0,0.0,0.0,Delayed,True,Weather


#### Function to make the API call to get historic weather data based on flight time and origin. 

In [306]:
def get_historic_geo_data_by_zip(startDateTime, originCityName, endDateTime):
    #URL
    #url = 'https://visual-crossing-weather.p.rapidapi.com/history?'
    #API key Request Headers
    headers = {"X-RapidAPI-Key": "046c439acamshd081e11265aa749p15f219jsn1275dc460233",
               "X-RapidAPI-Host": "visual-crossing-weather.p.rapidapi.com"}
    #URL
    geocode_request_url = "https://visual-crossing-weather.p.rapidapi.com/history?" 
    #Request Parameters
    parms = {'startDateTime': startDateTime, 'aggregateHours': 24, 'location': originCityName,
             'endDateTime': endDateTime, 'unitGroup': 'us'}
    try:
        #API call to get the weather data at the scheduled time of flight.
        response = requests.get(geocode_request_url, params=parms, headers=headers)
    except requests.exceptions.RequestException as e:  # This is the correct syntax
        print('There was an error in the API call : ', e)
    except requests.exceptions.Timeout as t:
        print('The API call timedout. Please retry.')
    return response

### Data Transformation 

#### 1. Data manipulation using regular expressions to convert response text into a list of keys and values

#### 2. Data transformation to parse the list of keys and values to form a dictionary (key-value pair) 

In [325]:
#Function to process the API response
def process_api_data(response, index):
    try:
        if response.status_code==200: #OK
            try:
                 # The API response is not a formatted JSON. Parsing through the text to create a key-value pair
                json_data = response.text.splitlines()
                if len(json_data) <= 2: #We only expect a list of keys and values. Ideally there shouldn't be a count > 2
                    keys = (re.split(',', json_data[0])) 
                    # replace ', ' by | to be able to split the strings correctly for key value pair
                    values = (json_data[1].replace(', ','|').replace('"', '')).split(',') 
                    #print(len(keys), len(values))
                    if len(keys) == len(values): #Converting to dict only when keys and values count match.
                        for i in range(len(keys)): 
                            historic_weather_data[keys[i]] = values[i]  
                    else:
                        print("Key value pair counts don't match")
            except RuntimeError as ex:
                print('There was an error in processing the API response : ', ex)
        elif response.status_code==404:
            print("Requested historic weather data not found for parms : Fl_Date - ",start_date_time, ' City - ',row.ORIGIN_CITY_NAME)
        else:
            print('Unable to get historic weather data for parms : Fl_Date - ',start_date_time, ' City - ',row.ORIGIN_CITY_NAME)
    except RuntimeError as ex:
        print("There was an error in dictionary creation from API response.")
    
    return historic_weather_data

#### Function call to create the API request, get and process the response.

#### 3. Data transformation to convert the dictionary to a dataframe

In [326]:
index = 0 
historic_weather_data = {}
weather_data_df = pd.DataFrame()

for inx, row in weather_delay_df.iterrows():
    start_date_time = pd.to_datetime(row.FL_DATE).strftime("%Y-%m-%dT%H:%M:%S") #Fl start time
    end_date_time=start_date_time 

    #Switching back the if condition and reducing the API calls to 2, incase of a rerun to avoid reaching the API limit. 
    #Since this is a public API there is a limit to the number of calls I can make per month.
    #The loop for entire dataframe has been run and the weather_data_df is created for all rows in weather_delay_df.
    #This condition will be removed again for the final project submission 
    if index < 2: 
        #Call funtion to get weather at origin and flight time
        response = get_historic_geo_data_by_zip(start_date_time,row.ORIGIN_CITY_NAME , end_date_time) 

        weather_dict={}
        weather_dict = process_api_data(response, index) #Get each API response in a dict

        df  = pd.DataFrame([weather_dict], columns=weather_dict.keys()) #Convert dict to a dataframe
        weather_data_df = pd.concat([weather_data_df, df], axis =0).reset_index(drop=True) #Append dict to dataframe rows

        index = index + 1         

25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 25
25 2

In [327]:
# Final weather_data_df
weather_data_df

Unnamed: 0,Address,Date time,Minimum Temperature,Maximum Temperature,Temperature,Dew Point,Relative Humidity,Heat Index,Wind Speed,Wind Gust,...,Visibility,Cloud Cover,Sea Level Pressure,Weather Type,Latitude,Longitude,Resolved Address,Name,Info,Conditions
0,Birmingham|AL,05/01/2022,65.0,82.1,73.0,63.6,73.68,83.5,10.8,36.7,...,9.4,66.8,1016.0,Lightning Without Thunder|Mist|Thunderstorm|Ra...,33.5207,-86.8118,Birmingham|AL|United States,Birmingham|AL|United States,,Rain|Partially cloudy
1,Cleveland|OH,05/01/2022,53.1,74.0,62.2,45.6,60.2,,18.3,30.2,...,9.4,75.6,1010.3,Mist|Thunderstorm|Light Rain,41.5047,-81.6908,Cleveland|OH|United States,Cleveland|OH|United States,,Rain|Overcast
2,Charlotte|NC,05/01/2022,57.3,80.5,68.6,59.5,73.82,81.8,13.1,20.7,...,9.9,64.2,1017.5,Lightning Without Thunder,35.2229,-80.838,Charlotte|NC|United States,Charlotte|NC|United States,,Rain|Partially cloudy
3,Pittsburgh|PA,05/01/2022,53.5,70.4,60.8,46.6,64.71,,9.2,23.5,...,8.8,76.1,1013.4,Lightning Without Thunder|Mist|Thunderstorm|Ra...,40.4385,-79.9973,Pittsburgh|PA|United States,Pittsburgh|PA|United States,,Rain|Overcast
4,Raleigh/Durham|NC,05/01/2022,50.5,82.2,65.0,57.3,79.5,82.6,12.5,20.8,...,8.2,66.1,1018.2,Mist|Light Drizzle|Thunderstorm|Rain|Fog|Heavy...,35.9869,-78.6686,Durham Rd|Raleigh|NC 27614|United States,Durham Rd|Raleigh|NC 27614|United States,,Rain|Partially cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4147,Savannah|GA,05/31/2022,70.5,84.0,76.6,73.2,89.54,91.7,8.8,,...,9.8,34.3,1018.7,Lightning Without Thunder|Mist|Rain|Thundersto...,32.0809,-81.0912,Savannah|GA|United States,Savannah|GA|United States,,Rain|Partially cloudy
4148,Dallas|TX,05/31/2022,76.0,92.3,84.0,69.4,63.24,96.7,18.3,31.6,...,9.9,59.3,1009.8,,32.7782,-96.7951,Dallas|TX|United States,Dallas|TX|United States,,Partially cloudy
4149,Kansas City|MO,05/31/2022,65.0,78.2,72.3,65.9,81.0,,16.1,42.5,...,8.4,87.7,1009.7,Lightning Without Thunder|Mist|Thunderstorm|Ra...,39.1034,-94.5831,Kansas City|MO|United States,Kansas City|MO|United States,,Rain|Overcast
4150,Tampa|FL,05/31/2022,72.7,91.2,80.3,69.3,71.22,95.5,15.3,24.2,...,9.9,43.1,1016.4,Lightning Without Thunder|Light Drizzle|Thunde...,27.9465,-82.4593,Tampa|FL|United States,Tampa|FL|United States,,Rain|Partially cloudy


>Now that we have the weather data in dataframe, we'll perform the Data Transformation Steps

#### 4. Look for empty rows and null values

In [328]:
weather_data_df.dropna()

Unnamed: 0,Address,Date time,Minimum Temperature,Maximum Temperature,Temperature,Dew Point,Relative Humidity,Heat Index,Wind Speed,Wind Gust,...,Visibility,Cloud Cover,Sea Level Pressure,Weather Type,Latitude,Longitude,Resolved Address,Name,Info,Conditions
0,Birmingham|AL,05/01/2022,65.0,82.1,73.0,63.6,73.68,83.5,10.8,36.7,...,9.4,66.8,1016.0,Lightning Without Thunder|Mist|Thunderstorm|Ra...,33.5207,-86.8118,Birmingham|AL|United States,Birmingham|AL|United States,,Rain|Partially cloudy
1,Cleveland|OH,05/01/2022,53.1,74.0,62.2,45.6,60.2,,18.3,30.2,...,9.4,75.6,1010.3,Mist|Thunderstorm|Light Rain,41.5047,-81.6908,Cleveland|OH|United States,Cleveland|OH|United States,,Rain|Overcast
2,Charlotte|NC,05/01/2022,57.3,80.5,68.6,59.5,73.82,81.8,13.1,20.7,...,9.9,64.2,1017.5,Lightning Without Thunder,35.2229,-80.838,Charlotte|NC|United States,Charlotte|NC|United States,,Rain|Partially cloudy
3,Pittsburgh|PA,05/01/2022,53.5,70.4,60.8,46.6,64.71,,9.2,23.5,...,8.8,76.1,1013.4,Lightning Without Thunder|Mist|Thunderstorm|Ra...,40.4385,-79.9973,Pittsburgh|PA|United States,Pittsburgh|PA|United States,,Rain|Overcast
4,Raleigh/Durham|NC,05/01/2022,50.5,82.2,65.0,57.3,79.5,82.6,12.5,20.8,...,8.2,66.1,1018.2,Mist|Light Drizzle|Thunderstorm|Rain|Fog|Heavy...,35.9869,-78.6686,Durham Rd|Raleigh|NC 27614|United States,Durham Rd|Raleigh|NC 27614|United States,,Rain|Partially cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4147,Savannah|GA,05/31/2022,70.5,84.0,76.6,73.2,89.54,91.7,8.8,,...,9.8,34.3,1018.7,Lightning Without Thunder|Mist|Rain|Thundersto...,32.0809,-81.0912,Savannah|GA|United States,Savannah|GA|United States,,Rain|Partially cloudy
4148,Dallas|TX,05/31/2022,76.0,92.3,84.0,69.4,63.24,96.7,18.3,31.6,...,9.9,59.3,1009.8,,32.7782,-96.7951,Dallas|TX|United States,Dallas|TX|United States,,Partially cloudy
4149,Kansas City|MO,05/31/2022,65.0,78.2,72.3,65.9,81.0,,16.1,42.5,...,8.4,87.7,1009.7,Lightning Without Thunder|Mist|Thunderstorm|Ra...,39.1034,-94.5831,Kansas City|MO|United States,Kansas City|MO|United States,,Rain|Overcast
4150,Tampa|FL,05/31/2022,72.7,91.2,80.3,69.3,71.22,95.5,15.3,24.2,...,9.9,43.1,1016.4,Lightning Without Thunder|Light Drizzle|Thunde...,27.9465,-82.4593,Tampa|FL|United States,Tampa|FL|United States,,Rain|Partially cloudy


>No null rows to drop.

#### 5. Drop Columns

In [None]:
weather_data_df.groupby(['Info'])['Info'].count().sort_index()
#There doesn't seem to be any relevant information in the Info column.

In [342]:
 weather_data_df[weather_data_df['Resolved Address'] == weather_data_df['Name']]

Unnamed: 0,Address,Date time,Minimum Temperature,Maximum Temperature,Temperature,Dew Point,Relative Humidity,Heat Index,Wind Speed,Wind Gust,...,Visibility,Cloud Cover,Sea Level Pressure,Weather Type,Latitude,Longitude,Resolved Address,Name,Info,Conditions
0,Birmingham|AL,05/01/2022,65.0,82.1,73.0,63.6,73.68,83.5,10.8,36.7,...,9.4,66.8,1016.0,Lightning Without Thunder|Mist|Thunderstorm|Ra...,33.5207,-86.8118,Birmingham|AL|United States,Birmingham|AL|United States,,Rain|Partially cloudy
1,Cleveland|OH,05/01/2022,53.1,74.0,62.2,45.6,60.2,,18.3,30.2,...,9.4,75.6,1010.3,Mist|Thunderstorm|Light Rain,41.5047,-81.6908,Cleveland|OH|United States,Cleveland|OH|United States,,Rain|Overcast
2,Charlotte|NC,05/01/2022,57.3,80.5,68.6,59.5,73.82,81.8,13.1,20.7,...,9.9,64.2,1017.5,Lightning Without Thunder,35.2229,-80.838,Charlotte|NC|United States,Charlotte|NC|United States,,Rain|Partially cloudy
3,Pittsburgh|PA,05/01/2022,53.5,70.4,60.8,46.6,64.71,,9.2,23.5,...,8.8,76.1,1013.4,Lightning Without Thunder|Mist|Thunderstorm|Ra...,40.4385,-79.9973,Pittsburgh|PA|United States,Pittsburgh|PA|United States,,Rain|Overcast
4,Raleigh/Durham|NC,05/01/2022,50.5,82.2,65.0,57.3,79.5,82.6,12.5,20.8,...,8.2,66.1,1018.2,Mist|Light Drizzle|Thunderstorm|Rain|Fog|Heavy...,35.9869,-78.6686,Durham Rd|Raleigh|NC 27614|United States,Durham Rd|Raleigh|NC 27614|United States,,Rain|Partially cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4147,Savannah|GA,05/31/2022,70.5,84.0,76.6,73.2,89.54,91.7,8.8,,...,9.8,34.3,1018.7,Lightning Without Thunder|Mist|Rain|Thundersto...,32.0809,-81.0912,Savannah|GA|United States,Savannah|GA|United States,,Rain|Partially cloudy
4148,Dallas|TX,05/31/2022,76.0,92.3,84.0,69.4,63.24,96.7,18.3,31.6,...,9.9,59.3,1009.8,,32.7782,-96.7951,Dallas|TX|United States,Dallas|TX|United States,,Partially cloudy
4149,Kansas City|MO,05/31/2022,65.0,78.2,72.3,65.9,81.0,,16.1,42.5,...,8.4,87.7,1009.7,Lightning Without Thunder|Mist|Thunderstorm|Ra...,39.1034,-94.5831,Kansas City|MO|United States,Kansas City|MO|United States,,Rain|Overcast
4150,Tampa|FL,05/31/2022,72.7,91.2,80.3,69.3,71.22,95.5,15.3,24.2,...,9.9,43.1,1016.4,Lightning Without Thunder|Light Drizzle|Thunde...,27.9465,-82.4593,Tampa|FL|United States,Tampa|FL|United States,,Rain|Partially cloudy


>All rows have same data from resolved area and Name. Dropping one of these columns since its a duplicate. Info has all NAN values. Dropping the 2 columns

In [349]:
weather_data_df = weather_data_df.drop(columns=['Info', 'Resolved Address'])

In [356]:
weather_data_df.shape

(4152, 23)

#### 6. Drop Duplicates

In [350]:
#Dropping dups from the copy and retaining the original df,
#to avoid having to recreate the df with multiple hits to the API. 
weather_data_df_copy = weather_data_df

In [351]:
weather_data_df_copy.shape, weather_data_df.shape

((4152, 23), (4152, 23))

In [352]:
print('Dataframe before dropping duplicates :', weather_data_df_copy.shape)
weather_data_df_copy = weather_data_df_copy.drop_duplicates() # 1,389 rows dropped
print('Dataframe after dropping duplicates :',weather_data_df_copy.shape)

Dataframe before dropping duplicates : (4152, 23)
Dataframe after dropping duplicates : (1376, 23)


#### 7. Replace column names

In [353]:
weather_data_df.columns

Index(['Address', 'Date time', 'Minimum Temperature', 'Maximum Temperature',
       'Temperature', 'Dew Point', 'Relative Humidity', 'Heat Index',
       'Wind Speed', 'Wind Gust', 'Wind Direction', 'Wind Chill',
       'Precipitation', 'Precipitation Cover', 'Snow Depth', 'Visibility',
       'Cloud Cover', 'Sea Level Pressure', 'Weather Type', 'Latitude',
       'Longitude', 'Name', 'Conditions'],
      dtype='object')

In [354]:
columns = ['ORIGIN', 'FL_DATE', 'MIN_TEMP', 'MAX_TEMP','TEMP', 'DEW_POINT', 'RELATIVE_HUMIDITY', 'HEAT_INDEX', 'WIND_SPEED',
           'WIND_GUST', 'WIND_DIRECTION', 'WIND_CHILL', 'PRECIPITATION', 'PRECIPITATION_COVER', 'SNOW_DEPTH', 
           'VISIBILITY','CLOUD_COVER', 'SEA_LEVEL_PRESSURE', 'WEATHER_TYPE', 'LATITUDE','LONGITUDE', 'CITY_NAME', 'CONDITIONS']

weather_data_df_copy.columns = columns

In [355]:
weather_data_df_copy.columns

Index(['ORIGIN', 'FL_DATE', 'MIN_TEMP', 'MAX_TEMP', 'TEMP', 'DEW_POINT',
       'RELATIVE_HUMIDITY', 'HEAT_INDEX', 'WIND_SPEED', 'WIND_GUST',
       'WIND_DIRECTION', 'WIND_CHILL', 'PRECIPITATION', 'PRECIPITATION_COVER',
       'SNOW_DEPTH', 'VISIBILITY', 'CLOUD_COVER', 'SEA_LEVEL_PRESSURE',
       'WEATHER_TYPE', 'LATITUDE', 'LONGITUDE', 'CITY_NAME', 'CONDITIONS'],
      dtype='object')

#### 8. Fill NA/NaN values, if any

In [363]:
print(weather_data_df_copy[weather_data_df_copy.LATITUDE.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.LONGITUDE.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.MIN_TEMP.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.MAX_TEMP.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.TEMP.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.DEW_POINT.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.RELATIVE_HUMIDITY.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.HEAT_INDEX.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.WIND_SPEED.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.WIND_GUST.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.WIND_DIRECTION.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.WIND_CHILL.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.PRECIPITATION.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.PRECIPITATION_COVER.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.SNOW_DEPTH.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.VISIBILITY.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.CLOUD_COVER.isna()==True])
print(weather_data_df_copy[weather_data_df_copy.SEA_LEVEL_PRESSURE.isna()==True])

Empty DataFrame
Columns: [ORIGIN, FL_DATE, MIN_TEMP, MAX_TEMP, TEMP, DEW_POINT, RELATIVE_HUMIDITY, HEAT_INDEX, WIND_SPEED, WIND_GUST, WIND_DIRECTION, WIND_CHILL, PRECIPITATION, PRECIPITATION_COVER, SNOW_DEPTH, VISIBILITY, CLOUD_COVER, SEA_LEVEL_PRESSURE, WEATHER_TYPE, LATITUDE, LONGITUDE, CITY_NAME, CONDITIONS]
Index: []

[0 rows x 23 columns]
Empty DataFrame
Columns: [ORIGIN, FL_DATE, MIN_TEMP, MAX_TEMP, TEMP, DEW_POINT, RELATIVE_HUMIDITY, HEAT_INDEX, WIND_SPEED, WIND_GUST, WIND_DIRECTION, WIND_CHILL, PRECIPITATION, PRECIPITATION_COVER, SNOW_DEPTH, VISIBILITY, CLOUD_COVER, SEA_LEVEL_PRESSURE, WEATHER_TYPE, LATITUDE, LONGITUDE, CITY_NAME, CONDITIONS]
Index: []

[0 rows x 23 columns]
Empty DataFrame
Columns: [ORIGIN, FL_DATE, MIN_TEMP, MAX_TEMP, TEMP, DEW_POINT, RELATIVE_HUMIDITY, HEAT_INDEX, WIND_SPEED, WIND_GUST, WIND_DIRECTION, WIND_CHILL, PRECIPITATION, PRECIPITATION_COVER, SNOW_DEPTH, VISIBILITY, CLOUD_COVER, SEA_LEVEL_PRESSURE, WEATHER_TYPE, LATITUDE, LONGITUDE, CITY_NAME, CONDITI

>There are no Nan values in numeric columns. 

### Ethical implications:
>API Data - 
The data source of the API is genuine and reliable as stated in the terms of use in the website and I do not see any legal concers in using the data. However, I would like to validate the accuracy of the data by looking up the weather at a certain place and time known to have a bad weather situation.

### Conclusion:
>As a part of this milestone, the following Data Transformation steps have been performed.

>1. Data manipulation using regular expressions to convert response text into a list of keys and values
>2. Data transformation to parse the list of keys and values to form a dictionary (key-value pair) 
>3. Data transformation to convert the dictionary to a dataframe
>4. Look for empty rows and null values
>5. Drop Columns
>6. Drop Duplicates
>7. Replace column names
>8. Fill NA/NaN values, if any 

>During the final project analysis, if there was no concerning weather condition at the origin airport, the same can be run against the destination airport to see if a flight was delayed/cancelled due to bad weather at the destination airport

###### Following lines can be ignored. There is practice code that I would like to refer, if required.