# Airline on Time Data
## Introduction
Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you'd had more data? This is your chance to find out. 

The data: The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigab ytes when uncompressed. The data comes originally from RITA where it is described in detail. You can download the data [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7).

**However only data for 2007 will be used for this project**

Variable descriptions: 

S/N | Name | Description 
-- | -- | --
1 | `Year` | 1987-2008 (2007 for this analysis)
2 | `Month` | 1-12 
3 | `DayofMonth` | 1-31 
4 | `DayOfWeek` | 1 (Monday) - 7 (Sunday) 
5 | `DepTime` | actual departure time (local, hhmm) 
6 | `CRSDepTime` | scheduled departure time (local, hhmm) 
7 | `ArrTime` | actual arrival time (local, hhmm) 
8 | `CRSArrTime` | scheduled arrival time (local, hhmm) 
9 | `UniqueCarrier` | unique carrier code 
10 | `FlightNum` | flight number 
11 | `TailNum` | plane tail number 
12 | `ActualElapsedTime` | actual elapsed time in minutes 
13 | `CRSElapsedTime` | scheduled elapsed time in minutes 
14 | `AirTime` | the time from the moment an aircraft leaves the surface until it comes into contact with the surface at the next point of landing;,in minutes 
15 | `ArrDelay` | arrival delay, in minutes 
16 | `DepDelay` | departure delay, in minutes 
17 | `Origin` | origin, IATA airport code 
18 | `Dest` | destination, IATA airport code 
19 | `Distance` | distance covered, in miles 
20 | `TaxiIn` | taxi in time, in minutes 
21 | `TaxiOut` | taxi out time in minutes 
22 | `Cancelled` | was the flight cancelled? 
23 | `CancellationCode` | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) 
24 | `Diverted` | 1 = yes, 0 = no 
25 | `CarrierDelay` | delay within air carrier's control, in minutes 
26 | `WeatherDelay` | delay caused by extreme weather conditions, in minutes 
27 | `NASDelay` | delay within the NAS control, in minutes 
28 | `SecurityDelay` | delay due to security checks, breach or faulty security equipment, in minutes 
29 | `LateAircraftDelay` | delay due to the late arrival of the same aircraft at a previous airport, in minutes

* The International Air Transport Association's (IATA) Location Identifier is a unique 3-letter code (also commonly known as IATA code) used in aviation and also in logistics to identify an airport.
* A computer reservation system or a central reservation system (CRS) is a web-based software used by travel agencies and travel management companies to retrieve and conduct transactions related to air travel, hotels, car rental, or other activities. 

**Questions of Interest for the 2007 data**
* When is the best time of day/day of week/time of year to fly to minimise delays?
* Do older planes suffer more delays?
* How does the number of people flying between different locations change over time?
* How well does weather predict plane delays?
* Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

In [1]:
# import modules and libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## Data Wrangling
### Gathering data

In [28]:
airline_df = pd.read_csv('2007.csv')
airports = pd.read_csv('airports.csv')
carriers = pd.read_csv('carriers.csv')
plane_data = pd.read_csv('plane-data.csv')

In [3]:
airline_df.shape

(7453215, 29)

It's a reasonably sized data with 7.4 million entries and 29 variables

### Assessing the data

In [4]:
airline_df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,...,4,11,0,,0,0,0,0,0,0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,...,5,6,0,,0,0,0,0,0,0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,...,6,9,0,,0,3,0,0,0,31
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,...,3,8,0,,0,23,0,0,0,3
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,...,3,9,0,,0,0,0,0,0,0


In [24]:
airline_df.iloc[:,:15].head(10)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,N351,69.0,75.0,54.0,1.0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,N370,85.0,90.0,74.0,8.0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,N685,88.0,90.0,73.0,34.0
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,N364,86.0,90.0,75.0,26.0
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,N480,86.0,90.0,74.0,-3.0
5,2007,1,1,1,1430.0,1420,1553.0,1550,WN,2386,N611SW,83.0,90.0,74.0,3.0
6,2007,1,1,1,1936.0,1840,2217.0,2130,WN,409,N482,101.0,110.0,89.0,47.0
7,2007,1,1,1,944.0,935,1223.0,1225,WN,1131,N749SW,99.0,110.0,86.0,-2.0
8,2007,1,1,1,1537.0,1450,1819.0,1735,WN,1212,N451,102.0,105.0,90.0,44.0
9,2007,1,1,1,1318.0,1315,1603.0,1610,WN,2456,N630WN,105.0,115.0,92.0,-7.0


In [23]:
airline_df.iloc[:,15:].head(10)

Unnamed: 0,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,7.0,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
1,13.0,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
2,36.0,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
3,30.0,SMF,PDX,479,3,8,0,,0,23,0,0,0,3
4,1.0,SMF,PDX,479,3,9,0,,0,0,0,0,0,0
5,10.0,SMF,PDX,479,2,7,0,,0,0,0,0,0,0
6,56.0,SMF,PHX,647,5,7,0,,0,46,0,0,0,1
7,9.0,SMF,PHX,647,4,9,0,,0,0,0,0,0,0
8,47.0,SMF,PHX,647,5,7,0,,0,20,0,0,0,24
9,3.0,SMF,PHX,647,5,8,0,,0,0,0,0,0,0


In [26]:
airline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7453215 entries, 0 to 7453214
Data columns (total 29 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Year               int64  
 1   Month              int64  
 2   DayofMonth         int64  
 3   DayOfWeek          int64  
 4   DepTime            float64
 5   CRSDepTime         int64  
 6   ArrTime            float64
 7   CRSArrTime         int64  
 8   UniqueCarrier      object 
 9   FlightNum          int64  
 10  TailNum            object 
 11  ActualElapsedTime  float64
 12  CRSElapsedTime     float64
 13  AirTime            float64
 14  ArrDelay           float64
 15  DepDelay           float64
 16  Origin             object 
 17  Dest               object 
 18  Distance           int64  
 19  TaxiIn             int64  
 20  TaxiOut            int64  
 21  Cancelled          int64  
 22  CancellationCode   object 
 23  Diverted           int64  
 24  CarrierDelay       int64  
 25  WeatherDelay      

Percentage of missing values per column

In [32]:
percentNaN = airline_df.isnull().sum()/len(airline_df) * 100
percentNaN

Year                  0.000000
Month                 0.000000
DayofMonth            0.000000
DayOfWeek             0.000000
DepTime               2.156761
CRSDepTime            0.000000
ArrTime               2.387252
CRSArrTime            0.000000
UniqueCarrier         0.000000
FlightNum             0.000000
TailNum               0.000295
ActualElapsedTime     2.387252
CRSElapsedTime        0.013337
AirTime               2.387252
ArrDelay              2.387252
DepDelay              2.156761
Origin                0.000000
Dest                  0.000000
Distance              0.000000
TaxiIn                0.000000
TaxiOut               0.000000
Cancelled             0.000000
CancellationCode     97.843226
Diverted              0.000000
CarrierDelay          0.000000
WeatherDelay          0.000000
NASDelay              0.000000
SecurityDelay         0.000000
LateAircraftDelay     0.000000
dtype: float64

`CancellationCode` seems to have the highest percentage of NaN values but that's expected because it shows that a huge majority of the flights weren't cancelled.

In [36]:
percentNaN[percentNaN != 0]

DepTime               2.156761
ArrTime               2.387252
TailNum               0.000295
ActualElapsedTime     2.387252
CRSElapsedTime        0.013337
AirTime               2.387252
ArrDelay              2.387252
DepDelay              2.156761
CancellationCode     97.843226
dtype: float64

Let's look at the data for the above variables excluding `CancellationCode`

In [38]:
airline_df[airline_df.DepTime.isnull()].sample(15)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
4771362,2007,8,23,4,,1350,,1455,MQ,3527,...,0,0,1,C,0,0,0,0,0,0
1104366,2007,2,15,4,,1535,,1715,AA,1902,...,0,0,1,A,0,0,0,0,0,0
522742,2007,1,20,6,,915,,1035,AA,1616,...,0,0,1,B,0,0,0,0,0,0
4289180,2007,7,12,4,,625,,917,B6,371,...,0,0,1,A,0,0,0,0,0,0
2950137,2007,5,12,6,,1210,,1450,AA,677,...,0,0,1,A,0,0,0,0,0,0
5191364,2007,9,12,3,,710,,810,OO,5780,...,0,0,1,B,0,0,0,0,0,0
1004147,2007,2,14,3,,1000,,1005,MQ,4359,...,0,0,1,B,0,0,0,0,0,0
2152451,2007,4,5,4,,1814,,2140,EV,4083,...,0,0,1,A,0,0,0,0,0,0
1289736,2007,3,5,1,,1635,,1808,XE,2996,...,0,0,1,C,0,0,0,0,0,0
4852617,2007,8,19,7,,1905,,2159,9E,5663,...,0,0,1,A,0,0,0,0,0,0


I suspect that most missing values in the other columns are as a result of the flight being cancelled and as such, the flight can't have an arrival time or departure time. Let's look for missing data in the flights that weren't cancelled.

In [44]:
flights_NotCancelled = airline_df.query('Cancelled == 0')

In [45]:
percentNaN = flights_NotCancelled.isnull().sum()/len(airline_df) * 100
percentNaN[percentNaN != 0]

ArrTime               0.230491
ActualElapsedTime     0.230491
CRSElapsedTime        0.009754
AirTime               0.230491
ArrDelay              0.230491
CancellationCode     97.843226
dtype: float64

Notice that it's just an insignificant amount that contains NaN values. Let's have a look at it.

In [54]:
# selecting the missing values in the flights that weren't cancelled
# dropping the CancellationCode column first
flights_NotCancelled[flights_NotCancelled.drop('CancellationCode', axis=1).ArrDelay.isnull()]

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
3340,2007,1,2,2,802.0,755,,1115,WN,837,...,0,8,0,,1,0,0,0,0,0
8000,2007,1,4,4,1627.0,1610,,1735,WN,2474,...,0,15,0,,1,0,0,0,0,0
8014,2007,1,4,4,1951.0,1905,,2005,WN,2860,...,0,9,0,,1,0,0,0,0,0
8447,2007,1,4,4,845.0,845,,1245,WN,76,...,0,7,0,,1,0,0,0,0,0
8827,2007,1,4,4,1052.0,840,,1000,WN,102,...,0,8,0,,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7450737,2007,12,13,4,1158.0,1200,,1427,DL,670,...,0,16,0,,1,0,0,0,0,0
7451337,2007,12,13,4,653.0,700,,920,DL,1465,...,0,38,0,,1,0,0,0,0,0
7451634,2007,12,13,4,1151.0,1155,,1500,DL,1777,...,0,12,0,,1,0,0,0,0,0
7451702,2007,12,13,4,1153.0,1200,,1507,DL,1844,...,0,12,0,,1,0,0,0,0,0


It looks like all the missing values, except for `CancellationCode` are as a result of the flight being diverted. How true is that? 

In [55]:
flights_NotCancelled[flights_NotCancelled.drop('CancellationCode', axis=1).ArrDelay.isnull()].Diverted.value_counts()

1    17179
Name: Diverted, dtype: int64

In [57]:
flights_NotCancelled[flights_NotCancelled.drop('CancellationCode', axis=1).CRSElapsedTime.isnull()].Diverted.value_counts()

1    727
Name: Diverted, dtype: int64

In [59]:
# clear flights_NotCancelled from memory
del flights_NotCancelled

Check for duplicates

In [61]:
airline_df.duplicated().sum()

27

In [62]:
airline_df[airline_df.duplicated()]

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
354402,2007,1,14,7,35.0,35,618.0,605,F9,514,...,37,12,0,,0,0,0,0,0,0
356314,2007,1,21,7,32.0,35,621.0,605,F9,514,...,35,10,0,,0,0,0,16,0,0
4076805,2007,7,23,1,1905.0,1900,2020.0,2035,F9,419,...,6,16,0,,0,0,0,0,0,0
4076807,2007,7,23,1,2136.0,2130,2251.0,2255,F9,419,...,13,11,0,,0,0,0,0,0,0
4076809,2007,7,23,1,619.0,625,931.0,940,F9,222,...,4,12,0,,0,0,0,0,0,0
4076811,2007,7,23,1,1029.0,1025,1630.0,1600,F9,448,...,7,23,0,,0,4,0,26,0,0
4076813,2007,7,23,1,1714.0,1645,1845.0,1905,F9,449,...,6,14,0,,0,0,0,0,0,0
4076815,2007,7,23,1,2004.0,2005,2345.0,2335,F9,237,...,8,16,0,,0,0,0,0,0,0
4076817,2007,7,23,1,610.0,620,723.0,749,F9,378,...,10,9,0,,0,0,0,0,0,0
4076819,2007,7,23,1,2030.0,2030,2346.0,2345,F9,372,...,5,10,0,,0,0,0,0,0,0


In [66]:
airline_df.iloc[4076804:4076824,:15]

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay
4076804,2007,7,23,1,1905.0,1900,2020.0,2035,F9,419,N946FR,135.0,155.0,113.0,-15.0
4076805,2007,7,23,1,1905.0,1900,2020.0,2035,F9,419,N946FR,135.0,155.0,113.0,-15.0
4076806,2007,7,23,1,2136.0,2130,2251.0,2255,F9,419,N946FR,135.0,145.0,111.0,-4.0
4076807,2007,7,23,1,2136.0,2130,2251.0,2255,F9,419,N946FR,135.0,145.0,111.0,-4.0
4076808,2007,7,23,1,619.0,625,931.0,940,F9,222,N947FR,132.0,135.0,116.0,-9.0
4076809,2007,7,23,1,619.0,625,931.0,940,F9,222,N947FR,132.0,135.0,116.0,-9.0
4076810,2007,7,23,1,1029.0,1025,1630.0,1600,F9,448,N947FR,241.0,215.0,211.0,30.0
4076811,2007,7,23,1,1029.0,1025,1630.0,1600,F9,448,N947FR,241.0,215.0,211.0,30.0
4076812,2007,7,23,1,1714.0,1645,1845.0,1905,F9,449,N947FR,211.0,260.0,191.0,-20.0
4076813,2007,7,23,1,1714.0,1645,1845.0,1905,F9,449,N947FR,211.0,260.0,191.0,-20.0


In [67]:
airline_df.iloc[4076804:4076824,15:]

Unnamed: 0,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
4076804,5.0,MDW,DEN,895,6,16,0,,0,0,0,0,0,0
4076805,5.0,MDW,DEN,895,6,16,0,,0,0,0,0,0,0
4076806,6.0,DEN,LAX,862,13,11,0,,0,0,0,0,0,0
4076807,6.0,DEN,LAX,862,13,11,0,,0,0,0,0,0,0
4076808,-6.0,SMF,DEN,910,4,12,0,,0,0,0,0,0,0
4076809,-6.0,SMF,DEN,910,4,12,0,,0,0,0,0,0,0
4076810,4.0,DEN,PHL,1557,7,23,0,,0,4,0,26,0,0
4076811,4.0,DEN,PHL,1557,7,23,0,,0,4,0,26,0,0
4076812,29.0,PHL,DEN,1557,6,14,0,,0,0,0,0,0,0
4076813,29.0,PHL,DEN,1557,6,14,0,,0,0,0,0,0,0


In [69]:
airports

Unnamed: 0,iata,airport,city,state,country,lat,long
0,00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
1,00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
2,00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
3,01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
4,01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944
...,...,...,...,...,...,...,...
3371,ZEF,Elkin Municipal,Elkin,NC,USA,36.280024,-80.786069
3372,ZER,Schuylkill Cty/Joe Zerbey,Pottsville,PA,USA,40.706449,-76.373147
3373,ZPH,Zephyrhills Municipal,Zephyrhills,FL,USA,28.228065,-82.155916
3374,ZUN,Black Rock,Zuni,NM,USA,35.083227,-108.791777


In [83]:
airports.sample(25)

Unnamed: 0,iata,airport,city,state,country,lat,long
3204,UKI,Ukiah Municipal,Ukiah,CA,USA,39.125957,-123.200855
1021,BVK,Buckland,Buckland,AK,USA,65.982286,-161.151978
1361,EEK,Eek,Eek,AK,USA,60.215904,-162.005609
1251,DBN,"W. H. \Bud\"" Barron """,Dublin,GA,USA,32.564458,-82.985256
1829,I78,Union County,Marysville,OH,USA,40.224694,-83.351611
1163,COM,Coleman Municipal,Coleman,TX,USA,31.841139,-99.403611
2545,OTG,Worthington Municipal,Worthington,MN,USA,43.655066,-95.579209
1433,EWB,New Bedford Municipal,New Bedford,MA,USA,41.676142,-70.956942
436,4I7,Putnam County,Greencastle,IN,USA,39.633596,-86.813833
2090,LNN,Lost Nation,Willoughby,OH,USA,41.683917,-81.390306


In [81]:
airports.state.unique()

array(['MS', 'TX', 'CO', 'NY', 'FL', 'AL', 'WI', 'OH', 'MO', 'MN', 'IN',
       'NV', 'IL', 'ND', 'MI', 'NE', 'GA', 'DC', 'TN', 'AK', 'ME', 'MA',
       'VT', 'SD', 'NM', 'OK', 'KS', 'KY', 'IA', 'AR', 'LA', 'CA', 'WA',
       'VA', 'AZ', 'PA', 'NJ', 'OR', 'NC', 'UT', 'MT', 'ID', 'CT', 'SC',
       'NH', 'MD', 'DE', 'WV', 'WY', 'PR', 'RI', nan, 'AS', 'CQ', 'GU',
       'HI', 'VI'], dtype=object)

In [91]:
airports.country.unique()

array(['USA', 'Thailand', 'Palau', 'N Mariana Islands',
       'Federated States of Micronesia'], dtype=object)

In [92]:
airports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3376 entries, 0 to 3375
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   iata     3376 non-null   object 
 1   airport  3376 non-null   object 
 2   city     3364 non-null   object 
 3   state    3364 non-null   object 
 4   country  3376 non-null   object 
 5   lat      3376 non-null   float64
 6   long     3376 non-null   float64
dtypes: float64(2), object(5)
memory usage: 184.8+ KB


In [93]:
airports.describe()

Unnamed: 0,lat,long
count,3376.0,3376.0
mean,40.036524,-98.621205
std,8.329559,22.869458
min,7.367222,-176.646031
25%,34.688427,-108.761121
50%,39.434449,-93.599425
75%,43.372612,-84.137519
max,71.285448,145.621384


In [95]:
carriers.sample(20)

Unnamed: 0,Code,Description
409,CTA,Century Airlines
214,ARN,Arnold Aviation
620,HAQ,Hapag Lloyd Flug.
34,4E (1),British Airtours Limited
1283,TWA,Trans Western Airlines Utah
292,BHQ,Turks Air Ltd. (1)
833,MJ,Lapa-Lineas Aereas Privadas
49,5G,"Skyservice Airlines, Inc."
865,MX,Compania Mexicana De Aviaci
1172,SNB,Ccair


In [97]:
carriers.duplicated().sum()

0

In [100]:
carriers.Code.value_counts()

02Q    1
PHL    1
PLA    1
PL     1
PKQ    1
      ..
ENT    1
EMP    1
EME    1
EMA    1
ZYZ    1
Name: Code, Length: 1490, dtype: int64

In [106]:
plane_data.shape

(5029, 9)

In [103]:
plane_data.duplicated().sum()

0

In [108]:
plane_data

Unnamed: 0,tailnum,type,manufacturer,issue_date,model,status,aircraft_type,engine_type,year
0,N050AA,,,,,,,,
1,N051AA,,,,,,,,
2,N052AA,,,,,,,,
3,N054AA,,,,,,,,
4,N055AA,,,,,,,,
...,...,...,...,...,...,...,...,...,...
5024,N997DL,Corporation,MCDONNELL DOUGLAS AIRCRAFT CO,03/11/1992,MD-88,Valid,Fixed Wing Multi-Engine,Turbo-Fan,1992
5025,N998AT,Corporation,BOEING,01/23/2003,717-200,Valid,Fixed Wing Multi-Engine,Turbo-Fan,2002
5026,N998DL,Corporation,MCDONNELL DOUGLAS CORPORATION,04/02/1992,MD-88,Valid,Fixed Wing Multi-Engine,Turbo-Jet,1992
5027,N999CA,Foreign Corporation,CANADAIR,07/09/2008,CL-600-2B19,Valid,Fixed Wing Multi-Engine,Turbo-Jet,1998


In [107]:
plane_data.sample(15)

Unnamed: 0,tailnum,type,manufacturer,issue_date,model,status,aircraft_type,engine_type,year
3293,N689SW,Corporation,BOEING,03/11/1997,737-3Q8,Valid,Fixed Wing Multi-Engine,Turbo-Fan,1985.0
3610,N754UW,Corporation,AIRBUS INDUSTRIE,10/16/2000,A319-112,Valid,Fixed Wing Multi-Engine,Turbo-Jet,2000.0
4179,N843AS,Corporation,BOMBARDIER INC,04/22/2004,CL-600-2B19,Valid,Fixed Wing Multi-Engine,Turbo-Fan,1999.0
1074,N320AE,Corporation,SAAB-SCANIA,05/15/1998,SAAB 340B,Valid,Fixed Wing Multi-Engine,Turbo-Prop,
2118,N4YGAA,,,,,,,,
616,N202FR,Corporation,AIRBUS,03/06/2008,A320-214,Valid,Fixed Wing Multi-Engine,Turbo-Fan,2008.0
2488,N570AS,Corporation,BOEING,03/21/2007,737-890,Valid,Fixed Wing Multi-Engine,Turbo-Fan,2007.0
4746,N935EV,Corporation,BOMBARDIER INC,09/09/2005,CL-600-2B19,Valid,Fixed Wing Multi-Engine,Turbo-Fan,2005.0
4537,N914DL,Corporation,MCDONNELL DOUGLAS AIRCRAFT CO,06/13/1988,MD-88,Valid,Fixed Wing Multi-Engine,Turbo-Fan,1988.0
1602,N3CEAA,,,,,,,,


#### Issues
**`airline_df`**
1. All time variables need to be represented as datetime dtypes and not numerical data types.
2. Drop duplicate records

**`plane_data`**
1. Drop NaN records
2. Convert year to datetime dtype ('year')