# Aviation Accidents Analysis

## 1. Overview

A company I am working with is planning to diversify its portfolio by exploring new areas, particularly the aviation industry. The leadership is interested in the risks associated with aircraft operations, mainly safety and reliability. It is on that background that as a senior data scientist in the company that the management tusked me with the responsibility of carrying out data analysis to assist the management in making informed investment decisions to determine which aircraft presents the lowest operational risk.


## 2. Business Understanding
The company's goal is to identify the safest and most reliable aircraft to purchase based on an analysis of accident and incident data collected over the years. Therefore, the company wants to answer the following questions before taking any steps to make a purchase.
The company wants aircraft:

* That is not accident-prone/low accident frequency.
* If ever involved in an accident, then a low number of deaths and injuries.
* For aircraft that are frequently involved in accidents, who built their engines, and what models are they?
* Are the aircraft that are frequently accident-prone professionally or amateur-built?
* What are the leading causes of accidents? Are the causes attributed to the engine or other parameters, such as weather?

## 3. Data Science Understanding
Analyze accident data collected over the years to establish which aircraft has the lowest risk by using risk indicators such as accident frequency, fatality rate, and severity of injuries. This will answer the questions such as;
* Which aircraft models have the lowest number of reported accidents in the dataset?
* What is the accident rate per aircraft model, considering how often each model appears?
* For each aircraft model, what is the average and total number of fatalities and injuries per accident?
* Which aircraft models have the lowest injury and fatality rates in reported incidents?
* Among aircraft with high accident frequency, what are the common engine manufacturers and engine models?
* Is there a relationship between engine builder and accident frequency?
* What proportion of accident-prone aircraft were amateur-built vs. professionally built?
* How many of these causes are related to the engine, pilot error, weather conditions, or other external factors?

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
# Loading accident data to the notebook
aviation_data = pd.read_csv("../Data/AviationData.csv", index_col = 0, encoding = "latin1", low_memory = False)

### 3.1 exploring data

In [3]:
aviation_data.head() # checking the first five rows of the dataframe

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,Fatal(3),...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [4]:
aviation_data.tail() # chwcking the last five rows of the dataframe

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,,,Minor,...,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,,...,,,0.0,0.0,0.0,0.0,,,,
20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,Non-Fatal,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,
20221230106513,Accident,ERA23LA097,2022-12-29,"Athens, GA",United States,,,,,Minor,...,Personal,,0.0,1.0,0.0,1.0,,,,30-12-2022


In [5]:
aviation_data.info() # getting concise information about the dataframe

<class 'pandas.core.frame.DataFrame'>
Index: 88889 entries, 20001218X45444 to 20221230106513
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Investigation.Type      88889 non-null  object 
 1   Accident.Number         88889 non-null  object 
 2   Event.Date              88889 non-null  object 
 3   Location                88837 non-null  object 
 4   Country                 88663 non-null  object 
 5   Latitude                34382 non-null  object 
 6   Longitude               34373 non-null  object 
 7   Airport.Code            50132 non-null  object 
 8   Airport.Name            52704 non-null  object 
 9   Injury.Severity         87889 non-null  object 
 10  Aircraft.damage         85695 non-null  object 
 11  Aircraft.Category       32287 non-null  object 
 12  Registration.Number     87507 non-null  object 
 13  Make                    88826 non-null  object 
 14  Model                

All the columns have missing values except the first three.

In [6]:
aviation_data.shape # checking the dataframe dimensions, i.e number of rows and cloumns in the dataframe.

(88889, 30)

Dataset contains 88,889 rows and 30 columns.

## 4. Data Preparation and cleaning

### 4.1 Checking for missing values

In [7]:
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 88889 entries, 20001218X45444 to 20221230106513
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Investigation.Type      88889 non-null  object 
 1   Accident.Number         88889 non-null  object 
 2   Event.Date              88889 non-null  object 
 3   Location                88837 non-null  object 
 4   Country                 88663 non-null  object 
 5   Latitude                34382 non-null  object 
 6   Longitude               34373 non-null  object 
 7   Airport.Code            50132 non-null  object 
 8   Airport.Name            52704 non-null  object 
 9   Injury.Severity         87889 non-null  object 
 10  Aircraft.damage         85695 non-null  object 
 11  Aircraft.Category       32287 non-null  object 
 12  Registration.Number     87507 non-null  object 
 13  Make                    88826 non-null  object 
 14  Model                

In [8]:
# checking for number of missing values per a column
aviation_data.isnull().sum() 

Investigation.Type            0
Accident.Number               0
Event.Date                    0
Location                     52
Country                     226
Latitude                  54507
Longitude                 54516
Airport.Code              38757
Airport.Name              36185
Injury.Severity            1000
Aircraft.damage            3194
Aircraft.Category         56602
Registration.Number        1382
Make                         63
Model                        92
Amateur.Built               102
Number.of.Engines          6084
Engine.Type                7096
FAR.Description           56866
Schedule                  76307
Purpose.of.flight          6192
Air.carrier               72241
Total.Fatal.Injuries      11401
Total.Serious.Injuries    12510
Total.Minor.Injuries      11933
Total.Uninjured            5912
Weather.Condition          4492
Broad.phase.of.flight     27165
Report.Status              6384
Publication.Date          13771
dtype: int64

In [9]:
missing_values_percentage= aviation_data.isnull().mean().sort_values(ascending = False) * 100
missing_values_percentage

Schedule                  85.845268
Air.carrier               81.271023
FAR.Description           63.974170
Aircraft.Category         63.677170
Longitude                 61.330423
Latitude                  61.320298
Airport.Code              43.601570
Airport.Name              40.708074
Broad.phase.of.flight     30.560587
Publication.Date          15.492356
Total.Serious.Injuries    14.073732
Total.Minor.Injuries      13.424608
Total.Fatal.Injuries      12.826109
Engine.Type                7.982990
Report.Status              7.181991
Purpose.of.flight          6.965991
Number.of.Engines          6.844491
Total.Uninjured            6.650992
Weather.Condition          5.053494
Aircraft.damage            3.593246
Registration.Number        1.554748
Injury.Severity            1.124999
Country                    0.254250
Amateur.Built              0.114750
Model                      0.103500
Make                       0.070875
Location                   0.058500
Accident.Number            0

In [10]:
# columns with more than 60 percent missing values
columns_to_drop = missing_values_percentage.head(6) 
columns_to_drop

Schedule             85.845268
Air.carrier          81.271023
FAR.Description      63.974170
Aircraft.Category    63.677170
Longitude            61.330423
Latitude             61.320298
dtype: float64

In [11]:
# Dropping columns that have more than 60 percent missing values
aviation_data = aviation_data.drop(columns= columns_to_drop.index)
aviation_data

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,Fatal(2),Destroyed,NC6404,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,Fatal(4),Destroyed,N5069P,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,,,Fatal(3),Destroyed,N5142R,...,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,Fatal(2),Destroyed,N1168J,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,Fatal(1),Destroyed,N15NY,...,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,Minor,,N1867H,...,,Personal,0.0,1.0,0.0,0.0,,,,29-12-2022
20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,N2895Z,...,,,0.0,0.0,0.0,0.0,,,,
20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,PAN,PAYSON,Non-Fatal,Substantial,N749PJ,...,,Personal,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,N210CU,...,,Personal,0.0,0.0,0.0,0.0,,,,


In [12]:
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 88889 entries, 20001218X45444 to 20221230106513
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Investigation.Type      88889 non-null  object 
 1   Accident.Number         88889 non-null  object 
 2   Event.Date              88889 non-null  object 
 3   Location                88837 non-null  object 
 4   Country                 88663 non-null  object 
 5   Airport.Code            50132 non-null  object 
 6   Airport.Name            52704 non-null  object 
 7   Injury.Severity         87889 non-null  object 
 8   Aircraft.damage         85695 non-null  object 
 9   Registration.Number     87507 non-null  object 
 10  Make                    88826 non-null  object 
 11  Model                   88797 non-null  object 
 12  Amateur.Built           88787 non-null  object 
 13  Number.of.Engines       82805 non-null  float64
 14  Engine.Type          

### 4.2 Filling the missing values

In [13]:
categorical_columns = ["Investigation.Type", "Accident.Number", "Event.Date", "Location", "Country", "Airport.Code", "Airport.Name", "Injury.Severity", "Aircraft.damage", "Registration.Number", "Make", "Model", "Amateur.Built", "Purpose.of.flight", "Weather.Condition", "Broad.phase.of.flight", "Report.Status","Publication.Date"]
categorical_columns

['Investigation.Type',
 'Accident.Number',
 'Event.Date',
 'Location',
 'Country',
 'Airport.Code',
 'Airport.Name',
 'Injury.Severity',
 'Aircraft.damage',
 'Registration.Number',
 'Make',
 'Model',
 'Amateur.Built',
 'Purpose.of.flight',
 'Weather.Condition',
 'Broad.phase.of.flight',
 'Report.Status',
 'Publication.Date']

In [14]:
aviation_data[categorical_columns] = aviation_data[categorical_columns].fillna("unknown")
aviation_data

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,unknown,unknown,Fatal(2),Destroyed,NC6404,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,unknown
20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,unknown,unknown,Fatal(4),Destroyed,N5069P,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,unknown,unknown,Fatal(3),Destroyed,N5142R,...,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,unknown,unknown,Fatal(2),Destroyed,N1168J,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,unknown,unknown,Fatal(1),Destroyed,N15NY,...,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,unknown,unknown,Minor,unknown,N1867H,...,,Personal,0.0,1.0,0.0,0.0,unknown,unknown,unknown,29-12-2022
20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,unknown,unknown,unknown,unknown,N2895Z,...,,unknown,0.0,0.0,0.0,0.0,unknown,unknown,unknown,unknown
20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,PAN,PAYSON,Non-Fatal,Substantial,N749PJ,...,,Personal,0.0,0.0,0.0,1.0,VMC,unknown,unknown,27-12-2022
20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,unknown,unknown,unknown,unknown,N210CU,...,,Personal,0.0,0.0,0.0,0.0,unknown,unknown,unknown,unknown


In [15]:
aviation_data.sample(10)

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001214X45331,Accident,LAX84LA092,1983-12-03,"BLACK CANYON, AZ",United States,unknown,unknown,Non-Fatal,Substantial,N7740S,...,Unknown,Other Work Use,0.0,0.0,0.0,1.0,IMC,Cruise,Probable Cause,unknown
20001206X01332,Accident,LAX94LA228,1994-05-30,"SONOMA, CA",United States,0Q9,SONOMA SKYPARK,Non-Fatal,Substantial,N44630,...,Reciprocating,Personal,0.0,0.0,0.0,2.0,VMC,Landing,Probable Cause,12-01-1995
20191015X65228,Accident,GAA20CA022,2019-10-14,"Douglas, GA",United States,DQH,Douglas Muni,Non-Fatal,Substantial,N8047R,...,Turbo Prop,Personal,0.0,0.0,0.0,1.0,VMC,unknown,The pilot's failure to extend the landing gear...,26-09-2020
20001212X19476,Accident,DEN00LA109,1999-08-03,"WHEATLAND, WY",United States,EAN,PHIFER,Non-Fatal,Substantial,N4789J,...,Reciprocating,Instructional,0.0,0.0,0.0,2.0,VMC,Landing,Probable Cause,05-12-2000
20030926X01606,Accident,CHI03LA321,2003-09-25,"NEW RICHMOND, WI",United States,RNH,New Richmond Municipal Airport,Non-Fatal,Substantial,N5237G,...,Reciprocating,Personal,,,,1.0,VMC,Landing,Probable Cause,02-06-2004
20001213X31520,Accident,FTW87LA163,1987-07-07,"TULSA, OK",United States,1H6,HARVEY YOUNG,Non-Fatal,Substantial,N1175C,...,Reciprocating,Personal,0.0,0.0,1.0,1.0,VMC,Takeoff,Probable Cause,25-10-1988
20060323X00332,Accident,DEN06LA050,2006-03-20,"EMPORIA, KS",United States,unknown,unknown,Non-Fatal,Substantial,N331FC,...,Reciprocating,unknown,,1.0,,,IMC,Cruise,Probable Cause,29-08-2006
20001214X39540,Accident,DEN84FA144,1984-05-04,"RUIDOSO, NM",United States,RUI,RUIDOSO,Fatal(1),Destroyed,N9137T,...,Reciprocating,Personal,1.0,3.0,0.0,0.0,VMC,Takeoff,Probable Cause,unknown
20001212X17454,Accident,ATL91LA149,1991-07-31,"SPARTANBURG, SC",United States,SPA,SPARTANBURG DOWNTOWN MEM.,Non-Fatal,Substantial,N300HF,...,Reciprocating,Business,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,04-12-1992
20180215X95605,Incident,ENG18WA013,2018-02-08,"Paris, France",France,LFPO,Paris Orly,unknown,unknown,F-GZHO,...,,unknown,0.0,0.0,0.0,0.0,unknown,unknown,unknown,25-09-2020


In [16]:
numericals_columns = ["Total.Fatal.Injuries", "Total.Serious.Injuries", "Total.Minor.Injuries", "Total.Uninjured", "Number.of.Engines"]
numericals_columns

['Total.Fatal.Injuries',
 'Total.Serious.Injuries',
 'Total.Minor.Injuries',
 'Total.Uninjured',
 'Number.of.Engines']

In [17]:
aviation_data[numericals_columns] = aviation_data[numericals_columns].fillna(0)
aviation_data

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,unknown,unknown,Fatal(2),Destroyed,NC6404,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,unknown
20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,unknown,unknown,Fatal(4),Destroyed,N5069P,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,unknown,unknown,Fatal(3),Destroyed,N5142R,...,Reciprocating,Personal,3.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,26-02-2007
20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,unknown,unknown,Fatal(2),Destroyed,N1168J,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,unknown,unknown,Fatal(1),Destroyed,N15NY,...,,Personal,1.0,2.0,0.0,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,unknown,unknown,Minor,unknown,N1867H,...,,Personal,0.0,1.0,0.0,0.0,unknown,unknown,unknown,29-12-2022
20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,unknown,unknown,unknown,unknown,N2895Z,...,,unknown,0.0,0.0,0.0,0.0,unknown,unknown,unknown,unknown
20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,PAN,PAYSON,Non-Fatal,Substantial,N749PJ,...,,Personal,0.0,0.0,0.0,1.0,VMC,unknown,unknown,27-12-2022
20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,unknown,unknown,unknown,unknown,N210CU,...,,Personal,0.0,0.0,0.0,0.0,unknown,unknown,unknown,unknown


In [18]:
aviation_data.sample(10)

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20020917X01928,Accident,DEN82DA035,1982-02-17,"7 MILES N. OF S, CO",United States,unknown,unknown,Non-Fatal,Substantial,N33NF,...,Reciprocating,Personal,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,17-02-1983
20210811103675,Accident,ERA21LA322,2021-08-09,"Hiddenite, NC",United States,unknown,unknown,Non-Fatal,Substantial,N906ER,...,,Instructional,0.0,0.0,0.0,2.0,VMC,unknown,unknown,19-08-2021
20001212X24032,Accident,LAX90DVG18,1990-08-28,"VACAVILLE, CA",United States,NONE,VACAVILLE GLIDER PORT,Non-Fatal,Substantial,N132S,...,Unknown,Personal,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,23-11-1992
20001226X45493,Accident,NYC01FA056,2000-12-14,"CHESTERFIELD, NH",United States,EEN,DILLANT-HOPKINS AIRPORT,Fatal(1),Destroyed,N55QS,...,Reciprocating,Unknown,1.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,17-07-2001
20180501X12617,Accident,WPR18LA130,2018-05-02,"Portland, OR",United States,unknown,unknown,Non-Fatal,Substantial,N7526S,...,Reciprocating,Personal,0.0,0.0,1.0,0.0,VMC,unknown,A hard landing for reasons that could not be d...,25-05-2021
20001214X36667,Accident,ATL85LA186,1985-06-15,"ABBEVILLE, SC",United States,A03,DAVIS FIELD,Non-Fatal,Substantial,N47071,...,Reciprocating,Personal,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,unknown
20151021X11755,Accident,GAA16CA022,2015-10-19,"Sioux Falls, SD",United States,FSD,JOE FOSS FIELD,Non-Fatal,Substantial,N9780G,...,Reciprocating,Personal,0.0,0.0,0.0,2.0,VMC,unknown,The pilot's decision to land with a 20 knot di...,25-09-2020
20180926X65509,Accident,CEN18FA389,2018-09-26,"Austin, AR",United States,unknown,unknown,Fatal,Substantial,N534MM,...,Reciprocating,Personal,1.0,0.0,0.0,0.0,VMC,unknown,The pilot's failure to maintain airplane contr...,03-12-2020
20060829X01254,Accident,DFW06CA179,2006-07-10,"GEORGETOWN, TX",United States,GTU,Georgetown Municipal Airport,Non-Fatal,Substantial,N8164J,...,Reciprocating,Instructional,0.0,0.0,2.0,0.0,VMC,Approach,Probable Cause,31-10-2006
20050124X00100,Accident,LAX05LA069,2005-01-15,"CAMARILLO, CA",United States,CMA,Camarillo Airport,Fatal(1),Substantial,unknown,...,Reciprocating,Personal,1.0,0.0,0.0,0.0,VMC,Landing,Probable Cause,28-02-2006


### 4.3 Checking for Duplicates

In [19]:
duplicates = aviation_data[aviation_data.duplicated()]
print(len(duplicates))
duplicates.head()

0


Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


### 4.4 Checking for extraneouse values

In [20]:
for col in aviation_data.columns:
    print(col, '\n', aviation_data[col].value_counts(normalize=True).head(), '\n\n')

Investigation.Type 
 Investigation.Type
Accident    0.956418
Incident    0.043582
Name: proportion, dtype: float64 


Accident.Number 
 Accident.Number
CEN22LA149    0.000022
WPR23LA041    0.000022
WPR23LA045    0.000022
DCA22WA214    0.000022
DCA22WA089    0.000022
Name: proportion, dtype: float64 


Event.Date 
 Event.Date
1984-06-30    0.000281
1982-05-16    0.000281
2000-07-08    0.000281
1983-08-05    0.000270
1984-08-25    0.000270
Name: proportion, dtype: float64 


Location 
 Location
ANCHORAGE, AK      0.004882
MIAMI, FL          0.002250
ALBUQUERQUE, NM    0.002205
HOUSTON, TX        0.002171
CHICAGO, IL        0.002070
Name: proportion, dtype: float64 


Country 
 Country
United States     0.925289
Brazil            0.004207
Canada            0.004039
Mexico            0.004027
United Kingdom    0.003870
Name: proportion, dtype: float64 


Airport.Code 
 Airport.Code
unknown    0.436016
NONE       0.016740
PVT        0.005456
APA        0.001800
ORD        0.001676
Name: pro

In [21]:
# confirming if there are still missing values in the data
aviation_data.isnull().sum().sort_values(ascending=False).head(10)

Engine.Type               7096
Investigation.Type           0
Accident.Number              0
Report.Status                0
Broad.phase.of.flight        0
Weather.Condition            0
Total.Uninjured              0
Total.Minor.Injuries         0
Total.Serious.Injuries       0
Total.Fatal.Injuries         0
dtype: int64

There are 7096 missing values in Engine. Type columns.

In [22]:
# Handling the remaining missing values in the Engine.Type column

aviation_data['Engine.Type']= aviation_data['Engine.Type'].fillna("Unknown")
aviation_data


Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,unknown,unknown,Fatal(2),Destroyed,NC6404,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,unknown
20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,unknown,unknown,Fatal(4),Destroyed,N5069P,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,unknown,unknown,Fatal(3),Destroyed,N5142R,...,Reciprocating,Personal,3.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,26-02-2007
20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,unknown,unknown,Fatal(2),Destroyed,N1168J,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,unknown,unknown,Fatal(1),Destroyed,N15NY,...,Unknown,Personal,1.0,2.0,0.0,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,unknown,unknown,Minor,unknown,N1867H,...,Unknown,Personal,0.0,1.0,0.0,0.0,unknown,unknown,unknown,29-12-2022
20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,unknown,unknown,unknown,unknown,N2895Z,...,Unknown,unknown,0.0,0.0,0.0,0.0,unknown,unknown,unknown,unknown
20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,PAN,PAYSON,Non-Fatal,Substantial,N749PJ,...,Unknown,Personal,0.0,0.0,0.0,1.0,VMC,unknown,unknown,27-12-2022
20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,unknown,unknown,unknown,unknown,N210CU,...,Unknown,Personal,0.0,0.0,0.0,0.0,unknown,unknown,unknown,unknown


In [23]:
aviation_data.isnull().sum().sort_values(ascending=False)

Investigation.Type        0
Accident.Number           0
Report.Status             0
Broad.phase.of.flight     0
Weather.Condition         0
Total.Uninjured           0
Total.Minor.Injuries      0
Total.Serious.Injuries    0
Total.Fatal.Injuries      0
Purpose.of.flight         0
Engine.Type               0
Number.of.Engines         0
Amateur.Built             0
Model                     0
Make                      0
Registration.Number       0
Aircraft.damage           0
Injury.Severity           0
Airport.Name              0
Airport.Code              0
Country                   0
Location                  0
Event.Date                0
Publication.Date          0
dtype: int64

In [24]:
aviation_data.to_csv("cleaned_data.csv", index=False)

In [33]:
aviation_data["Country"].value_counts()

Country
United States                       82248
Brazil                                374
Canada                                359
Mexico                                358
United Kingdom                        344
                                    ...  
Seychelles                              1
Palau                                   1
Libya                                   1
Saint Vincent and the Grenadines        1
Turks and Caicos Islands                1
Name: count, Length: 220, dtype: int64

Since majority of the cases are in the united states, I only going to concentrate on the united states.

In [35]:
us_data = aviation_data[aviation_data["Country"] == "United States"]
us_data.head()

Unnamed: 0_level_0,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Registration.Number,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
Event.Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,unknown,unknown,Fatal(2),Destroyed,NC6404,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,unknown
20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,unknown,unknown,Fatal(4),Destroyed,N5069P,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,unknown,unknown,Fatal(3),Destroyed,N5142R,...,Reciprocating,Personal,3.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,26-02-2007
20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,unknown,unknown,Fatal(2),Destroyed,N1168J,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,unknown,unknown,Fatal(1),Destroyed,N15NY,...,Unknown,Personal,1.0,2.0,0.0,0.0,VMC,Approach,Probable Cause,16-04-1980


In [36]:
# checking for missing values and data types
us_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 82248 entries, 20001218X45444 to 20221230106513
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Investigation.Type      82248 non-null  object 
 1   Accident.Number         82248 non-null  object 
 2   Event.Date              82248 non-null  object 
 3   Location                82248 non-null  object 
 4   Country                 82248 non-null  object 
 5   Airport.Code            82248 non-null  object 
 6   Airport.Name            82248 non-null  object 
 7   Injury.Severity         82248 non-null  object 
 8   Aircraft.damage         82248 non-null  object 
 9   Registration.Number     82248 non-null  object 
 10  Make                    82248 non-null  object 
 11  Model                   82248 non-null  object 
 12  Amateur.Built           82248 non-null  object 
 13  Number.of.Engines       82248 non-null  float64
 14  Engine.Type          