# Project Overview

This project anylses the Aviation Accident dataset which contains accidents from 1962 and as recent as 2023. It contains over 80,000 records. This analysis can be used to see the safest airlines with the least accidents, fatalities occured and areas that can be improved to reduce such calamities.

# Business Understanding

We have been hired by Sky High Corp. They are interested in **purchasing and operating airplanes** for **commercial and private activities** and they want to know the **potential of risks involved in aviation**.

We have been tasked to find **which aircraft have the lowest risk** for the company to start with as they venture into this business venture.

In [296]:
# Importing the necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# To display all columns
pd.set_option('display.max_columns', 500)

# To ensure all visualizations stay within the notebook
%matplotlib inline

In [297]:
# Loading the dataset 
aviation_df = pd.read_csv("./data/AviationData.csv", encoding = 'latin-1',
                         dtype = {6: str, 7:str, 28: str})
aviation_df

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,,,Minor,,,N1867H,PIPER,PA-28-151,No,,,091,,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
88885,20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,,,,N2895Z,BELLANCA,7ECA,No,,,,,,,0.0,0.0,0.0,0.0,,,,
88886,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,Non-Fatal,Substantial,Airplane,N749PJ,AMERICAN CHAMPION AIRCRAFT,8GCBC,No,1.0,,091,,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
88887,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,,,,N210CU,CESSNA,210N,No,,,091,,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,


In [298]:
# Checking the percentage of null values in all columns
aviation_df.isna().mean()*100

Event.Id                   0.000000
Investigation.Type         0.000000
Accident.Number            0.000000
Event.Date                 0.000000
Location                   0.058500
Country                    0.254250
Latitude                  61.320298
Longitude                 61.330423
Airport.Code              43.469946
Airport.Name              40.611324
Injury.Severity            1.124999
Aircraft.damage            3.593246
Aircraft.Category         63.677170
Registration.Number        1.481623
Make                       0.070875
Model                      0.103500
Amateur.Built              0.114750
Number.of.Engines          6.844491
Engine.Type                7.961615
FAR.Description           63.974170
Schedule                  85.845268
Purpose.of.flight          6.965991
Air.carrier               81.271023
Total.Fatal.Injuries      12.826109
Total.Serious.Injuries    14.073732
Total.Minor.Injuries      13.424608
Total.Uninjured            6.650992
Weather.Condition          5

# Data Understanding

In [299]:
print(f"The Accident Aviation dataset contains {aviation_df.shape[0]} rows and {aviation_df.shape[1]} columns")

The Accident Aviation dataset contains 88889 rows and 31 columns


In [300]:
aviation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [301]:
# Checking the percentage of null values in all columns
aviation_df.isna().mean()*100

Event.Id                   0.000000
Investigation.Type         0.000000
Accident.Number            0.000000
Event.Date                 0.000000
Location                   0.058500
Country                    0.254250
Latitude                  61.320298
Longitude                 61.330423
Airport.Code              43.469946
Airport.Name              40.611324
Injury.Severity            1.124999
Aircraft.damage            3.593246
Aircraft.Category         63.677170
Registration.Number        1.481623
Make                       0.070875
Model                      0.103500
Amateur.Built              0.114750
Number.of.Engines          6.844491
Engine.Type                7.961615
FAR.Description           63.974170
Schedule                  85.845268
Purpose.of.flight          6.965991
Air.carrier               81.271023
Total.Fatal.Injuries      12.826109
Total.Serious.Injuries    14.073732
Total.Minor.Injuries      13.424608
Total.Uninjured            6.650992
Weather.Condition          5

# Data Preparation

In [302]:
# Converting all column names to lower and replacing dots with underscores
aviation_df.columns = aviation_df.columns.str.lower().str.replace('.', "_", regex = False)
aviation_df.columns

Index(['event_id', 'investigation_type', 'accident_number', 'event_date',
       'location', 'country', 'latitude', 'longitude', 'airport_code',
       'airport_name', 'injury_severity', 'aircraft_damage',
       'aircraft_category', 'registration_number', 'make', 'model',
       'amateur_built', 'number_of_engines', 'engine_type', 'far_description',
       'schedule', 'purpose_of_flight', 'air_carrier', 'total_fatal_injuries',
       'total_serious_injuries', 'total_minor_injuries', 'total_uninjured',
       'weather_condition', 'broad_phase_of_flight', 'report_status',
       'publication_date'],
      dtype='object')

In [303]:
# To get only accidents that happen in US and US territories
us_territories = ["United States",'American Samoa','Guam',"Marshall Islands","Micronesia",
                  "Northern Marianas","Palau","Puerto Rico","Virgin Islands","Washington_DC",
                  "Gulf of mexico","Atlantic ocean","Pacific ocean"]
us_accidents_df = aviation_df[aviation_df['country'].isin(us_territories)]

In [304]:
us_accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82372 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   event_id                82372 non-null  object 
 1   investigation_type      82372 non-null  object 
 2   accident_number         82372 non-null  object 
 3   event_date              82372 non-null  object 
 4   location                82359 non-null  object 
 5   country                 82372 non-null  object 
 6   latitude                32285 non-null  object 
 7   longitude               32275 non-null  object 
 8   airport_code            49247 non-null  object 
 9   airport_name            51714 non-null  object 
 10  injury_severity         82252 non-null  object 
 11  aircraft_damage         80370 non-null  object 
 12  aircraft_category       28194 non-null  object 
 13  registration_number     82321 non-null  object 
 14  make                    82351 non-null

In [305]:
us_accidents_df.isna().mean()*100

event_id                   0.000000
investigation_type         0.000000
accident_number            0.000000
event_date                 0.000000
location                   0.015782
country                    0.000000
latitude                  60.805856
longitude                 60.817996
airport_code              40.213908
airport_name              37.218958
injury_severity            0.145681
aircraft_damage            2.430438
aircraft_category         65.772350
registration_number        0.061914
make                       0.025494
model                      0.046132
amateur_built              0.025494
number_of_engines          2.321177
engine_type                3.717283
far_description           65.686155
schedule                  87.449619
purpose_of_flight          3.001020
air_carrier               82.360511
total_fatal_injuries      12.937649
total_serious_injuries    13.811732
total_minor_injuries      13.028699
total_uninjured            6.076094
weather_condition          0

In [306]:
us_accidents_df.head(10)

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,latitude,longitude,airport_code,airport_name,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,far_description,schedule,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
5,20170710X52551,Accident,NYC79AA106,1979-09-17,"BOSTON, MA",United States,42.445277,-70.758333,,,Non-Fatal,Substantial,Airplane,CF-TLU,Mcdonnell Douglas,DC9,No,2.0,Turbo Fan,Part 129: Foreign,SCHD,,Air Canada,,,1.0,44.0,VMC,Climb,Probable Cause,19-09-2017
6,20001218X45446,Accident,CHI81LA106,1981-08-01,"COTTON, MN",United States,,,,,Fatal(4),Destroyed,,N4988E,Cessna,180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,IMC,Unknown,Probable Cause,06-11-2001
7,20020909X01562,Accident,SEA82DA022,1982-01-01,"PULLMAN, WA",United States,,,,BLACKBURN AG STRIP,Non-Fatal,Substantial,Airplane,N2482N,Cessna,140,No,1.0,Reciprocating,Part 91: General Aviation,,Personal,,0.0,0.0,0.0,2.0,VMC,Takeoff,Probable Cause,01-01-1982
8,20020909X01561,Accident,NYC82DA015,1982-01-01,"EAST HANOVER, NJ",United States,,,N58,HANOVER,Non-Fatal,Substantial,Airplane,N7967Q,Cessna,401B,No,2.0,Reciprocating,Part 91: General Aviation,,Business,,0.0,0.0,0.0,2.0,IMC,Landing,Probable Cause,01-01-1982
9,20020909X01560,Accident,MIA82DA029,1982-01-01,"JACKSONVILLE, FL",United States,,,JAX,JACKSONVILLE INTL,Non-Fatal,Substantial,,N3906K,North American,NAVION L-17B,No,1.0,Reciprocating,,,Personal,,0.0,0.0,3.0,0.0,IMC,Cruise,Probable Cause,01-01-1982


In [307]:
# Dropping the Latitude, Longitude and Schedule and far_description columns
us_accidents = us_accidents_df.copy()
us_accidents.drop(['latitude',
                   'longitude', 
                   'schedule',
                   'far_description',
                   'airport_code',
                   'airport_name'], axis = 1, inplace = True)

In [308]:
# Filling the null values in air_carrier column with Unknown
us_accidents['air_carrier'] = us_accidents['air_carrier'].fillna('Unknown')

In [309]:
us_accidents.isna().mean()*100

event_id                   0.000000
investigation_type         0.000000
accident_number            0.000000
event_date                 0.000000
location                   0.015782
country                    0.000000
injury_severity            0.145681
aircraft_damage            2.430438
aircraft_category         65.772350
registration_number        0.061914
make                       0.025494
model                      0.046132
amateur_built              0.025494
number_of_engines          2.321177
engine_type                3.717283
purpose_of_flight          3.001020
air_carrier                0.000000
total_fatal_injuries      12.937649
total_serious_injuries    13.811732
total_minor_injuries      13.028699
total_uninjured            6.076094
weather_condition          0.832807
broad_phase_of_flight     25.656777
report_status              3.221969
publication_date          15.436071
dtype: float64

In [310]:
# Filling null values with 0 and changing data type to 0
# 0 becomes a placeholder
us_accidents['number_of_engines'] = us_accidents['number_of_engines'].fillna(0).astype(int)
us_accidents['total_fatal_injuries'] = us_accidents['total_fatal_injuries'].fillna(0).astype(int)
us_accidents['total_serious_injuries'] = us_accidents['total_serious_injuries'].fillna(0).astype(int)
us_accidents['total_minor_injuries'] = us_accidents['total_minor_injuries'].fillna(0).astype(int)
us_accidents['total_uninjured'] = us_accidents['total_uninjured'].fillna(0).astype(int)

# Converting to datetime format and to dd-mm-yyyy
us_accidents['event_date'] = pd.to_datetime(us_accidents['event_date'], format='%Y-%m-%d').dt.strftime('%d-%m-%Y')
us_accidents['publication_date'] = pd.to_datetime(us_accidents['publication_date'], format='%d-%m-%Y')

In [311]:
us_accidents

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218X45444,Accident,SEA87LA080,24-10-1948,"MOOSE CREEK, ID",United States,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1,Reciprocating,Personal,Unknown,2,0,0,0,UNK,Cruise,Probable Cause,NaT
1,20001218X45447,Accident,LAX94LA336,19-07-1962,"BRIDGEPORT, CA",United States,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1,Reciprocating,Personal,Unknown,4,0,0,0,UNK,Unknown,Probable Cause,1996-09-19
2,20061025X01555,Accident,NYC07LA005,30-08-1974,"Saltville, VA",United States,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1,Reciprocating,Personal,Unknown,3,0,0,0,IMC,Cruise,Probable Cause,2007-02-26
3,20001218X45448,Accident,LAX96LA321,19-06-1977,"EUREKA, CA",United States,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1,Reciprocating,Personal,Unknown,2,0,0,0,IMC,Cruise,Probable Cause,2000-09-12
4,20041105X01764,Accident,CHI79FA064,02-08-1979,"Canton, OH",United States,Fatal(1),Destroyed,,N15NY,Cessna,501,No,0,,Personal,Unknown,1,2,0,0,VMC,Approach,Probable Cause,1980-04-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,ERA23LA093,26-12-2022,"Annapolis, MD",United States,Minor,,,N1867H,PIPER,PA-28-151,No,0,,Personal,Unknown,0,1,0,0,,,,2022-12-29
88885,20221227106494,Accident,ERA23LA095,26-12-2022,"Hampton, NH",United States,,,,N2895Z,BELLANCA,7ECA,No,0,,,Unknown,0,0,0,0,,,,NaT
88886,20221227106497,Accident,WPR23LA075,26-12-2022,"Payson, AZ",United States,Non-Fatal,Substantial,Airplane,N749PJ,AMERICAN CHAMPION AIRCRAFT,8GCBC,No,1,,Personal,Unknown,0,0,0,1,VMC,,,2022-12-27
88887,20221227106498,Accident,WPR23LA076,26-12-2022,"Morgan, UT",United States,,,,N210CU,CESSNA,210N,No,0,,Personal,MC CESSNA 210N LLC,0,0,0,0,,,,NaT


In [312]:
us_accidents.isna().mean()*100

event_id                   0.000000
investigation_type         0.000000
accident_number            0.000000
event_date                 0.000000
location                   0.015782
country                    0.000000
injury_severity            0.145681
aircraft_damage            2.430438
aircraft_category         65.772350
registration_number        0.061914
make                       0.025494
model                      0.046132
amateur_built              0.025494
number_of_engines          0.000000
engine_type                3.717283
purpose_of_flight          3.001020
air_carrier                0.000000
total_fatal_injuries       0.000000
total_serious_injuries     0.000000
total_minor_injuries       0.000000
total_uninjured            0.000000
weather_condition          0.832807
broad_phase_of_flight     25.656777
report_status              3.221969
publication_date          15.436071
dtype: float64

In [313]:
us_accidents[us_accidents.duplicated(subset = 'event_id', keep = False)].head(50)

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
117,20020917X01908,Accident,DCA82AA012B,19-01-1982,"ROCKPORT, TX",United States,Fatal(3),Destroyed,Airplane,N26660,Grumman,AA5A,No,1,Reciprocating,Personal,Unknown,3,0,0,0,IMC,Approach,Probable Cause,1983-01-19
118,20020917X01908,Accident,DCA82AA012A,19-01-1982,"ROCKPORT, TX",United States,Fatal(3),Destroyed,Airplane,N336SA,Swearingen,SA226-T(B),No,2,Turbo Prop,Executive/corporate,Unknown,3,0,0,0,IMC,Approach,Probable Cause,1983-01-19
153,20020917X02259,Accident,LAX82FA049A,23-01-1982,"VICTORVILLE, CA",United States,Fatal(2),Destroyed,Airplane,N7860V,Mooney,M20C,No,1,Reciprocating,Personal,Unknown,2,0,4,0,VMC,Unknown,Probable Cause,1983-01-23
158,20020917X02400,Accident,MIA82FA038B,23-01-1982,"NEWPORT RICHEY, FL",United States,Non-Fatal,Substantial,Airplane,N45453,Cessna,150M,No,1,Reciprocating,Personal,Unknown,0,0,0,3,VMC,Cruise,Probable Cause,1983-01-23
159,20020917X02400,Accident,MIA82FA038A,23-01-1982,"NEWPORT RICHEY, FL",United States,Non-Fatal,Substantial,Airplane,N32555,Piper,PA-34-200T,No,2,Reciprocating,Personal,Unknown,0,0,0,3,VMC,Approach,Probable Cause,1983-01-23
160,20020917X02259,Accident,LAX82FA049B,23-01-1982,"VICTORVILLE, CA",United States,Fatal(2),Substantial,Airplane,N32380,Piper,PA-28-235,No,1,Reciprocating,Personal,Unknown,2,0,4,0,VMC,Cruise,Probable Cause,1983-01-23
242,20020917X02585,Accident,SEA82DA028A,06-02-1982,"MEDFORD, OR",United States,Non-Fatal,Minor,Airplane,N56270,Boeing,A75N1,No,1,Reciprocating,Aerial Application,Unknown,0,0,0,3,VMC,Taxi,Probable Cause,1983-02-06
244,20020917X02173,Accident,LAX82DA065B,06-02-1982,"SAN JOSE, CA",United States,Non-Fatal,Minor,Airplane,N71681,Bellanca,7KCAB,No,1,Reciprocating,Personal,Unknown,0,0,0,3,VMC,Standing,Probable Cause,1983-02-06
245,20020917X02585,Accident,SEA82DA028B,06-02-1982,"MEDFORD, OR",United States,Non-Fatal,Substantial,Airplane,N95078,Taylorcraft,BC12-D,No,1,Reciprocating,Personal,Unknown,0,0,0,3,VMC,Taxi,Probable Cause,1983-02-06
248,20020917X02173,Accident,LAX82DA065A,06-02-1982,"SAN JOSE, CA",United States,Non-Fatal,Substantial,Airplane,N3343D,Cessna,180,No,1,Reciprocating,Personal,Unknown,0,0,0,3,VMC,Taxi,Probable Cause,1983-02-06


I was trying to check for duplicates in the `event_id` column to drop them. However, upon investigation, it was discovered that in those cases, two aircrafts were involved in the accident. They were both logged in one event_id but different accident number 

In [314]:
us_accidents[us_accidents['aircraft_category'].isna()].head(20)

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218X45444,Accident,SEA87LA080,24-10-1948,"MOOSE CREEK, ID",United States,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1,Reciprocating,Personal,Unknown,2,0,0,0,UNK,Cruise,Probable Cause,NaT
1,20001218X45447,Accident,LAX94LA336,19-07-1962,"BRIDGEPORT, CA",United States,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1,Reciprocating,Personal,Unknown,4,0,0,0,UNK,Unknown,Probable Cause,1996-09-19
2,20061025X01555,Accident,NYC07LA005,30-08-1974,"Saltville, VA",United States,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1,Reciprocating,Personal,Unknown,3,0,0,0,IMC,Cruise,Probable Cause,2007-02-26
3,20001218X45448,Accident,LAX96LA321,19-06-1977,"EUREKA, CA",United States,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1,Reciprocating,Personal,Unknown,2,0,0,0,IMC,Cruise,Probable Cause,2000-09-12
4,20041105X01764,Accident,CHI79FA064,02-08-1979,"Canton, OH",United States,Fatal(1),Destroyed,,N15NY,Cessna,501,No,0,,Personal,Unknown,1,2,0,0,VMC,Approach,Probable Cause,1980-04-16
6,20001218X45446,Accident,CHI81LA106,01-08-1981,"COTTON, MN",United States,Fatal(4),Destroyed,,N4988E,Cessna,180,No,1,Reciprocating,Personal,Unknown,4,0,0,0,IMC,Unknown,Probable Cause,2001-11-06
9,20020909X01560,Accident,MIA82DA029,01-01-1982,"JACKSONVILLE, FL",United States,Non-Fatal,Substantial,,N3906K,North American,NAVION L-17B,No,1,Reciprocating,Personal,Unknown,0,0,3,0,IMC,Cruise,Probable Cause,1982-01-01
10,20020909X01559,Accident,FTW82DA034,01-01-1982,"HOBBS, NM",United States,Non-Fatal,Substantial,,N44832,Piper,PA-28-161,No,1,Reciprocating,Personal,Unknown,0,0,0,1,VMC,Approach,Probable Cause,1982-01-01
11,20020909X01558,Accident,ATL82DKJ10,01-01-1982,"TUSKEGEE, AL",United States,Non-Fatal,Substantial,,N4275S,Beech,V35B,No,1,Reciprocating,Personal,Unknown,0,0,0,1,VMC,Landing,Probable Cause,1982-01-01
84,20020917X01907,Accident,DCA82AA011,13-01-1982,"WASHINGTON, DC",United States,Fatal(78),Destroyed,,N62AF,Boeing,737-222,No,2,Turbo Fan,Unknown,"Air Florida, Inc",78,6,3,0,IMC,Takeoff,Probable Cause,1983-01-13


The client only wanted airplanes. The logical method would be to drop all rows whose accidents aren't labeled "Airplane" under the `aircraft_category` column. However, 65% of the column are null values and a couple of them are planes. The null values will thus count.

In [315]:
new_cols = us_accidents['location'].str.rsplit(',',n = 1, expand = True)
us_accidents['area'] = new_cols[0]
us_accidents['state_short_code'] = new_cols[1].str.strip()

In [316]:
# pd.set_option('display.max_rows', None)
us_accidents['state_short_code'].value_counts()

CA                                  8857
TX                                  5913
FL                                  5825
AK                                  5672
AZ                                  2834
                                    ... 
Micronesia (Federated States of)       2
Marshall Islands                       1
MARSHALL ISLANDS                       1
Palau                                  1
CB                                     1
Name: state_short_code, Length: 68, dtype: int64

In [317]:
# Renaming the short codes accordingly
us_accidents['state_short_code'] = us_accidents['state_short_code'].replace(["Virgin Islands (British)", 'CB'], 'VI')
us_accidents['state_short_code'] = us_accidents['state_short_code'].replace(["American Samoa","AMERICAN SAMOA"], 'AS')
us_accidents['state_short_code'] = us_accidents['state_short_code'].replace("Micronesia (Federated States of)", 'FM')
us_accidents['state_short_code'] = us_accidents['state_short_code'].replace(["Marshall Islands","MARSHALL ISLANDS"], 'MH')
us_accidents['state_short_code'] = us_accidents['state_short_code'].replace("Palau", 'PW')

# All Empty Values replaced with UN for Unknown
us_accidents['state_short_code'] = us_accidents['state_short_code'].replace("", 'UN')
us_accidents['state_short_code'] = us_accidents['state_short_code'].fillna('UN')

In [318]:
# pd.set_option('display.max_rows', None)
us_accidents['state_short_code'].value_counts()

CA    8857
TX    5913
FL    5825
AK    5672
AZ    2834
      ... 
GU       8
AS       4
MH       2
FM       2
PW       1
Name: state_short_code, Length: 63, dtype: int64

All good

In [319]:
us_accidents.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82372 entries, 0 to 88888
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   event_id                82372 non-null  object        
 1   investigation_type      82372 non-null  object        
 2   accident_number         82372 non-null  object        
 3   event_date              82372 non-null  object        
 4   location                82359 non-null  object        
 5   country                 82372 non-null  object        
 6   injury_severity         82252 non-null  object        
 7   aircraft_damage         80370 non-null  object        
 8   aircraft_category       28194 non-null  object        
 9   registration_number     82321 non-null  object        
 10  make                    82351 non-null  object        
 11  model                   82334 non-null  object        
 12  amateur_built           82351 non-null  object

In [320]:
us_accidents.isna().mean()*100

event_id                   0.000000
investigation_type         0.000000
accident_number            0.000000
event_date                 0.000000
location                   0.015782
country                    0.000000
injury_severity            0.145681
aircraft_damage            2.430438
aircraft_category         65.772350
registration_number        0.061914
make                       0.025494
model                      0.046132
amateur_built              0.025494
number_of_engines          0.000000
engine_type                3.717283
purpose_of_flight          3.001020
air_carrier                0.000000
total_fatal_injuries       0.000000
total_serious_injuries     0.000000
total_minor_injuries       0.000000
total_uninjured            0.000000
weather_condition          0.832807
broad_phase_of_flight     25.656777
report_status              3.221969
publication_date          15.436071
area                       0.015782
state_short_code           0.000000
dtype: float64

In [321]:
us_accidents[us_accidents['aircraft_category'].isna()]

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date,area,state_short_code
0,20001218X45444,Accident,SEA87LA080,24-10-1948,"MOOSE CREEK, ID",United States,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1,Reciprocating,Personal,Unknown,2,0,0,0,UNK,Cruise,Probable Cause,NaT,MOOSE CREEK,ID
1,20001218X45447,Accident,LAX94LA336,19-07-1962,"BRIDGEPORT, CA",United States,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1,Reciprocating,Personal,Unknown,4,0,0,0,UNK,Unknown,Probable Cause,1996-09-19,BRIDGEPORT,CA
2,20061025X01555,Accident,NYC07LA005,30-08-1974,"Saltville, VA",United States,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1,Reciprocating,Personal,Unknown,3,0,0,0,IMC,Cruise,Probable Cause,2007-02-26,Saltville,VA
3,20001218X45448,Accident,LAX96LA321,19-06-1977,"EUREKA, CA",United States,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1,Reciprocating,Personal,Unknown,2,0,0,0,IMC,Cruise,Probable Cause,2000-09-12,EUREKA,CA
4,20041105X01764,Accident,CHI79FA064,02-08-1979,"Canton, OH",United States,Fatal(1),Destroyed,,N15NY,Cessna,501,No,0,,Personal,Unknown,1,2,0,0,VMC,Approach,Probable Cause,1980-04-16,Canton,OH
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88882,20221222106486,Accident,CEN23LA068,21-12-2022,"Reserve, LA",United States,Minor,,,N321GD,GRUMMAN AMERICAN AVN. CORP.,AA-5B,No,0,,Instructional,Unknown,0,1,0,1,,,,2022-12-27,Reserve,LA
88884,20221227106491,Accident,ERA23LA093,26-12-2022,"Annapolis, MD",United States,Minor,,,N1867H,PIPER,PA-28-151,No,0,,Personal,Unknown,0,1,0,0,,,,2022-12-29,Annapolis,MD
88885,20221227106494,Accident,ERA23LA095,26-12-2022,"Hampton, NH",United States,,,,N2895Z,BELLANCA,7ECA,No,0,,,Unknown,0,0,0,0,,,,NaT,Hampton,NH
88887,20221227106498,Accident,WPR23LA076,26-12-2022,"Morgan, UT",United States,,,,N210CU,CESSNA,210N,No,0,,Personal,MC CESSNA 210N LLC,0,0,0,0,,,,NaT,Morgan,UT


In [322]:
us_accidents['aircraft_category'].value_counts()

Airplane             24256
Helicopter            2735
Glider                 503
Balloon                229
Gyrocraft              172
Weight-Shift           161
Powered Parachute       90
Ultralight              25
WSFT                     9
Unknown                  4
Blimp                    4
Powered-Lift             3
Rocket                   1
ULTR                     1
UNK                      1
Name: aircraft_category, dtype: int64

In [323]:
us_accidents['make'].value_counts()

Cessna                21597
Piper                 11670
CESSNA                 4287
Beech                  4168
PIPER                  2509
                      ...  
Hallett                   1
Steven R. Jackson         1
Weste                     1
Arthur P. Matthews        1
ROYSE RALPH L             1
Name: make, Length: 8003, dtype: int64

In [324]:
us_accidents['make'] = us_accidents['make'].str.title()

In [325]:
#pd.set_option('display.max_rows', None)
us_accidents['make'].value_counts().head(56)

Cessna                            25884
Piper                             14179
Beech                              5061
Bell                               2296
Boeing                             1496
Mooney                             1294
Grumman                            1142
Bellanca                           1040
Robinson                            926
Hughes                              874
Schweizer                           745
Air Tractor                         645
Aeronca                             635
Maule                               577
Champion                            514
Mcdonnell Douglas                   467
Stinson                             439
Luscombe                            413
Aero Commander                      397
De Havilland                        386
Taylorcraft                         382
North American                      374
Aerospatiale                        351
Hiller                              345
Rockwell                            337


In [326]:
us_accidents.loc[us_accidents['make'] == 'Cessna', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Piper', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Beech', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Mooney', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'De Havilland', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Bell', 'aircraft_category'] = 'Helicopter'
us_accidents.loc[us_accidents['make'] == 'Hughes', 'aircraft_category'] = 'Helicopter'
us_accidents.loc[us_accidents['make'] == 'Bellanca', 'aircraft_category'] = 'Airplane'


us_accidents['make'] = us_accidents['make'].replace(['Robinson','Robinson Helicopter','Robinson Helicopter Company'], "Robinson Helicopter Company")
us_accidents.loc[us_accidents['make'] == 'Robinson Helicopter Company', 'aircraft_category'] = 'Helicopter'

us_accidents['make'] = us_accidents['make'].replace(['Grumman','Grumman American'], "Northrop Grumman")
us_accidents.loc[us_accidents['make'] == 'Northrop Grumman', 'aircraft_category'] = 'Airplane'

In [327]:
us_accidents['aircraft_category'].isna().mean()*100

20.293303549749915

In [328]:
us_accidents['make'].value_counts().head(56)

Cessna                            25884
Piper                             14179
Beech                              5061
Bell                               2296
Boeing                             1496
Northrop Grumman                   1366
Robinson Helicopter Company        1335
Mooney                             1294
Bellanca                           1040
Hughes                              874
Schweizer                           745
Air Tractor                         645
Aeronca                             635
Maule                               577
Champion                            514
Mcdonnell Douglas                   467
Stinson                             439
Luscombe                            413
Aero Commander                      397
De Havilland                        386
Taylorcraft                         382
North American                      374
Aerospatiale                        351
Hiller                              345
Rockwell                            337


In [329]:
subsets = us_accidents[(us_accidents['make'] == 'Schweizer') & 
                       (us_accidents['aircraft_category'].isna()) &
                       (us_accidents['aircraft_category'] == ())]
subsets

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date,area,state_short_code
