# Aircraft Saftey Analysis


**Author:** Noah Meakins
***

# Overview

This in-depth analysis is intended to guide our company's foray into the commercial and private aviation sectors. By identifying historical safety risks and the most common causes of serious accidents, we aim to implement best practices and safety standards that surpass industry norms. This proactive approach to safety and risk analysis will be foundational in establishing our company as a responsible and trustworthy player in the aviation industry.

### Business Problem

My company is expanding into new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises but do not know anything about the potential risks of aircraft. I have been tasked with determining which aircraft are the lowest risk for the company to start this new business endeavor. With the data I will gather from the attached dataset, I will provide insights that will assist the head of the new aviation division, which can then use those insights to help decide which aircraft to purchase. Some specific data I will be analyzing at a deeper level are Country, Location, Make, Model, Number of Engines, Engine Type, Weather Conditions, Injury Severity, and Aircraft Damage. Some questions I will be asking myself during this analysis are:

1. Out of the columns that might have missing data, what methods should I implement for columns that I will need in my analysis? 
2. What Charts will be beneficial in presenting a visual representation of my analysis? 
3. How will I determine the safest and most profitable aircraft to invest in? 

These questions are important from a business perspective because if the data in my analysis is skewed due to missing data or data that doesn’t pertain to this analysis, it can negatively influence business decisions.  
***

### Data Understanding

- The data for this project was sourced from the National Transportation Safety Board and includes aviation accident data from 1962 to 2023 about civil aviation accidents and selected incidents in the United States and international waters. It's relevant to our analysis as we aim to understand various aircraft models and the levels of saftey associated with them. 

- The dataset consists of individual aircraft accidents, each record detailing the accident. It includes data from incidents in the United States and international waters. Key variables include Country, Location, Make, Model, Number of Engines, Engine Type, Weather Conditions, Injury Severity, and Aircraft Damage. 

- For this analysis, in line with our company's focus on safety and risk assessment, the primary target variable is 'total fatal injuries'. Analyzing this variable will help us assess the severity of accidents and understand the safety challenges inherent in the aviation industry, especially pertinent to commercial and private flight operations.


In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
pd.set_option('max_columns', 200) # I wanted to see all the columns in the dataset when I printed it out.
pd.set_option('max_rows', None) # I wanted to see all the rows in the dataset when I printed it out.
%matplotlib inline

In [2]:
aviation_data = pd.read_csv('/Users/unit66/Downloads/AviationData.csv', encoding='latin-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [4]:
aviation_data.shape

(88889, 31)

In [5]:
aviation_data.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [6]:
aviation_data.dtypes

Event.Id                   object
Investigation.Type         object
Accident.Number            object
Event.Date                 object
Location                   object
Country                    object
Latitude                   object
Longitude                  object
Airport.Code               object
Airport.Name               object
Injury.Severity            object
Aircraft.damage            object
Aircraft.Category          object
Registration.Number        object
Make                       object
Model                      object
Amateur.Built              object
Number.of.Engines         float64
Engine.Type                object
FAR.Description            object
Schedule                   object
Purpose.of.flight          object
Air.carrier                object
Total.Fatal.Injuries      float64
Total.Serious.Injuries    float64
Total.Minor.Injuries      float64
Total.Uninjured           float64
Weather.Condition          object
Broad.phase.of.flight      object
Report.Status 

### Aviation Data 
The Aviation dataset consists of records from 1962 to 2023, and contains a large amount of information on Event Date, Location, Aircraft Damage, Injury Severity, and Total Fatal Injuries.

In [7]:
aviation_data.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [8]:
aviation_data.describe()

Unnamed: 0,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,82805.0,77488.0,76379.0,76956.0,82977.0
mean,1.146585,0.647855,0.279881,0.357061,5.32544
std,0.44651,5.48596,1.544084,2.235625,27.913634
min,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,0.0,2.0
max,8.0,349.0,161.0,380.0,699.0


## Data Preparation

In preparation for this analysis, I will simplify our tasks by normalizing the column names and dropping anything we wont need. This will make things a bit easier to analyze. 


In [9]:
# Creating a function that take the column names and return a normalized version of it
def normalize_column_name(name):
    return name.strip().lower().replace(' ', '_').replace('.','_').replace('-','_')

In [10]:
# List comprehension to apply the function to all the column names
aviation_data.columns = [normalize_column_name(col) for col in aviation_data.columns]

In [11]:
# Verifying that the column names have been normalized
aviation_data.head()

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,latitude,longitude,airport_code,airport_name,injury_severity,aircraft_damage,aircraft_category,registration_number,make,model,amateur_built,number_of_engines,engine_type,far_description,schedule,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [12]:
# Dropping columns that wont be needed for the analysis
aviation_data.drop(columns = ['latitude', 'longitude', 'aircraft_category', 'far_description', 'schedule', 'air_carrier', 'airport_code', 'airport_name', 'event_id', 'accident_number', 'registration_number', 'report_status', 'publication_date'], inplace = True)

In [13]:
missing_values = aviation_data.isna().sum()
print(missing_values)

investigation_type            0
event_date                    0
location                     52
country                     226
injury_severity            1000
aircraft_damage            3194
make                         63
model                        92
amateur_built               102
number_of_engines          6084
engine_type                7077
purpose_of_flight          6192
total_fatal_injuries      11401
total_serious_injuries    12510
total_minor_injuries      11933
total_uninjured            5912
weather_condition          4492
broad_phase_of_flight     27165
dtype: int64


In [14]:
# Checking how much data is missing in each column as a percentage
missing_percent = aviation_data.isnull().sum() / len(aviation_data) * 100
print(missing_percent)

investigation_type         0.000000
event_date                 0.000000
location                   0.058500
country                    0.254250
injury_severity            1.124999
aircraft_damage            3.593246
make                       0.070875
model                      0.103500
amateur_built              0.114750
number_of_engines          6.844491
engine_type                7.961615
purpose_of_flight          6.965991
total_fatal_injuries      12.826109
total_serious_injuries    14.073732
total_minor_injuries      13.424608
total_uninjured            6.650992
weather_condition          5.053494
broad_phase_of_flight     30.560587
dtype: float64


In [15]:
# Filtering out any rows that contain aircraft that are amateur built
aviation_data = aviation_data[aviation_data['amateur_built'] != 'Yes']

In [16]:
# Filtering out any row that contains aircraft that are reciprocating engine powered (propeller)
aviation_data = aviation_data[aviation_data['engine_type'] != 'Reciprocating']

In [17]:
aviation_data.head(3)

Unnamed: 0,investigation_type,event_date,location,country,injury_severity,aircraft_damage,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight
4,Accident,1979-08-02,"Canton, OH",United States,Fatal(1),Destroyed,Cessna,501,No,,,Personal,1.0,2.0,,0.0,VMC,Approach
5,Accident,1979-09-17,"BOSTON, MA",United States,Non-Fatal,Substantial,Mcdonnell Douglas,DC9,No,2.0,Turbo Fan,,,,1.0,44.0,VMC,Climb
22,Accident,1982-01-02,"CHAMBLEE, GA",United States,Non-Fatal,Substantial,Bell,206L-1,No,1.0,Turbo Shaft,Unknown,0.0,0.0,0.0,1.0,VMC,Approach


In [18]:
aviation_data.columns

Index(['investigation_type', 'event_date', 'location', 'country',
       'injury_severity', 'aircraft_damage', 'make', 'model', 'amateur_built',
       'number_of_engines', 'engine_type', 'purpose_of_flight',
       'total_fatal_injuries', 'total_serious_injuries',
       'total_minor_injuries', 'total_uninjured', 'weather_condition',
       'broad_phase_of_flight'],
      dtype='object')

In [19]:
# List of columns to check for missing data
columns_to_check = ['location', 'country', 'make', 'model', 'amateur_built']

# Drop rows where any of the specified columns have missing data
aviation_data = aviation_data.dropna(subset=columns_to_check)

# Resetting the index after dropping rows
aviation_data.reset_index(drop=True, inplace=True)

In [23]:
# Replace 'Unavailable' with NaN first, to unify the missing data representation
aviation_data['injury_severity'] = aviation_data['injury_severity'].replace('Unavailable', pd.NA)

# Now replace all NaN values with 'Unknown'
aviation_data['injury_severity'] = aviation_data['injury_severity'].fillna('Unknown')

In [27]:
# Replacing all NaN values with 'Unknown' for aircraft_damage
aviation_data['aircraft_damage'] = aviation_data['aircraft_damage'].fillna('Unknown')

In [32]:
# Looking into the rows that have a value of 0.0 in the 'number_of_engines' column
aviation_data[aviation_data['number_of_engines'] == 0.0].head()

Unnamed: 0,investigation_type,event_date,location,country,injury_severity,aircraft_damage,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight
4,Accident,1982-01-09,"CALISTOGA, CA",United States,Non-Fatal,Substantial,Schleicher,ASW 20,No,0.0,Unknown,Personal,0.0,0.0,0.0,1.0,VMC,Landing
28,Accident,1982-02-06,"GLENDALE, AZ",United States,Non-Fatal,Substantial,Raven,S-55A,No,0.0,Unknown,Personal,0.0,0.0,0.0,2.0,VMC,Landing
39,Accident,1982-02-19,"PHOENIX, AZ",United States,Non-Fatal,Substantial,Balloon Works,FIREFLY,No,0.0,Unknown,Personal,0.0,0.0,0.0,3.0,VMC,Landing
46,Accident,1982-02-27,"CINCINNATI, OH",United States,Non-Fatal,Substantial,Barnes,FIREFLY-7,No,0.0,Unknown,Personal,0.0,1.0,1.0,2.0,VMC,Takeoff
47,Accident,1982-02-28,"NAPA, CA",United States,Non-Fatal,Destroyed,Barnes,BALLOON AX7,No,0.0,Unknown,Unknown,0.0,0.0,1.0,4.0,VMC,Landing


In [33]:
# Filtering out 'number_of_engines' that have a value of 0.0
aviation_data = aviation_data[aviation_data['number_of_engines'] != 0.0]

# Resetting the index after dropping rows
aviation_data.reset_index(drop=True, inplace=True)

In [37]:
aviation_data['number_of_engines'].value_counts()

1.0    6160
2.0    5050
3.0     470
4.0     369
8.0       3
6.0       1
Name: number_of_engines, dtype: int64

In [38]:
# Assiging 'Unknown' value to the nan values in the 'number_of_engines' column
aviation_data['number_of_engines'] = aviation_data['number_of_engines'].fillna('Unknown')

In [39]:
aviation_data.head()

Unnamed: 0,investigation_type,event_date,location,country,injury_severity,aircraft_damage,make,model,amateur_built,number_of_engines,engine_type,purpose_of_flight,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight
0,Accident,1979-08-02,"Canton, OH",United States,Fatal(1),Destroyed,Cessna,501,No,Unknown,,Personal,1.0,2.0,,0.0,VMC,Approach
1,Accident,1979-09-17,"BOSTON, MA",United States,Non-Fatal,Substantial,Mcdonnell Douglas,DC9,No,2,Turbo Fan,,,,1.0,44.0,VMC,Climb
2,Accident,1982-01-02,"CHAMBLEE, GA",United States,Non-Fatal,Substantial,Bell,206L-1,No,1,Turbo Shaft,Unknown,0.0,0.0,0.0,1.0,VMC,Approach
3,Accident,1982-01-06,"MAMMOTH LAKES, CA",United States,Non-Fatal,Substantial,Aerospatiale,SA-316B,No,1,Turbo Shaft,Business,0.0,0.0,0.0,6.0,VMC,Taxi
4,Incident,1982-01-12,"CHICAGO, IL",United States,Incident,Unknown,Lockheed,L-1011,No,3,Turbo Fan,Unknown,0.0,0.0,0.0,149.0,UNK,Cruise


In [57]:
# Removing rows with an 'engine_type' of 'None'
aviation_data = aviation_data[aviation_data['engine_type'] != 'None']

# Making sure to reset the index after dropping rows
aviation_data.reset_index(drop=True, inplace=True)

In [62]:
aviation_data['engine_type'].value_counts()

Turbo Shaft        3491
Turbo Prop         3302
Turbo Fan          2455
Unknown            1237
Turbo Jet           688
Geared Turbofan      12
Electric             10
LR                    2
UNK                   1
Hybrid Rocket         1
Name: engine_type, dtype: int64

In [68]:
# Change NaN in 'engine_type' to 'Unknown' only where 'number_of_engines' is 'Unknown'
aviation_data.loc[(aviation_data['number_of_engines'] == 'Unknown') & (aviation_data['engine_type'].isna()), 'engine_type'] = 'Unknown'

In [69]:
# Re-checking how much data is missing in each column as a percentage
missing_percent = aviation_data.isnull().sum() / len(aviation_data) * 100
print(missing_percent)

investigation_type         0.000000
event_date                 0.000000
location                   0.000000
country                    0.000000
injury_severity            0.000000
aircraft_damage            0.000000
make                       0.000000
model                      0.000000
amateur_built              0.000000
number_of_engines          0.000000
engine_type               13.180134
purpose_of_flight         30.180597
total_fatal_injuries      13.457976
total_serious_injuries    15.420236
total_minor_injuries      15.588099
total_uninjured            7.889558
weather_condition         22.325770
broad_phase_of_flight     56.037277
dtype: float64


In [73]:
aviation_data['purpose_of_flight'].unique()

array(['Personal', nan, 'Unknown', 'Business', 'Executive/corporate',
       'Ferry', 'Instructional', 'Aerial Application', 'Public Aircraft',
       'Positioning', 'Other Work Use', 'Skydiving', 'Aerial Observation',
       'Flight Test', 'Air Race/show', 'Public Aircraft - Federal',
       'Air Drop', 'Public Aircraft - Local', 'External Load',
       'Firefighting', 'Public Aircraft - State', 'Banner Tow',
       'Air Race show', 'Glider Tow', 'PUBS', 'ASHO', 'PUBL'],
      dtype=object)

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***