# Aviation Accident Database & Synopses, up to 2023

## Overview

This project analyzes data from the "Aviation Accident Database & Synopses" dataset to identify the safest and most cost-effective aircraft for a company looking to expand into the aviation industry. By reviewing accident and incident data, aircraft safety records, maintenance issues, and operational contexts, the project aims to provide actionable insights for minimizing risks and optimizing operational efficiency.


## Business Understanding

As the company diversifies its portfolio into the aviation sector, it faces significant risks
associated with aircraft safety, maintenance costs, and operational efficiency. Selecting the right 
aircraft is crucial for ensuring the success of this new business endeavor. The aviation industry is 
highly competitive and regulated, requiring careful consideration of multiple factors to ensure safe,
reliable, and cost-effective operations.

## Data Understanding


The "Aviation Accident Database & Synopses, up to 2023" from Kaggle is a comprehensive dataset 
containing detailed information about aviation accidents and incidents. This dataset provides critical
insights into various aspects of aviation safety and operational efficiency. 
For our analysis, we focus on the following key components of the dataset:


Accident and Incident Data: This includes detailed records of aviation accidents and incidents, 
                    encompassing data points such as the date, location, and severity of each event                  
Aircraft Information: Details about the aircraft involved in each accident or incident, including 
                      model, manufacturer, and registration.
                      
Synopsis: Brief summaries of each accident or incident, providing context and initial findings.
Cause and Contributing Factors: Information on the identified causes and contributing factors for
                      each event.
                      
Human Factors: Data on pilot and crew performance, errors, and other human-related aspects.
Weather Conditions: Weather data at the time of each accident or incident.

Operational Context: Information about the flight phase (e.g., takeoff, cruise, landing) and operation
                     type (e.g., commercial, private, cargo).

## IMPORTS AND DATA

In [2]:
# importing packages
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

In [3]:
#Loading data
df = pd.read_csv('AviationData.csv')
df

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [4]:
#Loading data
df = pd.read_csv('AviationData.csv', encoding='ISO-8859-1')
df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,10/24/1948,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,7/19/1962,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,8/30/1974,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,6/19/1977,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12/9/2000
4,20041105X01764,Accident,CHI79FA064,8/2/1979,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,2.02212E+13,Accident,ERA23LA093,12/26/2022,"Annapolis, MD",United States,,,,,...,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
88885,2.02212E+13,Accident,ERA23LA095,12/26/2022,"Hampton, NH",United States,,,,,...,,,0.0,0.0,0.0,0.0,,,,
88886,2.02212E+13,Accident,WPR23LA075,12/26/2022,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
88887,2.02212E+13,Accident,WPR23LA076,12/26/2022,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,


In [5]:
#Finding descriptive analysis about our data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [22]:
df.shape

(88889, 25)

## Handling Missing Values in the dataframe

In [6]:
#finding the sum of the missing values in the dataframe

df.isna().sum()

Event.Id                      0
Investigation.Type            0
Accident.Number               0
Event.Date                    0
Location                     52
Country                     226
Latitude                  54507
Longitude                 54516
Airport.Code              38640
Airport.Name              36099
Injury.Severity            1000
Aircraft.damage            3194
Aircraft.Category         56602
Registration.Number        1317
Make                         63
Model                        92
Amateur.Built               102
Number.of.Engines          6084
Engine.Type                7077
FAR.Description           56866
Schedule                  76307
Purpose.of.flight          6192
Air.carrier               72241
Total.Fatal.Injuries      11401
Total.Serious.Injuries    12510
Total.Minor.Injuries      11933
Total.Uninjured            5912
Weather.Condition          4492
Broad.phase.of.flight     27165
Report.Status              6381
Publication.Date          13771
dtype: i

In [7]:
#finding the rows and colums
df.shape

(88889, 31)

In [8]:
#finding the percentage of missing values in the dataframe
percent_missing = ((df.isna().sum())/88889)*100
percent_missing

Event.Id                   0.000000
Investigation.Type         0.000000
Accident.Number            0.000000
Event.Date                 0.000000
Location                   0.058500
Country                    0.254250
Latitude                  61.320298
Longitude                 61.330423
Airport.Code              43.469946
Airport.Name              40.611324
Injury.Severity            1.124999
Aircraft.damage            3.593246
Aircraft.Category         63.677170
Registration.Number        1.481623
Make                       0.070875
Model                      0.103500
Amateur.Built              0.114750
Number.of.Engines          6.844491
Engine.Type                7.961615
FAR.Description           63.974170
Schedule                  85.845268
Purpose.of.flight          6.965991
Air.carrier               81.271023
Total.Fatal.Injuries      12.826109
Total.Serious.Injuries    14.073732
Total.Minor.Injuries      13.424608
Total.Uninjured            6.650992
Weather.Condition          5

In [20]:
#dropping columns with missing percentage greater than 50%

greater_50 = percent_missing[percent_missing>50].index
df.drop(labels=greater_50, inplace=True, axis=1)

greater_50

KeyError: "['Latitude' 'Longitude' 'Aircraft.Category' 'FAR.Description' 'Schedule'\n 'Air.carrier'] not found in axis"

In [21]:
#confirming if the rows have been dropped
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Airport.Code            50249 non-null  object 
 7   Airport.Name            52790 non-null  object 
 8   Injury.Severity         87889 non-null  object 
 9   Aircraft.damage         85695 non-null  object 
 10  Registration.Number     87572 non-null  object 
 11  Make                    88826 non-null  object 
 12  Model                   88797 non-null  object 
 13  Amateur.Built           88787 non-null  object 
 14  Number.of.Engines       82805 non-null

## EXPLORATORY DATA ANALYSIS (EDA)

### Investing in Aircraft with Proven Safety Records

Analyzing the Aircrafts Safety Records is crucial to ensuring the safety,
financial stability, reputation, regulatory compliance, and long-term viability of the company's 
aviation division.

We will thus concentrate on the following columns to assist us in analyzing and identifying the 
aircraft with the best safety records. 

In [23]:
df.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Airport.Code', 'Airport.Name',
       'Injury.Severity', 'Aircraft.damage', 'Registration.Number', 'Make',
       'Model', 'Amateur.Built', 'Number.of.Engines', 'Engine.Type',
       'Purpose.of.flight', 'Total.Fatal.Injuries', 'Total.Serious.Injuries',
       'Total.Minor.Injuries', 'Total.Uninjured', 'Weather.Condition',
       'Broad.phase.of.flight', 'Report.Status', 'Publication.Date'],
      dtype='object')

 Perform descriptive analysis to understand the distribution of accidents, injury severity,
aircraft damage, and purpose of the flight.


In [26]:
#understanding the distribution of accidents per Event.Id
df['Event.Id'].value_counts()

2.02207E+13       190
2.02208E+13       186
2.02206E+13       185
2.02106E+13       183
2.02107E+13       174
                 ... 
20180603X31024      1
20001212X16687      1
20081012X85752      1
20011016X02099      1
20001208X09169      1
Name: Event.Id, Length: 84468, dtype: int64

In [37]:
#findin the number of accidents per Aircraft

relevant_columns =df[['Event.Id', 'Accident.Number']]
print(relevant_columns)

             Event.Id Accident.Number
0      20001218X45444      SEA87LA080
1      20001218X45447      LAX94LA336
2      20061025X01555      NYC07LA005
3      20001218X45448      LAX96LA321
4      20041105X01764      CHI79FA064
...               ...             ...
88884     2.02212E+13      ERA23LA093
88885     2.02212E+13      ERA23LA095
88886     2.02212E+13      WPR23LA075
88887     2.02212E+13      WPR23LA076
88888     2.02212E+13      ERA23LA097

[88889 rows x 2 columns]


In [38]:
#count the number of accidents for each Aicraft type

df['Event.Id'].value_counts()

2.02207E+13       190
2.02208E+13       186
2.02206E+13       185
2.02106E+13       183
2.02107E+13       174
                 ... 
20180603X31024      1
20001212X16687      1
20081012X85752      1
20011016X02099      1
20001208X09169      1
Name: Event.Id, Length: 84468, dtype: int64

In [43]:
#Identify the aircraft with the highest number of accidents

highest_number_acccidents = df['Event.Id'].nunique()
print(highest_number_acccidents)

84468
