# Project Overview

This project anylses the Aviation Accident dataset which contains accidents from 1962 and as recent as 2023. It contains over 80,000 records. This analysis can be used to see the safest airlines with the least accidents, fatalities occured and areas that can be improved to reduce such calamities.

# Business Understanding

We have been hired by Sky High Corp. They are interested in **purchasing and operating airplanes** for **commercial and private activities** and they want to know the **potential of risks involved in aviation**.

We have been tasked to find **which aircraft have the lowest risk** for the company to start with as they venture into this business venture.

In [39]:
# Importing the necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# To display all columns
pd.set_option('display.max_columns', 500)

# To ensure all visualizations stay within the notebook
%matplotlib inline

In [None]:
# Loading the dataset 
aviation_df = pd.read_csv("./data/AviationData.csv", encoding = 'latin-1',
                         dtype = {6: str, 7:str, 28: str})
# dtype = {6: str, 7:str, 28: str}: was used to set the data type for those specific columns to str to avoid errors
aviation_df

# Data Understanding

In [None]:
print(f"The Accident Aviation dataset contains {aviation_df.shape[0]} rows and {aviation_df.shape[1]} columns")

In [None]:
aviation_df.info()

In [None]:
# Checking the percentage of null values in all columns
aviation_df.isna().mean()*100

# Data Preparation

In [None]:
# Converting all column names to lower and replacing dots with underscores
aviation_df.columns = aviation_df.columns.str.lower().str.replace('.', "_", regex = True)
aviation_df.columns

In [None]:
aviation_df.isna().sum()

In [None]:
# To get only accidents that happen in the United States and its territories
us_territories = ["United States",'American Samoa','Guam',"Marshall Islands","Micronesia",
                  "Northern Marianas","Palau","Puerto Rico","Virgin Islands","Washington_DC",
                  "Gulf of Mexico","Atlantic Ocean","Pacific Ocean"]
us_accidents_df = aviation_df[aviation_df['country'].isin(us_territories)]

In [None]:
us_accidents_df.info()

In [None]:
# Checking for the percentage of null values in each column
us_accidents_df.isna().mean()*100

In [None]:
# Dropping unnecessary columns
us_accidents = us_accidents_df.copy()
us_accidents.drop(['latitude',
                   'longitude', 
                   'schedule',
                   'far_description',
                   'airport_code',
                   'report_status',
                   'publication_date',
                   'air_carrier',
                   'airport_name',], axis = 1, inplace = True)

Some columns had to be changed to appropriate data types:
- `number_of_engines`,`total_fatal_injuries`,`total_serious_injuries`,`total_minor_injuries` and `total_uninjured` had to be changed as people and number of engines cannot be continuous data
- `event_date` had to be changed to a datetime format and the year extracted

In [None]:
# Filling null values with 0 and changing data type to int
# 0 becomes a placeholder
us_accidents['number_of_engines'] = us_accidents['number_of_engines'].fillna(0).astype(int)
us_accidents['total_fatal_injuries'] = us_accidents['total_fatal_injuries'].fillna(0).astype(int)
us_accidents['total_serious_injuries'] = us_accidents['total_serious_injuries'].fillna(0).astype(int)
us_accidents['total_minor_injuries'] = us_accidents['total_minor_injuries'].fillna(0).astype(int)
us_accidents['total_uninjured'] = us_accidents['total_uninjured'].fillna(0).astype(int)

# Only getting the Year the incident/accident happened
us_accidents['event_date'] = pd.to_datetime(us_accidents['event_date'], format='%Y-%m-%d').dt.strftime('%Y')
us_accidents.columns = us_accidents.columns.str.replace('event_date', "event_year")

In [None]:
# Checking for duplicated using the event_id
us_accidents[us_accidents.duplicated(subset = 'event_id', keep = False)].head(50)

While trying to check for duplicates in `event_id`, it was discovered that in cases where the `event_id` was duplicated, two aircrafts were involved in the accident. They were both logged in one event_id but different accident_number 

### Aircraft Category Column

The Aircraft Category Column started with around 65% of null values in the column. Since our client mostly wants airplanes data, we had to try to minimize the null values.

The following were done after a lot of research:
- The type of aircraft had to be identified using the `make` and `model` columns.
- Some duplicates were removed in the `make` and `model` columns by converting all values into Title case.
- Depending on the aircrafts we have in the dataset, we determined all that were helicopters, airplanes and some gliders
- Some naming conventions were changed to ensure uniformity in the dataset.

Once most of the data in the `aircraft_category` was cleaned, we were able to reduce the null values from 65% to 13%. The rest of the null values were then dropped

In [None]:
us_accidents.isna().mean()*100

In [None]:
us_accidents['make'].value_counts()

In [None]:
# Converting all values into Title case
us_accidents['make'] = us_accidents['make'].str.title()
#pd.set_option('display.max_rows', None)
us_accidents['make'].value_counts().head(56)

In [None]:
# Imputing the appropriate aircraft category depending on make and model columns
us_accidents.loc[us_accidents['make'] == 'Cessna', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Piper', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Beech', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Mooney', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Bellanca', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Boeing', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'American Champion Aircraft', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Aeronca', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Maule', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Stinson', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Luscombe', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Aero Commander', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Taylorcraft', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'Rockwell International', 'aircraft_category'] = 'Airplane'
us_accidents.loc[us_accidents['make'] == 'North American', 'aircraft_category'] = 'Helicopter'
us_accidents.loc[us_accidents['make'] == 'Hiller', 'aircraft_category'] = 'Helicopter'
us_accidents.loc[us_accidents['make'] == 'Bell', 'aircraft_category'] = 'Helicopter'
us_accidents.loc[us_accidents['make'] == 'Hughes', 'aircraft_category'] = 'Helicopter'

# Streamlining naming conventions for Robinson Helicopter Company
us_accidents['make'] = us_accidents['make'].replace(['Robinson','Robinson Helicopter','Robinson Helicopter Company'], "Robinson Helicopter Company")
us_accidents.loc[us_accidents['make'] == 'Robinson Helicopter Company', 'aircraft_category'] = 'Helicopter'

# Streamlining naming conventions for Northrop Grumman 
us_accidents['make'] = us_accidents['make'].replace(['Grumman','Grumman American','Grumman American Avn. Corp.'], "Northrop Grumman")
us_accidents.loc[us_accidents['make'] == 'Northrop Grumman', 'aircraft_category'] = 'Airplane'

# Streamlining naming conventions for De Havilland 
us_accidents['make'] = us_accidents['make'].replace(['Dehavilland','De Havilland'], "De Havilland")
us_accidents.loc[us_accidents['make'] == 'De Havilland', 'aircraft_category'] = 'Airplane'

# Streamlining naming conventions for Air Tractor Inc 
us_accidents['make'] = us_accidents['make'].replace(['Air Tractor','Air Tractor Inc'], "Air Tractor Inc")
us_accidents.loc[us_accidents['make'] == 'Air Tractor Inc', 'aircraft_category'] = 'Airplane'

# Streamlining naming conventions for American Champion Aircraft 
us_accidents['make'] = us_accidents['make'].replace(['American Champion Aircraft','Champion'], "American Champion Aircraft")

# Streamlining naming conventions for Rockwell International 
us_accidents['make'] = us_accidents['make'].replace(['Rockwell','Rockwell International'], "Rockwell International")

# Streamlining naming conventions for Cirrus Design Corp 
us_accidents['make'] = us_accidents['make'].replace(['Cirrus Design Corp','Cirrus'], "Cirrus Design Corp")

# Streamlining naming conventions for Aviat Aircraft Inc 
us_accidents['make'] = us_accidents['make'].replace(['Aviat Aircraft Inc','Aviat'], "Aviat Aircraft Inc")

# Streamlining naming conventions for Rockwell International 
us_accidents['make'] = us_accidents['make'].replace(['Ayres Corporation','Ayres'], "Ayres Corporation")

# Streamlining naming conventions for Diamond Aircraft Ind Inc 
us_accidents['make'] = us_accidents['make'].replace(['Diamond Aircraft Ind Inc','Diamond'], "Diamond Aircraft Ind Inc")

# Imputing the appropriate aircraft category depending on make and model columns for Schweizer
us_accidents.loc[(us_accidents['aircraft_category'].isna()) & 
                 (us_accidents['make'] == "Schweizer") & 
                 (us_accidents['model'].str.contains('269|300', na = False, case = False)), 'aircraft_category'] = "Helicopter"
us_accidents.loc[(us_accidents['aircraft_category'].isna()) & 
                 (us_accidents['make'] == "Schweizer") & 
                 (us_accidents['model'].str.contains('2-3|1-2|2-2|1-3|SGS', na = False, case = False)), 'aircraft_category'] = "Glider"
us_accidents.loc[(us_accidents['aircraft_category'].isna()) & 
                 (us_accidents['make'] == "Schweizer") & 
                 (us_accidents['model'].str.contains('164', na = False, case = False)), 'aircraft_category'] = "Airplane"

# Imputing the appropriate aircraft category depending on make and model columns for McDonnell Douglas
us_accidents.loc[(us_accidents['aircraft_category'].isna()) & 
                 (us_accidents['make'] == "Mcdonnell Douglas") & 
                 (us_accidents['model'].str.contains('DC|MD-8|MD-11|MD-9|MD8|MD-10|MD11', na = False, case = False)), 'aircraft_category'] = "Airplane"
us_accidents.loc[(us_accidents['aircraft_category'].isna()) & 
                 (us_accidents['make'] == "Mcdonnell Douglas") & 
                 (us_accidents['model'].str.contains('369|500|600|269|520|90', na = False, case = False)), 'aircraft_category'] = "Helicopter"
us_accidents['make'] = us_accidents['make'].replace('Mcdonnell Douglas', "McDonnell Douglas")

# Replacing UNK with Unknown
us_accidents['aircraft_category'] = us_accidents['aircraft_category'].replace('UNK', "Unknown")

# Dropping the rest of the values
us_accidents.dropna(subset = ['make', 'model','aircraft_category'], inplace = True)

In [None]:
# Final Results
us_accidents['aircraft_category'].value_counts()

### Location and State Columns

New column had to be computed to get the states and the area that the accident happened
- `area` was to contain the genral area where the accident occured
- `state_short_code` contains the abbreviation for the states and the territories

Due to input errors, especially among the US Territories, manual replacements had to be done to get the correct data. In cases where the area could not be fetched, **UN** is put to represent **Unknown**

In [None]:
# Creating and cleaning up the created columns
new_cols = us_accidents['location'].str.rsplit(',',n = 1, expand = True)
us_accidents['area'] = new_cols[0]
us_accidents['state_abbrev'] = new_cols[1].str.strip()

In [None]:
# pd.set_option('display.max_rows', None)
us_accidents['state_abbrev'].value_counts()

In [None]:
# Renaming the short codes accordingly
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].replace(["Virgin Islands (British)", 'CB'], 'VI')
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].replace(["American Samoa","AMERICAN SAMOA"], 'AS')
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].replace("Micronesia (Federated States of)", 'FM')
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].replace(["Marshall Islands","MARSHALL ISLANDS"], 'MH')
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].replace("Palau", 'PW')

# All Empty Values replaced with UN for Unknown
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].replace("", 'UN')
us_accidents['state_abbrev'] = us_accidents['state_abbrev'].fillna('UN')

In [None]:
# pd.set_option('display.max_rows', None)
us_accidents['state_abbrev'].value_counts()

All good

In [None]:
# This dictionary contains thee long form of the state abbreviations
state_abbreviation = {
    "AL": "Alabama",
    "AK": "Alaska",
    "AZ": "Arizona",
    "AR": "Arkansas",
    "CA": "California",
    "CO": "Colorado",
    "CT": "Connecticut",
    "DE": "Delaware",
    "FL": "Florida",
    "GA": "Georgia",
    "HI": "Hawaii",
    "ID": "Idaho",
    "IL": "Illinois",
    "IN": "Indiana",
    "IA": "Iowa",
    "KS": "Kansas",
    "KY": "Kentucky",
    "LA": "Louisiana",
    "ME": "Maine",
    "MD": "Maryland",
    "MA": "Massachusetts",
    "MI": "Michigan",
    "MN": "Minnesota",
    "MS": "Mississippi",
    "MO": "Missouri",
    "MT": "Montana",
    "NE": "Nebraska",
    "NV": "Nevada",
    "NH": "New Hampshire",
    "NJ": "New Jersey",
    "NM": "New Mexico",
    "NY": "New York",
    "NC": "North Carolina",
    "ND": "North Dakota",
    "OH": "Ohio",
    "OK": "Oklahoma",
    "OR": "Oregon",
    "PA": "Pennsylvania",
    "RI": "Rhode Island",
    "SC": "South Carolina",
    "SD": "South Dakota",
    "TN": "Tennessee",
    "TX": "Texas",
    "UT": "Utah",
    "VT": "Vermont",
    "VA": "Virginia",
    "WA": "Washington",
    "WV": "West Virginia",
    "WI": "Wisconsin",
    "WY": "Wyoming",
    "AS": "American Samoa",
    "GU": "Guam",
    "MH": "Marshall Islands",
    "FM": "Micronesia",
    "MP": "Northern Marianas",
    "PW": "Palau",
    "PR": "Puerto Rico",
    "VI": "Virgin Islands",
    "DC": "Washington DC",
    "GM": "Gulf of Mexico",
    "AO": "Atlantic Ocean",
    "PO": "Atlantic Ocean",
    "UN": "Unknown"
}

In [None]:
# Making a new column with the abbreviations
us_accidents['state'] = us_accidents['state_abbrev'].map(state_abbreviation)

In [None]:
us_accidents.info()

While investigating the state null values, it was discovered that some columns had the **OF** abbreviation that is not attched to any state and territory as they are not in the United States. They were thus dropped.

In [None]:
us_accidents.drop(us_accidents[us_accidents['state'].isna()].index, inplace = True)

### Injury Columns

A new column, `'total_injured'`, is created. It contains the sum of all the injured columns.

In [None]:
us_accidents['total_injured'] = us_accidents[['total_fatal_injuries',
                                              'total_serious_injuries',
                                              'total_minor_injuries']].sum(axis = 1)
us_accidents

### Injury Severity Column

Cleaning up the `'injury_severity'` column has a lot of fatal rows but contains a number. Let us only remain with **Fatal** and not Fatal(1), Fatal(4) etc.

In [None]:
# Before
us_accidents['injury_severity'].value_counts()

In [None]:
us_accidents['injury_severity'] = us_accidents['injury_severity'].str.replace(r'\(\d+\)', '', regex = True)

In [None]:
# After
us_accidents['injury_severity'].value_counts()

### Model Column

The model column is mostly clean. The biggest worry in this column is user input error where some users have put hyphens or spaces where there shouldn't be or they have been used interchangeably. To curb this and get a more accurate description, removing of the hyphens and whitespaces might be the best way to solve this issue.

In [None]:
us_accidents['model'] = us_accidents['model'].str.replace(r"[-\s]", '', regex = True)
# -: removes hyphens
# s: removes whitespaces

In [None]:
#pd.set_option('display.max_rows', None)
us_accidents[['make','model']].value_counts()

In [None]:
# Dropping rows with "Unavailable" and NaN values
us_accidents.drop(us_accidents[(us_accidents['injury_severity'].isna()) | 
                  (us_accidents['injury_severity'] == "Unavailable") |
                   (us_accidents['amateur_built'] == "Yes") |
                    (us_accidents['amateur_built'].isna())].index, inplace = True)

#### Cleaning Up Null Values

In [None]:
# Filling NaN in aircraft_damage with Unknown
us_accidents['aircraft_damage'] =  us_accidents['aircraft_damage'].fillna('Unknown')

# Filling NaN in engine_type with Unknown
us_accidents['engine_type'] = us_accidents['engine_type'].replace("UNK", "Unknown")
us_accidents['engine_type'] = us_accidents['engine_type'].fillna('Unknown')

# Filling NaN in purpose_of_flight with Unknown
us_accidents['purpose_of_flight'] = us_accidents['purpose_of_flight'].fillna("Unkown")

# Dropping unnecessary Columns
us_accidents.drop(['location',
                   'broad_phase_of_flight',
                   'registration_number',
                   'area',
                   'amateur_built',
                   'accident_number',
                   'weather_condition'
                  ], axis = 1, inplace = True)

In [None]:
# Only remaining with columns which contain Airplane and Helicopters only
us_accidents = us_accidents[(us_accidents['aircraft_category'] == "Airplane") | 
                            (us_accidents['aircraft_category'] == "Helicopter")]

In [None]:
# Filtering us_accidents to only get the top 30 planes
considered_planes = list(us_accidents['make'].value_counts().head(30).index)

us_accidents = us_accidents[us_accidents['make'].isin(considered_planes)]

In [None]:
# Reordering Columns and resetting the index
col_order = ['event_id', 'investigation_type', 'event_year','state_abbrev', 'state','country',
             'aircraft_category', 'make','model', 'number_of_engines', 'engine_type','purpose_of_flight',
            'total_fatal_injuries', 'total_serious_injuries','total_minor_injuries', 'total_injured', 'total_uninjured',
            'injury_severity', 'aircraft_damage']
us_accidents = us_accidents[col_order].reset_index(drop = True)

This is our cleaned data

In [None]:
# Saving the file
us_accidents.to_csv("./data/US_Avaition_Accidents.csv", encoding = 'latin-1', index = False)

# Data Analysis