# Aviation Risk Analysis Project

Authors: Brian Woo, Evan Rosenbaum

## Overview

This project involves data cleaning, imputation, analysis, and visualization to generate insights for a business stakeholder. The goal is to determine the lowest risk aircraft for a company looking to expand into the aviation industry.

# Business Problem

### General Corporate Stakeholder

#### Diversification into the Aviation Industry

##### Goals

- Our client is looking to diversify their business by venturing into the aviation industry. In doing so, they are looking for new growth opportunities in an industry that has lagged behind other industries in their adoption of technology. 

#### Priorities

##### Safety
- Safety is the foremost concern for our client. They are looking for models with impeccable safety records. Further, they a preference for manufacturers with a proven track record of producing safe and reliable aircraft. 

##### Durability
- Our client is looking for an aircraft that has a long operational life. They want to ensure that their investment pays off and that their aircraft can withstand the wear and tear they put on it. 

##### Lasting Value
- Our client wants to be sure that the aircraft they purchase will not be outdated soon after their purchase. Further, they would prefer for their aircraft to be able to meet the current demands and potential future demands of the industry. 



## Source of Data

The dataset used for this project is from the National Transportation Safety Board (NTSB) and includes aviation accident data from 1962 to 2023. This dataset covers civil aviation accidents and selected incidents in the United States and international waters.

# Understanding the Data Structure

In [1]:
# Import standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import chi2_contingency
from itertools import combinations
import re

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Open the file in read mode and read lines

aviation_df = pd.read_csv('./data_files/AviationData.csv', encoding='latin-1')

state_codes_df = pd.read_csv('./data_files/USState_Codes.csv', encoding='latin-1')

FileNotFoundError: [Errno 2] No such file or directory: './data_files/AviationData.csv'

In [None]:
state_codes_df

In [None]:
# Changing the naming convention of columns
aviation_df.columns = aviation_df.columns.str.lower().str.replace('.', '_')
aviation_df.columns

In [None]:
# Apply lambda function to remove whitespace from every element in the DataFrame
aviation_df = aviation_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

In [None]:
# Drops duplicate rows
aviation_df.drop_duplicates(inplace=True)

In [None]:
aviation_df.shape

In [None]:
aviation_df

# Data Types and Missing Values

In [None]:
aviation_df.info()

### Missing values

In [None]:
# Percent of missing values in each column
aviation_df.isna().sum() * 100 / len(aviation_df)

In [None]:
# Number of unique and missing values
unique_missing_vals = {}
for i in aviation_df.columns:
    unique_missing_vals[i] = len(aviation_df[i].unique())

unique_values= pd.DataFrame(list(unique_missing_vals.items()), columns = ['Column', 'unique_val'])
unique_values["missing_values"] = aviation_df.isna().sum().values
unique_values

# Descriptive Statistics

In [None]:
aviation_df.describe()

### Analysis of Statistics

- There are large outlier inside of all of the injury metrics. However, these outliers are important and give us valuable information about the riskiness of the aircraft.

- We should not remove these outliers.


In [None]:
aviation_df.mode().T

# Data Cleaning

In [None]:
def column_info(dataframe, column):
    """
    Provides a view into the row information provided in each column
    -
    Input:
    dataframe : Pandas DataFrame
    columns_list: list 
    -
    Output:
        Prints:
            - A preview of the first 5 values in the column.
            - Value counts of the column.
            - The percentage of missing values in the column.
    """
    preview = dataframe[column].head()
    value_counts = dataframe[column].value_counts()
    percent_missing = dataframe[column].isna().sum() * 100 / len(dataframe)
    
    print("Preview of the first 5 rows in the column:")
    print(preview)
    print("\nValue counts of the column:")
    print(value_counts)
    print("\nPercentage of missing values in the column:")
    print(f"{percent_missing:.2f}%") 

## Normalizing Columns

For the following columns, we are normalizing the casing to account of duplicated entries.

In [None]:
# List of columns
columns_to_lowercase = [
    'investigation_type',
    'injury_severity',
    'aircraft_damage',
    'aircraft_category',
    'make',
    'amateur_built',
    'engine_type',
    'purpose_of_flight',
    'broad_phase_of_flight' 
]

# Convert the columns to lowercase
for column in columns_to_lowercase:
    aviation_df[column] = aviation_df[column].str.lower()

## Filtering the Data 

### location and country

In [None]:
# Preview the row entries for the column
column_info(aviation_df,'location')

In [None]:
# Check for missing values
aviation_df['location'].isna().sum()

In [None]:
# Row entry review

# Copy the DataFrame so no data is lost
aviation_df_copy = aviation_df.copy()

# Fill NaN values with a blank string
aviation_df_copy['location'].fillna('', inplace=True)

# Apply the filter
aviation_df_copy[aviation_df_copy['location'].str.contains('NEAR')]['location']

In [None]:
# Show if there is a country listed for location values that are NaNs
aviation_df['location'][aviation_df['country'] != 'United States'].value_counts()

**Review**

The location column while providing more detail than the country file has some issues.

First is that not every entry has an accurate city location.

Second is that for foreign countries, there are multiple comma delimters indentifying the city. Additionally, the country is listed inside of the location.

**Examples**

- Los Mochis, Sinaloa, Mexico
- Ledbury, Herefordshire, United Kingdom
- Panama City, Panama
- Ji'an City Jiangxi Province, China

**Recommendation**

- Filter the data set so that we only have US flights.
- Create a state category so we can reliably use the consistent information for location.
- Drop the country and location columns. 

**Action**

In [None]:
# Filter for only the rows the country is the United States
aviation_df = aviation_df[aviation_df['country'] == 'United States']

# Drop NaN values from location
aviation_df.dropna(subset=['location'], inplace=True)

# Create a new column 'state' that holds the state information from the location column
aviation_df['state'] = aviation_df['location'].apply(lambda x: x.split(',')[-1].strip())

# Drop the 'location' column
aviation_df.drop(columns=['location'], inplace=True)

# Drop the 'country' column
aviation_df.drop(columns=['country'], inplace=True)

# Merge the csv DataFrames on the 'state' column in aviation_df and 'Abbreviation' column in state_codes_df
aviation_df = pd.merge(aviation_df, state_codes_df, left_on='state', right_on='Abbreviation', how='inner')

# Drop the 'Abbreviation' column as it's no longer needed
aviation_df.drop(columns=['Abbreviation'], inplace=True)

# Drop the 'Abbreviation' column as it's no longer needed
aviation_df.drop(columns=['US_State'], inplace=True)

### investigation_type

In [None]:
# Preview the row entries for the column
column_info(aviation_df,'investigation_type')

**Review**

Events are classified as either being accidents or incidents. 

According to the Code of Federal Regulations, "an accident is defined as an occurrence associated with the operation of an aircraft which takes place between the time any person boards the aircraft with the intention of flight and all such persons have disembarked, and in which any person suffers death or serious injury, or in which the aircraft receives substantial damage. For purposes of this part, the definition of “aircraft accident” includes “unmanned aircraft accident,” as defined herein."

An incident is defined as "an occurrence other than an accident, associated with the operation of an aircraft, which affects or could affect the safety of operations."

https://www.ecfr.gov/current/title-49/subtitle-B/chapter-VIII/part-830/subpart-A/section-830.2

**Recommendation**

Narrow our search to only include records that are labeled as accident. 
- This would result in about 5% of the total data being dropped.

**Action**

In [None]:
# Filter the event_id column for only accidents
aviation_df = aviation_df[aviation_df['investigation_type'] != 'incident']

## Analysis for Columns to Drop

### Columns with Unique Identifiers

- event_id
    - The event_id serves as a unqiue indentifier for each row entry. These are for reference to search the event on the NTSB aviation accident database.
- accident_number
    - The accident_number serves as a unqiue indentifier for each row entry. These are for reference to search the event on the NTSB aviation accident database.
- registration_number
    - It is unclear why there is missing values or duplicated values inside of this column as one would assume that every aircrafts registration number would be unique.
    - No documentation on this column could be located to provide further information.

### Columns with missing values above 50%

**Overview**

These columns contain too much missing data. Additionally, they are not pertinent to the business question. 

**Specific Column Information**

- latitude
- longitude
- far_description
    - The column contains references to the Federal Aviation Regulations (FARs) or similar regulatory categories. The FARs are a set of rules prescribed by the Federal Aviation Administration (FAA) governing all aviation activities in the United States.
    - The values are not normalized and contain duplicates.
     - Examples:
        - Part 91: General Aviation
            - Covers general operating and flight rules for all aircraft not governed by other specific parts (e.g., private pilots, corporate flights, etc.).
        - PUBU: Public Use
             - Refers to aircraft operated by government agencies or other public entities for official purposes.
        - https://www.faa.gov/hazmat/air_carriers/operations
        - https://www.ecfr.gov/current/title-14
- schedule
    - No documentation on this column could be located to provide further information.

### Columns with insignficant data in reference to business question

**Overview**

These columns while providing interesting information do not provide pertinent information in relation to the busines question. 

- airport_code
- airport_name
- air_carrier
- publication_date

### Action to Drop Columns

In [None]:
# drop the columns that are not useful for analysis
columns_to_drop = [
    'event_id',
    'accident_number',
    'latitude',
    'longitude',
    'airport_code',
    'airport_name',
    'registration_number',
    'far_description',
    'schedule',
    'air_carrier',
    'publication_date',
]

aviation_df.drop(columns=columns_to_drop, inplace=True)

## Analysis for Dropping Row NaN's

### make and model

**Overview**

We want to ensure there is a make and model for every row element. If there is no make and mode, we cannot give a recommendation to the business. 

**Action**

In [None]:
# Drop missing values inside of the 'make' column
aviation_df.dropna(subset=['make', 'model'], inplace=True)

## Analysis for Columns to Replace NaN's

### Columns with no pre-existing unknown category

For each of these columns, there are no pre-existing unknown or other categories. 

As such, we are filling the NaN values with an 'unknown' string. 

In [None]:
columns_to_replace_with_unknown = [
    'purpose_of_flight',
    'report_status',
    'aircraft_damage',
    'aircraft_category',
    'model',
    'amateur_built'
]

aviation_df[columns_to_replace_with_unknown] = aviation_df[columns_to_replace_with_unknown].fillna('unknown')

### Columns with pre-existing unknown category (UNK, Unknown, Other)

In [None]:
# find Columns with pre-existing unknown category (UNK, Unknown, Other, None) and replace them with 'unknown'
columns_to_replace_with_unknown = [
    'investigation_type',
    'event_date',
    'injury_severity', 
    'aircraft_damage',
    'aircraft_category',
    'make',
    'model',
    'amateur_built',
    'engine_type',
    'purpose_of_flight',
    'weather_condition',
    'broad_phase_of_flight',
    'report_status'
]

replace_values = ['unk', 'unknown', 'other', 'none', 'unavailable', np.nan]

aviation_df[columns_to_replace_with_unknown] = aviation_df[columns_to_replace_with_unknown].applymap(
    lambda x: 'unknown' if isinstance(x, str) and x.lower() in replace_values or pd.isna(x) else x)


## Imputing Numerical Column Values

### total_fatal_injuries, total_serious_injuries, total_minor_injuries and total_uninjured

**Overview**

The total_fatal_injuries, total_serious_injuries, total_minor_injuries, and total_uninjured columns were replaced with the median values of the column.

In [None]:
injury_columns = [
    'total_fatal_injuries', 
    'total_serious_injuries', 
    'total_minor_injuries', 
    'total_uninjured'
]

In [None]:
# Create subplots with 2 rows and 2 columns
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Flatten the axes array for easy indexing
axes = axes.flatten()

# Iterate over each column and plot a histogram
for i, column in enumerate(injury_columns):
    # Drop NaN values before plotting
    data = aviation_df[column].dropna()
    
    # Plot histogram
    axes[i].hist(data, bins=100, color='skyblue', edgecolor='black')
    
    # Calculate mean and median
    mean_val = data.mean()
    median_val = data.median()
    
    # Draw vertical lines for mean and median
    axes[i].axvline(mean_val, color='red', linestyle='--', linewidth=1, label=f'Mean: {mean_val:.2f}')
    axes[i].axvline(median_val, color='blue', linestyle='-', linewidth=1, label=f'Median: {median_val:.2f}')
    
    # Set title for each subplot
    axes[i].set_title(f'Histogram of {column}')
    
    # Set labels for x and y axes
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')
    
    # Add legend
    axes[i].legend()

# Adjust layout to prevent overlap
plt.tight_layout()

# Show plot
plt.show()


**Overview**

Given the positve skew of each column, we will impute the missing values with the median. 

**Action**

In [None]:
# Imputing the missing values with the median
aviation_df[injury_columns] = aviation_df[injury_columns].apply(lambda x: x.fillna(x.median()))

In [None]:
# Creating 'total_injuries' column
aviation_df['total_injuries'] = aviation_df[
    ['total_fatal_injuries', 
     'total_serious_injuries', 
     'total_minor_injuries'
    ]
].sum(axis=1)

### number_of_engines

**Overview**

The number_of_engines had to be done differently because, it would be better to be true to the data, because it is impossible to correlate the number of engines in a plane.

Big planes could have 1 engine, while another big plane of the same sizes could have 2 or more engines

In [None]:
# Get values for histogram
value_counts = aviation_df['number_of_engines'].value_counts().sort_index()

# Calculate mean and median
mean_num_engines = aviation_df['number_of_engines'].mean()
median_num_engines = aviation_df['number_of_engines'].median()

# Plot a histogram of the value counts
plt.figure(figsize=(10, 6))
value_counts.plot(kind='bar')
plt.xlabel('Number of Engines')
plt.ylabel('Frequency')
plt.title('Histogram of Number of Engines')
plt.xticks(rotation=0)

# Add vertical lines for mean and median
plt.axvline(x=mean_num_engines, color='red', linestyle='--', linewidth=1, label=f'Mean: {mean_num_engines:.2f}')
plt.axvline(x=median_num_engines, color='blue', linestyle='-', linewidth=1, label=f'Median: {median_num_engines:.2f}')

# Add legend
plt.legend()

# Show the plot
plt.show()

**Recommendation**

Unlike the injury columns, this column should be treated as a categorical column since the number of wheels describing the aircraft. 

As such, we will convert the column type to object and fill the missing values with an unknown category.

**Action**

In [None]:
aviation_df['number_of_engines'] = aviation_df['number_of_engines'].fillna('unknown')

## Adding New Columns and Cleaning Individual Columns

### New Columns

In [None]:
# Adding 'total_injuries' column
aviation_df['total_injuries'] = (
    aviation_df['total_fatal_injuries'] +
    aviation_df['total_serious_injuries'] +
    aviation_df['total_minor_injuries']
)

# Adding 'year', 'month', and 'day' columns
aviation_df['event_date'] = pd.to_datetime(aviation_df['event_date'])
aviation_df['year'] = aviation_df['event_date'].dt.year
aviation_df['month'] = aviation_df['event_date'].dt.month
aviation_df['day'] = aviation_df['event_date'].dt.day

aviation_df.drop(columns=['event_date'], inplace=True)

### Cleaning Individual Columns

#### injury_severity

In [None]:
# Removing the count that follows the description of the injury in the 'injury_severity column'
aviation_df['injury_severity'] = aviation_df['injury_severity'].apply(lambda x: x.split('(')[0])

##### report_status

In [None]:
column_info(aviation_df, 'report_status')

**Review:**

After reviewing the report status for accidents there is a large number of records the mention the pilot as the source of the accident. 

However, the most common entries in the column are:

- Probable Cause : 61754
- Foreign : 1999
- Factual : 167

**Recommendation:**

We make a column identifying explicit pilot error by searching for the keyword of pilot. The keyword while general is reflective of pilot error given that each row represents an accident and that when a description is given, it is telling us about the error that occurred that caused the accident. 

The new column will be a boolean value. 

Fill the NaN values with 'unknown'

**Examples:**

- "The pilot’s decision to perform a takeoff from a perpendicular taxiway rather than the airport runway, which led to the airplane striking trees at the end of the departure path."

- "The pilot’s failure to maintain airplane control during the landing roll on a snow-covered runway surface."

- "The pilot’s failure to retract the flaps during a go-around from a bounced landing, which resulted in a collision with trees then terrain."

**Action**

In [None]:
# Create a new column 'human_error' with True and False labels based on 'investigation_type'
aviation_df['human_error'] = aviation_df['investigation_type'].str.contains('pilot', case=False) | aviation_df['investigation_type'].str.contains('instructor', case=False)

#### make

In [None]:
# Create a dictionary to store the mapping of individual make names to origin names
make_origin_mapping = {}

# Iterate over each make in aviation_df to extract the origin name
for make in aviation_df['make'].unique():
    origin_make = make.split()[0]  # Get the origin make ('mcdonnell douglas helicopter' from 'mcdonnell')
    make_origin_mapping[make] = origin_make
    
aviation_df['make'] = aviation_df['make'].map(make_origin_mapping)

#### Commerical Jet Cleaning

In [None]:
def create_comparison_bar_chart(dataframe, categorical_col, numerical_col, measure, sort_boolean, num_of_values_shown):
    # Calculate numerical measure for each categorical column
    num_by_cat = dataframe.groupby(categorical_col)[numerical_col].agg(measure)
    
    #Sort the data
    num_by_cat = num_by_cat.sort_values(ascending=sort_boolean).head(num_of_values_shown)
    
    # Plot the bar chart
    plt.figure(figsize=(12, 6))
    num_by_cat.plot(kind='bar', color='skyblue')
    plt.title(f'{numerical_col} by {categorical_col}')
    plt.xlabel(f'{categorical_col}')
    plt.ylabel(f'{numerical_col}')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

In [None]:
create_comparison_bar_chart(aviation_df, 'purpose_of_flight', 'total_injuries', 'sum', False, 20)

Our stakeholder wants to see whether to get into the Commercial or Private airline industry based primarily on the safety of the aircraft. 

At present, the data set does not have a classification for Commercial Flights inside of the 'purpose_of_flight' column. 

Commercial Travel Planes primarily use Turbo Fan engines however the data set includes duplicate plane models inside of the Turbo Jet engine classification. To account for this, we are going to include both engine types and manually review the models to ensure only commercial planes are included in the analysis. 

In [None]:
# Filter the DataFrame for specific engine types and select relevant columns
filtered_df = aviation_df[
    aviation_df['engine_type'].isin(['turbo fan', 'turbo jet'])
][['make', 'model', 'engine_type', 'purpose_of_flight']]

# Filter the DataFrame to exclude erroneous plane addition
final_airbus_df = filtered_df[(filtered_df['model'] != 'F4-6')]

# Count occurrences and reset the index
value_counts_df = final_airbus_df.value_counts().reset_index()

# Filter for instances containing 'airbus'
airbus_commerical_planes = value_counts_df[value_counts_df['make'].str.contains('airbus', case=False)]

# Get unique models
unique_airbus_commerical_planes = airbus_commerical_planes['model'].unique()

##### Airbus

In [None]:
airbus_replace_dict = {}

for plane in unique_airbus_commerical_planes:
    if not plane[0].isalpha(): 
        # If the first character is not a letter, prefix the plane name with 'A' and take the first three characters of the original name
        airbus_replace_dict[plane] = 'A' + plane[:3]
    elif plane[1] == '-': 
        # If the second character is a hyphen, remove the hyphen and take the first four characters of the modified name
        airbus_replace_dict[plane] = plane.replace('-', '')[:4]
    else:
        # Take the first four characters of the plane name
        airbus_replace_dict[plane] = plane[:4]

In [None]:
# Filter the DataFrame to exclude erroneous plane addition
final_airbus_df = filtered_df[(filtered_df['model'] != 'DC-10')]

# Count occurrences and reset the index
value_counts_df = filtered_df.value_counts().reset_index()

# Filter for instances containing 'boeing'
unique_boeing_commercial_planes = value_counts_df[value_counts_df['make'].str.contains('boeing', case=False)]

# Get unique models
unique_boeing_commercial_planes = unique_boeing_commercial_planes['model'].unique()

##### Boeing

In [None]:
boeing_replace_dict = {}
for plane in unique_boeing_commercial_planes:
    if plane[1] == '-':
        # If the plane starts with 'B' and contains '-', like 'B737-300', take the part after 'B' and before the '-'
        boeing_replace_dict[plane] = plane.replace('-', '')[1:4]
    elif "BOEING" in plane:
        # If the plane contains 'BOEING', like 'BOEING 777-236',remove 'BOEING', strip leading/trailing whitespace,
        # and extract the first three digits from the remaining string
        boeing_replace_dict[plane] = plane.replace('BOEING', '')[1:4]
    elif "MD" in plane or "DC" in plane:
        # If the plane contains 'MD' or 'DC', like 'MD-11' or 'DC-10', take the first five characters of the plane
        boeing_replace_dict[plane] = plane[:5]
    elif plane[0].isalpha():
        # If the plane starts with an alpha character, take characters from index 1 to 4
        boeing_replace_dict[plane] = plane[1:4]
    else:
        # Take the first three characters
        boeing_replace_dict[plane] = plane[:3]

##### McDonell

In [None]:
# find the mcdonnell douglas planes
filter_mcdonnell_planes = aviation_df[
    (aviation_df['engine_type'].isin(['turbo fan', 'turbo jet'])) & 
    (aviation_df['purpose_of_flight'] == 'commercial flight') &
    (aviation_df['make'] == 'mcdonnell')
]['model'].unique()

# remove the rows with model 'RF-4C' and F/A-18C in filter_mcdonnell_planes
models_to_remove = ['RF-4C', 'F/A-18C']
unique_mcdonnell_commerical_planes = [model for model in filter_mcdonnell_planes if model not in models_to_remove]

In [None]:
mcdonnell_replace_dict = {}

for plane in unique_mcdonnell_commerical_planes:
    # Use regex to split the model
    parts = re.split(r'[- ]', plane)
    
    if len(parts) > 1:
        # Extract only numeric part from the second segment
        numeric_part = ''.join(filter(str.isdigit, parts[1]))
        if numeric_part:
            if plane == 'DC10-30F':
                mcdonnell_replace_dict[plane] = 'DC10'
            elif plane == 'DC8-63F':
                mcdonnell_replace_dict[plane] = 'DC8'
            else:
                mcdonnell_replace_dict[plane] = parts[0] + numeric_part
        else:
            mcdonnell_replace_dict[plane] = parts[0]
    else:
        
        mcdonnell_replace_dict[plane] = plane.replace(' ', '')

##### Embraer

In [None]:
# find the embraer planes
unique_embraer_commerical_planes = aviation_df[
    (aviation_df['engine_type'].isin(['turbo fan', 'turbo jet'])) & 
    (aviation_df['purpose_of_flight'] == 'commercial flight') &
    (aviation_df['make'] == 'embraer')
]['model'].unique()

In [None]:
# only get the first 6 letters (erj) and numbers in filter_embraer_planes
embraer_replace_dict = {}
for plane in unique_embraer_commerical_planes:
    embraer_replace_dict[plane] = 'ERJ' + ''.join(filter(str.isdigit, plane[:7]))

##### Fokker

In [None]:
# find the fokker planes
unique_fokker_commercial_planes = aviation_df[
    (aviation_df['engine_type'].isin(['turbo fan', 'turbo jet'])) & 
    (aviation_df['purpose_of_flight'] == 'commercial flight') &
    (aviation_df['make'] == 'fokker')
]['model'].unique()

In [None]:
# only f and get the next full number
fokker_replace_dict = {}
for plane in unique_fokker_commercial_planes:
    # print(plane)
    hypen_splitter = plane.split('-')
    space_splitter = plane.split(' ')
    if not plane[0].isalpha():
        fokker_replace_dict[plane] = 'F' + ''.join(filter(str.isdigit, plane[:2]))
    elif plane[2] == '-':
        value = hypen_splitter[0] + hypen_splitter[1]
        fokker_replace_dict[plane] = value.replace('K', '')
    elif space_splitter[0] == 'F28':
        fokker_replace_dict[plane] = 'F28'
    elif plane[1] == '-':
        fokker_replace_dict[plane] = space_splitter[0].replace('-', '')
    elif plane[1] == '.':
        fokker_replace_dict[plane] = space_splitter[0].replace('.', '')
    else:
        fokker_replace_dict[plane] = plane

##### Lockheed

In [None]:
# find the lockheed planes
unique_lockheed_commercial_planes = aviation_df[
    (aviation_df['engine_type'].isin(['turbo fan', 'turbo jet'])) & 
    (aviation_df['purpose_of_flight'] == 'commercial flight') &
    (aviation_df['make'] == 'lockheed')
]['model'].unique()

In [None]:
lockheed_replace_dict = {}
for plane in unique_lockheed_commercial_planes:
    hypen_splitter = plane.split('-')
    if plane[1] == '-':
        lockheed_replace_dict[plane] = hypen_splitter[0] + hypen_splitter[1]
    else:
        lockheed_replace_dict[plane] = hypen_splitter[0]

##### Illyushin

In [None]:
# find the illyushin planes
unique_ilyushin_commercial_plane = aviation_df[
    (aviation_df['engine_type'].isin(['turbo fan', 'turbo jet'])) & 
    (aviation_df['purpose_of_flight'] == 'commercial flight') &
    (aviation_df['make'] == 'ilyushin')
]['model'].unique()

In [None]:
ilyushin_replace_dict = {}
unique_ilyushin_commercial_plane[0].replace('-', '')
ilyushin_replace_dict[unique_ilyushin_commercial_plane[0]] = unique_ilyushin_commercial_plane[0].replace('-', '')

##### Apply replacement dictionaries

In [None]:
list_of_replacement_dictionaries = [
    airbus_replace_dict,
    boeing_replace_dict,
    mcdonnell_replace_dict,
    embraer_replace_dict,
    fokker_replace_dict,
    lockheed_replace_dict,
    ilyushin_replace_dict
]

# Apply the replacements
for replace_dict in list_of_replacement_dictionaries:
    aviation_df['model'] = aviation_df['model'].replace(replace_dict)

##### Fill 'purpose_of_flight' with commercial flight classification

In [None]:
makes_to_include = [
    'boeing', 
    'airbus', 
    'embraer', 
    'mcdonnell', 
    'lockheed', 
    'douglas', 
    'fokker', 
    'ilyushin'
]

aviation_df.loc[
    (
        ((aviation_df['engine_type'] == 'turbo fan') | (aviation_df['engine_type'] == 'turbo jet')) &
        (aviation_df['purpose_of_flight'] == 'unknown') &
        ((aviation_df['amateur_built'] == 'no') | (aviation_df['amateur_built'] == 'unknown')) &
        (aviation_df['make'].isin(makes_to_include))
    ),
    'purpose_of_flight'
] = 'commercial flight'

#### Make and model 

In [None]:
# Create 'make_model' column
aviation_df['make_model'] = aviation_df['make'] + ' ' + aviation_df['model']

## Cleaned Dataset

In [None]:
aviation_df.info()

# Business Question 1: Which area of the avaiation industry should they get into?¶

With our stakeholders first priority being safety, we want to determine which portion of the industry either commerical flights or private flights is safer. 

To measure safety, we are using the total number of injuries (fatal, serious, and minor).

In [None]:
create_comparison_bar_chart(aviation_df, 'purpose_of_flight', 'total_injuries', 'sum', False, 20)

**Review**

Overwhelmingly, the largest cause of injury when onboard an aircraft is while flying for personal reasons. This could be due to a multitude of reasons, cheif among them is flight hours.

Given the data, the hourly requirement should be increased to limit the injuries caused by personal flights. 

"A person applying for a private pilot certificate in airplanes, helicopters, and gyro-planes must log at least 40 hours of flight time, of which at least 20 hours are flight training from an authorized instructor and 10 hours of solo flight training in the appropriate areas of operation; three hours of cross country; three hours at night, three hours of instrument time; and other requirements specific to the category and class rating sought.

Private pilots in gliders and lighter-than-air aircraft must have logged from an authorized instructor a similar number of hours and/ or training flights, which include both cross country and solo according to category and class rating sought. Though the regulations require a minimum of 40 hours flight time, in the U. S. the average number of hours for persons without a hearing impairment completing the private pilot certification requirements is approximately 75 hours."

Importantly, commercial flights are slightly less risky than business flights. If we include executive/corporate flight purposes in with business flights, we see that commercial flights are the safer option between private or commercial flights. 

- https://www.faa.gov/faq/what-are-hourly-requirements-becoming-pilot

**Action**

Given that commercial flights have a higher degree of safety (fewer total injuries) than business/executive/corporate flights we suggest to our stakeholder that pursue commercial flights rather than private flights. 

As such, we will trim our data set to view only commercial flights so that we can provide targeted insights. 

In [None]:
commercial_flights_df = aviation_df[(aviation_df['purpose_of_flight'] == 'commercial flight')]

In [None]:
commercial_flights_df.info()

# Outliers Detection

## Numerical Columns

In [None]:
def find_outliers_iqr(dataframe, column):
    """
    Find the outliers in the column using the IQR method
    -
    Input: 
        dataframe: Pandas DataFrame
        column: str
    -
    Output:
        Prints:
            - The number of outliers in the column.
            - The percentage of outliers in the column.
    """
    q1 = dataframe[column].quantile(0.25)
    q3 = dataframe[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    
    outliers = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)]
    percent_outliers = len(outliers) * 100 / len(dataframe)
    
    print(f"Number of outliers: {len(outliers)}")
    print(f"Percentage of outliers: {percent_outliers:.2f}%")

In [None]:
numerical_columns = [
    'total_fatal_injuries',
    'total_serious_injuries',
    'total_minor_injuries',
    'total_uninjured',
    'total_injuries'
]

In [None]:
# Create subplots for box plots
fig, axes = plt.subplots(nrows=len(numerical_columns), ncols=1, figsize=(10, 6 * len(numerical_columns)))

# Plot each numerical column in a separate subplot
for ax, column in zip(axes, numerical_columns):
    sns.boxplot(data=commercial_flights_df, x=column, ax=ax)
    ax.set_title(f'Box Plot for {column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Value')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Printing out the number/percentage of outliers

for column in numerical_columns:
    print('\n', column)
    find_outliers_iqr(commercial_flights_df, column)

### Explaination of Data

The box and whisker plots above show a line rather than a rectangle for the IQR because the range of the IQR is between 0 and 1. On the graph, when comparing the IQR to the outliers that exist, the range of the outliers far outsizes the range of the IQR.

**Handling Outliers**

Even though the outliers skew the columns, we will keep all of the outliers that are in all of the numerical columns because the outliers are relevant data points that provide useful information for each column. 

# Correlation Analysis

## Numerical Columns Correlation

In [None]:
commercial_flights_df.corr()

In [None]:
# use correlation matrices and scatter plots to understand relationships between numerical variables
correlation_matrix = commercial_flights_df.corr()

plt.figure(figsize=(12, 8))
plt.matshow(correlation_matrix, cmap='coolwarm', fignum=1)
plt.colorbar(label='correlation coefficient')
plt.xticks(range(len(correlation_matrix.columns)), correlation_matrix.columns, rotation='vertical')
plt.yticks(range(len(correlation_matrix.columns)), correlation_matrix.columns)
plt.title('Correlation Matrix')

plt.show()

In [None]:
def scatter_plot(dataframe, columns):
    """
    Create scatter plots for numerical columns.
    -
    Input:
        dataframe: Pandas DataFrame
        columns: list of column names (strings)
    -
    Output:
        Displays scatter plots for each column pair
    """
    for i in range(len(columns)):
        for j in range(i+1, len(columns)):
            plt.figure(figsize=(12, 8))
            plt.scatter(dataframe[columns[i]], dataframe[columns[j]], alpha=0.5)
            plt.xlabel(columns[i])
            plt.ylabel(columns[j])
            plt.title(f'{columns[i]} vs {columns[j]}')
            plt.grid(True)
            plt.show()

In [None]:
# List of numerical columns to compare
numerical_columns = ['total_fatal_injuries', 'total_serious_injuries', 'total_minor_injuries', 'total_uninjured']

# Call the function with the dataframe and the list of columns
scatter_plot(commercial_flights_df, numerical_columns)

## Categorical Columns Correlation

In [None]:
def chi_squared_test(dataframe, column1, column2):
    """
    Perform a chi-squared test for independence between two categorical variables.
    
    Input:
        dataframe: Pandas DataFrame
        column1: str
        column2: str
        
    Output:
        Returns chi-squared test statistic, p-value, and degrees of freedom.
    """
    contingency_table = pd.crosstab(dataframe[column1], dataframe[column2])
    chi2, p, dof, _ = chi2_contingency(contingency_table)
    return chi2, p, dof

In [None]:
def find_correlating_columns(dataframe, columns_to_compare, alpha=0.01):
    """
    Find the most correlating columns using the chi-squared test.
    
    Input:
        dataframe: Pandas DataFrame
        columns_to_compare: list of str, specific columns to include in the comparison
        alpha: float, significance level
        
    Output:
        DataFrame with pairs of columns and their chi-squared test results.
    """
    # Replace placeholders with NaNs
    dataframe.replace(['unknown', 'N/A', 'na', '', ' '], pd.NA, inplace=True)
    
    # Drop rows with NaNs in the specified columns
    dataframe = dataframe[columns_to_compare].dropna()
    
    # Generate pairs of specified columns
    column_pairs = list(combinations(columns_to_compare, 2))
    
    # Perform chi-squared test for each pair
    results = []
    for col1, col2 in column_pairs:
        chi2, p, dof = chi_squared_test(dataframe, col1, col2)
        # Append results if the p-value is less than or equal to the alpha level
        if p <= alpha:
            results.append((col1, col2, chi2, p, dof))
    
    # Create a DataFrame to store results
    results_df = pd.DataFrame(results, columns=['Column 1', 'Column 2', 'Chi-squared', 'P-value', 'Degrees of Freedom'])
    
    # Sort by P-value first and then by Chi-squared test statistic
    results_df.sort_values(by=['P-value', 'Chi-squared'], ascending=[True, False], inplace=True)
    
    return results_df

In [None]:
columns_to_compare = ['total_injuries', 
                      'aircraft_damage',
                      'weather_condition', 
                      'broad_phase_of_flight', 
                      'make_model', 
                      'state', 
                      'year',
                      'month'
]
chi_square_df = commercial_flights_df[columns_to_compare].copy()

correlating_columns = find_correlating_columns(chi_square_df, columns_to_compare, alpha=0.01)
correlating_columns.head(10)

## Results
Above are the 10 most relevant data to compare based on p-value and the chi-squared tests.
- The p-value represents the probability of observing the test results.
    - Ranges from 0-1
    - Closer to 0 indicates a strong connection
    - Closer to 1 indicates a weaker connection

- The Chi-squared stat measures the difference between the actual data and the expected data
    - Higher stat indicates a stronger connection
    - Lower stat indicates a weaker connection

# Data Visualization

## Which make and model is most dangerous?

In [None]:
# Filter 'amateur_built' planes so we only see non-amateur or unknown planes
create_comparison_bar_chart(commercial_flights_df, 'make', 'total_injuries', 'sum', False, 50)

**Review**



In [None]:
# Create a new column 'make_model' by combining 'make' and 'model' columns
create_comparison_bar_chart(commercial_flights_df, 'make_model', 'total_injuries', 'sum', False, 50)

## Which state has the most accidents?

In [None]:
create_comparison_bar_chart(commercial_flights_df, 'state', 'total_injuries', 'sum', False, 100)

## Which weather pattern is most dangerous?

In [None]:
create_comparison_bar_chart(commercial_flights_df, 'weather_condition', 'total_injuries', 'sum', False, 10)

## Which phase of flight of most dangerous?

In [None]:
create_comparison_bar_chart(commercial_flights_df, 'broad_phase_of_flight', 'total_injuries', 'sum', False, 10)

## Which engine type is most dangerous?

In [None]:
create_comparison_bar_chart(commercial_flights_df, 'engine_type', 'total_injuries', 'sum', False, 10)

## Do the number of engines the aircraft has mitigate the damage done to the aircraft?

In [None]:
# Group the data by 'aircraft_damage' and 'number_of_engines' and count the frequency of each combination
grouped_data = commercial_flights_df.groupby(['aircraft_damage', 'number_of_engines']).size().unstack(fill_value=0)

# Plot the grouped bar chart
grouped_data.plot(kind='bar', figsize=(10, 6))
plt.title('Comparison of Aircraft Damage by Number of Engines')
plt.xlabel('Aircraft Damage')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.legend(title='Number of Engines')
plt.tight_layout()
plt.show()

# Trend Analysis

## Number of Avaiation Accidents

In [None]:
# Group the DataFrame to show pertinent values
yearly_injuries = commercial_flights_df.groupby('year')['total_injuries'].sum()

# Plot the results
yearly_injuries.plot(kind='line', figsize=(10, 6))
plt.title('Total Injuries by Year')
plt.xlabel('Year')
plt.ylabel('Total Injuries')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Calculate the total number of accidents and fatal injuries per year
accidents_per_year = commercial_flights_df.groupby('year').size()
fatal_injuries_per_year = aviation_df.groupby('year')['total_fatal_injuries'].sum()

# Calculate the fatality rate per year
fatality_rate_per_year = fatal_injuries_per_year / accidents_per_year

# Plot the fatality rate over the years
plt.figure(figsize=(10, 6))
fatality_rate_per_year.plot(kind='line', color='red', marker='o')
plt.title('Fatality Rate per Year')
plt.xlabel('Year')
plt.ylabel('Fatality Rate')
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Calculate the total number of fatal injuries per month
fatal_injuries_per_month = commercial_flights_df.groupby('month')['total_fatal_injuries'].sum()

# Plot the count of fatal injuries for each month
plt.figure(figsize=(10, 6))
fatal_injuries_per_month.plot(kind='bar', color='blue')
plt.title('Count of Fatal Injuries per Month')
plt.xlabel('Month')
plt.ylabel('Count of Fatal Injuries')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## Commerical Aircraft

### Do certain commerical aircraft perform better in different weather conditions?

In [None]:
# Group the data by 'make_model' and 'weather_condition', count values, convert to single index DataFrame
weather_counts = commercial_flights_df.groupby(['make_model', 'weather_condition']).size().unstack()

# Sum the counts of the columns ('make_model' by 'weather_condition')
weather_counts_total = weather_counts.sum(axis=1)

# Sort the total counts in descending order
weather_counts_total_sorted = weather_counts_total.sort_values(ascending=False)

# Sorts the weather_counts df by using the indexes of the series weather_counts_total_sorted 
weather_counts_sorted = weather_counts.loc[weather_counts_total_sorted.index]

# Plotting
weather_counts_sorted.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Counts of Incidents by Aircraft Model and Weather Condition')
plt.xlabel('Aircraft Model')
plt.ylabel('Count of Accidents')
plt.xticks(rotation=45)
plt.legend(title='Weather Condition')
plt.tight_layout()
plt.show();

**Review**

- VMC
    - "...is an aviation flight category in which visual flight rules (VFR) flight is permitted—that is, conditions in which pilots have sufficient visibility to fly the aircraft maintaining visual separation from terrain and other aircraft. They are the opposite of instrument meteorological conditions (IMC)."
    - https://en.wikipedia.org/wiki/Visual_meteorological_conditions
        
- IMC
     - "...are weather conditions that require pilots to fly primarily by reference to flight instruments, and therefore under instrument flight rules (IFR), as opposed to flying by outside visual references under visual flight rules (VFR). Typically, this means flying in cloud or poor weather, where little or nothing can be seen or recognised when looking out of the window."
     - https://en.wikipedia.org/wiki/Instrument_meteorological_conditions

The data here shows that the majority of accidents occurred while there was sufficient visibility to fly the aircraft. It is worth looking into the data further to see if there is a way to determine if human error could have resulted in an accident in either IMC or VMC weather conditions. 

### Are certain commerical aircraft more susceptible to human error? 

In [None]:
# Group the data by 'make_model' and 'human_error', count values, convert to single index DataFrame
human_error_counts = commercial_flights_df.groupby(['make_model', 'human_error']).size().unstack()

# Sum the counts of the columns ('make_model' by 'human_error')
human_error_counts_total = human_error_counts.sum(axis=1)

# Sort the total counts in descending order
human_error_counts_total_sorted = human_error_counts_total.sort_values(ascending=False)

# Sorts the human_error_counts df by using the indexes of the series human_error_counts_total_sorted 
human_error_counts_sorted = human_error_counts.loc[human_error_counts_total_sorted.index]

# Plotting
human_error_counts_sorted.plot(kind='bar', stacked=True, figsize=(12, 4))
plt.title('Counts of Accidents by Aircraft Model and Human Error')
plt.xlabel('Aircraft Model')
plt.ylabel('Count of Accidents')
plt.xticks(rotation=45)
plt.legend(title='Human Error')
plt.tight_layout()
plt.show();

**Review**

Fortunately, there are no accidents in commercial aircrafts that were explicitly the result of human error. It is worth noting that the 'report_staus' column did have ~7% missing values so there is the chance that an accident was the result of human error. 

Regardless of the weather condition (IMC v VMC), pilots of commercial aircrafts are not the root cause of the accident. 

For our stakeholder, this is good news as it allows a direct comparison of commercial aircraft as the accidents should all be caused by the aircraft itself and not the crew piloting the aircraft. 

### Are certain commerical aircraft more durable than others?

In [None]:
# Group the data by 'make_model' and 'aircraft_damage', count values, convert to single index DataFrame
damage_counts = commercial_flights_df.groupby(['make_model', 'aircraft_damage']).size().unstack()

# Sum the counts of the columns ('make_model' by 'weather_condition')
damage_counts_total = damage_counts.sum(axis=1)

# Sort the total counts in descending order
damage_counts_total_sorted = damage_counts_total.sort_values(ascending=False)

# Sorts the weather_counts df by using the indexes of the series weather_counts_total_sorted 
damage_counts_sorted = damage_counts.loc[damage_counts_total_sorted.index]

# Plotting
damage_counts_sorted.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Counts of Incidents by Aircraft Model and Aircraft Damage')
plt.xlabel('Aircraft Model')
plt.ylabel('Count of Accidents')
plt.xticks(rotation=45)
plt.legend(title='Aircraft Damage')
plt.tight_layout()
plt.show();

**Review**

While there are lot of unknown values, one thing is clear. When commercial aircraft get into accidents, they are rarely destroyed. Most of the time the damage is substantial but not beyond repair. 

### What is the safest commerical aircraft?

In [None]:
create_comparison_bar_chart(commercial_flights_df, 'make_model', 'total_injuries', 'sum', False, 50)

# Document Your Finding

## Conclusion:

Based on our findings we have concluded that these are the 3 top choices for your commerical airline company:

1. **Boeing 787**
- Pros:
    - Advanced technology and fuel efficiency (8,463 miles).
    - High passenger comfort with modern amenities.
    - No fatalities or hull losses reported.

- Cons:
    - Early operational issues with lithium-ion batteries.
    - Significant quality control issues and production slowdowns.

Verdict: Despite initial issues, the Boeing 787 is a reliable and modern aircraft suitable for long-haul flights, with a strong safety record post-battery redesign.

2. **Airbus A321**

- Pros:
    - Common type rating with other A320-family variants, reducing pilot training costs.
    - Still in production, ensuring parts availability and support.
    - Suitable for short to medium-haul routes (~4,000 miles).

- Cons:
    - Limited range operations, designed for short to medium flights.

Verdict: The Airbus A321 is a versatile, cost-effective option for medium-haul routes, benefiting from commonality with other A320-family aircraft.

3. **Airbus A330**

- Pros:
    - Extensive service history with more than 65 million flight hours.
    - Still in production, ensuring continued support and parts availability.
    - Well-suited for long flying ranges (~8,300 miles)

- Cons:
    - Less fuel efficient, which can result in high operational costs.
    - Older model plane, less advanced technology and lower overall efficiency.
    - Poor engine performance, which can decrease fuel efficiency and cause cabin vibrations.

Verdict: The Airbus A330 is a proven, reliable aircraft for medium to long-haul routes with a strong operational track record.