# PrimeAir Aircraft Purchase Recommendations 

Authors: [Catherine Langley](https://www.linkedin.com/in/catherine-langley-4b904a1ab/),  [Aung Si](https://www.linkedin.com/in/aungsi99/), and  [Sam Whitehurst](https://www.linkedin.com/in/sam-whitehurst23/) 

## Executive Summary 

This project analyzes the potential risks of three sizes of aircraft for private and commercial use by company new branch, PrimeAir. 

* Descriptive analysis of aviation accidents produced a Danger Zone Scale based on historic accidents related to plane models. 
* Descriptive analysis of the commercial aircraft inventory found the most commonly used commercial planes in three size categories. 
* The Danger Zone Scale was used to identify the aircraft models with the lowest historic reported risk. 

PrimeAir can use this analysis to select the lower risk planes for both commercial and private use from the most popular aircraft in recent commercial inventory. 


## Business Problem

PrimeAir may be able to reduce risks in their industry expansion by including the results of this analysis in their purchase decisions. Following these recommendations will: 

- Improve day-to-day operations by reducing potential interruptions. 
- Reduce the probability of aircraft repair/loss. 
- Reduce the probability of human injury and loss of life. 

## Our Data


### Aviation Accident Database

Our two datasets originate from the National Transportation Safety Board (NTSB) and the Bureau of Transportation Statistics (BTS), respectively. 

The NTSB’s Aviation Accident Database lists reported aviation accidents in the United States and related areas. The NTSB is required by law to investigate all civil aviation accidents and to publically report its findings. This database is a part of the mandated reporting process. 

This database shares all reported accidents with the date of the accident, the aircraft make and model, degree of damage to the aircraft and degree of injury/loss to human life. 

* Basic Facts: 
    * Mandated reporting for accidents
    * Over 87,000 records 
    * 1962* - 2022
* Where we got it: 
    * Download from [Kaggle's Aviation Accident Database & Synopses, up to 2023](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses). 
    * Includes  AviationData.csv and USState_Codes.csv.
* Original Data: 
    * Download from [NTSB](https://www.ntsb.gov/Pages/AviationQuery.aspx) website.  
* Explanatory Information: 
    * [GILS Aviation Accident Database](https://www.ntsb.gov/GILS/Pages/AviationAccident.aspx) webpage.
    * Includes a full definition of accidents vs. incidents.
* Data Limitations: 
    * Reports are primarily accidents: serious personal injury and serious aircraft damage. 
    * Reported aircraft incidents are in a separate Federal Aviation Administration [incident database](https://www.asias.faa.gov/).

### Inventory of Aircraft Database 

This BTS database lists the yearly aircraft inventory of large certified carriers in the United States.  Each entry includes the year of inventory, the plane's unique serial number, and the manufacturer and model of the aircraft. 

* Basic Facts: 
    * Mandated reporting for large airlines
    * Yearly count 
    * Over 120,000 records
    * 2006 - 2022
* Where we got it: 
    * The Bureau of Transportation Statistics’s (BTS) [Schedule B-43 Database: Annual Inventory of Airframe and Aircraft Engines](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=GEH).
    * Found in T_F41SCHEDULE_B43.csv.
    * We downloaded the data from the website, selecting all years, and all columns into the file. 
* Explanatory information: 
    * The database is part of the  [Form 41 Financial Database: Air Carrier Financial Reports](https://www.transtats.bts.gov/Tables.asp?QO_VQ=EGI&QO_anzr=Nv4%FDPn44vr4%FDSv0n0pvny%FDer21465%FD%FLS14z%FDHE%FDSv0n0pvny%FDQn6n%FM&QO_fu146_anzr=Nv4%FDPn44vr4%FDSv0n0pvny). 
    * Form 41 Financial Reports is collected by the Office of Airline Information of the Bureau of Transportation Statistics from large certified carriers. 
    * More information on the BTS [TranStats](https://www.transtats.bts.gov/DatabaseInfo.asp?QO_VQ=EGI&Yv0x=D) webpage.   
* Data Limitations:
    * Only includes large certified carriers which are carriers that have the annual operating revenues of 20 million USD.  

### Special Feature: The Danger Zone Scale 

The danger zone scale attributes a scalar value to aircraft damage and human injury per accident
* Combines weighted values of aircraft damage with injury to human life.  
* Aircraft damage is weighted 75% since aircraft malfunction involves potentially greater human injury.   
  


## Method

We used descriptive analysis which provides a useful insight into the historic risk level of the models of top ten most common models in recent inventory.   

## Our Process

### I. Data Loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import utility as u

In [2]:
plane_accidents_raw = pd.read_csv('data/AviationData.csv', encoding='latin-1')
plane_inventory_raw = pd.read_csv('data/T_F41SCHEDULE_B43.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


#### A. Plane Accidents Data 
Our plane accidents dataset contains information such as the event ID, location, extent of aircraft damage, extent of aircraft damage, and much more. The records range from the years 1962 - 2022. As it stands in its uncleaned form, this dataset has a lot more missing information and inconsistencies than does our inventory dataset.

In [3]:
plane_accidents_raw.head(3)

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007


#### B. Plane Inventory Data
Our plane inventory dataset contains information such as the serial number of the plane, its number of seats, an The records range from the years 2006 - 2022 - though the number of records are much greater than our accidents dataset, in terms of years this dataset is much smaller than our accidents dataset.

In [4]:
plane_inventory_raw.head(3)

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,AIRCRAFT_TYPE,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER
0,2006,16,PSA Airlines Inc.,2003.0,PSA Airlines Inc.,7858,N202PS,B,Y,50.0,CANADAIR,,CRJ-2/4,47000.0,10/28/2003 12:00:00 AM,20397.0,16
1,2006,16,PSA Airlines Inc.,2003.0,PSA Airlines Inc.,7860,N206PS,B,Y,50.0,CANADAIR,,CRJ-2/4,47000.0,10/30/2003 12:00:00 AM,20397.0,16
2,2006,16,PSA Airlines Inc.,2003.0,PSA Airlines Inc.,7873,N207PS,B,Y,50.0,CANADAIR,,CRJ-2/4,47000.0,11/26/2003 12:00:00 AM,20397.0,16


### II. Data Preparation and Cleaning
We now clean the two datasets by normalizing our column names and only keeping the relevant columns.

#### A. Cleaning the Accidents Data
For our accidents dataset, we're only concerned information that tells us the year, make, model, extent of aircraft damage and human injury (we will later engineer a score off of each of these), and the count of the different types of injuries involved in the accident. To eventually be able to work with the two datasets in tandem, we transformed the make values to all lowercase, and only kept the manufacturer names as part of the value. For the model values, we stripped them of alphanumerics, and kept only the first three numbers of the model value.

In [5]:
plane_accidents = u.clean_data_PA(plane_accidents_raw)
plane_accidents.head()

Unnamed: 0,year,aircraft_damage,make,model,total_fatal_injuries,total_serious_injuries,total_minor_injuries
0,1948,Destroyed,stinson,108,2.0,0.0,0.0
1,1962,Destroyed,piper,241,4.0,0.0,0.0
2,1974,Destroyed,cessna,172,3.0,,
3,1977,Destroyed,rockwell,112,2.0,0.0,0.0
4,1979,Destroyed,cessna,501,1.0,2.0,


#### B. Cleaning the Inventory Data
For our inventory dataset, we keep the year, make, model, and number of seats columns (we will later engineer a plane size feature off of the number of seats). We transformed the makes and models of this dataset the same way we did for our accidents dataset.Engineered Feature 

In [6]:
plane_inventory = u.clean_data_PI(plane_inventory_raw)
plane_inventory

Unnamed: 0,year,make,model,number_of_seats
0,2006,canadair,24,50.0
1,2006,canadair,24,50.0
2,2006,canadair,24,50.0
3,2006,canadair,24,50.0
4,2006,canadair,24,50.0
...,...,...,...,...
124069,2022,bombardier,600,50.0
124070,2022,bombardier,600,50.0
124071,2022,bombardier,600,50.0
124072,2022,bombardier,600,50.0


Now that we have our datasets cleaned, we can engineer features that serve as the crux of our analysis.

### III. Feature Engineering - Accidents Data

#### A. Make/Model Feature 
The first feature we engineer is the `make_model` feature. We do this for both datasets so that we can parse through these values for both easily. 

In [7]:
plane_accidents_1F = u.engineer_make_model_feature_PAPI(plane_accidents)
plane_accidents_1F

Unnamed: 0,year,aircraft_damage,make,model,total_fatal_injuries,total_serious_injuries,total_minor_injuries,make_model
0,1948,Destroyed,stinson,108,2.0,0.0,0.0,stinson 108
1,1962,Destroyed,piper,241,4.0,0.0,0.0,piper 241
2,1974,Destroyed,cessna,172,3.0,,,cessna 172
3,1977,Destroyed,rockwell,112,2.0,0.0,0.0,rockwell 112
4,1979,Destroyed,cessna,501,1.0,2.0,,cessna 501
...,...,...,...,...,...,...,...,...
80307,2022,,piper,281,0.0,1.0,0.0,piper 281
80308,2022,,bellanca,7,0.0,0.0,0.0,bellanca 7
80309,2022,Substantial,american champion,8,0.0,0.0,0.0,american champion 8
80310,2022,,cessna,210,0.0,0.0,0.0,cessna 210


#### Human Injury and Aircraft Damage 

An accident can be broken down into two metrics: the extent of aircraft damage, and the extent of injury it inflicts upon the passengers involved. We quantified these two metrics below, in order to eventually quantify the danger level of each plane.

#### B. Human Injury Feature 
We now engineer a `human_injury` feature, which categorizes the accidents as such:
1. We categorize the accident as `'Fatal'` if it has $>1$ fatalities.
2. We categorize the accident as `'Serious'` if it has $>1$ serious injuries.
3. We categorize the accident as `'Minor'` if it has $>1$ minor injuries.
4. We categorize the accident as `'Unknown'` if the injury data is missing.

Off of the  `human_injury`, we compute a `human_injury_numeric` in order to quantify the categories. The computation for this score is broken down as follows:
1. We give a score of 4 for all accidents labelled `'Fatal'`
2. We give a score of 3 for all accidents labelled `'Serious'`
3. We give a score of 2 for all accidents labelled `'Minor'`
4. We give a score of 1 for all accidents labelled `'Unknown'`

These raw scores are then transformed into our `human_injury_numeric` feature, which is the raw scores above min-max scaled and multiplied by 10 so as to give us a normalized score with ranging from 0-10.

In [8]:
plane_accidents_2F = u.engineer_accident_features_PA(plane_accidents_1F)
plane_accidents_2FA

NameError: name 'plane_accidents_2FA' is not defined

#### C.  Aircraft Damage Feature 
The `'aircraft_damage'` column present in the accidents dataset already categorizes the damage experienced by the aircraft in a particular accident. Similar to how we scored the extent of human injury, we:
1. gave a score of 4 for all accidents labelled `'Destroyed'`
2. gave a score of 3 for all accidents labelled `'Substantial'`
3. gave a score of 2 for all accidents labelled `'Minor'`
4. gave a score of 1 for all missing labels within the `'aircraft_damage'` column.

These raw scores are then transformed into our `'aircraft_damage_numeric'` feature, which is the raw scores above min-max scaled and multiplied by 10 so as to give us a normalized score with ranging from 0-10.

In [None]:
plane_accidents_3F = u.engineer_damage_feature_PA(plane_accidents_2F)
plane_accidents_3F

#### D. Danger Zone Score Feature 
Now that we have our two scores pertaining to human injuries and aircraft damage, respectively, we can aggregate them to get a danger zone score. The danger zone score is **computed as the weighted sum of `human_injury_numeric` and `aircraft_damage_numeric`**. We placed more weight on the `aircraft_damage_numeric`, as we believed the extent of plane damage is indicative of higher potential of lives lost.

In [None]:
plane_accidents_4F = u.engineer_danger_score_PA(plane_accidents_3F)
plane_accidents_4F

### III. Feature Engineering - Inventory Data 

In [None]:
plane_inventory_1F = u.engineer_make_model_feature_PAPI(plane_inventory)

#### A. Plane Size Feature
To give a more granular account of danger levels across the plane, we categorized the planes by sizes within our inventory dataset: 
1. Planes with $3-20$ seats are categorized as `'small'`
2. Planes with $21-100$ seats are categorized as '`medium`'.
3. Planes with $>100$ seats are categorized as `'large'`.* 

In [None]:
plane_inventory_2F = u.engineer_plane_size_feature_PI(plane_inventory_1F)
plane_inventory_2F

## Analysis
To understand where each plane stood relative to each other in terms of the human_injury scores, aircraft damage scores, and danger zone scores, we took the mean of each of these scores with the planes binned into their respective sizes. 

We took the `number_of_planes` column to be the occurence of the plane make, model, and size within the inventory dataset, and the `recorded_accidents_for_plane_model` to be all the planes within the inventory set with recorded accidents within the accidents dataset. We then computed a '`record_accidents_per_plane_in_inventory'` metric, which is the former divided by the latter - we did this to get the likelihood/frequency of accidents of a specific model by factoring in the count of the plane in existence; had we not factored in the inventory count, the frequency of accidents of any given plane will be skewed.

In [None]:
final_results = u.ult_df(plane_accidents_4F, plane_inventory_2F)
final_results

## Results

## Recommendations

Our recommendations for each plane size follow the metrics we have developed and used in analysis: Popularity (Top Ten), Lowest Mean Danger Score, and Accident Ratio. Our recommendations followed by an asterisk exclude higher scoring planes that are discontinued by the manufacturer. 

1. Commercial Planes 
    * Large 
           * Most Popular Model: Boeing 737
           * Lowest Mean Danger Score Pick: Airbus 330 
           * Lowest Accicent Ratio Pick: Airbus 319
    * Regional 
           * Most Popular Model: Embraer 175
           * Lowest Mean Danger Score Pick: Embraer 175 
           * Lowest Accicent Ratio Pick: Embraer 175
    * Small -We found we needed data and analysis beyond the scope of the present project to find the top picks.  
    
2. Private Planes 
    * Large - We recommend following the current practice of converting a large commercial plane for this purpose. 
    
    * Regional - We found that you may wish to convert a regional commercial plane for this purpose. If so, we recommend the Embraer 175. However, we recommend further analysis of regional size private plane models. 
    
    * Small - As with the commercial planes, further analysis is required to find the best options that is beyond the scope of our current data and analysis 

## Next Steps 
Additional analysis could provide further depth of knowledge to support our current recommendations. Our recommended next steps are in three groups:

* Exploring this data further  
    * Further cleaning plane model types by removing planes no longer in production and compiling similar model types would sharpen results especially with regard to commercial small planes and private regional planes and small planes. 
    * Zeroing in on the desired size of plane will give more relevant results. For example, 120 seats or 300 makes a big difference in cost and purpose. 
    * Investigating the cause of accidents would allow for further recommendations specific to locals. For instance, an investigation of the weather would create good local plane recommendations.

* Bringing in other aircraft data 
    * Looking into incident level safety issues will fine tune safety score more specific analysis describing the cause of these accidents could provide better insight for our stakeholder. For example, if our stakeholder plans to only operate in one area of the country, then it would be valuable to identify typical weather conditions for that area and cross examine the airplanes that appear to perform poorly in those conditions.
    * Investigating smaller air carrier/private plane inventory Will permit accurate risk ratings for small size aircraft 
    * Adding further analysis exploring lesser “incident” safety issues compiled by the FAA could further clarify risk level for each of these aircraft

* Other factors 
    * Adding a cost analysis for the airplane options we are recommending would offer an additional level of business insight that would be valuable for our stakeholders. 
    * Conducting thorough industry research would provide a stakeholder expanding into a brand new business segment the opportunity to learn from similar peers’ mistakes and ensure that they do not make those same mistakes. 
    
