# Business Problem

Your company is expanding in to new industries to diversify its portfolio. 

Specifically, they are interested in **purchasing and operating airplanes** for **commercial and private enterprises**, but do not know anything about the potential risks of aircraft. 

You are charged with determining which aircraft are the **lowest risk** for the company to start this new business endeavor. 

You must then translate your findings into actionable insights that the head of the new aviation division can use to **help decide which aircraft to purchase**.

## How Do We Define Risk

- Initial thoughts on risk is we sum the injury columns and divide by the total uninjured column. We should also weigh these values by the amount of people on board (sum of injured/uninjured). 

- Curious if 'Amateur.Built' flights are more dangerous.

- Are all injuries weighted the same? 
    - I would like my plane rides to be injury free but what consititues a minor/major injury? 
    - Stubbed toe as a minor injury isn't related to the quality of the plane. 

# Exploratory Data Analysis

## Understand the Data Structure

In [5]:
#Import csv
df = pd.read_csv('AviationData.csv', encoding='latin')

  df = pd.read_csv('AviationData.csv', encoding='latin')


In [7]:
df.shape

(88889, 31)

In [10]:
df

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88884,20221227106491,Accident,ERA23LA093,2022-12-26,"Annapolis, MD",United States,,,,,...,Personal,,0.0,1.0,0.0,0.0,,,,29-12-2022
88885,20221227106494,Accident,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,,,...,,,0.0,0.0,0.0,0.0,,,,
88886,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,...,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022
88887,20221227106498,Accident,WPR23LA076,2022-12-26,"Morgan, UT",United States,,,,,...,Personal,MC CESSNA 210N LLC,0.0,0.0,0.0,0.0,,,,


## Data Types and Missing Values

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

- We should rename columns using underscores instead of periods for better readability

In [12]:
#Percent of missing values in each column
df.isna().sum() * 100 / len(df)

Event.Id                   0.000000
Investigation.Type         0.000000
Accident.Number            0.000000
Event.Date                 0.000000
Location                   0.058500
Country                    0.254250
Latitude                  61.320298
Longitude                 61.330423
Airport.Code              43.601570
Airport.Name              40.708074
Injury.Severity            1.124999
Aircraft.damage            3.593246
Aircraft.Category         63.677170
Registration.Number        1.554748
Make                       0.070875
Model                      0.103500
Amateur.Built              0.114750
Number.of.Engines          6.844491
Engine.Type                7.982990
FAR.Description           63.974170
Schedule                  85.845268
Purpose.of.flight          6.965991
Air.carrier               81.271023
Total.Fatal.Injuries      12.826109
Total.Serious.Injuries    14.073732
Total.Minor.Injuries      13.424608
Total.Uninjured            6.650992
Weather.Condition          5

### Missing Values

- There are a lot of missing values but the majority of missing values are in categorical variables. For these we could create an 'unknown' classification to fill in the missing values.

- For the floats, we can use 0's for the injury columns.

### Data Cleaning Exploration

In [55]:
#With Risk defined as total injuries, we will create a new column called injuries 
df['total_injuries'] = df['Total.Fatal.Injuries'] + df['Total.Serious.Injuries'] + df['Total.Minor.Injuries']

In [15]:
#Checking if the aircraft category is null is there identifiying information
df[df['Aircraft.Category'].isnull()][['Make', 'Model']]

Unnamed: 0,Make,Model
0,Stinson,108-3
1,Piper,PA24-180
2,Cessna,172M
3,Rockwell,112
4,Cessna,501
...,...,...
88883,AIR TRACTOR,AT502
88884,PIPER,PA-28-151
88885,BELLANCA,7ECA
88887,CESSNA,210N


- The 'Aircraft.Category' contains many aircrafts, we only care about airplanes. But there is a lot of missing values in this category. 
- But there are make and models, I think we could assume they are aircraft. The better route would be to label the missing values as unknown. 
- Maybe join in another table that has make and models of airplanes?

In [17]:
df['Amateur.Built'].value_counts()

Amateur.Built
No     80312
Yes     8475
Name: count, dtype: int64

- This column should be cast to boolean values

In [40]:
df['Location'].str.strip()

0        MOOSE CREEK, ID
1         BRIDGEPORT, CA
2          Saltville, VA
3             EUREKA, CA
4             Canton, OH
              ...       
88884      Annapolis, MD
88885        Hampton, NH
88886         Payson, AZ
88887         Morgan, UT
88888         Athens, GA
Name: Location, Length: 88889, dtype: object

- Split location in city and state
- Will need to get ride of the missing values in the Location column to do this
    - 0.058500% missing values 
        - could drop these values as the impact is low or we could label them as unknown, unknown
        - that way when it is split like the others, an unknown is passed into both new columns

In [47]:
df['Schedule'][df['Schedule'].notnull()].value_counts()

Schedule
NSCH    4474
UNK     4099
SCHD    4009
Name: count, dtype: int64

- These values do not appear to be pertinent to determining the riskiness of a plane

In [48]:
df['Air.carrier'][df['Air.carrier'].notnull()].value_counts()

Air.carrier
Pilot                        258
American Airlines             90
United Airlines               89
Delta Air Lines               53
SOUTHWEST AIRLINES CO         42
                            ... 
WOODY CONTRACTING INC          1
Rod Aviation LLC               1
Paul D Franzon                 1
TRAINING SERVICES INC DBA      1
MC CESSNA 210N LLC             1
Name: count, Length: 13590, dtype: int64

In [74]:
df.loc[df['Air.carrier'].notnull(), ['Air.carrier', 'total_injuries']].sort_values(by='total_injuries', ascending= False)[:20]

Unnamed: 0,Air.carrier,total_injuries
75437,MALAYSIAN AIRLINES SYSTEM BERHAD,295.0
23534,United Airlines,283.0
74808,Malaysian Airlines,239.0
66465,Air France,228.0
73847,Asiana Airlines,190.0
84456,Pegasus Airlines,183.0
84369,Ukraine International Airlines,176.0
65201,Spanair,172.0
68171,Air India Charters,165.0
1871,Pan American World A/w,162.0


- The airline or individual company flying the aircraft should not have an effect on the risk of the airplane. 
- Could tell us something about the complexity involved with flying the plane if non-major airlines are listed with a high count of injuries 
    - But the data does not reflect this. Larger airlines have larger amounts of injuries due to the larger volume of flights. 

### Potential Columns to Drop
- latitude 
- longitude
- FAR.Description
- Schedule
- Air.carrier
- Airport.Code
- Airport.Name
- Schedule

## Descriptive Statistics

In [8]:
df.describe()

Unnamed: 0,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,82805.0,77488.0,76379.0,76956.0,82977.0
mean,1.146585,0.647855,0.279881,0.357061,5.32544
std,0.44651,5.48596,1.544084,2.235625,27.913634
min,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,0.0,2.0
max,8.0,349.0,161.0,380.0,699.0


### Outliers
- There are large outlier inside of all of the injury metrics. However, these outliers are important and give us valuable information about the riskiness of the aircraft. 

- We should not remove these outliers. 