# Phase 1 Project Data Cleaning / EDA

## Business Problem

This is the problem our project is specifying:

`"Your company is expanding in to new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises, but do not know anything about the potential risks of aircraft. You are charged with determining which aircraft are the lowest risk for the company to start this new business endeavor. You must then translate your findings into actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase."`

In this EDA, I do the following:
- Clean the Aviation_Data.csv dataset by:
    - Removing irrelevant measures
    - Removing duplicated records
    - Carefully removing or replacing null values
- Run a few different analyses to determine the lowest risk aircraft

## Importing the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')

We had an import issue that Nick solved by specifying the data type for columns 6, 7, and 28.

In [2]:
with open('../data/Aviation_Data.csv') as f:
    dtypes = {'Column6Name': 'str', 'Column7Name': 'str', 'Column28Name': 'str'}
    df = pd.read_csv(f, dtype=dtypes, low_memory=False)

In [3]:
df.head(2)

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

Okay, so we've got 90,348 records and 31 columns. Let's clean this up a bit.

## Cleaning the Data

#### Keeping / Removing Columns for These Reasons

Keeping
- Event ID and Accident Number: keeping for now to screen for duplicates in next step*
- Event.Date: might be useful
- Location and Country: might be useful
- Injury Severity, Aircraft Damage: will be useful
- Aircraft Category: Shows many records not involving airplanes, which we can remove later*
- Make and Model: will be useful
- Amateur Built: can probably use to remove from data. About 8.5k out of 90 are listed as amateur built*
- Number of Engines: will need further investigation. Some show 0 engines, most show one. 
I would imagine we would only be looking at 2+ engine planes for enterprise use*
- Engine Type: useful
- Purpose of Flight: useful for determining cause of accident
- Total Injuries columns: useful
- Weather Conditions: will help determine possible cause

*Return to these measures for further cleaning

Removing
- Investigation Type: upon doing a .values_count(), we see it is not useful
- Latitude and Longitude: mostly null, also redundant since we have Location (city, state)
- Airport.Code and Airport Name: about half null, also irrelevant to determining safety
- Registration Number: irrelevant
- FAR Description: Stands for Federal Aviation Regulation description. Not useful.
- Schedule: Mostly null, also irrelevant
- Air Carrier: mostly null, aldo irrelevant
- Broad Phase of Flight: Enough null values for a categorical data point that we should exclude
- Publication Date: irrelevant when data was published
- Report Status: Not helpful


In [5]:
df.drop(['Investigation.Type','Latitude', 'Longitude', 'Airport.Code', 
         'Airport.Name', 'Registration.Number', 'FAR.Description', 
         'Schedule', 'Air.carrier', 'Broad.phase.of.flight', 
         'Publication.Date', 'Report.Status'], axis=1, inplace=True)

In [6]:
df.head(2)

Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Aircraft.Category,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition
0,20001218X45444,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,Fatal(2),Destroyed,,Stinson,108-3,No,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK
1,20001218X45447,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,Fatal(4),Destroyed,,Piper,PA24-180,No,1.0,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK


####  Drop Null Event ID's

In [7]:
# Hiding output for space

df[df['Event.Id'].isna()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 64030 to 90097
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                0 non-null      object 
 1   Accident.Number         0 non-null      object 
 2   Event.Date              0 non-null      object 
 3   Location                0 non-null      object 
 4   Country                 0 non-null      object 
 5   Injury.Severity         0 non-null      object 
 6   Aircraft.damage         0 non-null      object 
 7   Aircraft.Category       0 non-null      object 
 8   Make                    0 non-null      object 
 9   Model                   0 non-null      object 
 10  Amateur.Built           0 non-null      object 
 11  Number.of.Engines       0 non-null      float64
 12  Engine.Type             0 non-null      object 
 13  Purpose.of.flight       0 non-null      object 
 14  Total.Fatal.Injuries    0 non-null 

Getting all null values in rows where Event ID is null. Let's drop those records

In [8]:
df.dropna(subset=['Event.Id'], inplace=True)

####  Filter for US Data

As a new business, we will likely only be operating in the US, so let's only look at US data.

In [9]:
#Get non-US indexes and drop

foreign_indexes = df[df['Country'] != 'United States'].index

df.drop(index=foreign_indexes, inplace=True)

#### Look for duplicates

In [10]:
# New df containing duplicated Event ID's

df_duplicates1 = df[df.duplicated(subset=['Event.Id'], keep=False) == True].sort_values('Event.Id')

df_duplicates1

Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Aircraft.Category,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition
45562,20001204X00086,LAX99LA086B,1999-01-29,"MADERA, CA",United States,Non-Fatal,Minor,,Coelho,RV4,Yes,1.0,Reciprocating,Personal,0.0,0.0,0.0,2.0,VMC
45564,20001204X00086,LAX99LA086A,1999-01-29,"MADERA, CA",United States,Non-Fatal,Substantial,,Cessna,150L,No,1.0,Reciprocating,Personal,0.0,0.0,0.0,2.0,VMC
45703,20001205X00276,CHI99IA100B,1999-03-02,"SALINA, KS",United States,Incident,,,Lockheed,L-1O11-385-1-15,No,3.0,Turbo Jet,Unknown,0.0,0.0,0.0,6.0,VMC
45704,20001205X00276,CHI99IA100A,1999-03-02,"SALINA, KS",United States,Incident,,,Mcdonnell Douglas,DC-10,No,3.0,Turbo Jet,Unknown,0.0,0.0,0.0,6.0,VMC
45716,20001205X00305,DEN99LA047B,1999-03-05,"DENVER, CO",United States,Non-Fatal,Minor,,Swearingen,SA226TC,No,2.0,Turbo Prop,Unknown,0.0,0.0,0.0,2.0,VMC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90236,20221112106276,CEN23MA034,2022-11-12,"Dallas, TX",United States,Fatal,Destroyed,Airplane,BOEING,B17,No,4.0,,ASHO,6.0,0.0,0.0,0.0,VMC
90254,20221121106336,WPR23LA041,2022-11-18,"Las Vegas, NV",United States,Non-Fatal,Minor,Helicopter,ROBINSON HELICOPTER,R44,No,1.0,,Instructional,0.0,0.0,0.0,3.0,VMC
90255,20221121106336,WPR23LA041,2022-11-18,"Las Vegas, NV",United States,Non-Fatal,Substantial,Airplane,CESSNA,172M,No,1.0,,Instructional,0.0,0.0,0.0,3.0,VMC
90272,20221123106354,WPR23LA045,2022-11-22,"San Diego, CA",United States,Non-Fatal,Substantial,Helicopter,SIKORSKY,UH-60A,No,2.0,,Instructional,0.0,0.0,0.0,4.0,VMC


We thought we may have found some misentries, but upon further inspection (Google), we realize these double listings are for collisions, so we have 2 different planes involved. Let's leave these in.

In [13]:
# Let's check along the Accident Number column

# df.duplicated(subset=['Accident.Number']).value_counts()

For cleanliness, let's drop the Accident Number column. We can leave the Event ID for indexing.

In [14]:
# df.drop(['Accident.Number'], axis=1, inplace=True)

####  Drop any non-airplane records

Now let's look at Aircraft Category and exclude any non-airplane records

In [11]:
# These are the records we need to drop, hiding for space

df[(df['Aircraft.Category'] != 'Airplane') & (df['Aircraft.Category'].isna() == False)]

Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Aircraft.Category,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition
16,20020917X01962,DEN82DTM08,1982-01-02,"MIDWAY, UT",United States,Non-Fatal,Destroyed,Helicopter,Enstrom,280C,No,1.0,Reciprocating,Personal,0.0,0.0,0.0,1.0,IMC
19,20020917X02339,MIA82DA028,1982-01-02,"MIAMI, FL",United States,Non-Fatal,Substantial,Helicopter,Smith,WCS-222 (BELL 47G),No,1.0,Reciprocating,Personal,0.0,0.0,0.0,2.0,VMC
22,20020917X01657,ATL82DA027,1982-01-02,"CHAMBLEE, GA",United States,Non-Fatal,Substantial,Helicopter,Bell,206L-1,No,1.0,Turbo Shaft,Unknown,0.0,0.0,0.0,1.0,VMC
46,20020917X02157,LAX82DA039,1982-01-06,"MAMMOTH LAKES, CA",United States,Non-Fatal,Substantial,Helicopter,Aerospatiale,SA-316B,No,1.0,Turbo Shaft,Business,0.0,0.0,0.0,6.0,VMC
62,20020917X02247,LAX82DVG13,1982-01-09,"CALISTOGA, CA",United States,Non-Fatal,Substantial,Glider,Schleicher,ASW 20,No,0.0,Unknown,Personal,0.0,0.0,0.0,1.0,VMC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90287,20221128106373,ERA23LA073,2022-11-27,"Titusville, FL",United States,Non-Fatal,Substantial,Helicopter,CHILDS MICHAEL A,ROTORWAY EXEC 162-F,No,1.0,,Instructional,0.0,0.0,0.0,2.0,VMC
90301,20221204106407,ERA23FA078,2022-12-04,"Beverly, MA",United States,Fatal,Substantial,Gyrocraft,ROTORSPORT UK LTD,CAVALON,No,1.0,,Personal,1.0,0.0,0.0,0.0,VMC
90313,20221209106435,WPR23LA061,2022-12-07,"Waimea, HI",United States,Minor,Substantial,Helicopter,EUROCOPTER,EC 130 B4,No,1.0,,,0.0,1.0,0.0,6.0,
90315,20221212106442,WPR23LA063,2022-12-08,"La Sal, UT",United States,Non-Fatal,Substantial,Helicopter,HUGHES,369D,No,1.0,,Other Work Use,0.0,0.0,0.0,2.0,VMC


In [12]:
# Get their indexes and drop

nonplane_indexes = df[(df['Aircraft.Category'] != 'Airplane') 
                      & (df['Aircraft.Category'].isna() == False)].index

df.drop(index=nonplane_indexes, inplace=True)

In [13]:
# Now let's drop that column

df.drop(['Aircraft.Category'], axis=1, inplace=True)

####  Drop Engine Counts Below 2

Let's look further into Engine Counts. We can probably drop all records for single engine aircraft, as our company probably will not be using prop planes. But first - what's going on with the 0 engines?

In [14]:
df['Number.of.Engines'].value_counts()

1.0    65412
2.0     9892
0.0      610
3.0      430
4.0      335
6.0        1
8.0        1
Name: Number.of.Engines, dtype: int64

In [15]:
df[df['Number.of.Engines'] == 0]

Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition
1505,20020917X03932,NYC82DA121,1982-06-09,"SHREWSBURY, PA",United States,Non-Fatal,Substantial,Scheicher,K8B,No,0.0,Unknown,Personal,0.0,0.0,0.0,1.0,VMC
3606,20001214X42064,MKC83LA051,1983-01-02,"INDIANOLA, IA",United States,Non-Fatal,,Balloon Works,FIREFLY 7B,No,0.0,Unknown,Personal,0.0,1.0,0.0,1.0,VMC
3659,20001214X42066,MKC83LA053,1983-01-08,"GREENWOOD, MO",United States,Non-Fatal,Substantial,Balloon Works,FIRE FLY 7-B,No,0.0,Unknown,Instructional,0.0,0.0,0.0,2.0,VMC
3951,20001214X42143,ATL83LA123,1983-02-21,"WOODBINE, MD",United States,Non-Fatal,Substantial,Scheibe Flugzeugbau,L SPATZ-55,No,0.0,Unknown,Personal,0.0,0.0,0.0,1.0,VMC
4093,20001214X42553,NYC83LA076,1983-03-12,"BUENA VISTA, PA",United States,Non-Fatal,Substantial,Burkhart Grob,G10Z ASTIR CS,No,0.0,Unknown,Personal,0.0,0.0,0.0,1.0,VMC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63301,20070907X01321,DEN07LA152,2007-09-02,"HUTCHINSON, KS",United States,Non-Fatal,Substantial,Slingsby,Swallow Type T.45,No,0.0,,Glider Tow,,1.0,,,VMC
63538,20071029X01673,DEN08LA004,2007-10-08,"ALBUQUERQUE, NM",United States,Fatal(1),Substantial,Aerostar,S-66A,No,0.0,,Personal,1.0,2.0,,2.0,VMC
63634,20071030X01689,DEN08LA017,2007-10-26,"Salida, CO",United States,Fatal(1),Destroyed,Schempp-hirth,Ventus B/16.6,No,0.0,,Personal,1.0,,,,VMC
64890,20080827X01334,CHI08CA202,2008-07-05,"Beloit, WI",United States,Non-Fatal,Substantial,AB Sportine Aviacija,Genesis 2,No,0.0,,Personal,0.0,0.0,0.0,1.0,VMC


With a bit of googling, we find that these are gliders and balloons. Exclude!

In [16]:
# Get their indexes and drop

engines_1and0_indexes = df[(df['Number.of.Engines'] == 0) 
                      | (df['Number.of.Engines'] == 1)].index

df.drop(index=engines_1and0_indexes, inplace=True)

####  Drop Amateur-Built Planes

Now let's drop any amateur-built planes. We certainly are not in the market for those. 

In [17]:
# Get their indexes and drop

amateur_indexes = df[(df['Amateur.Built'] == 'Yes')].index

df.drop(index=amateur_indexes, inplace=True)

In [18]:
# Now let's drop that column

df.drop(['Amateur.Built'], axis=1, inplace=True)

####  Filter on Injury Severity

Where Injury Severity is listed as 'Incident', Aircraft Damage is null or Minor and Injuries are null, zero, or low.

Where Injury Severity is null, Injuries are null, zero, or low. 

Let's drop these records.

In [19]:
df[(df['Injury.Severity'] == 'Incident') | (df['Injury.Severity'].isna())]

Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Make,Model,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition
79,20020917X01897,CHI82IA026,1982-01-12,"CHICAGO, IL",United States,Incident,,Lockheed,L-1011,3.0,Turbo Fan,Unknown,0.0,0.0,0.0,149.0,UNK
80,20020917X01765,ATL82IA034,1982-01-12,"CLARKSBURG, WV",United States,Incident,Minor,Embraer,EMB-110P1,2.0,Turbo Prop,Unknown,0.0,0.0,0.0,2.0,VMC
119,20020917X01766,ATL82IA038,1982-01-19,"WASHINGTON, DC",United States,Incident,Minor,De Havilland,DHC-6-300,2.0,Turbo Prop,Ferry,0.0,0.0,0.0,1.0,IMC
131,20020917X02334,LAX82IA044,1982-01-20,"SAN JOSE, CA",United States,Incident,Minor,Piper,PA-31-350,2.0,Reciprocating,Executive/corporate,0.0,0.0,0.0,2.0,VMC
149,20020917X01767,ATL82IA041,1982-01-22,"LOUISVILLE, KY",United States,Incident,,Dassault/sud,FALCON 20,2.0,Turbo Fan,Unknown,0.0,0.0,0.0,2.0,VMC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90321,20221212106440,ERA23LA084,2022-12-11,"Fox Island, NY",United States,,,ROBINSON HELICOPTER,R22 BETA,,,Instructional,0.0,0.0,0.0,0.0,
90333,20221215106462,CEN23LA064,2022-12-15,"Patterson, LA",United States,,,BELL,206-L4,,,,0.0,0.0,0.0,0.0,
90338,20221219106472,DCA23LA096,2022-12-18,"Kahului, HI",United States,,,AIRBUS,A330-243,,,,0.0,0.0,0.0,0.0,
90344,20221227106494,ERA23LA095,2022-12-26,"Hampton, NH",United States,,,BELLANCA,7ECA,,,,0.0,0.0,0.0,0.0,


In [20]:
# Get their indexes and drop

incident_null_indexes = df[(df['Injury.Severity'] == 'Incident') | (df['Injury.Severity'].isna())].index

df.drop(index=incident_null_indexes, inplace=True)

####  Drop Null Makes / Models

Let's drop records where Make or Model are null

In [21]:
null_make_model_index = df[(df['Make'].isna()) | (df['Model'].isna())].index

df.drop(index=null_make_model_index, inplace=True)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10440 entries, 4 to 90347
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                10440 non-null  object 
 1   Accident.Number         10440 non-null  object 
 2   Event.Date              10440 non-null  object 
 3   Location                10435 non-null  object 
 4   Country                 10440 non-null  object 
 5   Injury.Severity         10440 non-null  object 
 6   Aircraft.damage         9619 non-null   object 
 7   Make                    10440 non-null  object 
 8   Model                   10440 non-null  object 
 9   Number.of.Engines       9092 non-null   float64
 10  Engine.Type             9877 non-null   object 
 11  Purpose.of.flight       9136 non-null   object 
 12  Total.Fatal.Injuries    9202 non-null   float64
 13  Total.Serious.Injuries  9055 non-null   float64
 14  Total.Minor.Injuries    8994 non-null 

####  Drop Null Aircraft Damage

In [None]:
# Get indexes and drop

damage_null_indexes = df[df['Aircraft.damage'].isna()].index

df.drop(index=damage_null_indexes, inplace=True)

####  Fill in Engine Counts and Types

Now with cleaner data set I can wrap my head around some potential questions. In general, I'd like to know which Makes, Models, Engine Counts, and Engine Types are most/least involved in accidents, which I would like to measure by Injury Severity, Aircraft Damage, and Fatal and Serious Injury count.

Before going further, I'd like to try to fill in some null values for Engine Count and Engine Type by making a dictionary of existing counts and types by Model. Let's see if it works.

In [23]:
# Starting number

df['Number.of.Engines'].count()

9092

In [24]:
# Creating a dictionary using Model as keys and Number of Engines (non-null) as values

engine_count_dict = df[(df['Number.of.Engines'].isna() == 
                        False)].set_index('Model')['Number.of.Engines'].to_dict()

In [25]:
# Using this dictionary to fill in some null values in Number of Engines

df['Number.of.Engines'] = df['Number.of.Engines'].fillna(df['Model'].map(engine_count_dict))

In [26]:
# Ending number

df['Number.of.Engines'].count()

9265

Got another 1500! Now let's do this for Engine Type.

In [27]:
# Starting number

df['Engine.Type'].count()

9877

In [28]:
# {Model: Engine Type}, then use to fill in nulls. Got another 2000!

engine_type_dict = df[(df['Engine.Type'].isna() == 
                        False)].set_index('Model')['Engine.Type'].to_dict()

df['Engine.Type'] = df['Engine.Type'].fillna(df['Model'].map(engine_type_dict))

df['Engine.Type'].count()

10273

#### Clean Up Makes

Get the values in the Makes column to match case.

In [48]:
# Convert Make column to title case

df['Make'] = df['Make'].str.title()

As you can see, we've got too many Makers to properly analyze. Let's select a group of the largest manufacturers, then filter out the rest. We found this resource (http://www.fi-aeroweb.com/US-Commercial-Aircraft-Fleet.html) to narrow the list down. We also added a few that have been consolidated into larger manufacturers, such as Learjet (bought by Bombardier).

In [53]:
current_makes_list = df['Make'].value_counts().index.tolist()
len(current_makes_list)

369

In [54]:
keep_makes_list = ['Boeing', 'Airbus', 'Bombardier', 'Embraer', 
                   'Cessna', 'Mcdonnell Douglas', 'ATR', 'Gulfstream', 
                   'Lockheed', 'Convair', 'Douglas', 'Dassault', 
                   'CASA', 'Hawker', 'Curtiss', 'Pilatus', 'Beech',
                   'Honda', 'Raytheon', 'Learjet']

Many of the Make data points in our set have been entered inconsistently. This function will help us make sure we don't lose any points that contain one of our list of manufacturers.

In [58]:
def planemaker(maker, list):
    for manufacturer in list:
        if manufacturer in maker:
            return manufacturer
        else:
            continue

In [59]:
df['Manufacturer'] = df['Make'].apply(lambda x: planemaker(x, keep_makes_list))

In [61]:
# Get indexes and drop

manufacturer_null_indexes = df[df['Manufacturer'].isna()].index

df.drop(index=manufacturer_null_indexes, inplace=True)

In [65]:
df['Manufacturer'].value_counts()

Cessna               2211
Beech                1843
Boeing                338
Learjet               152
Mcdonnell Douglas     140
Douglas               139
Embraer                89
Bombardier             77
Airbus                 59
Lockheed               57
Gulfstream             55
Raytheon               48
Dassault               44
Hawker                 37
Convair                32
Curtiss                 7
Honda                   4
Pilatus                 3
Name: Manufacturer, dtype: int64

Now with cleaner data set I can wrap my head around some potential questions. In general, I'd like to know which Makes, Models, Engine Counts, and Engine Types are most/least involved in accidents, which I would like to measure by Injury Severity, Aircraft Damage, and Fatal and Serious Injury count.