# MA0218 Mini Project 
---
#### Selected Dataset 4: `Aviation Accident Database`
#### Problem Statement: `Predicting the Probability of Fatality in Aviation Accident/Incident using classification modelling.`
#### Designed By: `MA8 - Group AngKuKueh`
> Ang Jun Jie U1822901A  
> Lewis Lee U1820229F  
> Ong Jun Yu U1920988L  
> Tan AIk Lim Philip U1821641K  
> Tay Song Heng Denzil U1823710F  

---
# Data Preparation

In [None]:
#Import the standard libraries
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

In [None]:
#Import csv file
aviationData = pd.read_csv("../input/dataset/AviationData.csv", encoding = 'iso-8859-1')

pd.set_option('display.max_columns',31)
aviationData.head()

In [None]:
aviationData.info()

In [None]:
#Check for significance of the null values

for i in aviationData.columns:
    data_missing = np.mean(aviationData[i].isnull())
    print('{} - {}% , {}'.format(i, round(data_missing*100), aviationData[i].isna().sum()))
    

#### Considering the significance of null values and our initial judgement on the importance of each variable, we proceed to clean the data with the following variables of interest:
`Event.Date`, `Latitude`, `Longtitude`, `Injury Severity`, `Aircraft Damage`, `Make`, `Amateur Built`, `Number of Engines`, `Engine Type`, `Purpose of Flight`, `Weather Condition`, `Broad Phase of Flight`, `Report Status`

---
# Data Cleaning

In [None]:
#Understand the breakdown of the variables before cleaning
variables = ['Event.Date',
            'Latitude',
            'Longitude',
            'Injury.Severity',
            'Aircraft.Damage',
            'Make',
            'Amateur.Built',
            'Number.of.Engines',
            'Engine.Type',
            'Purpose.of.Flight',
            'Weather.Condition',
            'Broad.Phase.of.Flight',
            'Report.Status']

for x in variables:
    print(x, ': ', aviationData[x].value_counts(), '\n')

In [None]:
#Event.date
#Split the Event date dataset into month, day, year datasets
aviationData['Month'] = aviationData['Event.Date'].str.split('/', expand=True)[0]
aviationData['Day'] = aviationData['Event.Date'].str.split('/', expand=True)[1]
aviationData['Year'] = aviationData['Event.Date'].str.split('/', expand=True)[2]

In [None]:
#Latitude and longtitude
#Since around 64% of the data is missing, it does not make sense to fill
#in the data. Filling in the null values with the mean or median of Latitude and longtitude
#does not make sense. Hence we will be dropping the null values in the later analysis.


In [None]:
#Injury.Severity
#Categorise Injurity Severity dataset into Fatal and Non-Fatal
#For incident and unavailable data, assumed as Non-Fatal
aviationData['Fatal'] = aviationData['Injury.Severity'].apply(lambda x: 'No' 
                                                              if x=='Non-Fatal' 
                                                              or x== 'Incident' 
                                                              or x=='Unavailable' 
                                                              else 'Yes')

In [None]:
#Aircraft.Damage
#Fill the null values of Aircraft.Damage with the most common recurring value
aviationData['Aircraft.Damage'].fillna(aviationData['Aircraft.Damage'].mode()[0], inplace=True)

In [None]:
#Make
#Standardise Make to uppercase letters (to remove complications)
aviationData['Make'] = aviationData['Make'].str.upper()

#Fill null values with 'Others'
aviationData['Make'].fillna('Others', inplace = True)

#Group those Make with insignificant sample size(<1% of total) with 'Others'
make_others = aviationData["Make"].value_counts()<850
aviationData["Make"] = aviationData["Make"].apply(lambda x: 'Others' if make_others.loc[x]==True else x)

In [None]:
#Amateur.Built
#Fill null values of Amateur.Built with the most common recurring data
aviationData['Amateur.Built'].fillna(aviationData['Amateur.Built'].mode()[0], inplace=True)

In [None]:
#Number.of.Engines
#Fill null values of Number.of.Engines with the most common recurring data
aviationData['Number.of.Engines'].fillna(aviationData['Number.of.Engines'].mode()[0], inplace=True)

#Convert data type of Number.of.Engines to int64 for Exploratory Analysis
aviationData['Number.of.Engines'] = aviationData['Number.of.Engines'].astype('int64')

#Simplify dataset by representing 3 or more engines as 3
aviationData['Number.of.Engines'] = aviationData['Number.of.Engines'].replace(4, 3)
aviationData['Number.of.Engines'] = aviationData['Number.of.Engines'].replace(8, 3)
aviationData['Number.of.Engines'].value_counts()

In [None]:
#Engine.Type
#Fill null values of Engine.Type with the most common recurring data
aviationData['Engine.Type'].fillna(aviationData['Engine.Type'].mode()[0], inplace=True)

#Group those engine type with insignificant sample size(<1% of total) as 'Others'
type_others = aviationData['Engine.Type'].value_counts()<850
aviationData['Engine.Type'] = aviationData['Engine.Type'].apply(lambda x: 'Others' if type_others.loc[x]==True else x)

In [None]:
#Purpose.of.Flight
#Fill null values of Purpose.of.Flight as 'Unknown'
aviationData['Purpose.of.Flight'].fillna('Unknown', inplace=True)

#Group purpose of flight with those of similar purpose
#Public Aircraft
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].replace('Public Aircraft - Local', 'Public Aircraft')
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].replace('Public Aircraft - Federal', 'Public Aircraft')
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].replace('Public Aircraft - State', 'Public Aircraft')

#Work use
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].replace('Executive/Corporate', 'Work Use')
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].replace('Other Work Use', 'Work Use')
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].replace('Business', 'Work Use')

#Group those purpose of flight with insignificant sample size(<1% of total) as 'Others'
purpose_others = aviationData['Purpose.of.Flight'].value_counts()<850
aviationData['Purpose.of.Flight'] = aviationData['Purpose.of.Flight'].apply(lambda x: 'Others' if purpose_others.loc[x]==True else x)

In [None]:
#Weather.Condition
#Fill null values of Weather.Condition with the most common recurring data
aviationData['Weather.Condition'].fillna(aviationData['Weather.Condition'].mode()[0], inplace = True)

In [None]:
#Broad.Phase.of.Flight
#Fill null values of Broad.Phase.of.Flight as 'UNKNOWN'
aviationData['Broad.Phase.of.Flight'].fillna('UNKNOWN', inplace = True)

In [None]:
#Report.Status
#No steps requred for cleaning

In [None]:
#Breakdown of variables after cleaning
for x in variables:
    print(x, ': ', aviationData[x].value_counts(), '\n')

In [None]:
#Rearrange the order of columns
#Exclude 'Day' as it is unnecessary in analysis
#Exclude Latitude and Longtitude as they will be put in a separate dataframe/analysis
clean_data = pd.DataFrame(aviationData[['Year',
                                        'Month',
                                        'Aircraft.Damage',
                                        'Make',
                                        'Amateur.Built',
                                        'Number.of.Engines',
                                        'Engine.Type',
                                        'Purpose.of.Flight',
                                        'Weather.Condition',
                                        'Broad.Phase.of.Flight',
                                        'Report.Status',
                                        'Fatal']])

In [None]:
#Latitude and Longtitude dataframe
LatLong_data = pd.DataFrame(aviationData[['Latitude', 'Longitude', 'Fatal']])

In [None]:
#Check for null values(if any)
for i in clean_data.columns:
    data_missing = np.mean(clean_data[i].isnull())
    print('{} - {}% , {}'.format(i, round(data_missing*100), clean_data[i].isna().sum()))
    

In [None]:
clean_data.head()

### _Data Cleaning completed_ 
#### Datasets to be used for Data Exploration and Analysis:
`clean_data` and `LatLong_Data` 

---
# Data Exploration and Analysis

#### ANALYSIS OF ACCIDENTS OVER THE YEARS

In [None]:
#Arrange by years(ascending)
clean_data = clean_data.sort_values(by = 'Year', ascending = True)

#Count plot of number of accidents every year
f, axes = plt.subplots(1, 1, figsize=(24, 10))
ax = sb.countplot(x = 'Year', data = clean_data)
ax.axes.set_title("Total Accidents Each Year",fontsize=20)

#### YEARLY FATALITY COUNT

In [None]:
#Count plot of Fatal & Non-Fatal every year
f, axes = plt.subplots(1, 1, figsize=(30, 10))
ax = sb.countplot(x = 'Year', hue = 'Fatal', data = clean_data, palette = 'Set1')
ax.axes.set_title("Fatal:Non Fatal Ratio Over the Years",fontsize=20)

#### YEARLY PERCENTAGE OF FATALITY COUNT

In [None]:
#Percentage of fatality over the years
dataset = pd.DataFrame(clean_data.groupby('Year')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Year'] = dataset.index
fatal_count = []
for yr in dataset['Year']:
    data1 = clean_data.loc[clean_data['Year']== yr]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(45)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Year', data = dataset, orient = 'h')

From the above plot, we observe that the number of accidents reported has been decreasing over the years, however this is largely due to the number of non-fatal accidents decreasing as the number of fatal accidents still remain approximately the same. We can also observe that the fatality rate from 1948-1979 is very high, this is an anomaly as very few accidents were reported during that period as the aviation industry was still developing.

#### TOTAL NUMBER OF ACCIDENTS SORTED BY MONTH

In [None]:
#We will be diving deeper into the time series and look at possible pattern in a year
#Arrange the months(ascending)

clean_data['Month'] = clean_data['Month'].replace(['1'],'Jan')
clean_data['Month'] = clean_data['Month'].replace(['2'],'Feb')
clean_data['Month'] = clean_data['Month'].replace(['3'],'Mar')
clean_data['Month'] = clean_data['Month'].replace(['4'],'Apr')
clean_data['Month'] = clean_data['Month'].replace(['5'],'May')
clean_data['Month'] = clean_data['Month'].replace(['6'],'Jun')
clean_data['Month'] = clean_data['Month'].replace(['7'],'Jul')
clean_data['Month'] = clean_data['Month'].replace(['8'],'Aug')
clean_data['Month'] = clean_data['Month'].replace(['9'],'Sep')
clean_data['Month'] = clean_data['Month'].replace(['10'],'Oct')
clean_data['Month'] = clean_data['Month'].replace(['11'],'Nov')
clean_data['Month'] = clean_data['Month'].replace(['12'],'Dec')

#Count plot of accidents over the months
f, axes = plt.subplots(1, 1, figsize=(24, 10))
ax = sb.countplot(x = 'Month', data = clean_data, order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.axes.set_title("Total Accidents Each Month",fontsize=20)

#### FATILITY RATES SORTED BY MONTH

In [None]:
#Count plot of accidents over the months
f, axes = plt.subplots(1, 1, figsize=(24, 10))
ax = sb.countplot(x = 'Month', data = clean_data, hue = 'Fatal', order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], palette = 'Set1')
ax.axes.set_title("Fatal:Non Fatal Ratio Each Month",fontsize=20)

#### FATALITY PERCENTAGE SORTED BY MONTH

In [None]:
dataset = pd.DataFrame(clean_data.groupby('Month')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Month'] = dataset.index
fatal_count = []
for mon in dataset['Month']:
    data1 = clean_data.loc[clean_data['Month']== mon]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(12)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Month', data = dataset, order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

It is interesting to note that the majority of the accidents occur in the months of Jun/Jul, yet they have the lowest percentage of fatality.

It is also true for the months where the number of accidents are lower such as the first few months (Jan-June) and last few (July-Dec), yet they have a high percentage of fatality.

#### ANALYSIS OF AIRCRAFT DAMAGE

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 10))
ax = sb.countplot(x = 'Aircraft.Damage', hue = 'Fatal', data = clean_data, palette = 'Set1')
ax.axes.set_title("Fatal:Non Fatal Ratio of Air Craft Damage",fontsize=20)

#Show Percentages of the Total Count
AircraftDamage_data = clean_data['Aircraft.Damage']
total = float(len(AircraftDamage_data))
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2., height + 100, '{:1.2f}'.format(height*100/total),
            ha="center")

From the above plot, we observe that most aircraft suffer substantial aircraft damage after an accident. In a case whereby the aircraft is destroyed, the rate of fatality is higher. Logically, a destroyed aircraft results in higher rate of death.

However, there is also high fatal rate when aircraft suffer minor damage, yet a low fatal rate when aircraft suffer substantial damage. This is counter intuitive as we would expect higher rate of death in the case of higher damage to an aircraft. Nevertheless, this might be a biased statistic due to the significant difference in the count.

#### ANALYSIS OF MAKE

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 10))
ax = sb.countplot(x = 'Make', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Show Percentages of the Total Count
make_data = clean_data['Make']
total = float(len(make_data))
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2., height + 100, '{:1.2f}'.format(height*100/total),
            ha="center")
    
#percentage fatality of Make
dataset = pd.DataFrame(clean_data.groupby('Make')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Make'] = dataset.index
fatal_count = []
for i in dataset['Make']:
    data1 = clean_data.loc[clean_data['Make']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(11)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Make', data = dataset)

From the above plots, we observe that `CESSNA` has the highest count of accident, with one of the lowest fatality percentage. `BOEING` has an exceptionally low count of fatality percentage while `BEECH`, `MOONEY` and `ROBINSON` has the highest fatality percentages. However, accident count for `BOEING`, `BEECH`, `MOONEY` and `ROBINSON` are the lowest and the statistic might be biased given the smaller sample count. It is still worthy to note that `BOEING` has the low accident and fatality rate.

#### ANALYSIS OF AMATEUR BUILT

In [None]:
f, axes = plt.subplots(1, 1, figsize=(24, 10))
ax = sb.countplot(x = 'Amateur.Built', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Percentage fatality of amateur built
dataset = pd.DataFrame(clean_data.groupby('Amateur.Built')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Amateur.Built'] = dataset.index
fatal_count = []
for i in dataset['Amateur.Built']:
    data1 = clean_data.loc[clean_data['Amateur.Built']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(2)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Amateur.Built', data = dataset, palette = 'RdBu_r')

From the above plots, we observe that `Non-amateur built` aircraft have higher counts of accidents which may be due to a significantly higher number and frequency of `Non-amateur built` aircraft flown. `Amateur built` aircraft has a higher fatality rate which may be reasoned with their purpose of build and flight, yet can also be a biased statistic due to the significant difference in sample size. 

#### ANALYSIS OF NUMBER OF ENGINES

In [None]:
#countplot of number of engines
f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.countplot(x = 'Number.of.Engines', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Percentage fatality of Number of engines
dataset = pd.DataFrame(clean_data.groupby('Number.of.Engines')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Number.of.Engines'] = dataset.index
fatal_count = []
for i in dataset['Number.of.Engines']:
    data1 = clean_data.loc[clean_data['Number.of.Engines']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(4)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Number.of.Engines', data = dataset, orient = 'h', palette = 'RdBu_r')

From the above plots, we observe that aircrafts with `1 engine` has the highest accident count. However, this may be because `1 engine` is the most common in aircrafts. Another observation is that as the `number of engines` increases from `0 to 2`, the fatality rate increases as well. However this observation may be biased due to significant difference in sample counts.

#### ANALYSIS OF ENGINE TYPE

In [None]:
#countplot of engine types
f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.countplot(x = 'Engine.Type', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Percentage fatality of Engine type
dataset = pd.DataFrame(clean_data.groupby('Engine.Type')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Engine.Type'] = dataset.index
fatal_count = []
for i in dataset['Engine.Type']:
    data1 = clean_data.loc[clean_data['Engine.Type']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(6)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Engine.Type', data = dataset, palette = 'RdBu_r')

From the above plots, we observe that aircrafts with `Reciprocating` engine type has the highest accident count. However, this may be because `Reciprocating` engine type is the most common in aircrafts. `Turbo Fan` has an exceptionally low fatality rate while `Turbo Prop` has the highest. Nevertheless, this might be biased statistic due to the small sample count for `Turbo Fan` and `Turbo Prop`.

#### ANALYSIS OF PURPOSE OF FLIGHT

In [None]:
#countplot of purpose of flight
f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.countplot(x = 'Purpose.of.Flight', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Percentage fatality of Purpose of Flight
dataset = pd.DataFrame(clean_data.groupby('Purpose.of.Flight')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Purpose.of.Flight'] = dataset.index
fatal_count = []
for i in dataset['Purpose.of.Flight']:
    data1 = clean_data.loc[clean_data['Purpose.of.Flight']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(8)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Purpose.of.Flight', data = dataset, palette = 'RdBu_r')

From the above plots, we observe that aircrafts for purpose of `personal` use has the highest accident count. However, this may be because `personal` use is the most common purpose of flight. `Aerial Application` and `Instructional` has exceptionally low fatality rate, which could be due to the nature of flight and experience of the pilots. `Work use` and `Others` has the highest fatality rates. Nevertheless, this might be biased statistic due to the small sample sizes.

#### ANALYSIS OF WEATHER CONDITION

In [None]:
#countplot of weather condition
f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.countplot(x = 'Weather.Condition', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Percentage fatality of Weather condition
dataset = pd.DataFrame(clean_data.groupby('Weather.Condition')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Weather.Condition'] = dataset.index
fatal_count = []
for i in dataset['Weather.Condition']:
    data1 = clean_data.loc[clean_data['Weather.Condition']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(3)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Weather.Condition', data = dataset, palette = 'RdBu_r')

##### VMC - Visual Meteorological Conditions
##### IMC - Instrument Meteorological Conditions
##### UNK - Unknown

From the above plots, we observe that flight in `VMC` has the highest accident count. However, this may be because `VMC` is the most common weather condition. `VMC` also has an exceptionally low fatality rate which is expectedly due to the ease of flight and management of adversity in a mild and safer weather condition. 

The fatality rate is higher in `IMC` or `UNK` weather condition. Logically, flying and managing crisis in worse weather condition is likely to be a challenge and hence explains the higher rate of fatality. Rate of fatality can observed to be highly related to the weather condition. 

#### ANALYSIS OF BROAD PHASE OF FLIGHT

In [None]:
#countplot of broad phase of flight
f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.countplot(x = 'Broad.Phase.of.Flight', hue = 'Fatal', data = clean_data, palette = 'Set1')

#Percentage fatality of Broad Phase of Flight
dataset = pd.DataFrame(clean_data.groupby('Broad.Phase.of.Flight')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Broad.Phase.of.Flight'] = dataset.index
fatal_count = []
for i in dataset['Broad.Phase.of.Flight']:
    data1 = clean_data.loc[clean_data['Broad.Phase.of.Flight']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(12)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Broad.Phase.of.Flight', data = dataset, palette = 'RdBu_r')

From the above plots, we observe that `Landing` and `Taxi` phase has exceptionally low fatality rate. `Maneuvering` and `Unknown` phases of flight has the highest fatality rates.

#### ANALYSIS OF REPORT STATUS

In [None]:
#countplot of report status
f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.countplot(x = 'Report.Status', hue = 'Fatal', data = clean_data, palette='Set1')

#Percentage fatality of report status
dataset = pd.DataFrame(clean_data.groupby('Report.Status')['Fatal'].count())
dataset = dataset.rename(columns={'Fatal': "Total"})
dataset['Report.Status'] = dataset.index
fatal_count = []
for i in dataset['Report.Status']:
    data1 = clean_data.loc[clean_data['Report.Status']== i]
    data1 = len(data1.loc[data1['Fatal'] =='Yes'])
    fatal_count.append(data1)
dataset['Fatal_Count'] = fatal_count
dataset['Percentage'] = dataset['Fatal_Count']/dataset['Total'] * 100
dataset['Index'] = [x for x in range(4)]
dataset = dataset.set_index('Index')

f, axes = plt.subplots(1, 1, figsize=(24, 10))
sb.barplot(x = 'Percentage', y = 'Report.Status', data = dataset, palette = 'RdBu_r')

From the above plots, we observe that `Probable Cause ` report status has highest accident count. However, this might be because a `Probable Cause` report status is most common. A `Factual` report status has the lowest fatality rate . A `Foreign` report status has the highest fatality rates. Nevertheless, this might be biased statistic due to the small sample sizes for `Foreign` and `Factual` reports.

#### ANALYSIS OF LATITUDE AND LONGITUDE

In [None]:
#Check for null values
for i in LatLong_data.columns:
    latlong_missing = np.mean(LatLong_data[i].isnull())
    print('{} - {}% , {}'.format(i, round(data_missing*100), LatLong_data[i].isna().sum()))
    

In [None]:
#Drop the null values
LatLong_data = LatLong_data.dropna()

#check for null values (if any)
for i in LatLong_data.columns:
    latlong_missing = np.mean(LatLong_data[i].isnull())
    print('{} - {}% , {}'.format(i, round(data_missing*100), LatLong_data[i].isna().sum()))

In [None]:
LatLong_data.describe()

In [None]:
#Observe that max (Longitude) = 435.833334, value(s) that exist beyond the possible range of Longitude 
#Latitude: -90 to 90
#Longtitude: -180 to 180 

#Remove the anomaly

latrange = LatLong_data[(LatLong_data['Latitude'] >= 90) | (LatLong_data['Latitude'] <= -90)].index
longrange = LatLong_data[(LatLong_data['Longitude'] >= 180) | (LatLong_data['Longitude'] <= -180)].index
LatLong_data.drop(latrange, inplace = True)
LatLong_data.drop(longrange, inplace = True)

LatLong_data.describe() #Check

In [None]:
#scatter plot to visualize the location of the accidents/incidents

sb.set_style("darkgrid")
f, axes = plt.subplots(1, 1, figsize = (24, 14))
sb.scatterplot(x = 'Longitude', y = 'Latitude', data = LatLong_data)

In [None]:
#scatter plot to visualize the location of the accidents/incidents that are fatal and non-fatal

f, axes = plt.subplots(1, 1, figsize = (24, 14))
sb.scatterplot(x = 'Longitude', y = 'Latitude', hue = 'Fatal', data = LatLong_data)

From the plots above, we observe that the points shape the world map. Expected since the axis is latitude by longitude. Majority of accidents reported are in the area of the United States. Another observation is that majority of the accidents outside of the United States resulted in fatality. This is coherent with our observation from analysis of Report Status whereby a `Foreign` report on an accident is likely to be fatal.

### Conclusion:  
#### The possible factors that are likely to affect the fatality of an accident/incident: 
`Aircraft Damage`, `Purpose of Flight`, `Weather Conditions`, `Broad phase of flight` and `Report Status`.

---
# Modelling and Predictions

Reference: <br>
https://machinelearningmastery.com/feature-selection-with-categorical-data/

### Categorical Feature Selection

Using OriginalEncoder and LabelEncode to encode each variable to integers

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif

In [None]:
#Allocating features / target variable
X = pd.DataFrame(clean_data[['Month',              
                             'Aircraft.Damage',     
                             'Make',
                             'Amateur.Built',
                             'Number.of.Engines',
                             'Engine.Type',
                             'Purpose.of.Flight',
                             'Weather.Condition',
                             'Broad.Phase.of.Flight',
                             'Report.Status',]])

y = pd.DataFrame(clean_data["Fatal"]) #Target Variable

# prepare input data
def prepare_inputs(X_train, X_test):
    oe = OrdinalEncoder()
    oe.fit(X_train)
    X_train_enc = oe.transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc

# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    le.fit(y_train.values.ravel())
    y_train_enc = le.transform(y_train)
    y_test_enc = le.transform(y_test)
    return y_train_enc, y_test_enc

#Split randomly into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

#### 1. Chi-Squared Feature Selection
Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

In [None]:
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=chi2, k='all')
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
    
# plot the scores  
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)

**Chi2:** it can be seen that Features 1,3,6 and 7 are the four best variables to choose from.

These features are: `Aircraft.Damage`, `Amateur.Built`, `Purpose.of.Flight`, `Weather.Condition` respectively.

#### 2. Mutual Information Feature Selection
Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

In [None]:
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=mutual_info_classif, k='all')
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# what are scores for the features
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
    
# plot the scores  
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)

**Mutual_Info_classif:** it can be seen that Features 1,7,8 and 9 are the four best variables to choose from.

These features are: `Aircraft.Damage`, `Weather.Condition`, `Broad.Phase.of.Flight`, `Report.Status` respectively.

### Logistic Regression
#### Modelling with ALL FEATURES

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
# fit the model
model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train_enc, y_train_enc)

# evaluate the model
y_test_pred = model.predict(X_test_enc)
y_train_pred = model.predict(X_train_enc)

# evaluate predictions
accuracy_test = accuracy_score(y_test_enc, y_test_pred)
accuracy_train = accuracy_score(y_train_enc, y_train_pred)
print("Goodness of fit for model using ALL Features")
print("Classification Accuracy (train dataset) :\t %.2f" %(accuracy_train*100))
print("Classification Accuracy (test dataset) :\t %.2f" %(accuracy_test*100))

#Plotting a heatmap
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train_enc, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test_enc, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

axes[0].set_title("Train")
axes[1].set_title("Test")

#### Model Built Using Chi-Squared Features

We can use the chi-squared test to score the features and select the four most relevant features.

In [None]:
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=chi2, k=4)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs

# feature selection
X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)

#fit the model
model = LogisticRegression(solver='lbfgs', max_iter=1000)
model.fit(X_train_fs, y_train_enc)

# evaluate the model
y_test_pred = model.predict(X_test_fs)
y_train_pred = model.predict(X_train_fs)

# evaluate predictions
accuracy_test = accuracy_score(y_test_enc, y_test_pred)
accuracy_train = accuracy_score(y_train_enc, y_train_pred)
print("Goodness of fit for model using Chi-Squared selected features")
print("Classification Accuracy (train dataset) :\t %.2f" %(accuracy_train*100))
print("Classification Accuracy (test dataset) :\t %.2f" %(accuracy_test*100))

#Plotting a heatmap
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train_enc, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test_enc, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

axes[0].set_title("Train")
axes[1].set_title("Test")

#### Model Built Using Mutual Information Features
We can repeat the experiment and select the top four features using a mutual information statistic.

In [None]:
def select_features(X_train, y_train, X_test):
    fs = SelectKBest(score_func=mutual_info_classif, k=4)
    fs.fit(X_train, y_train)
    X_train_fs = fs.transform(X_train)
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs

# feature selection
X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc)

# fit the model
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_fs, y_train_enc)

# evaluate the model
y_test_pred = model.predict(X_test_fs)
y_train_pred = model.predict(X_train_fs)

# evaluate predictions
accuracy_test = accuracy_score(y_test_enc, y_test_pred)
accuracy_train = accuracy_score(y_train_enc, y_train_pred)
print("Goodness of fit for model using Mutual Info selected features")
print("Classification Accuracy (train dataset) :\t %.2f" %(accuracy_train*100))
print("Classification Accuracy (test dataset) :\t %.2f" %(accuracy_test*100))

#Plotting a heatmap
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train_enc, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test_enc, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

axes[0].set_title("Train")
axes[1].set_title("Test")

#### Comments:

The classification accuracy of the 3 different models above are shown to be ~87%. They are highly accurate for prediction of fatality. 

We observe that the accuracy of prediction using selected features, either by `Chi-Squared` or `Mutual Info Classification`, is marginally higher then using `all features`. This might indicate that modelling using `all features` might impose a negative effect due to overfitting. 

### Natural Forest Regression

References: <br>
https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
pd.set_option('display.max_column',90)

forest_data = pd.DataFrame(clean_data[['Month', 
                                       'Aircraft.Damage', 
                                       'Make', 
                                       'Amateur.Built', 
                                       'Number.of.Engines',
                                       'Engine.Type', 
                                       'Purpose.of.Flight', 
                                       'Weather.Condition', 
                                       'Broad.Phase.of.Flight',
                                       'Report.Status', 
                                       'Fatal']])

features = pd.get_dummies(forest_data)
features

In [None]:
numeric_cols = [col for col in features if features[col].dtype.kind != 'O']
# Positive is 2, Negative is 1 since dataframe +=1
# Making it so no values will divide by 0 later
features[numeric_cols] += 1

#### Convert Data to Arrays

Label - Data we want to Predict <br>
Features - Variables to used for prediction <br>
Convert to Numpy - in order for this algorithm to work

In [None]:
# Labels are the values we want to predict
labels = np.array(features['Fatal_Yes'])

# Remove the labels from the features
# axis 1 refers to the columns
features = features.drop(['Fatal_No', 'Fatal_Yes'], axis = 1)

# Saving feature names for later use
feature_list = list(features.columns)

# Convert to numpy array
features = np.array(features)

#### Training and Testing Sets
We expect the training features number of columns to match the testing feature number of columns and the number of rows to match for the respective training and testing features and the labels

In [None]:
# Split the data into training and testing sets 
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, 
                                                                            test_size = 0.3, random_state = 42)

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
                                                        


#### Creating and training the model

In [None]:
# Instantiate model with 100 decision trees 
regressor = RandomForestRegressor(n_estimators=100, random_state= 42)

# Train the model on training data
regressor.fit(train_features, train_labels)  

#### Use the forest's predict method on the train data
To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.

In [None]:
train_predictions = regressor.predict(train_features)

# Calculate the absolute errors
errors = abs(train_predictions - train_labels)

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / train_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Train Accuracy:', round(accuracy, 2), '%')

#### Use the forest's predict method on the test data
To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.

In [None]:
predictions = regressor.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Test Accuracy:', round(accuracy, 2), '%')

#### Variable Importances
The importances returned in Skicit-learn represent how much including a particular variable improves the prediction


In [None]:
importances = list(regressor.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(features, round(importance, 4)) for features, 
                       importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

In future implementations of the model, we can remove those variables that have no importance, and the performance will not suffer. Additionally, if we are using a different model(eg: a support vector machine) we could use the random forest feature importances as a feature selection method.

#### Visualization

Simple bar plot of the feature importances to illustrate the disparities in the relative significance of the variables.

In [None]:
# list of x locations for plotting
x_values = list(range(len(importances)))

# Make a bar chart
f, axes = plt.subplots(1, 1, figsize = (24,12))
plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')

# Axis labels and title
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')

#### Random forest with only the most important variables
`Aircraft.Damage_Substantial` to see how the performance compares.

In [None]:
# New random forest with only the most important variables
regressor_most_important = RandomForestRegressor(n_estimators= 100, random_state=42)

# Extract the two most important features
important_indices = [feature_list.index('Aircraft.Damage_Substantial')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]

# Train the random forest
regressor_most_important.fit(train_important, train_labels)

# Make predictions and determine the error
predictions = regressor_most_important.predict(test_important)

#### Most Important Variable to Predict - train data

In [None]:
errors = abs(train_predictions - train_labels)

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / train_labels)
accuracy = 100 - np.mean(mape)
print('Train Accuracy:', round(accuracy, 2), '%')

#### Most Important Variable to Predict - test data

In [None]:
errors = abs(predictions - test_labels)

# Display the performance metrics
mape = np.mean(100 * (errors / test_labels))
accuracy = 100 - mape
print('Accuracy:', round(accuracy, 2), '%.')

#### Comments:

With only 1 variable, we were able to achieve an accurate result - slightly more accurate if we were to use all the variables. 

This means that if we were to operate with this model, using variables of importances are sufficient to achieve optimal performance. 

In a production setting, we would need to weigh this effect on accuracy against the number of variables and time required to obtain them. 

---

# `THE` `END` `. `