# AN ANALYSIS OF AVIATION ACCIDENTS(1919 TO 2023)

<img src="Images/Airplane_Analysis.jpeg" style="width: 1980px; height: 1020;" alt="Image description">


## **Overview**


This analysis aims at identifying the least-risk aircraft type that would encourage a well-informed purchase based on historical accidents data. This will be done by examining several variables such as accident frequency, fatality rates, aircraft type, operator and geographical distribution. This will consequently highlight the aircraft and operational environments that are associated with the highest and lowest levels of risk. Afterwards, this data will be used by the company to adjust capital investment, training programs and operational focus to improve resource allocation and minimize financial exposure in this new business endeavor. 

## **Business Problem**

The company wants to expand in to new industries so as to diversify its portfolio. Given that this involves purchasing and operating arplanes for commercial and private enterprises, all the risks involved have to be mitigated to a sustainable level. This ensures that the resource allocation will not only yield profits but also allows the the company to better serve its clients since it is expanding the scope of services they can offer. Using the NTSB Aviation Accidents data, I describe patterns in accidents frequency and accidents location  to anticipate the lowest-risk aircraft to invest in.


## **Data Understanding**


The [dataset](https://www.kaggle.com/datasets/drealbash/aviation-accident-from-1919-2023/data) for this analysis contains a compiled list of publicly available aviation accident reports that occurred over a period that spans well over a century: 1919 to 2023. The data set thus provides the date, location and aircraft type for each accident as well as other characteristics (e.g. registration and operator). The dataset contains a total of 23,967 records and 9 fields.

In [5]:
# Importing the necessary libraries
import pandas as pd

In [9]:
# Loading the dataset
df = pd.read_csv('Data/aviation-accident-data-2023-05-16.csv')
df

Unnamed: 0,date,type,registration,operator,fatalities,location,country,cat,year
0,date unk.,Antonov An-12B,T-1206,Indonesian AF,,,Unknown country,U1,unknown
1,date unk.,Antonov An-12B,T-1204,Indonesian AF,,,Unknown country,U1,unknown
2,date unk.,Antonov An-12B,T-1201,Indonesian AF,,,Unknown country,U1,unknown
3,date unk.,Antonov An-12BK,,Soviet AF,,Tiksi Airport (IKS),Russia,A1,unknown
4,date unk.,Antonov An-12BP,CCCP-11815,Soviet AF,0,Massawa Airport ...,Eritrea,A1,unknown
...,...,...,...,...,...,...,...,...,...
23962,11-MAY-2023,Hawker 900XP,PK-LRU,Angkasa Super Services,0,Maleo Airport (MOH),Indonesia,A2,2023
23963,11-MAY-2023,Cessna 208B Grand Caravan,PK-NGA,Nasional Global Aviasi,0,Fentheik Airstrip,Indonesia,A2,2023
23964,12-MAY-2023,Cessna 208B Grand Caravan,5X-RBR,Bar Aviation,0,Kampala-Kajjansi...,Uganda,A1,2023
23965,14-MAY-2023,Boeing 747-4R7F,LX-OCV,Cargolux,0,Luxembourg-Finde...,Luxembourg,A2,2023


Initial exploration revealed that accidents involved in this dataset spanned multiple countries with most of them having none or very few fatalities with each accident being grouped into a specific similar category. 

### Data Quality

The dataset involves the following 9 columns:
* **Date of Accident**: This column contains the dates of each aviation accident, ranging from 1919 to 2023. The dates are not in a standardised format while some are also missing due to the (date unk.) label

* **Type**: This column indicates the model of the aircraft that was involved in the accident. This column has no missing values.

* **Registration**: This column contains the unique identification code that is usually assigned to each aircraft. It helps identify and track specific aircraft involved in incidents. This column has 1548 missing values.

* **Operator**: This column shows the airline that commands that specific aircraft that was involved in the accident.This column has 4 missing values.

* **Fatalities**: This column records the count of fatalities associated with each aviation accident. It provides information on the number of fatalities both ground fatality and aircraft fatality. This column contains 3938 missing values. Fatalities stored as strings; requires numeric conversion.

* **Location**: This column shows the specific region within the country where each accident occurred. It could include details such as city names, airports. It contains 948 missing values.

* **Country**: This column indicates the country where each aviation accident took place. This column has no missing values.

* **Accident Category**: This column classify each aviation accident into different categories based on factors such as the cause, nature, or severity of the incident. This column has no missing values. Examples of categories: 
    * A = Accident

    * I = Incident

    * H = Hijacking

    * C = Criminal occurrence (sabotage, shoot down) O= other occurrence (ground fire, sabotage)

    * U = type of occurrence unknown

    * 1 = hull-loss

    * 2 = repairable damage

    * E.g. the A1 category means an Accident resulting in a total loss of the plane.



* **Year**: This is a column that includes the extracted year-data from the date column. It has no missing values but has them labelled as 'Unknown'.

In [14]:
# Displaying summary about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23967 entries, 0 to 23966
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          23967 non-null  object
 1   type          23967 non-null  object
 2   registration  22419 non-null  object
 3   operator      23963 non-null  object
 4   fatalities    20029 non-null  object
 5   location      23019 non-null  object
 6   country       23967 non-null  object
 7   cat           23967 non-null  object
 8   year          23967 non-null  object
dtypes: object(9)
memory usage: 1.6+ MB


In [15]:
# Displaying the columns with null values
df.isnull().sum()

date               0
type               0
registration    1548
operator           4
fatalities      3938
location         948
country            0
cat                0
year               0
dtype: int64

## Data Preparation