# Aircraft Risk Analysis Project - Phase 1

## Introduction


In today's fast-evolving business landscape, diversification is key to staying competitive. This project explores a scenario where a company is looking to expand into the aviation industry—an exciting but high-risk move. While the aviation sector holds strong potential for both commercial and private enterprise, safety remains a top concern, especially for new entrants with limited industry experience.

The goal of this project is to use data science techniques to help the company make informed, data-driven decisions as it considers which aircraft to invest in. By examining historical accident data, we can uncover patterns, evaluate safety records, and highlight aircraft models that pose the least operational risk.

This project serves as both a business case and a demonstration of key data science skills: data cleaning, missing value imputation, exploratory data analysis, and visualization. The final output will include actionable insights and an interactive dashboard designed to support decision-making by the company’s new aviation division.



## Objectives

- Understand trends in aircraft accidents over time  
- Identify aircraft with the lowest risk profiles  
- Handle missing data and perform necessary cleaning  
- Present insights using clear visualizations  
- Build an interactive dashboard for stakeholders

## Data Source

The dataset comes from the National Transportation Safety Board (NTSB) and includes records of civil aviation accidents and selected incidents in the United States and international waters from 1962 to 2023.

## Tools & Libraries

- Python (Pandas, NumPy, Matplotlib)
- Jupyter Notebook
- Data cleaning and wrangling
- Exploratory data analysis (EDA)

### Step 1 : Import Libraries

In [3]:
#Importing libraries using standard alias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### Step 2 : Load and Inspect Data

In [9]:
#Loading the csv file into the notebook
#Adding low_memory=False to allow pandas to read the full file before deciding datatypes
#Telling pandas to treat ? , Unknown , N/A and blank spaces as missing values
data= pd.read_csv('data/Aviation_Data.csv', low_memory=False, na_values=['?', 'Unknown', 'N/A', ''])

#Inspect the first 5 rows of the dataset
data.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [10]:
# Checking the number of rows and columns in the dataset
data.shape

# Displaying the dataset's dimensions (rows, columns)
print(f"The dataset contains {data.shape[0]} rows and {data.shape[1]} columns.")


The dataset contains 90348 rows and 31 columns.


In [11]:
# Get a summary of the dataset:That is the  column names, data types, and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88836 non-null  object 
 5   Country                 88660 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52783 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85576 non-null  object 
 12  Aircraft.Category       32273 non-null  object 
 13  Registration.Number     87569 non-null  object 
 14  Make                    88805 non-null

In [15]:
#Look at all columns present
data.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [16]:
#Generate summary statistics for numerical columns in the dataset
data.describe()

Unnamed: 0,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,82805.0,77488.0,76379.0,76956.0,82977.0
mean,1.146585,0.647855,0.279881,0.357061,5.32544
std,0.44651,5.48596,1.544084,2.235625,27.913634
min,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,0.0,2.0
max,8.0,349.0,161.0,380.0,699.0


### Step 3 : Data Cleaning

In [18]:
# Create a copy of the cleaned DataFrame to avoid modifying the original
cleaned_data = data.copy()

In [22]:
#Change the Event.Date and Publication.Date columns into datetime
df['Event.Date']=pd.to_datetime(df['Event.Date'])
df['Publication.Date']=pd.to_datetime(df['Publication.Date'])

#Check if it has been applied
cleaned_data[['Event.Date','Publication.Date']].head()

Unnamed: 0,Event.Date,Publication.Date
0,1948-10-24,
1,1962-07-19,19-09-1996
2,1974-08-30,26-02-2007
3,1977-06-19,12-09-2000
4,1979-08-02,16-04-1980


In [23]:
# Check if there are any duplicate rows
cleaned_data.duplicated().any()


True

In [26]:
# Drop duplicated rows and keep the first occurrence
cleaned_data = cleaned_data.drop_duplicates()

#Check dimensions of our cleaned data after duplicates have been dropped
cleaned_data.shape
print(f"The Cleaned dataset contains {cleaned_data.shape[0]} rows and {cleaned_data.shape[1]} columns.")


The Cleaned dataset contains 88958 rows and 31 columns.


In [27]:
# Calculate the threshold: keep columns with at least half non-null values
threshold = cleaned_data.shape[0] // 2

# Drop columns with more than half missing values
cleaned_data = cleaned_data.dropna(axis=1, thresh=threshold)


In [29]:
#Check how many columns have remained
cleaned_data.info()
cleaned_data.columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88958 entries, 0 to 90347
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88958 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88836 non-null  object 
 5   Country                 88660 non-null  object 
 6   Airport.Code            50249 non-null  object 
 7   Airport.Name            52783 non-null  object 
 8   Injury.Severity         87889 non-null  object 
 9   Aircraft.damage         85576 non-null  object 
 10  Registration.Number     87569 non-null  object 
 11  Make                    88805 non-null  object 
 12  Model                   88796 non-null  object 
 13  Amateur.Built           88787 non-null  object 
 14  Number.of.Engines       82805 non-null

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Airport.Code', 'Airport.Name',
       'Injury.Severity', 'Aircraft.damage', 'Registration.Number', 'Make',
       'Model', 'Amateur.Built', 'Number.of.Engines', 'Engine.Type',
       'Purpose.of.flight', 'Total.Fatal.Injuries', 'Total.Serious.Injuries',
       'Total.Minor.Injuries', 'Total.Uninjured', 'Weather.Condition',
       'Broad.phase.of.flight', 'Report.Status', 'Publication.Date'],
      dtype='object')

In [39]:
#Group the numerical values and fill them with the mean
numeric_values = cleaned_data.select_dtypes(include=[float, int]).columns
cleaned_data[numeric_values] = cleaned_data[numeric_values].fillna(cleaned_data[numeric_values].mean())


In [40]:
#Check for any null values in the numerical values
cleaned_data[numeric_values].isna().sum()

Number.of.Engines         0
Total.Fatal.Injuries      0
Total.Serious.Injuries    0
Total.Minor.Injuries      0
Total.Uninjured           0
dtype: int64

In [50]:
#Group the categorical values and fill them with the mode
categorical_values = cleaned_data.select_dtypes(include=['object']).columns

for column in categorical_values:
    if not cleaned_data[column].mode().empty:
        mode_value = cleaned_data[column].mode().iloc[0]
        cleaned_data[column].fillna(mode_value, inplace=True)



In [52]:
#Check for any null values in the categorical values
cleaned_data[categorical_values].isna().sum()

Event.Id                 0
Investigation.Type       0
Accident.Number          0
Event.Date               0
Location                 0
Country                  0
Airport.Code             0
Airport.Name             0
Injury.Severity          0
Aircraft.damage          0
Registration.Number      0
Make                     0
Model                    0
Amateur.Built            0
Engine.Type              0
Purpose.of.flight        0
Weather.Condition        0
Broad.phase.of.flight    0
Report.Status            0
Publication.Date         0
dtype: int64

In [60]:
# Select numeric and categorical data by column names, then combine them back into one DataFrame
numerical_data = cleaned_data[numeric_values]
categorical_data = cleaned_data[categorical_values]
cleaned_data = pd.concat([numerical_data, categorical_data], axis=1)

#Check the 1st 5 rows of the cleaned data
cleaned_data.head()

Unnamed: 0,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,...,Registration.Number,Make,Model,Amateur.Built,Engine.Type,Purpose.of.flight,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,1.0,2.0,0.0,0.0,0.0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",...,NC6404,Stinson,108-3,No,Reciprocating,Personal,UNK,Cruise,Probable Cause,25-09-2020
1,1.0,4.0,0.0,0.0,0.0,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",...,N5069P,Piper,PA24-180,No,Reciprocating,Personal,UNK,Landing,Probable Cause,19-09-1996
2,1.0,3.0,0.279881,0.357061,5.32544,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",...,N5142R,Cessna,172M,No,Reciprocating,Personal,IMC,Cruise,Probable Cause,26-02-2007
3,1.0,2.0,0.0,0.0,0.0,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",...,N1168J,Rockwell,112,No,Reciprocating,Personal,IMC,Cruise,Probable Cause,12-09-2000
4,1.146585,1.0,2.0,0.357061,0.0,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",...,N15NY,Cessna,501,No,Reciprocating,Personal,VMC,Approach,Probable Cause,16-04-1980


In [61]:
# Display the summary of the DataFrame to check data types, non-null counts, and memory usage
cleaned_data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 88958 entries, 0 to 90347
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Number.of.Engines       88958 non-null  float64
 1   Total.Fatal.Injuries    88958 non-null  float64
 2   Total.Serious.Injuries  88958 non-null  float64
 3   Total.Minor.Injuries    88958 non-null  float64
 4   Total.Uninjured         88958 non-null  float64
 5   Event.Id                88958 non-null  object 
 6   Investigation.Type      88958 non-null  object 
 7   Accident.Number         88958 non-null  object 
 8   Event.Date              88958 non-null  object 
 9   Location                88958 non-null  object 
 10  Country                 88958 non-null  object 
 11  Airport.Code            88958 non-null  object 
 12  Airport.Name            88958 non-null  object 
 13  Injury.Severity         88958 non-null  object 
 14  Aircraft.damage         88958 non-null

'''
The dataset has been thoroughly cleaned, with missing values handled as previously discussed. Duplicate rows have been removed, and columns with more than 50% of their values missing have been excluded. Following these data cleaning steps, the dataset now consists of 25 columns and 88,895 rows. The next step involves visualizing the cleaned data.
'''