#  Aircraft Risk Analysis for Business Expansion  

## ✈️ Introduction  
As part of the company's diversification strategy, they are exploring the aviation industry. However, operating aircrafts come with inherent risks. To make informed investment decisions, they must assess which aircraft types pose the **lowest operational risk** based on historical accident data.  

In this project, I will analyze **aviation accident data (1962–2023)** from the National Transportation Safety Board (NTSB) to:  
- **Identify risk factors** associated with aircraft accidents.  
- **Determine the safest aircraft types** for commercial and private operations.  
- **Provide actionable insights** to the aviation division for selecting aircraft with minimal risk.  

Using **data cleaning, imputation, analysis, and visualization**, we aim to transform raw aviation data into meaningful business insights.  


## 🔍 Methodology: Approach to Aircraft Risk Analysis  

To ensure a **thorough and accurate** assessment of aircraft risk, this notebook follows a structured approach:  

### 1️⃣ Data Exploration  
- Load and inspect the dataset to understand its structure and key attributes.  
- Identify missing values, inconsistencies, and potential data quality issues.  

### 2️⃣ Data Cleaning & Preprocessing  
- Handle missing values through imputation or removal where necessary.  
- Standardize formats and correct inconsistencies.  
- Filter relevant data to focus on meaningful insights.  

### 3️⃣ Data Visualization  
- Use **charts, graphs, and interactive visualizations** to identify trends and patterns in accident occurrences.  
- Compare accident rates across different aircraft types, years, and other factors.  

### 4️⃣ Risk Analysis & Insights Generation  
- Perform Basic **statistical analysis** to determine the most and least risky aircraft.  
- Identify key **risk factors** influencing accident likelihood.  
- Correlate accident severity with aircraft type, age, and operational conditions.  

### 5️⃣ Business Recommendations  
- Summarize key findings in a clear format.  
- Provide **data-driven recommendations** to the aviation division on the safest aircraft for investment.  
【1】 **Step 1:** Check dataset shape (`df.shape`)  
【2】 **Step 2:** Get column info (`df.info()`)  
【3】 **Step 3:** View summary statistics (`df.describe()`)  
【4】 **Step 4:** Check missing values (`df.isnull().sum()`)  
【5】 **Step 5:** Identify unique categories (`df['Aircraft Type'].unique()`)  
【6】 **Step 6:** Detect duplicate records (`df.duplicated().sum()`)  


### 【1】Data Exploration  

In [1]:
# Import essential libraries for data analysis and visualization  
import pandas as pd                 # Data manipulation and analysis  
import matplotlib.pyplot as plt     # Data visualization  
import seaborn as sns               # Statistical Data visualization  

In [None]:
# Load the dataset and inspect the first few rows
df = pd.read_csv("Data/Aviation_Data.csv")
df.head()

  df = pd.read_csv("Data/Aviation_Data.csv")


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


The dataset raised a **DtypeWarning**, meaning some columns contain **mixed data types** (e.g., numbers and text), which can cause issues during analysis; we will address this in the data cleaning stage by inspecting these columns, converting them to appropriate types, and handling inconsistencies.  


As you can see from the output above, this dataset is **very extensive**, containing numerous columns with detailed aviation accident records. However, to effectively answer our **business question**, we must focus only on the most relevant data points. By narrowing down our selection, we ensure that our analysis remains **targeted, efficient, and data-driven** in making a well-informed decision.  

### **Key Columns We Will Use:**  

- **`Make` & `Model`** → Identifies aircraft manufacturers and specific models to compare safety records.  
- **`Event.Date`** → Allows us to analyze accident trends over time.  
- **`Purpose.of.flight`** → Filters for **business flights**, since our company is focused on commercial aviation.  
- **`Broad.phase.of.flight`** → Helps identify when accidents are most likely to occur (e.g., takeoff, cruise, landing).  
- **`Total.Fatal.Injuries`, `Total.Serious.Injuries`, `Total.Minor.Injuries`, `Total.Uninjured`** → Quantifies accident severity and survivability.  

### **Why These Columns?**  
These fields provide **direct insights into aircraft risk levels** by showing:  
✅ Which **aircraft types** are more prone to accidents.  
✅ The **phases of flight** where accidents occur most often.  
✅ How accident rates have **changed over time**.  
✅ Whether people **walk away unharmed** or if accidents are severe.  

### **Note:**  
Since this is a **U.S.-based company**, we will focus only on accidents that occurred **within the United States**. As we progress through this notebook, we will filter our dataset accordingly to remove irrelevant data and improve analysis accuracy.  


In [4]:
# Check the dimensions of the dataset (rows, columns)
df.shape

(90348, 31)

In [None]:
# View a summary of the dataset columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90348 entries, 0 to 90347
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      90348 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

In [6]:
# Check for missing values in each column
df.isnull().sum()

Event.Id                   1459
Investigation.Type            0
Accident.Number            1459
Event.Date                 1459
Location                   1511
Country                    1685
Latitude                  55966
Longitude                 55975
Airport.Code              40216
Airport.Name              37644
Injury.Severity            2459
Aircraft.damage            4653
Aircraft.Category         58061
Registration.Number        2841
Make                       1522
Model                      1551
Amateur.Built              1561
Number.of.Engines          7543
Engine.Type                8555
FAR.Description           58325
Schedule                  77766
Purpose.of.flight          7651
Air.carrier               73700
Total.Fatal.Injuries      12860
Total.Serious.Injuries    13969
Total.Minor.Injuries      13392
Total.Uninjured            7371
Weather.Condition          5951
Broad.phase.of.flight     28624
Report.Status              7843
Publication.Date          16689
dtype: i

From the output above, we can see that many columns contain a significant number of missing values. This presents a challenge because incomplete data can lead to inaccurate visualizations and misleading analysis.  

###  Why Cleaning is Necessary?  
- **Ensures Data Accuracy** → Missing values can distort trends and patterns.  
- **Improves Visualization** → Charts and graphs require complete data for meaningful insights.   

###  Next Steps
To ensure our **final insights are reliable and data-driven**, we need to properly handle these missing values.  

In the next section, we will focus on Stage 2: **Data Cleaning**, where we will:  
✅ Drop columns with excessive missing values.  
✅ Fill or impute missing values where necessary.  
✅ Filter and refine the dataset for meaningful analysis.  

By carefully cleaning the data, we set the foundation for accurate **visualization and decision-making**.


In [7]:
# View summary statistics
df.describe()

Unnamed: 0,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,82805.0,77488.0,76379.0,76956.0,82977.0
mean,1.146585,0.647855,0.279881,0.357061,5.32544
std,0.44651,5.48596,1.544084,2.235625,27.913634
min,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,1.0
75%,1.0,0.0,0.0,0.0,2.0
max,8.0,349.0,161.0,380.0,699.0
