#  Aviation Accident Analysis (1962–2023)

##  1. Business Understanding

###  Project Overview

A new company is preparing to enter the aviation industry and is currently in the **decision-making phase** regarding **fleet acquisition** and **operational planning**. While the industry presents immense opportunities for profit and expansion, **aviation safety** remains a primary concern. Aircraft accidents not only lead to the tragic loss of life but also cause severe reputational damage, legal implications, and financial losses. Therefore, minimizing the risk of such incidents is a **core strategic goal** for the company.

This project aims to analyze over **60 years of aviation accident data** sourced from the **National Transportation Safety Board (NTSB)** to derive **data-driven insights** into:

- The **root causes** of aviation accidents
- The **types of aircraft** most involved in accidents
- The **safest aircraft models and manufacturers**
- Accident trends by **location**, **weather**, **phase of flight**, and **purpose of flight**
- **Fatality rates** and the **severity** of different types of incidents

### 🛠 Why This Analysis Matters

Entering the aviation industry without understanding historical risks would be like flying blind. This analysis will help the business:

1. **Reduce Risk Exposure**  
   By identifying aircraft types or operational conditions that frequently result in accidents, the company can avoid investing in high-risk assets or routes.

2. **Improve Procurement Strategy**  
   With insights into the safest aircraft models, the business can make more **informed purchasing decisions** that prioritize **safety, performance, and reliability**.

3. **Enhance Operational Safety Protocols**  
   By understanding accident patterns across weather conditions, flight phases, and human factors, the company can design **training programs**, **safety checklists**, and **emergency response procedures** that mitigate risk.

4. **Build Customer and Investor Trust**  
   Demonstrating a strong, data-backed commitment to safety can **enhance the brand image**, foster **customer confidence**, and attract **investors or partners** seeking responsible operators.

###  Business Questions We Aim to Answer

- What are the **most common causes** of aviation accidents?
- Which aircraft **types** and **manufacturers** are most often involved in fatal incidents?
- Which **U.S. states** and **regions** have recorded the highest number of accidents?
- How does accident frequency and severity vary by **flight type** (private, commercial, military)?
- What are the trends in aviation safety over the years?
- Which aircrafts or manufacturers have the **lowest accident-to-fatality ratio**, indicating better survivability?

### Final Deliverables

The findings of this project will be communicated via:

- An **interactive dashboard** for stakeholder exploration
- A **summary presentation** tailored to non-technical executives
- A **recommendation report** on aircraft procurement
- A clean, well-documented **GitHub repository** for transparency

---

>  This analysis goes beyond just numbers — it's a decision-making tool for the **future of aviation safety and investment**.


#  2. Data Understanding

In this section, we perform an initial exploration of the dataset to understand its structure, contents, and quality. This includes:
- Viewing column names and data types
- Previewing the first few records
- Identifying missing values
- Understanding the nature of each variable (categorical, numerical, textual)
- Flagging early data quality issues such as duplicates, inconsistent formats, or suspicious values


In [13]:
# Import essential libraries
import pandas as pd
import numpy as np

# Increase display width to view all columns
pd.set_option('display.max_columns', None)

# Try loading with fallback encodings
try:
    df = pd.read_csv('data/AviationData.csv', encoding='utf-8')
except UnicodeDecodeError:
    try:
        df = pd.read_csv('data/AviationData.csv', encoding='latin1')  # handles funky characters
    except Exception as e:
        print(" Still can't load the dataset:", e)
        df = pd.DataFrame()  # create empty DataFrame to avoid crashing

# Check if data loaded
if not df.empty:
    print(f" Data loaded: {df.shape[0]} rows, {df.shape[1]} columns.")

    # Preview data
    print("\n First 5 records:")
    print(df.head())

    # Column types
    print("\n Column data types:")
    print(df.dtypes)

    # Missing values
    print("\n Missing values per column:")
    print(df.isnull().sum())

    # Duplicates
    print("\n Duplicate rows:", df.duplicated().sum())
else:
    print(" No data to explore.")


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


✅ Data loaded: 88889 rows, 31 columns.

🔍 First 5 records:
         Event.Id Investigation.Type Accident.Number  Event.Date  \
0  20001218X45444           Accident      SEA87LA080  1948-10-24   
1  20001218X45447           Accident      LAX94LA336  1962-07-19   
2  20061025X01555           Accident      NYC07LA005  1974-08-30   
3  20001218X45448           Accident      LAX96LA321  1977-06-19   
4  20041105X01764           Accident      CHI79FA064  1979-08-02   

          Location        Country Latitude Longitude Airport.Code  \
0  MOOSE CREEK, ID  United States      NaN       NaN          NaN   
1   BRIDGEPORT, CA  United States      NaN       NaN          NaN   
2    Saltville, VA  United States  36.9222  -81.8781          NaN   
3       EUREKA, CA  United States      NaN       NaN          NaN   
4       Canton, OH  United States      NaN       NaN          NaN   

  Airport.Name Injury.Severity Aircraft.damage Aircraft.Category  \
0          NaN        Fatal(2)       Destroyed   

## Step 3: Data Preparation
### Objective:
Clean and prepare the AviationData.csv dataset to ensure it's free of:

Missing/null values

Duplicates

Incorrect data types

### (i) Check for Null Values
We begin by inspecting the dataset for missing values using .isnull().sum(). This helps us identify columns that may require imputation or removal, depending on the percentage of missing data and its relevance to our analysis.

In [14]:
#  Check for null values across all columns
null_counts = df.isnull().sum().sort_values(ascending=False)

#  Display columns with the highest number of missing values
null_counts[null_counts > 0]


Schedule                  76307
Air.carrier               72241
FAR.Description           56866
Aircraft.Category         56602
Longitude                 54516
Latitude                  54507
Airport.Code              38640
Airport.Name              36099
Broad.phase.of.flight     27165
Publication.Date          13771
Total.Serious.Injuries    12510
Total.Minor.Injuries      11933
Total.Fatal.Injuries      11401
Engine.Type                7077
Report.Status              6381
Purpose.of.flight          6192
Number.of.Engines          6084
Total.Uninjured            5912
Weather.Condition          4492
Aircraft.damage            3194
Registration.Number        1317
Injury.Severity            1000
Country                     226
Amateur.Built               102
Model                        92
Make                         63
Location                     52
dtype: int64

### Drop Columns with Too Many Nulls

In [15]:
# Drop columns with more than 50% missing data
threshold = len(df) * 0.5
cols_to_drop = df.columns[df.isnull().sum() > threshold]

df.drop(columns=cols_to_drop, inplace=True)

"""
We drop columns with more than 50% missing values, as they provide limited analytical value
and could introduce noise or misleading patterns if imputed poorly.
"""


'\nWe drop columns with more than 50% missing values, as they provide limited analytical value\nand could introduce noise or misleading patterns if imputed poorly.\n'

### Fill Injury Columns with 0

In [16]:
# Fill injury-related nulls with 0 assuming no injury occurred
injury_cols = [
    'Total.Fatal.Injuries', 
    'Total.Serious.Injuries', 
    'Total.Minor.Injuries', 
    'Total.Uninjured'
]

df[injury_cols] = df[injury_cols].fillna(0)

"""
Injury-related fields are often null when no injuries were reported.
Filling with 0 prevents distortion in statistics and allows for accurate injury analysis.
"""


'\nInjury-related fields are often null when no injuries were reported.\nFilling with 0 prevents distortion in statistics and allows for accurate injury analysis.\n'