## Aviation Accident Risk Analysis: Data-Driven Recommendations for Safer Investments
This project explores historical aviation accident data to identify patterns, contributing factors, and risk profiles associated with various aircraft models, flight conditions, and operational phases. By integrating accident records with regulatory data, weather conditions, and aircraft registration details, I aim to uncover actionable insights that support strategic decision-making—particularly for stakeholders assessing aircraft safety before investment or deployment.

Through a combination of statistical techniques and visual analytics, this analysis reveals key trends spanning decades of incidents. The ultimate goal: to deliver **at least three concrete, data-backed business recommendations** that enhance aviation safety and reduce investment risk for operators, insurers, and aviation decision-makers.

## Guiding Questions for Analysis

To shape meaningful business recommendations and uncover the underlying factors contributing to aviation accidents, the following key questions will guide me in my analysis:

1. **Which aircraft models are associated with the highest and lowest accident rates, and how do these rates compare when normalized by fleet size or registration volume?**  
   *→ Informs investment risk by identifying safer aircraft models.*

2. **What role do weather conditions play in aviation accidents, and which specific weather types are most frequently linked to severe outcomes?**  
   *→ Supports operational planning and risk mitigation under adverse weather.*

3. **To what extent do regulatory or maintenance-related issues contribute to accident frequency or severity?**  
   *→ Informs policy adjustments and helps rank compliance risk across aircraft categories.*

4. **Have accident patterns shifted over time, and what does this reveal about the effectiveness of safety regulations or technological advancements?**  
   *→ Tracks progress and identifies areas needing continued focus.*


## PHASE ONE:  Data Understanding

In this section, i will dive into a comprehensive examination of all datasets i will use in the project. The goal is to assess their structure, contents, and quality  and begin identifying how they can be integrated to support meaningful analysis and actionable insights.

---

###  Objectives

- Understand the schema, variables, and value distributions in each dataset.
- Assess data quality: missing values, inconsistencies, encoding issues.
- Identify relationships and join keys across datasets.
- Define preprocessing needs for each dataset.

---

###  Approach

#### 1. **Main Exploration (Aviation DAta)**
- Load the aviation accident dataset.
- Inspect variable types and value ranges.
- Identify missing or inconsistent values.
- Explore time, location, aircraft model, and severity distributions.

#### 2. **Explore Supplementary Data**
- Review each FAA data:
  - Are the values well-formatted?
  - Any obvious missing or invalid entries?
  - What columns are useful?

#### 3. **Plan for Dataset Integration**
- Identify common keys for joining:
  - `Registration.Number` ↔ `N-Number` (FAA)
  - `Model` ↔ `MODEL` (FAA)
  - Date + Lat/Lon proximity ↔ GHCND Weather
- Consider transformations (e.g., date parsing, coordinate matching).

---



In [None]:
#importing standard libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)

## PART ONE: Aviation Data
The primary dataset for this project consists of detailed records of aviation accidents, capturing various attributes such as accident number, date, aircraft model, flight phase, location, injury severity, and more. This dataset serves as the backbone of my analysis and will help me uncover core patterns in accident frequency, severity, and causes.

Before diving into analysis, i will begin by examining the structure and content of this dataset to understand its variables, detect missing or inconsistent data, and identify potential areas for transformation. This step is critical in ensuring that my insights are grounded in clean, reliable, and meaningful data.

**Objectives:**
- Get familiar with the features (columns) present in the dataset  
- Check the completeness and data types of each feature  
- Identify key columns that will drive our analysis.
- Detect potential issues such as missing values, formatting inconsistencies, or ambiguous entries  


In [None]:
aviation_data = pd.read_csv("Data/Aviation-data/AviationData.csv", encoding='latin1')

In [None]:
aviation_data.shape

In [None]:
aviation_data.head()

In [None]:
aviation_data.tail()

In [None]:
aviation_data.columns

In [None]:
aviation_data.info()

In [None]:
aviation_data.describe().T

In [None]:
aviation_data.describe(include='O').T

In [None]:
aviation_data.isna().any()



###  Findings
- The dataset successfully loaded using `latin1` encoding due to extended character sets in some fields.
- A preliminary inspection using `.head()` and `.tail()` confirms the structure is consistent across rows.

###  Columns & Features
- The dataset contains a wide range of features including:
  - Aircraft information (make, model, engine type, registration number, etc.)
  - Flight conditions (weather, phase of flight, purpose of flight)
  - Accident details (date, location, injury severity, aircraft damage, narrative)

- Column names are inconsistent and will require **standardization and renaming** for readability and usability in analysis.

###  Data Types and Initial Insights
- The `.info()` summary reveals a mixture of:
  - **Categorical features** such as `Injury.Severity`, `Weather.Condition`, and `Aircraft.Damage`
  - **Date fields** like `Event.Date`, which will be parsed into datetime format

### Missing Data
- A significant number of features contain **missing or null values**, particularly in:
  - latitude and longitude
  - Airport name and Code
  - Aircraft category
These issues will be addressed during the **Data Cleaning** phase.

---

This initial preview establishes a foundational understanding of the dataset. Further steps will involve cleaning, transforming, and preparing the data for analysis.


## PHASE TWO: DATA CLEANING & WRANGLING 

In this section, I begin the data cleaning and wrangling phase of my analysis. After getting an overview of all datasets, I will now dig deeper into the structure and contents of the aviation accident data — the main dataset powering my analysis.

My goal here is to:
- Understand the **meaning and relevance** of each column
- Decide which features are **critical for analysis**, and which can be **dropped or transformed**
- Handle **missing values** using clear logic
- Create a **clean, well-structured dataset** ready for Exploratory Data Analysis (EDA) and risk modeling

---

###  Why Focus on U.S. Data?

The aviation accident dataset spans both domestic and international incidents from **1962 to 2023**. After analyzing the `Country` column, I found:

- Total records in dataset: **88,889**
- Records with `Country == "United States"`: **82,248**
- Proportion of U.S. data: 92.5%


Given that **over 92% of the data is U.S.-based**, it is statistically sound to anchor my cleaning and initial analysis on this subset. This choice ensures:
- High-quality and consistent data (due to FAA reporting standards)
- Easier cross-referencing with other FAA and registration datasets
- A more stable foundation for accurate risk modeling and business recommendations

---

###  What About the Non-U.S. (Diaspora) Data?

While my focus will be on U.S.-based data for the purposes of cleaning, modeling, and initial business recommendations, I will **not discard the international data**.

Instead, I will:
- Preserve a cleaned version of non-U.S. (diaspora) data separately
- Consider adding it in later as a **secondary insight layer**
- Allow for potential **interactive filtering in dashboards** (e.g., U.S. vs. Global view)

This approach ensures that my analysis is both **deep (U.S. focus)** and **scalable (global relevance)**.

---

### Next Steps in Cleaning

I will now:
1. Filter and work with U.S. records only (`Country == "United States"`)
2. Examine each column in detail
3. Handle missing values logically
4. Clean inconsistencies (e.g., in aircraft model names, date formats, injury reports)
5. Save a clean version of the dataset for further analysis

Once complete, this cleaned dataset will form the foundation for:
- Exploratory Data Analysis
- Aircraft risk profiling
- Visualizations and business intelligence recommendations



In [None]:
aviation_data_copy = aviation_data.copy()

## Column Name Meanings

| Column Name              | Meaning                                                                       |
| ------------------------ | ----------------------------------------------------------------------------- |
| `Investigation.Type`     | Whether the event was an "Accident" or "Incident". Accidents are more severe. |
| `Accident.Number`        | Unique ID for each event. Serves as the primary key.                          |
| `Event.Date`             | Date the accident or incident occurred.                                       |
| `Location`               | General description of where the event happened (e.g., city, area).           |
| `Country`                | Country where the event occurred.                                             |
| `Latitude`               | Geographic coordinate (north-south) of the event.                             |
| `Longitude`              | Geographic coordinate (east-west) of the event.                               |
| `Airport.Code`           | FAA/IATA code of the airport involved (if any).                               |
| `Airport.Name`           | Full name of the airport involved (if any).                                   |
| `Injury.Severity`        | Summary of the severity of injuries (e.g., Fatal, Serious, Minor).            |
| `Aircraft.damage`        | Description of damage sustained by the aircraft.                              |
| `Aircraft.Category`      | General category of aircraft (e.g., airplane, rotorcraft).                    |
| `Registration.Number`    | Aircraft registration number (like a license plate).                          |
| `Make`                   | Manufacturer of the aircraft (e.g., Boeing, Cessna).                          |
| `Model`                  | Specific model of the aircraft.                                               |
| `Amateur.Built`          | Indicates if the aircraft was amateur-built ("Yes" or "No").                  |
| `Number.of.Engines`      | Number of engines the aircraft had.                                           |
| `Engine.Type`            | Description of the aircraft’s engine type.                                    |
| `FAR.Description`        | FAA regulatory category under which the aircraft was operating.               |
| `Schedule`               | Indicates if the flight was scheduled or unscheduled.                         |
| `Purpose.of.flight`      | Reason or purpose for the flight (e.g., personal, training).                  |
| `Air.carrier`            | Name of the air carrier, if applicable (commercial flights).                  |
| `Total.Fatal.Injuries`   | Total number of people who died in the event.                                 |
| `Total.Serious.Injuries` | Total number of people with serious injuries.                                 |
| `Total.Minor.Injuries`   | Total number of people with minor injuries.                                   |
| `Total.Uninjured`        | Total number of people who were not injured.                                  |
| `Weather.Condition`      | Weather during the event (e.g., VMC, IMC, UNK).                               |
| `Broad.phase.of.flight`  | Phase of flight during which the event occurred (e.g., landing, taxi).        |
| `Report.Status`          | Indicates if the report is preliminary or final.                              |
| `Publication.Date`       | Date the report was published.                                                |

----
The meanings provide description of what each column entails, thus expanding my domain knowledge on the data set

In [None]:
#cleaning and renaming the columns 
aviation_data_copy.rename(columns={
    'Investigation.Type': 'Investigation_Type',
    'Accident.Number': 'Accident_Number',
    'Event.Date': 'Event_Date',
    'Airport.Code': 'Airport_Code',
    'Airport.Name': 'Airport_Name',
    'Injury.Severity': 'Injury_Severity',
    'Aircraft.damage': 'Aircraft_Damage',
    'Aircraft.Category': 'Aircraft_Category',
    'Registration.Number': 'Registration_Number',
    'Make': 'Aircraft_Make',
    'Model': 'Aircraft_Model',
    'Amateur.Built': 'Amateur_Built',
    'Number.of.Engines': 'Number_of_Engines',
    'Engine.Type': 'Engine_Type',
    'FAR.Description': 'FAR_Description',
    'Schedule': 'Schedule_Type',
    'Purpose.of.flight': 'Purpose_of_Flight',
    'Air.carrier': 'Air_Carrier',
    'Total.Fatal.Injuries': 'Fatal_Injuries',
    'Total.Serious.Injuries': 'Serious_Injuries',
    'Total.Minor.Injuries': 'Minor_Injuries',
    'Total.Uninjured': 'Uninjured',
    'Weather.Condition': 'Weather_Condition',
    'Broad.phase.of.flight': 'Phase_of_Flight',
    'Report.Status': 'Report_Status',
    'Publication.Date': 'Publication_Date'
}, inplace=True)

aviation_data_copy.columns

In [None]:
def clean_column_names(df):
    df.columns = (
        df.columns
        .str.strip()
        .str.lower()
        .str.replace('.', '_', regex=False)
        .str.replace(' ', '_', regex=False)
    )
    return df

aviation_data_copy = clean_column_names(aviation_data_copy)

aviation_data_copy.columns


In [None]:
#copy of the US-subset
us_data = aviation_data_copy[aviation_data_copy['country'] == 'United States'].copy()
#copy for the diaspora data
diaspora_data = aviation_data_copy[aviation_data_copy['country'] != 'United States'].copy()

##  General Rules for Dropping Data

Dropping data—whether rows or columns—should be done cautiously, guided by domain knowledge and data quality goals. Below are standard, defensible rules that i will use in my analysis.

---

###  Dropping Columns

I will drop a column if:

- It has a **high percentage of missing values** (typically > 50–70%) and is not critical for analysis.
- It contains **only a single unique value** (i.e., zero variance — no information gain).
- It is a **duplicate of another column** (redundancy).
- The data is **irrelevant to the current analysis objectives** (e.g., IDs or metadata not used for joins or context).
- It is **impossible to interpret or decode** (e.g., poorly documented, encoded variables with no lookup).

---

###  Dropping Rows

I will  drop a row if:

- **Critical columns are missing**, especially where imputation is not appropriate (e.g., timestamps, unique identifiers, target variable).
- It contains **clearly erroneous or corrupted data** (e.g., wrong data types, impossible values like negative injuries or invalid dates).
- It is a **complete duplicate** of another row.
- It **violates integrity constraints**, such as conflicting values across dependent fields.

---

### Cautions

- Consider **imputation or transformation** before dropping — dropping should be the **last resort** if data is unrecoverable.
- **Document your rationale** for each drop, especially in sensitive or audit-heavy domains like aviation or healthcare.
- Consider the **impact on representativeness**: Dropping too many rows can introduce bias or reduce statistical power.

---

### Best Practice

I will use `.info()`, `.isnull().sum()`, and `.nunique()` early in EDA to assess the quality of each column and  back decisions with simple visuals (e.g., **missingness heatmaps** or **histograms**).


In [None]:

us_data.info()

In [None]:
us_data.isna().any()

## Findings ?
The only columns with no missing data are event_id,investigation_type,accident_number, event_date
i will go through each column one by one and keep relevant data and drop the rest, i can also use domain knowlege to fill some columns, either way i want to maintain minimal bias according to what i have.

## COLUMN BY COLUMN INVESTIGATION AND VERDICT
----


### 1.LOCATION

I noticed a pattern which might help me in one way or another fill missing data in my aviation data set, this is what i found

### The NTSB `accident_number` has a patter as shown and explained below:

According to the **NTSB Aviation Data Dictionary**, the structure of an `accident_number` follows a specific pattern:

###  Format Breakdown:

- **First 3 characters**: NTSB **office code**  
  *Example*: `MIA` = Miami Regional Office

- **Next 2 digits**: **Fiscal year** of the investigation  
  *Example*: `85` = Fiscal Year 1985

- **Next 2 letters**: **Investigation category and mode**  
  *Indicates whether the investigation involved airline, marine, etc.*

- **Next 3 digits**: A **sequential number** showing the order the case was opened in that fiscal year

- **Optional final letter**: Indicates **multiple aircraft** involved in the same event

---

### Example: `MIA85LAMS1`

This breaks down as:

- `MIA` → Miami NTSB Office  
- `85` → Fiscal year 1985  
- `L` → Likely a **major investigation** in **aviation** mode  
- `AMS` → Additional **category codes**  
- `1` → First in the sequence (possibly one of multiple aircraft)

##  Final Verdict on Missing `Location` Values (U.S. Data)

While analyzing the `Accident_Number` syntax, i came up with the following insights:

- The prefix (e.g., `MIA`, `FTW`, `LAX`) typically refers to the **NTSB regional office** that conducted the investigation — **not necessarily the accident location**.
- In some cases, the prefix aligns with the actual location.
- However, in other instances, the office may be **geographically distant** from where the accident occurred, making it **unreliable as a proxy** for true location.

---

### Conclusion

Although the `Accident_Number` can offer **hints**, it **cannot be consistently used** to infer accurate location data. thus the missing values i will use fillna with Unknown

---

In [None]:
us_data['location'] = us_data['location'].fillna("Unknown")


In [None]:
us_data[['city', 'state']] = (
    us_data['location']
    .str.split(',', n=1, expand=True)      
    .apply(lambda x: x.str.strip())        
)

In [None]:
us_data['city'] = us_data['city'].str.strip().str.upper()
us_data['city'] = us_data['city'].replace('', 'UNKNOWN')


In [None]:
state_fixes = {
    'HONOLULU, HI': 'HI',
    'OAHU, HI': 'HI',
    "MANU'A, HI": 'HI',
    'MAUI, HI': 'HI',
    'KAUAI, HI': 'HI',
    'MOLOKAI, HI': 'HI',
    'Oahu, HI': 'HI',
    'Kauai, HI': 'HI',
    'Maui, HI': 'HI',
    'NYC, NY': 'NY',
    'San Juan Is., WA': 'WA',
    'LA,': 'LA',
    ', NC': 'NC',
    ', WA': 'WA',
    'CO, CO': 'CO',
    'UN': 'UNKNOWN',
    'OF': 'UNKNOWN',
    'MG, OF': 'UNKNOWN',
    'CB': 'UNKNOWN',
    'GM': 'UNKNOWN',
    'AO': 'UNKNOWN',
    'PO': 'UNKNOWN',
    '': 'UNKNOWN',
    None: 'UNKNOWN'
}
us_data['state'] = us_data['state'].replace(state_fixes)
us_data['state'] = us_data['state'].str.strip().str.upper()


In [None]:
state_codes = pd.read_csv('Data/Aviation-data/USState_Codes.csv')
state_codes.rename(columns={
    'Abbreviation': 'state',
    'US_State': 'state_full'
}, inplace=True)

us_data = us_data.merge(state_codes, on='state', how='left')
us_data['state_full'] = us_data['state_full'].fillna('UNKNOWN').str.strip().str.upper()


At this stage i will split this column into two new columns before i proceed with the cleaning 

The `location` column combines city and state information in a single string ( `"COCOA, FL"`). To support more granular geographic analysis, i will split this column into two distinct fields:

- **`city`** – the name of the city, town, or municipality where the event occurred  
- **`state`** – the two-letter U.S. state abbreviation ( `FL`, `CA`)

-----

- **Missing or malformed entries**: If the `location` field is  missing or did not contain a comma, both `city` and `state` will be assigned `'Unknown'`.
- **Whitespace handling**: Leading and trailing whitespaces will be stripped from both city and state values for consistency.
- **State validation**: I will  U.S. state code reference provided  (`USState_Codes.csv`)  to map abbreviations to full state names.
- **New field – `state`**: This additional column improves interpretability and supports advanced analysis (e.g., aggregating by full state name).

By structuring the `location` data this way, we enable more precise regional breakdowns, simplify future joins with FAA and weather datasets, and enhance the overall analytical quality of the dataset.


## 2.Dropping Latitude and Longitude

The `latitude` and `longitude` columns represent the geographical coordinates of where each accident occurred. After evaluating their utility for this analysis, a decision was made to **drop both columns** based on the following rationale:

---

###  Reasons for Dropping:

- **Over 60% of values are missing** — specifically, `49,983` out of `82,248` us_data for the latitude column records have null entries, making reliable imputation impractical.
- The dataset lacks sufficient **contextual data** (e.g., accident causes, airport coordinates) needed to generate meaningful geospatial insights.
- The **focus of this analysis** is on **temporal, categorical, and severity-based trends**, rather than spatial or geographical mapping.
- Performing distance calculations (accident site to nearest airport) would require **external datasets and geolocation logic** not currently available in this phase.

---


In [None]:
us_data.drop(columns=['latitude', 'longitude'], inplace=True)


## 3.AIRPORT CODES AND NAMES

## Why I Considered Using Airport Data

At first, I thought about keeping `airport_code` and `airport_name` in the dataset. They could be useful **if** I were analyzing:

- Accidents by airport
- Geographic clustering
- Infrastructure-related risks at specific airports

But for that kind of analysis, I’d need **supporting geospatial data** like:

- Fuel logs or flight paths  
- Maintenance or repair history  
- Distances between the origin and crash site  
- Data on airport infrastructure or traffic density

---

## Why I’m Dropping It

After thinking it through, I decided to drop the airport data because:

- I’m **not analyzing airport-specific risks** in this project  
- I **don’t have the complementary data** needed to make airport-level insights meaningful  
- I already have **location data**, which offers better granularity — and I’ve taken the time to clean it  
- Plus, **airport names and codes are often inconsistent** or messy in large datasets — keeping them without a clear purpose would just add noise
- Lastly there mentioned columns have alot missing data

---



In [None]:
us_data.drop(columns=['airport_code', 'airport_name'], inplace=True)


## 4.INJURY SEVERITY

As I dug into the `injury_severity` column, I came across a mix of values like:
['Fatal(2)', 'Fatal(4)', 'Fatal(3)', 'Non-Fatal', 'Incident', 'Fatal(8)', ..., 'Minor', 'Serious'] after running `.unique()`
At first, I considered extracting the numbers inside the parentheses to quantify severity. But then I realized that the dataset already includes more precise injury data in the following columns:

- `total_fatal_injuries`
- `total_serious_injuries`
- `total_minor_injuries`
- `total_uninjured`

When I cross-checked, the numeric values embedded in the `total_injury_severity` strings (like `'Fatal(49)'`) matched the numbers in these dedicated columns — which are much **cleaner and more consistent**.

---

### My Decision

Rather than extracting and parsing those messy strings — which felt redundant and prone to errors — I decided to **normalize `injury_severity` into a new categorical column**.

This new column, `injury_severity_clean`, includes simple, consistent categories like:

- `Fatal`
- `Non-Fatal`
- `Incident`
- `Minor`
- `Serious`
- `Unavailable`
- `Unknown` (for missing values)

This makes the data much easier to group, analyze, and visualize — without worrying about inconsistent formatting.

---

### Final Verdict

-  I created a new column: `injury_severity_clean`
-  I kept the original `injury_severity` column for now (but may drop it later for tidiness)
-  All **quantitative injury analysis** will rely on the dedicated count columns (`fatal_injuries`, `serious_injuries`, etc.)

This approach gives me clarity and flexibility in both categorical and numeric injury analysis.


In [None]:
us_data['injury_severity'].isnull().sum()

In [None]:
us_data['injury_severity'].unique()

In [1]:
us_data['injury_severity'] = us_data['injury_severity'].fillna('Unknown')

us_data['injury_severity'] = us_data['injury_severity'].str.strip()

def normalize(value):
    value = value.upper().strip()
    if 'FATAL' in value:
        return 'FATAL'
    elif 'NON-FATAL' in value:
        return 'NON-FATAL'
    elif 'INCIDENT' in value:
        return 'INCIDENT'
    elif 'MINOR' in value:
        return 'MINOR'
    elif 'SERIOUS' in value:
        return 'SERIOUS'
    elif 'UNAVAILABLE' in value:
        return 'UNAVAILABLE'
    elif 'UNKNOWN' in value:
        return 'UNKNOWN'
    else:
        return 'OTHER'

us_data['injury_severity_clean'] = us_data['injury_severity'].apply(normalize)




NameError: name 'us_data' is not defined

### AIRCRAFT DAMAGE
The `aircraft_damage` Column

While reviewing the `aircraft_damage` column, I came across the following unique values:
`['Destroyed', 'Substantial', 'Minor', nan, 'Unknown']`
There were **1,979 missing values** (`NaN`), and I also noticed the string `'Unknown'` used as a category — making it a bit inconsistent. Since this column is **important for understanding the severity of aircraft damage**, I decided to retain it and clean it up for consistency.

---

To prepare the `aircraft_damage` column for analysis, I took the following steps:

- **Standardized formatting**: I converted all values to **uppercase** and **stripped whitespace** to ensure uniform text across entries.
- **Unified unknowns**: I treated both `NaN` values and the string `'Unknown'` as a single category — `'UNKNOWN'`. This makes it explicit in the analysis and prevents silent misclassification.

---

After cleaning, the column now contains just **four clear categories**:
`['DESTROYED', 'SUBSTANTIAL', 'MINOR', 'UNKNOWN']`


In [None]:
us_data['aircraft_damage'].unique()

In [None]:
us_data['aircraft_damage'].isna().sum()

In [None]:
us_data['aircraft_damage'] = (
    us_data['aircraft_damage']
    .fillna('UNKNOWN')
    .str.strip()
    .str.upper()
    .replace({'Unknown': 'UNKNOWN'})  
)

## 4. AIRCRAFT CATEGORY

## Imputing Missing `aircraft_category` Values

After noticing a large number of missing values in the `aircraft_category` column, I decided to explore whether these could be reasonably imputed based on other known attributes — specifically, the `aircraft_model`.

###  Domain Knowledge Insight

I leaned on some aviation domain knowledge to guide the imputation:

- A majority of the models with missing categories came from **Cessna** and **Piper**, such as **Cessna 152, 172**, and **Piper PA-28, PA-18**, etc.
- These models are widely recognized as **fixed-wing airplanes** commonly used in **general aviation**, **pilot training**, and **agricultural** operations.
-in as much as there might be other aircrafts that this assumption will overwrite the analyis even with the exact values the leading categories will be aircrafts and even the exact plot will follow my trend only that it has the  exact categories

**Sources** Wikipidia and NTSB and FAA

### Mode Imputation Evidence

To back up my approach with data:

- The dominant category in the dataset is **Airplane**, which accounts for **24,229 entries** — by far the majority.
- The sns histoplot also supports my assumption
- I examined the top 20 most frequent `aircraft_model` values where `aircraft_category` was missing, and **all of them** clearly mapped to the **"Airplane"** category.


Based on this evidence and domain context, I imputed the missing `aircraft_category` values using the following logic:

- If the `aircraft_model` belongs to a well-known **airplane family** — including **Cessna 150/152/172/180/182/206** or **Piper PA-series** — I confidently assigned it the category **"Airplane"**.


In [None]:
us_data['aircraft_category'].isna().sum()

In [None]:
us_data['aircraft_category'].unique()

In [None]:
us_data['aircraft_category'] = us_data['aircraft_category'].replace({
    'Unknown':'Airplane'
})
us_data['aircraft_category'] = us_data['aircraft_category'].fillna("Airplane")

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(data=us_data, x='aircraft_category', shrink=0.8, discrete=True, kde=False, color='skyblue')
plt.title('Histogram of Aircraft Categories', fontsize=16)
plt.xlabel('Aircraft Category', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 5. REGISTRATION NUMBER
The `registration_number` column turned out to be trickier and noisier than expected. Based on research and general aviation knowledge, every aircraft is supposed to have a unique registration number — similar to a license plate for cars.

---

##  Key Findings

### 1. **Duplicates Exist**
Some `registration_number` values appear more than once in the dataset. This could suggest:

- Multiple incidents involving the same aircraft over time (which is valid),
- Registration number reuse over time (common after an aircraft is decommissioned),
- Or possible data entry errors or truncation.

---

### 2. **Shared Registration Numbers Across Different Aircraft**
There are cases where:

- The same registration number is associated with different makes and models,
- Some entries even share similar event locations — possibly indicating recycled registrations, duplicate records, or poor documentation.

---

### 3. **Placeholder Values**
Many entries use generic or placeholder values like `"NONE"`, `"UNREG"`, and `"USAF"`. These typically indicate:

- Unregistered aircraft (especially experimental or military),
- Law enforcement or military aircraft with non-standard ID systems,
- Or situations where the actual registration wasn't available.

---

### 4. **Incomplete or Truncated IDs**
Some entries appear to be cut off or formatted inconsistently, making exact matching challenging without reliable external data like the FAA registry.

---

##  My Decision

Given that:

- The `registration_number` column is not essential to my core analysis goals,
- There is insufficient supporting data to clean or categorize entries properly,
- And time is limited,

 I chose to:

- Fill missing values with `"UNKNOWN"`:

In [None]:
us_data['registration_number'].unique()

In [None]:
us_data['registration_number'].isna().sum()

In [None]:

reg_counts = us_data[~us_data['registration_number'].isin(['NONE', 'UNREG'])]['registration_number'].value_counts()
reg_counts[reg_counts > 1].head(30)


In [None]:
us_data[us_data['registration_number'] == 'N20752'][['event_date', 'aircraft_make', 'aircraft_model', 'location']].sort_values('event_date')


In [None]:
us_data['registration_number']= us_data['registration_number'].fillna('UNKNOWN')

## 6. AIRCRAFT MAKE AND MODEL

This columns were kind to my sanity, with only 21 and 38 missing values respectively i will fill them with unknown, and standardize the colum removing white spaces and including hyphens in the relevant situation since on the model columns the numbers,casings and hyphens have meaning so i will just strip

In [None]:
us_data['aircraft_make'].value_counts()

In [None]:
us_data['aircraft_model'].value_counts()

In [None]:
us_data['aircraft_model'].isna().sum()

In [None]:
us_data['aircraft_make'].isna().sum()

In [None]:
us_data['aircraft_make'] = us_data['aircraft_make'].fillna('UNKNOWN')
us_data['aircraft_make'] = us_data['aircraft_make'].str.replace(r'\s+', '-', regex=True).astype('string').str.upper()


In [None]:
us_data['aircraft_model'] = us_data['aircraft_model'].fillna('UNKNOWN')
us_data['aircraft_model'] = us_data['aircraft_model'].str.strip()


## 7. AMATURE MODEL
**Amateur Built** ,also known as homebuilt, experimental, or kit aircraft refers to aircraft constructed by individuals, rather than certified aircraft manufacturers. These are usually built:

From kits sold by aviation companies

From plans using raw materials

Sometimes entirely custom-designed by the builder

### The FAA's View

In the U.S., the FAA classifies these as “Experimental - Amateur-Built” aircraft. The idea is that at least 51% of the aircraft must be built by amateurs for education, recreation, or personal use.

In my analysis i will fill the values with unknown and proceed to standardize everything

In [None]:
us_data['amateur_built'] = us_data['amateur_built'].fillna('UNKNOWN')
us_data['amateur_built'] = us_data['amateur_built'].str.strip().str.upper()


## 8.NUMBER OF ENGINES




I leveraged reliable information from the aircraft **aircraft_make** and **aircraft_odel** columns to accurately retrieve the number of engines for missing entries. While this was straightforward for well-known models, it required considerable manual mapping using domain knowledge and external research.

Eventually, this manual process plateaued — most of the remaining aircraft models were consistently single-engine types. Out of over 1,000 missing values, I had successfully mapped 400+, all of which had one engine. To validate this assumption, I examined the tail end of the unmapped entries and found the pattern persisted.

To save time and effort without compromising accuracy, I decided to impute the remaining missing values as **1**. This decision was further backed by a distribution plot showing a strong dominance of single-engine aircraft across the dataset.


In [None]:
engine_impute_map = {
    ('SCHWEIZER', 'SGS 2-33A'): 0,
    ('SCHWEIZER', 'SGS-2-33A'): 0,
    ('BALLOON-WORKS', 'FIREFLY 7'): 0,
    ('LET', 'L-13'): 0,
    ('LET', 'BLANIK L-13'): 0,
    ('SCHWEIZER', 'SGS 1-26E'): 0,
    ('I.C.A.-BRASOV', 'IS-28B2'): 0,
    ('BEECH', 'A36'): 1,
    ('CIRRUS', 'SR22'): 1,
    ('BEECH', '35'): 1,
    ('BURKHART-GROB', 'G103'): 0,
    ('CESSNA', '172'): 1,
    ('PIPER', 'PA28'): 1,
    ('BELL', '206'): 1,
    ('SCHWEIZER', 'SGS 1-34'): 0,
    ('CESSNA', 'A185'): 1,
    ('BALLOON-WORKS', 'FIREFLY 8-24'): 0,
    ('CAMERON', 'V-77'): 0,
    ('BOEING', '777'): 2, 
    ('PIPER', 'PA32R'): 1,
    ('BEECH', 'V35'): 1,
    ('RAVEN', 'RX-7'): 0,  
    ('SCHWEIZER', '2-33A'): 0,  
    ('CESSNA', '172N'): 1,
    ('AIR-TRACTOR-INC', 'AT 602'): 1,  
    ('SCHEMPP-HIRTH', 'STANDARD CIRRUS'): 0,
    ('HUGHES', '369D'): 1,  
    ('SCHWEIZER', 'SGS 2-33'): 0,
    ('PICCARD', 'AX-6'): 0,  
    ('ROLLADEN-SCHNEIDER', 'LS-4'): 0,  
    ('BEECH', '36'): 1, 
    ('PIPER', 'PA28R'): 1,  
    ('AEROSTAR', 'S-60A'): 0,  
    ('', 'S-77A'): 0,  
    ('SCHLEICHER', 'ASW-19'): 0,
    ('SCHEMPP-HIRTH', 'VENTUS-B'): 0,  
    ('AEROSTAR', 'RX-8'): 0,  
    ('PIPER', 'PA-18'): 1,
    ('WSK-PZL-KROSNO', 'KR-03A'): 0,  
    ('ADAMS', 'A55S'): 1,  
    ('AEROSTAR', 'S-77A'): 0,  
    ('SCHWEIZER', 'SGS 1-36'): 0,  
    ('BOEING', '737'): 2,
    ('BOEING', '757'): 2,
    ('LET', 'L 23 SUPER BLANIK'): 0,  
    ('LET', 'L-23'): 0,  
    ('RAVEN', 'S-60A'): 0,  
    ('AIRBUS', 'A320'): 2,
    ('CESSNA', 'T210'): 1,  
    ('AEROSTAR', 'S-66A'): 0,  
    ('PIPER', 'PA-28-181'): 1, 
    ('BURKHART-GROB', 'G102'): 0,  
    ('CESSNA', '182'): 1,
    ('BURKHART-GROB', '103'): 0,  
    ('AEROSTAR', 'RX 8'): 0,  
    ('SCHLEICHER', 'ASW-20'): 0, 
    ('BOEING', '737-300'): 2,
    ('UNKNOWN', 'UNKNOWN'): np.nan,
    ('RAVEN', 'S-66A'): 0,  
    ('SCHWEIZER', '269C'): 1,  
    ('BALLOON-WORKS', 'FIREFLY-7'): 0,
    ('ERCOUPE', '415'): 1, 
    ('BALLOON-WORKS', 'FIREFLY 8'): 0,
    ('AEROSTAR', 'RX-7'): 0,
    ('BALLOON-WORKS', 'FIREFLY 8B'): 0,
    ('SCHWEIZER', 'SGS 1-26B'): 0,  
    ('RAVEN', 'S60A'): 0,  
    ('SCHWEIZER', 'SGS 1-35'): 0,  
    ('GENERAL-BALLOON', 'AX-6'): 0,
    ('BELL', '206-L4'): 1,  
    ('CESSNA', 'T188C'): 1,  
    ('RAVEN', 'S55A'): 0,
    ('SCHWEIZER', 'SGS 1-26C'): 0,
    ('CHAMPION', '7ECA'): 1, 
    ('SCHWEIZER', 'SGS 2-32'): 0,
    ('GRUMMAN-ACFT-ENG-COR-SCHWEIZER', 'G 164A'): 1,  
    ('I.C.A.-BRASOV', 'IS-29D2'): 0,  
    ('RAVEN', 'S-66A'): 0, 
    ('SCHWEIZER', '269C'): 1, 
    ('BALLOON-WORKS', 'FIREFLY-7'): 0,
    ('ERCOUPE', '415'): 1, 
    ('BALLOON-WORKS', 'FIREFLY 8'): 0,
    ('AEROSTAR', 'RX-7'): 0,
    ('BALLOON-WORKS', 'FIREFLY 8B'): 0,
    ('SCHWEIZER', 'SGS 1-26B'): 0,  
    ('RAVEN', 'S60A'): 0,  
    ('SCHWEIZER', 'SGS 1-35'): 0,  
    ('GENERAL-BALLOON', 'AX-6'): 0,
    ('BELL', '206-L4'): 1, 
    ('CESSNA', 'T188C'): 1,  
    ('RAVEN', 'S55A'): 0,
    ('SCHWEIZER', 'SGS 1-26C'): 0,
    ('CHAMPION', '7ECA'): 1,  
    ('SCHWEIZER', 'SGS 2-32'): 0,  # Glider
    ('GRUMMAN-ACFT-ENG-COR-SCHWEIZER', 'G 164A'): 1,  
    ('I.C.A.-BRASOV', 'IS-29D2'): 0,
    ('CESSNA', 'TR182'): 1,
    ('CESSNA', '210'): 1,
    ('SCHWEIZER', 'SGS2-33A'): 0,
    ('SCHLEICHER', 'ASK-21'): 0,
    ('AIRBUS-INDUSTRIE', 'A320'): 2,
    ('CESSNA', '185'): 1,
    ('BARNES', 'AX-7'): 0,
    ('GLASFLUGEL', 'H-301'): 0,
    ('SCHEMPP-HIRTH', 'DISCUS A'): 0,
    ('SCHWEIZER', 'SGS 1-26A'): 0,
    ('BELLANCA', '8GCBC'): 1,
    ('LET', 'L-23 SUPER BLANIK'): 0,
    ('CESSNA', 'TU206'): 1,
    ('CESSNA', 'T210N'): 1,
    ('EMBRAER', 'ERJ170'): 2,
    ('GRUMMAN', 'G164'): 1,
    ('BALLOON-WORKS', 'FIREFLY 8B-15'): 0,
    ('AIR-TRACTOR', 'AT502'): 1,
    ('AIR-TRACTOR', 'AT802'): 1,
    ('CESSNA', 'A185F'): 1,
    ('HEAD-BALLOONS,-INC.', 'AX8-88'): 0,
    ('BOEING', '747'): 4,
    ('CESSNA', '180'): 1,
    ('SCHWEIZER', '2-32'): 0,
    ('BALLOON-WORKS', 'FIREFLY 7-B'): 0,
    ('BALLOON-WORKS', 'FIREFLY 7-15'): 0,
    ('ROLLADEN-SCHNEIDER', 'LS-4A'): 0,
    ('AIR-TRACTOR-INC', 'AT 502B'): 1,
    ('FLIGHT-DESIGN', 'CTLS'): 1,
    ('PIPER', 'PA-31-350'): 2,
    ('BOEING', '787'): 2,
    ('CESSNA', 'A188'): 1,
    ('TAYLORCRAFT', 'BC12 D'): 1,
    ('SCHWEIZER', 'SGS 1-26'): 0,
    ('AMERICAN-EUROCOPTER-CORP', 'AS350B3'): 1,
    ('EIRIAVION-OY', 'PIK 20D'): 0,
    ('ROLLADEN-SCHNEIDER', 'LS-6'): 0,
    ('ZENITH', 'CH 750'): 1,
    ('PZL-BIELSKO', 'SZD-59'): 1,
    ('CENTRAIR', '101A'): 1,
    ('DE-HAVILLAND', 'DHC-2'): 1,
    ('EMBRAER', 'EMB145'): 2,
    ('BALLOON-WORKS', 'AX-8B'): 1,
    ('SCHEMPP-HIRTH', 'DISCUS CS'): 1,
    ('BOMBARDIER-INC', 'CL-600-2B19'): 2,
    ('BALLOON-WORKS', 'FF-7'): 1,
    ('BALLOON-WORKS', 'FIRE FLY 7'): 1,
    ('BALLOON-WORKS', 'FIRE FLY 7-15'): 1,
    ('BALLOON-WORKS', 'FIRE FLY 8-24'): 1,
    ('BALLOON-WORKS', 'FIREFLY 11'): 1,
    ('CESSNA', 'P210N'): 1,
    ('SIKORSKY', 'S76'): 2,
    ('BALLOON-WORKS', 'FIREFLY 7B'): 1,
    ('CESSNA', 'T182T'): 1,
    ('BALLOON-WORKS', 'FIREFLY 9'): 1,
    ('EIRIAVION-OY', 'PIK-20B'): 1,
    ('BOMBARDIER', 'CL 600 2C10'): 2,
    ('GROB', 'G102'): 1,
    ('LINDSTRAND', '105A'): 1,
    ('BOEING', '757-200'): 2,
    ('PIPER', 'PA-28-161'): 1,
    ('CESSNA', '441'): 2,
    ('LITHUANIAN-FACTORY-OF-AVIATION', 'LAK-12'): 1,
    ('PIPER', 'PA-28-180'): 1,
    ('BURKHART-GROB', 'G 103 TWIN II'): 1,
    ('CHAMPION', '7EC'): 1,
    ('MCDONNELL-DOUGLAS-HELICOPTER', '369E'): 1,
    ('FLIGHT-DESIGN-GMBH', 'CTSW'): 1,
    ('BURKHART-GROB', 'G-102'): 1,
    ('BURKHART-GROB', 'G-103'): 1,
    ('BEECH', '95 B55 (T42A)'): 2,
    ('BEECH', '95 C55'): 2,
    ('ROLLADEN-SCHNEIDER', 'LS3-A'): 1,
    ('GLASFLUGEL', 'H 301 B LIBELLE'): 1,
    ('BURKHART-GROB', 'G-103-TWIN II'): 1,
    ('BURKHART-GROB', 'G-103A Twin II Acro'): 1,
    ('BELL', '430'): 2,
    ('THUNDER-AND-COLT', 'AX9-140'): 1,
    ('BOEING', '737-700'): 2,
    ('BURKHART-GROB', '102'): 1,
    ('BURKHART-GROB', '103A'): 1,
    ('BARNES', 'FIREFLY 7'): 1,
    ('CESSNA', '208B'): 1,
    ('LET', 'L13'): 1,
    ('BEECH', '200'): 2,
    ('BEECH', '23'): 1,
    ('LET', 'L 33 SOLO'): 1,
    ('LEARJET', '45'): 2,
    ('BEECH', '55'): 2,
    ('SCHEIBE-FLUGZEUGBAU', 'BERGFALKE II-55'): 1,
    ('SCHWEIZER', 'SGS-233A'): 1,
    ('BLANIK', 'L-13'): 1,
    ('BURKHART-GROB', 'G103 TWIN ASTIR'): 1,
    ('BURKHART-GROB', 'G103 TWIN II'): 1,
    ('SCHLEICHER', 'K8B'): 1,
    ('HILLER', 'UH 12D'): 1,
    ('CUB-CRAFTERS', 'CCK-1865'): 1,
    ('NORTH-AMERICAN', 'NAVION'): 1,
    ('BELL', 'UH 1H'): 1,
    ('SCHWEIZER', '2-33'): 0,
    ('SCHWEIZER', '2-33-A'): 0,
    ('BELL', '206B-III'): 1,
    ('BELL', '206-B3'): 1,
    ('SCHWEIZER', 'SGS 1 34'): 0,
    ('SCHWEIZER', 'SGS 1-26D'): 0,
    ('BEECH', 'F33'): 1,
    ('BEECH', 'E 55'): 2,
    ('COSTRUZIONI-AERONAUTICHE-TECNA', 'P92 EAGLET'): 1,
    ('SCHWEIZER', 'SGS-1-26E'): 0,
    ('ULTRAMAGIC', 'N250 - NO SERIES'): 0,
    ('MCDONNELL-DOUGLAS', 'MD-83'): 2,
    ('SCHWEIZER', 'SGS233'): 0,
    ('SCHWEIZER', 'SGU-2-22E'): 0,
    ('SCHWEIZER', 'SGU2-22E'): 0,
    ('CAMERON', 'A-140'): 0,
    ('SCHWEIZER', 'SGS-1-26B'): 0,
    ('SCHLEICHER', 'ASW 27'): 0,
    ('SCHLEICHER', 'ASW-20B'): 0,
    ('BELLANCA', '7GCAA'): 1,
    ('CESSNA', 'T240'): 1,
    ('CESSNA', 'T310R'): 2,
    ('CESSNA', 'U206F'): 1,
    ('DIAMOND', 'DA20'): 1,
    ('CESSNA', '525'): 2,
    ('AYRES', 'S2R'): 1,
    ('LUSCOMBE', '8E'): 1,
    ('SPROUL', '72K-TET'): 1,
    ('EMBRAER', 'ERJ190'): 2,
    ('EUROCOPTER', 'EC 130 B4'): 1,
    ('ULTRAMAGIC', 'N-250'): 0,
    ('CESSNA', 'A188B'): 1,
    ('BOEING', '767'): 2,
    ('VANS', 'RV6'): 1,
    ('RAVEN', 'S77A'): 0,
    ('AEROSTAR', 'RAVEN S57-A'): 0,
    ('WEATHERLY', '620'): 1,
    ('AEROSTAR', 'RX8'): 0,
    ('AEROSTAR', 'RXS-8'): 0,
    ('AEROSTAR', 'S-55A'): 0,
    ('GLASFLUGEL', 'STANDARD LIBELLE'): 0,
    ('VANS', 'RV8'): 1,
    ('ROBINSON', 'R44'): 1,
    ('AEROSTAR-INTERNATIONAL', 'RX8'): 0,
    ('ROCKWELL-INTERNATIONAL', 'S 2R'): 1,
    ('HUGHES', '369A'): 1,
    ('PIPER', 'PA32RT'): 1,
    ('BRANTLY', 'B 2B'): 1,
    ('CESSNA', '150'): 1,
    ('BOEING', 'E75'): 1,
    ('SCHEMPP-HIRTH', 'CIRRUS'): 0,
    ('CESSNA', '172P'): 1,
    ('PIPER', 'PA-22-150'): 1,
    ('GULFSTREAM', 'GIV'): 2,
    ('AERO-COMMANDER', 'S2R'): 1,
    ('AEROSPATIALE', 'AS350'): 1,
    ('AERONCA', '7AC'): 1,
    ('CAMERON', 'O-65'): 0,
    ('QUAD-CITY', 'CHALLENGER'): 1,
    ('GROB', 'G103'): 0,
    ('SCHEMPP-HIRTH', 'NIMBUS II'): 0,
    ('AEROSTAR', 'RAVEN S49A'): 0,
    ('RAVEN', 'S-55A'): 0,
    ('SCHEMPP-HIRTH', 'VENTUS A'): 0,
    ('SCHEMPP-HIRTH', 'VENTUS B/16.6'): 0,
    ('CAMERON', 'A-250'): 0,
    ('OTTERBACK', 'Lightning'): 1,
    ('NORTH-WING-UUM-INC', 'SPORT X2 912'): 1,
    ('O\'DELL', 'AEROMASTER'): 1,
    ('NORTHROP', 'N9M'): 4,
    ('NORTH-AMERICAN', 'SNJ'): 1,
    ('NORTH-AMERICAN', 'T28'): 1,
    ('1977-COLFER-CHAN', 'STEEN SKYBOLT'): 1,
    ('PIPER', 'PA 32R-300'): 1,
    ('NANCHANG', 'CJ 6'): 1,
    ('NANCHANG', 'CJ6'): 1,
    ('NATIONAL-BALLOON', 'AX-7'): 0,
    ('NATIONAL-BALLOONING-LTD', '858'): 0,
    ('NAVION', 'NAVION'): 1,
    ('NAVION', 'Navion A'): 1,
    ('NICKS', 'PW-5'): 0,
    ('NORTH-AMERICAN', 'AT6'): 1,
    ('NORTH-AMERICAN', 'AT6 - C'): 1,
    ('NORTH-AMERICAN', 'NAVION A'): 1,
    ('NORTH-AMERICAN', 'P 51D'): 1,
    ('NORTH-AMERICAN', 'SNJ-4'): 1,
    ('OWEN-KINGSLEY-B', 'VANS RV8'): 1,
    ('PADELT', 'PG37-1'): 0,
    ('PETRUS-DAVID-WAYNE', 'S90'): 1,
    ('PHOENIX-AIR-SRO', 'U-15 PHOENIX'): 1,
    ('PIAGGIO-INDUSTRIE', 'P180'): 2,
    ('AB-SPORTINE-AVIACIJA', 'GENESIS 2'): 0,
    ('PIETENPOL', 'AIRCAMPER'): 1,
    ('MONERAI', 'SAILPLANE'): 0,
    ('MONETT', 'MONARAI'): 0,
    ('MONOCOUPE', '110SP'): 1,
    ('MONTANA', 'Coyote'): 1,
    ('MOONEY', 'M20B'): 1,
    ('MOONEY', 'M20C'): 1,
    ('MOONEY', 'M20F'): 1,
    ('MOONEY', 'M20K'): 1,
    ('MOONEY', 'M20V'): 1,
    ('MOONEY-AIRCRAFT-CORP.', 'M20K'): 1,
    ('MX-AIRCRAFT-LLC', 'MXS'): 1,
    ('PICCARD', 'P-80'): 0,
    ('PIK', '20'): 0,
    ('PIPER', 'PA-24-260'): 1,
    ('PIK', 'PIK-20D'): 0,
    ('PILATUS', 'B-4'): 0,
    ('PILATUS', 'B4-PC11AF'): 0,
    ('ACRO', 'SUPER ACRO SPORT I'): 1,
    ('MCDONNELL-DOUGLAS-HELI-CO', '369FF'): 1,
    ('MEANS-ROBER-C', 'ROTORWAY EXEC'): 1,
    ('MEYERS', '200'): 1,
    ('MEYERS', 'MAC 145'): 1,
    ('MICHAEL-V-CRANFORD', 'VANS RV-4'): 1,
    ('MICHAEL-WILSON', 'MURPHY SPIRIT'): 1,
    ('MICROLITES-PTYLTD', 'Dragonfly B'): 1,
    ('MILLER,-TERRY-W.', 'TERN'): 0,
    ('MILLS-MICHAEL', 'S1L'): 1,
    ('MITSUBISHI', 'MU 300'): 2,
    ('MITSUBISHI', 'MU2B'): 2,
    ('MOLINO-OY', 'PIK-20'): 0,
    ('MOLINO-OY', 'PIK-20B'): 0,
    ('PIPER', 'J3C'): 1,
    ('PIPER', 'J5A'): 1,
    ('PIPER', 'PA 14'): 1,
    ('PIPER', 'PA 15'): 1,
    ('PIPER', 'PA 16'): 1,
    ('ADAMS', 'A-60'): 1,
    ('PIPER', 'PA-44 SEMINOLE'): 1,
    ('MCDONNELL-DOUGLAS', 'DC-9-83(MD-83)'): 2,
    ('MCDONNELL-DOUGLAS', 'MD 90-30'): 2,
    ('MCDONNELL-DOUGLAS', 'MD-80'): 2,
    ('MCDONNELL-DOUGLAS', 'MD11'): 3,
    ('MCDONNELL-DOUGLAS', 'MD80'): 2,
    ('MCDONNELL-DOUGLAS', 'MD82'): 2,
    ('MCDONNELL-DOUGLAS', 'MD88'): 2,
    ('MCDONNELL-DOUGLAS', 'OH-6A'): 1,
    ('MCDONNELL-DOUGLAS-AIRCRAFT-CO', 'MD 88'): 2,
    ('MD-HELICOPTER', '369'): 1,
    ('PIPER', 'PA 31P'): 2,
    ('PIPER', 'PA 32R-301'): 1,
    ('PIPER', 'PA 46-350P'): 1,
    ('PIPER', 'PA-18-150'): 1,
    ('PIPER', 'PA-22'): 1,
    ('PIPER', 'PA-24-250'): 1,
    ('PIPER', 'PA-25'): 1,
    ('ADAMS', 'A60S'): 1,
    ('PIPER', 'PA-28'): 1,
    ('MAULE', 'MX7'): 1,
    ('MAULE', 'MXT-7-180A'): 1,
    ('MAULE', 'MXT7'): 1,
    ('MBB', 'BK117'): 2,  
    ('MBB', 'BO-105'): 2, 
    ('MBB', 'PHOEBUS C'): 0,
    ('MCCUTCHAN', 'Glasair'): 1,
    ('MCDONNELL-DOUGLAS', '369E'): 1,
    ('MCDONNELL-DOUGLAS', '600'): 2, 
    ('MCDONNELL-DOUGLAS', 'DC 9 33F'): 2,
    ('MCDONNELL-DOUGLAS', 'DC-9-82 (MD-82)'): 2,
    ('PIPER', 'PA-28-140'): 1,
    ('MAULE', 'MX 7-180B'): 1,
    ('PIPER', 'PA-28-151'): 1,
    ('PIPER', 'PA-32-260'): 1,
    ('PIPER', 'PA-32-300'): 1,
    ('PIPER', 'PA-32R-300'): 1,
    ('ADAMS', 'AB'): 1,
    ('PIPER', 'PA-32R-301'): 1,
    ('MARINO', 'Benoist Type XIV'): 1,
    ('MARK-GOLDBERG', 'BEARHAWK PATROL'): 1,
    ('MARSH-TURNER', 'BG-12A'): 0, 
    ('MAS-EVENTS', 'NEMESIS'): 1,
    ('MASAK', 'SCIMITAR'): 0,  
    ('MATTHEWS-H-THOMAS', 'JODEL   F11 A'): 1,
    ('MAULE', 'M 6-235'): 1,
    ('MAULE', 'M4-220C'): 1,
    ('MAULE', 'M6'): 1,
    ('MAULE', 'M7'): 1,
    ('MAULE', 'MX 7-235'): 1,
    ('PIPER', 'PA-32RT'): 1,
    ('PIPER', 'PA36'): 1,
    ('PIPER', 'PA-32RT-300T'): 1,
    ('PIPER', 'PA-38-112'): 1,
    ('PIPER', 'PA-44-180'): 2,  
    ('PIPER', 'PA-46'): 1,
    ('PIPER', 'PA-46-500TP'): 1,
    ('PIPER', 'PA12'): 1,
    ('PIPER', 'PA18'): 1,
    ('PIPER', 'PA22'): 1,
    ('PIPER', 'PA23'): 2,
    ('ADAMS', 'AX-9'): 0,  
    ('PRUE-IRVING-OWEN', '160'): 0,  
    ('LINDSTRAND-BALLOONS-USA', '120A'): 0,  
    ('LINSTRAND', '240A'): 0,  
    ('LOCKHEED', 'C130'): 4, 
    ('LOCKHEED', 'L1011'): 3,  
    ('LOCKHEED', 'P2V-7'): 2,  
    ('LUDEMAN', 'HP-18'): 0,  
    ('LUSCOMBE', '8A'): 1,
    ('LUSCOMBE', '8B'): 1,
    ('LUSCOMBE', 'T-8F'): 1,
    ('LYONS-ROBERT', 'NAVAJO HKS'): 2, 
    ('M-SQUARED', 'Sport 1000'): 1,
    ('M-SQUARED-AIRCRAFT', 'SPRINT 1000'): 1,
    ('MAARTEN-H-VERSTEEG', 'ZENITH 601XL(B)'): 1,
    ('MACDONALD-CRAIG', 'MAC CUB'): 1,
    ('MAGNI', 'MAGNI M 16'): 1,  
    ('PIPER', 'PA31'): 2,
    ('PIPER', 'PA31T'): 2,
    ('PIPER', 'PA32'): 1,
    ('PIPER', 'PA34'): 2,
    ('PIPER', 'PA44'): 2,
    ('PIPER', 'PA46'): 1,
    ('PIPISTREL', 'Apis-Bee'): 0,
    ('AERO-COMMANDER', '200'): 2,
    ('LINDSTRAND-BALLOONS', '105A'): 0,
    ('LET', 'Blanik L-13'): 0,
    ('LET', 'L-23 SUPER BLANKIT'): 0,
    ('LET', 'L-33-SOLO'): 0,
    ('LET', 'L23'): 0,
    ('LET', 'SUPER BLANIK L-23'): 0,
    ('LET', 'SUPER BLANIK L-33'): 0,
    ('LIGHTNING-AVION-EAB-LLC', 'Arion Lightning'): 1,
    ('LINDSTRAND', '180A'): 0,
    ('LINDSTRAND', 'LBL-105G'): 0,
    ('LINDSTRAND', 'LBL69A'): 0,
    ('LINDSTRAND-BALLOONS', '150A'): 0,
    ('LEARJET', '35'): 2,
    ('LINDSTRAND-BALLOONS', '90A'): 0,
    ('LINDSTRAND-BALLOONS', 'LBL'): 0,
    ('POWRACHUTE', 'AIRWOLF'): 1,
    ('PRATT-READ', 'PRG-1'): 0,
    ('PROGRESSIVE-AERODYNE', 'SEAREY'): 1,
    ('PRUE-STANDARD', 'UNKNOWN'): 0,
    ('PURDY', 'HP-18'): 0,
    ('PZL-BIELSKO', 'JANTAR 2A'): 0,
    ('PZL-BIELSKO', 'SZD 50-3'): 0,
    ('PZL-BIELSKO', 'SZD 55-1'): 0,
    ('AERO-COMMANDER', '500'): 1,
    ('KOLB-COMPANY', 'FIRESTAR'): 1,
    ('KUBICEK', 'BB100'): 1,
    ('KUBICEK', 'BB30'): 1,
    ('KUBICEK', 'BB60'): 1,
    ('KUBICEK', 'BB85'): 1,
    ('LAISTER', 'LK10'): 1,
    ('LAISTER', 'LP-15'): 1,
    ('LAISTER', 'LP-49'): 1,
    ('LAKE', 'LA4'): 1,
    ('LANCAIR', '360'): 1,
    ('LANCAIR', 'IV'): 1,
    ('LANCAIR', 'LC41'): 1,
    ('LANCAIR', 'LEGACY RG'): 1,
    ('LARK-AVIATION', 'IS28B2'): 1,
    ('LET', 'Blanik'): 1,
    ('PZL-BIELSKO', 'SZD-48-3'): 1,
    ('PZL-BIELSKO', 'SZD-55-1'): 1,
    ('PZL-SWIDNIK', 'PW 5'): 1,
    ('QUICKIE', 'Q2'): 1,
    ('QUICKSILVER', 'MX II Sprint'): 1,
    ('QUICKSILVER', 'MXL II'): 1,
    ('QUICKSILVER', 'Sport'): 1,
    ('RAF', 'SE5A'): 1,
    ('AERO-COMMANDER', '680'): 1,
    ('RAVEN', 'AX-9'): 1,
    ('JOHNSON', 'Harmon Rocket'): 1,
    ('JONES', 'HATZ CB1'): 1,
    ('JONKER-SAILPLANES', 'JS1C'): 1,
    ('JONKER-SAILPLANES-(PTY)-LTD', 'JS1-C'): 1,
    ('JORDAN-JOHN', 'RV7'): 1,
    ('JUDD', 'Challenger II'): 1,
    ('JUST', 'JA30 SUPERSTOL'): 1,
    ('KENNETH-B-HINES', 'NIEUPORT 28'): 1,
    ('KITFOX', 'S7'): 1,
    ('KJOSTAD-JORGEN-A', 'WAGABOND'): 1,
    ('KNAPP', 'Easy Raider'): 1,
    ('KNELL', 'ASC SPIRIT'): 1,
    ('KOLB', 'FIRESTAR 2'): 1,
    ('KRELING', 'Supercat'): 1,
    ('RAINBOW-AIRCRAFT-(PTY)-LTD', 'AEROTRIKE'): 1,
    ('RANDY-WAYNE-MALONEY', 'M1'): 1,
    ('RANS', 'S12'): 1,
    ('RANS,-INC.', 'Rans S-6ES'): 1,
    ('RANS-S-12', 'Airaile'): 1,
    ('RATTE-JAMES', 'AVENTURA II'): 1,
    ('RAVEN', 'AERO STAR S-66A'): 1,
    ('RAVEN', 'AS-55A'): 1,
    ('AERO-TEK,-INC.', 'ZUNI'): 1,
    ('HUGHES', '369HS'): 1,
    ('HUGHES', '500D'): 1,
    ('HUGHES', 'OH 6A'): 1,
    ('HUGHES', 'TH 55A'): 1,
    ('I.C.A.-BRASOV', 'IS-26B2'): 1,
    ('I.C.A.-BRASOV', 'IS-28-B2'): 1,
    ('I.C.A.-BRASOV', 'IS-29D'): 1,
    ('I.C.A.-BRASOV', 'LARK I-28-B2'): 1,
    ('I.C.A.-BRASOV-(ROMANIA)', 'IS 29D'): 1,
    ('ICON', 'A5'): 1,
    ('ICP', 'Savannah'): 1,
    ('RAVEN', 'S-77A'): 1,
    ('JACK-MCDANIEL', 'Rans S-12'): 1,
    ('JAMES', 'Experimental Cub'): 1,
    ('JAVRON', 'PA-18 Replica'): 1,
    ('JEROME-A-BAAK', 'CH 601XL'): 1,
    ('RAVEN', 'AX-8'): 1,
    ('RAVEN', 'R-7'): 1,
    ('RAVEN', 'RALLEY RX7'): 1,
    ('RAVEN', 'RALLY II'): 1,
    ('RAVEN', 'RALLY RX7'): 1,
    ('RAVEN', 'RX-6'): 1,
    ('HUGHES', '369'): 1,
    ('HEAD', 'AX9-118'): 1,
    ('HEAD-BALLOONS,-INC.', 'AX9-118'): 1,
    ('HENRY-STEVEN-J', 'JUST ACFT SUPERSTOL'): 1,
    ('HI-MAX', 'HI-MAX'): 1,
    ('HILL-GROUP-LLC', 'CCX-2000'): 1,
    ('HILLER', 'UH 12E'): 1,
    ('HILLER', 'UH-12C'): 1,
    ('HILLER', 'UH-12E'): 1,
    ('HOGAN', 'Innovator'): 1,
    ('HOLMES', 'Challenger II'): 1,
    ('HOOVER-DAVID', 'ARNOLD AR 6'): 1,
    ('AERO-TEK-INC.', 'ZUNI'): 1,
    ('HPH-LTD', '304CZ'): 1,
    ('HUEBBE', 'Sonex HB'): 1,
    ('HUGHES', '269A'): 1,
    ('RAVEN', 'S-55A-707'): 1,
    ('RAVEN', 'S100A'): 1,
    ('RAVEN', 'S55A/AX7'): 1,
    ('RAYMOND-Z-BROWN', 'CONDOR'): 1,
    ('RAYTHEON', '58'): 1,
    ('RAYTHEON-AIRCRAFT-COMPANY', 'B200'): 1,
    ('RAYTHEON-CORPORATE-JETS', 'H25B'): 1,
    ('HEAD-BALLOONS,-INC.', 'AX7-77'): 1,
    ('HEAD', 'AX9 118'): 1
     
}

for (make, model), engine_count in engine_impute_map.items():
    us_data.loc[
        (us_data['aircraft_make'] == make) & 
        (us_data['aircraft_model'] == model) & 
        (us_data['number_of_engines'].isna()), 
        'number_of_engines'
    ] = engine_count


In [None]:
us_data['aircraft_make'].value_counts().head(20)

In [None]:
us_data['aircraft_model'].value_counts().head(20)

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(us_data['number_of_engines'], kde=True, color='skyblue', edgecolor='black')

mean_val = us_data['number_of_engines'].mean()
mode_val = us_data['number_of_engines'].mode()[0]

plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
plt.axvline(mode_val, color='green', linestyle='-', linewidth=2, label=f'Mode: {mode_val}')

plt.title('Distribution of Engine Number')
plt.xlabel('Engine count')
plt.ylabel('Aircrafts')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
us_data['number_of_engines'].isna().sum()

In [None]:
us_data['number_of_engines'] = us_data['number_of_engines'].fillna(1)

## 8. ENGINE TYPE

There were 3,042 missing values in the *Engine Type* column. To make a justifiable imputation, I examined the distribution of known engine types:

- **Reciprocating** – 68,507
- **Turbo Shaft** – 3,331
- **Turbo Prop** – 3,206
- **Turbo Fan** – 2,094
- **Unknown** – 1,385
- Others (Electric, Hybrid Rocket, LR, etc.) – rare (< 20 combined)

### Engine Type Descriptions:
| Engine Type      | Description |
|------------------|-------------|
| **Reciprocating** | Piston-driven engine similar to car engines. Common in small aircraft and general aviation. Typically low-power and uses 1–2 engines. |
| **Turbo Fan**     | Jet engine with a fan in front. Common in modern airliners and high-performance jets. Usually 2+ engines. |
| **Turbo Shaft**   | Jet engine used in helicopters. Drives a shaft to power rotors. |
| **Turbo Prop**    | Jet-powered engine that turns a propeller. Used in small commuter aircraft and bush planes. Typically 1 or 2 engines. |
| **Turbo Jet**     | Old-style pure jet engines, mostly found in vintage or military aircraft. Rare today. |
| **Electric**      | Found in experimental or very light sport aircraft. Emerging technology. |
| **Hybrid Rocket / LR / NONE / UNK** | Rare or placeholder values. Likely data inconsistencies or experimental craft. |

### Reasoning Behind Imputation:

The data shows that the vast majority of aircraft use **Reciprocating** engines. Given that most aircraft in the dataset also have a single engine and fall within the general aviation category, it is statistically and contextually reasonable to impute the missing values with **'Reciprocating'**.

This assumption avoids significant bias and aligns with the observed distribution, supported by both the dataset and basic domain knowledge about aircraft engine types.


In [None]:
us_data['engine_type'].unique()

In [None]:
us_data['engine_type'].isna().sum()

In [None]:
us_data['engine_type'].value_counts()

In [None]:

plt.figure(figsize=(12, 6))

sns.countplot(data=us_data, x='engine_type', order=us_data['engine_type'].value_counts().index,palette='Blues_d', edgecolor='black')    
          
mode_val = us_data['engine_type'].mode()[0]
plt.axhline(us_data['engine_type'].value_counts().max(), color='green', linestyle='--',label=f'Mode: {mode_val}', linewidth=2)
            
plt.title('Distribution of Engine Type')
plt.xlabel('Engine Type')
plt.ylabel('Number of Aircrafts')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
us_data['engine_type'] = us_data['engine_type'].fillna('Reciprocating')
us_data['engine_type'] = us_data['engine_type'].str.replace(r'\s+', '-', regex=True).astype('string').str.upper()

## 9. FAR DESCRIPTION

The `far_description` column shows the FAA regulation under which each flight was operated. It’s an important feature for identifying the operational context—whether the flight was general aviation, commercial, agricultural, or something else entirely.

###  Cleaning Process

This column was messy—some entries were full descriptions like `"Part 91: General Aviation"` while others were shorthand like `"091"`. I created a mapping that standardized everything into clean, consistent categories such as `"Part 91"`, `"Part 135"`, and so on.

For example:
- `"091"`, `"091K"`, and `"Part 91F: Special Flt Ops."` were all mapped to `"Part 91"`
- `"PUBU"`, `"Public Use"`, and `"Public Aircraft"` were grouped under `"Public Use"`
- Military designations like `"NUSC"`, `"NUSN"`, and `"ARMF"` became `"Military"`

I applied this mapping directly on the original `far_description` 

###  What about the missing values?

Originally, this column had **over 54,000 missing values**, which is a huge chunk out of the ~82,000 rows. I wanted to reduce that in a meaningful way without introducing bias. Here's what I did:

1. I built a reference map using the most frequent aircraft `make` and `model` combinations that already had valid `far_description` values.
2. Then, for each row that was still missing, I tried to infer the `far_description` based on its `make_model` or just the `make` if the combo wasn’t in the map.
3. If I couldn’t infer anything confidently, I left it as `'Unknown'`.

This brought the missing count down to **14,165**, which is a significant improvement.

### Further Explanations

To make this column easier to interpret, I added a new column called `far_description_explained` with short descriptions for each FAR Part. Here’s a quick breakdown of what each part means:

| FAR Part | Description |
|----------|-------------|
| Part 91  | General Operating and Flight Rules – covers non-commercial (general aviation) flights. |
| Part 121 | Air Carrier Operations – for scheduled commercial airlines and large aircraft. |
| Part 135 | Commuter and On-Demand Operations – includes air taxis, charter flights, small cargo. |
| Part 137 | Agricultural Aircraft Operations – crop dusting, aerial application, etc. |
| Part 129 | Foreign Air Carriers – operations of foreign airlines in U.S. airspace. |
| Part 133 | Rotorcraft External Load – operations involving lifting external loads with helicopters. |
| Part 125 | Large Aircraft Non-Commercial – 20+ passengers or over 6,000 lbs not under Part 121. |
| Part 103 | Ultralight Vehicles – very small, lightweight aircraft; no license required. |
| Part 107 | Commercial Drone Operations – rules for unmanned aerial systems (UAS). |
| Public Use | Government-operated or military aircraft not under civil FARs. |
| Military | Military aircraft operations (Navy, Army, or Defense-related). |
| Other | Other classifications or uncategorized FAA rules (e.g., Part 437 - experimental). |
| Unknown | No applicable FAR identified or insufficient data. |

### Conclusion 
This feature was tricky to clean, but using aircraft characteristics for guided imputation helped reduce the noise without making wild guesses. The added explanation column also helps give this variable real interpretive power in my analysis.


In [None]:
us_data['far_description'].isna().sum()

In [None]:
us_data['far_description'].unique()


In [None]:
far_description_map = {
    '091': 'Part 91',
    'Part 91: General Aviation': 'Part 91',
    '091K': 'Part 91',
    'Part 91 Subpart K: Fractional': 'Part 91',
    'Part 91F: Special Flt Ops.': 'Part 91',

    '135': 'Part 135',
    'Part 135: Air Taxi & Commuter': 'Part 135',

    '121': 'Part 121',
    'Part 121: Air Carrier': 'Part 121',

    '137': 'Part 137',
    'Part 137: Agricultural': 'Part 137',

    '129': 'Part 129',
    'Part 129: Foreign': 'Part 129',

    '133': 'Part 133',
    'Part 133: Rotorcraft Ext. Load': 'Part 133',

    '125': 'Part 125',
    'Part 125: 20+ Pax,6000+ lbs': 'Part 125',

    '103': 'Part 103',
    '107': 'Part 107',
    '437': 'Other',

    'PUBU': 'Public Use',
    'Public Use': 'Public Use',
    'Public Aircraft': 'Public Use',

    'NUSC': 'Military',
    'NUSN': 'Military',
    'ARMF': 'Military',

    'UNK': 'Unknown',
    'Unknown': 'Unknown',
    'nan': 'Unknown',
}

us_data['far_description'] = us_data['far_description'].replace(far_description_map)

In [None]:
far_part_explanations = {
    'Part 91': 'General Operating and Flight Rules – covers non-commercial (general aviation) flights.',
    'Part 121': 'Air Carrier Operations – for scheduled commercial airlines and large aircraft.',
    'Part 135': 'Commuter and On-Demand Operations – includes air taxis, charter flights, small cargo.',
    'Part 137': 'Agricultural Aircraft Operations – crop dusting, aerial application, etc.',
    'Part 129': 'Foreign Air Carriers – operations of foreign airlines in U.S. airspace.',
    'Part 133': 'Rotorcraft External Load – operations involving lifting external loads with helicopters.',
    'Part 125': 'Large Aircraft Non-Commercial – 20+ passengers or over 6,000 lbs not under Part 121.',
    'Part 103': 'Ultralight Vehicles – very small, lightweight aircraft; no license required.',
    'Part 107': 'Commercial Drone Operations – rules for unmanned aerial systems (UAS).',
    'Public Use': 'Government-operated or military aircraft not under civil FARs.',
    'Military': 'Military aircraft operations (Navy, Army, or Defense-related).',
    'Other': 'Other classifications or uncategorized FAA rules (e.g., Part 437 - experimental).',
    'Unknown': 'No applicable FAR identified or insufficient data.'
}



us_data['far_description_explained'] = us_data['far_description'].map(far_part_explanations)


In [None]:
far_infer_map = {
    'CESSNA': 'Part 91',
    'PIPER': 'Part 91',
    'BEECH': 'Part 91',
    'BOEING': 'Part 121',
    'MCDONNELL-DOUGLAS': 'Part 121',
    'BELL': 'Part 135',
    'AIR-TRACTOR': 'Part 137',
    'GRUMMAN_G-164A': 'Part 137',
    'PIPER_PA-18': 'Part 137',
    'ROBINSON': 'Part 91',
    'HUGHES': 'Part 91',
    'SCHWEIZER': 'Part 91',
    'MAULE': 'Part 91',
    'CHAMPION': 'Part 91',
    'MOONEY': 'Part 91',
    'STINSON': 'Part 91',
    'AERONCA': 'Part 91',
    'TAYLORCRAFT': 'Part 91',
    'LUSCOMBE': 'Part 91',
}

def infer_far(row):
    if pd.notna(row['far_description']):
        return row['far_description']
    
    key = f"{row['aircraft_make']}_{row['aircraft_model']}"
    if key in far_infer_map:
        return far_infer_map[key]
    
    if row['aircraft_make'] in far_infer_map:
        return far_infer_map[row['aircraft_make']]
    return 'Unknown'
 

us_data['far_description'] = us_data.apply(infer_far, axis=1)
us_data['far_description_explained'] = us_data['far_description'].map(far_part_explanations)


In [None]:
us_data.isna().sum()

## 10. SCHEDULE TYPE

While exploring the `schedule_type` column, I noticed that it had a few standardized values:

- `SCHD` — Scheduled  
- `NSCH` — Non-Scheduled  
- `UNK` — Unknown  
- And a significant number of **missing entries**

Since `schedule_type` is conceptually linked to the **Federal Aviation Regulations (FAR)** listed under the `far_description` column, I decided to bridge the two. I created a mapping (`far_to_schedule_map`) that assigns a typical scheduling type to each FAR part based on standard regulatory usage:

- **Part 121** generally corresponds to **scheduled commercial operations** → `SCHD`
- **Part 135** is typically used for **non-scheduled or charter services** → `NSCH`
- Other parts, like **Part 91** or **Public Use**, often involve **non-commercial general aviation**, so I defaulted them to → `UNK`

To impute the missing values while avoiding bias, I implemented a simple logic:

-  If the original `schedule_type` was present, I retained it.  
-  If it was missing, I inferred the value using `far_description`.  
-  If no FAR match was found, I conservatively assigned `UNK`.

This approach gave me a more complete and logically consistent understanding of the scheduling context behind each aircraft operation, while keeping the data as unbiased and explainable as possible.


In [None]:
us_data['schedule_type'].unique()

In [None]:
far_to_schedule_map = {
    'Part 121': 'SCHD',       
    'Part 135': 'NSCH',        
    'Part 91': 'NSCH',         
    'Part 137': 'NSCH',        
    'Part 129': 'SCHD',       
    'Part 133': 'NSCH',        
    'Part 125': 'SCHD',       
    'Part 103': 'NSCH',        
    'Part 107': 'NSCH',       
    'Public Use': 'NSCH',      
    'Military': 'NSCH',        
    'Other': 'UNK',
    'Unknown': 'UNK'
}
def get_schedule_type(row):
    if pd.notna(row['schedule_type']):
        return row['schedule_type']
    
    far = row['far_description']
    if far in far_to_schedule_map:
        return far_to_schedule_map[far]
    
    return 'UNK'

us_data['schedule_type'] = us_data.apply(get_schedule_type, axis=1)


## 11. PURPOSE OF FLIGHT

The `purpose_of_flight` column is quite self-explanatory — it describes why the aircraft was in operation during the incident (Personal, Business, Instructional, Aerial Application)

There were only **2,429 missing values**, which is relatively minor given the dataset size.

To prepare this column for analysis:

-  I **standardized** all string values to **uppercase** for consistency (e.g., "personal" → "PERSONAL").
- For the missing entries, I assigned the value **"UNKNOWN"** to maintain uniformity and avoid introducing assumptions.

This ensures cleaner grouping during analysis and avoids confusion caused by inconsistent casing or missing values.


In [None]:
us_data['purpose_of_flight'].unique()

In [None]:
us_data['purpose_of_flight'] = (
    us_data['purpose_of_flight']
    .fillna('UNKNOWN')
    .astype(str)
    .str.strip()
    .str.replace(r'[-\s]+', '-', regex=True)
    .str.upper()
)



## 12. AIR CARRIER 
In aviation datasets, the air_carrier field (when present) typically refers to the name of the airline or operating company
the column has nearly 60% + missing values and it little analytical value to my analysis
(i actually did not research much LOL!)
otherwise i will just drop it and move


In [None]:
us_data.drop(columns='air_carrier', inplace=True)


### Injury Data Analysis and Ethical Considerations

In handling the columns `total_fatal_injuries`, `total_serious_injuries`, `total_minor_injuries`, and `total_uninjured`, I exercised caution due to the sensitive nature of the data. These fields represent human outcomes from aviation accidents and align with NTSB and ICAO injury classification standards:

- **Fatal**: Death within 30 days of the accident  
- **Serious**: Hospitalization >48 hours, internal injuries, fractures, etc.  
- **Minor**: Injuries requiring treatment but not classified as serious  
- **Uninjured**: On board but unharmed

Given that many entries are missing or incomplete, I deliberately chose **not to impute or aggregate** values in these columns. Missing injury data is retained as `NaN`, preserving the original form of the dataset and avoiding any assumptions about real-world events.

This approach maintains analytical integrity, avoids introducing bias, and aligns with ethical best practices when working with data related to human lives.

Where needed, I created an auxiliary column to calculate the total number of people affected per incident, which will be useful for broader safety trend evaluations. I also included an optional flag column that indicates whether any injury data is available for each record, without modifying the original fields.


In [None]:
us_data['total_individuals_affected'] = us_data[
    ['fatal_injuries', 'serious_injuries', 'minor_injuries', 'uninjured']
].sum(axis=1, skipna=True)

us_data['injury_data_reported'] = us_data[
    ['fatal_injuries', 'serious_injuries', 'minor_injuries', 'uninjured']
].notna().any(axis=1)


## 13. Phase of Flight

The `phase_of_flight` column indicates the stage of the flight during which the accident occurred—such as takeoff, en route, landing, etc. While this data could offer valuable insights, I chose to exclude it from my analysis for several reasons:

- Over **25% of the values are missing** (~21,000 records), making it difficult to rely on this field without introducing bias or assumptions.
- I do **not have supporting data** such as mechanical logs, black box transcripts, or incident timelines that would allow for safe and meaningful inference of missing values.
- From an analytical standpoint, the **`aircraft_damage`** column already provides a clearer, more objective measure of the accident's severity, regardless of when it happened during the flight.
- Finally, drawing conclusions like "this model has frequent landing issues" would require a **much richer dataset** that includes flight hours, routes, maintenance reports, and contextual incident logs, none of which are present in this dataset.

Given these limitations, I dropped the column to keep the dataset clean and focused



In [None]:
us_data.drop(columns=['phase_of_flight'], inplace=True)

## 14. WEATHER CONDITIONS

The `weather_condition` column contains categorical values that describe the flight’s meteorological environment at the time of the accident. It uses the standard aviation terms:

- **VMC (Visual Meteorological Conditions)** – clear enough for visual navigation.
- **IMC (Instrument Meteorological Conditions)** – poor visibility, requiring instrument-based flight.
- **UNK (Unknown)** – weather conditions were not determined or recorded.

There were only a small number of missing values (645), and a few inconsistencies like `'Unk'` instead of `'UNK'`.

To prepare this column for analysis:
- I standardized all values to uppercase.
- I replaced inconsistent labels (like `'Unk'`) with `'UNK'`.
- I filled missing values with `'UNK'` to ensure consistency.



In [None]:
us_data['weather_condition'].unique()

In [None]:
us_data['weather_condition'] = us_data['weather_condition'].str.upper().replace({'UNK': 'UNK', 'UNKN': 'UNK'})
us_data['weather_condition'] = us_data['weather_condition'].fillna('UNK')

## 15. REPORT STATUS

The `report_status` column indicates the investigative stage or status of the NTSB accident report. Upon review, I discovered that while many values were valid (`"Probable Cause"`, `"Factual"`, `"Preliminary"`), others were improperly stored narrative summaries.

To clean this column:

- I retained only the known valid report stages, each with its respective meaning:

  - **Preliminary**: An early-stage summary issued shortly after the incident, based on initial findings.
  - **Factual**: A more detailed report outlining all the objective evidence without concluding on the cause.
  - **Probable Cause**: The final report identifying the likely cause(s) of the accident, issued after full investigation.
  - **Foreign**: Indicates that the investigation was conducted by a foreign authority, often for incidents outside U.S. jurisdiction.
  - **Pending**: A placeholder indicating that the investigation is ongoing and no preliminary or final report has yet been issued.

- Any entry not matching these values (including full accident narratives accidentally placed in this column) was classified as `Unknown`.

This cleaning step ensures consistency in the `report_status` classification and preserves the integrity of the analysis by eliminating misaligned or non-standard entries.


In [None]:
us_data['report_status'].unique()

In [None]:
valid_statuses = ['Probable Cause', 'Factual', 'Preliminary', 'Foreign', 'Pending']

us_data['report_status'] = us_data['report_status'].where(
    us_data['report_status'].isin(valid_statuses),
    'Unknown'
)

## 16. PUBLICATION DATE

did not research and i just dropped it

In [None]:
us_data.drop(columns=['publication_date'], inplace=True)

## 17. CLEANING THE DATE


In [None]:
us_data['event_date'] = pd.to_datetime(us_data['event_date'], errors='coerce')

## 18. CLEANING AND STANDARDIZING THE INVESTIGATION TYPE

In [None]:
us_data['investigation_type'] = us_data['investigation_type'].str.strip().str.upper()

## 19. CLEANING AND STANDARDIZING THE INVESTIGATION TYPE

In [None]:
us_data['country'] = us_data['country'].str.strip().str.upper()

## 20. Removing Duplicates

To ensure data quality and avoid skewing the analysis with redundant entries, I removed all exact duplicate rows from the dataset. This step guarantees that each incident is represented only once.

I used `drop_duplicates()` to identify and eliminate duplicate entries. A count of removed rows was displayed to track the cleanup process.


In [None]:
before = us_data.shape[0]
us_data.drop_duplicates(inplace=True)
after = us_data.shape[0]
print(f"Dropped {before - after} duplicate rows.")


In [None]:
us_data.columns

In [None]:
us_data.head()

## PHASE THREE: EXPLANATORY DATA ANALYSIS
The key objectives her is to identify patterns, trends, and risk indicators within aviation accident data to support data-driven decisions for safe and strategic aircraft investments.

In [None]:
us_data.columns

## 1. Univariate Analysis
**Goal**: Understand the distribution and composition of key individual features to establish baselines and detect anomalies.

**Numerical Columns**:
- `event_date`: When the accident occured
- `total_individuals_affected`: Assess severity of impact per incident.
- `number_of_engines`: Common engine configurations.

**Categorical Columns**:

- `aircraft_category`: Types of aircraft most frequently involved in incidents.
- `engine_type`: Dominant engine technologies and their prevalence.
- `schedule_type` & `purpose_of_flight`: Breakdown of usage contexts.
- `weather_condition`: Frequency of incidents under VMC vs IMC.
- `report_status`: Incident resolution trends.
- `amateur_built`: Proportion of non-commercial builds.
- `investigation_type`: nature of the accident
- `aircraft_damage`: extent of damage on the given craft
- `make & Model`: this is self explanatory basically description of the flight
- `engine_type`: basically the type of engine


----

## 1. `event_date`: *what are the trends over time?*

In [None]:
us_data['event_year'] = pd.to_datetime(us_data['event_date']).dt.year
yearly_counts = us_data['event_year'].value_counts().sort_index()
fig = px.line(x=yearly_counts.index, y=yearly_counts.values,
              labels={'x': 'Year', 'y': 'Number of Accidents'},
              title='Aviation Accidents per Year (All Years)')
fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    yaxis_gridcolor='lightgrey',
    xaxis_tickangle=-45,
    margin=dict(t=60, b=40, l=50, r=40)
)
fig.show()



In [None]:
us_data['event_decade'] = (us_data['event_year'] // 10) * 10
decade_counts = us_data.groupby('event_decade')['event_year'].count().reset_index()
decade_counts.columns = ['Decade', 'Total Accidents']

fig = px.bar(decade_counts, x='Decade', y='Total Accidents',
             title='Total Accidents by Decade',
             labels={'Decade': 'Decade', 'Total Accidents': 'Number of Accidents'},
             text='Total Accidents')
fig.update_layout(template='plotly_white')
fig.show()


In [None]:

recent_years = us_data[(us_data['event_year'] >= 2010) & (us_data['event_year'] <= 2022)]

recent_counts = recent_years['event_year'].value_counts().sort_index()
fig = px.line(x=recent_counts.index, y=recent_counts.values,
              labels={'x': 'Year', 'y': 'Number of Accidents'},
              title='Aviation Accidents (2010–2022)')
fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    yaxis_gridcolor='lightgrey',
    margin=dict(t=60, b=40, l=50, r=40)
)
fig.show()


##  Key Observations

###  Overall Downward Trend
There's a clear decline in the number of accidents over time, particularly from the 1980s onward.

### Possible Drivers
Several factors may have contributed to this positive trend:

- Advancements in aviation technology  
- Stricter safety regulations and enforcement  
- Improved pilot training and maintenance protocols  

### Decade-Level Subsets
When I analyzed the data by decade, the trend remained consistent—fewer accidents as time progressed. This supports the idea that safety standards have steadily improved over the years.

###  Recent Insights (2010–2022)

- A sharp decline is visible leading up to 2020, likely influenced by **COVID-19 restrictions** that significantly reduced flight volumes worldwide.
- There's a **slight uptick post-2020**, possibly due to the rebound in air travel. However, I would need more recent data to confirm if this is a sustained trend or just a temporary spike.

---

##  Interpretation

Although historical data provides important context, I find that **recent trends (2010–2022)** are more relevant for driving actionable business decisions today. By focusing on recent accident behavior, I can help the company align its strategies with **current risk levels**, rather than outdated patterns.

##  Business 1: Prioritize Modern Aircraft (2015 and Newer)

My analysis of accident trends over time reveals a notable decline in aviation incidents starting from 2015, with 2020 marking the lowest point in the dataset. While there is a minor rise in incidents post-2020, this is likely tied to the resurgence in air traffic following pandemic-induced slowdowns.



I recommend that the company **prioritize purchasing aircraft manufactured in 2010 or later**. These newer aircraft:

- Operate with **modern safety features** and **advanced avionics**
- Are governed by **stricter post-2000 FAA and international safety regulations**
- Show **lower historical accident rates**, indicating **reduced operational risk**
- Also enough data on performance of each aircraft and me

###  Strategic Value

Investing in newer aircraft not only reduces exposure to safety and maintenance issues, but also aligns the company with broader industry shifts toward:

- **Sustainability**
- **Automation**
- **Compliance with updated airworthiness standards**
------


## 2. `number_of_engines` *"What is the most common engine configuration among aircraft involved in accidents, and what might that imply for operational risk and strategic investment?"*

In [None]:
engine_counts = us_data['number_of_engines'].value_counts().sort_index().reset_index()
engine_counts.columns = ['number_of_engines', 'count']
fig = px.bar(
    engine_counts,
    x='number_of_engines',
    y='count',
    title='Distribution of Number of Engines in Aircrafts',
    labels={'number_of_engines': 'Number of Engines', 'count': 'Number of Aircraft'},
    color_discrete_sequence=['skyblue']
)


fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    yaxis_gridcolor='lightgrey',
    xaxis_tickangle=0,
    xaxis=dict(dtick=1),
    margin=dict(t=60, b=40, l=50, r=40)
)

fig.show()



### Observation & Interpretation

When I analyzed the distribution of aircraft accidents by the number of engines, a clear pattern emerged:

- **Single-engine aircraft** dominate the dataset, followed by **twin-engine aircraft**
- Aircraft with **three or more engines** are exceedingly rare
- Entries with **zero engines** likely represent anomalies such as gliders, balloons, or experimental aircraft used in extreme sports or scientific missions

This pattern makes practical sense—single- and twin-engine aircraft are widely favored due to:

- Lower acquisition and maintenance costs  
- Simplified operations and pilot training  
- Suitability for general aviation and small-scale commercial missions  

Additionally, as aviation technology advances, engine efficiency and power have improved significantly. Modern single turboprop or jet engines often deliver performance that previously required dual-engine setups. The industry is prioritizing **engine quality, design, and system redundancy** over sheer engine count.

---

### Recommendation

Unless the company plans to expand into **niche aviation segments** like extreme sports, research missions, or high-capacity commercial transport, I recommend focusing investment on **modern single- or twin-engine aircraft**.

These aircraft offer:

- A favorable balance of **performance and cost**
- Easier **regulatory compliance**
- Broad **availability of maintenance infrastructure** and parts

**Multi-engine aircraft (3+)** should be considered only if the business model includes high-capacity or international long-haul operations, which bring significant **financial, regulatory, and operational overhead**.

While this insight is valuable, I plan to validate it further through bivariate and multivariate analyses—particularly by cross-referencing engine count with **injury severity**, **purpose of flight**, and **aircraft make** for deeper operational risk insights.


## 3. `total_individuals_affected`: *How many individuals are typically affected in aircraft accidents*?
This variable captures the total number of people involved per incident, aggregating fatalities, serious injuries, minor injuries, and uninjured individuals making it a critical measure of impact severity.

In [None]:
top_counts = (
    us_data['total_individuals_affected']
    .value_counts()
    .sort_index()
    .reset_index()
)
top_counts.columns = ['total_individuals_affected', 'count']
top_counts = top_counts[top_counts['total_individuals_affected'] <= 30]

fig = px.bar(
    top_counts,
    x='total_individuals_affected',
    y='count',
    title='Distribution of Total Individuals Affected (0–20)',
    labels={
        'total_individuals_affected': 'Individuals Affected',
        'count': 'Number of Accidents'
    },
    color_discrete_sequence=['indianred'],
    text='count'
)

fig.update_layout(
    template='plotly_white',
    font=dict(size=14),
    xaxis=dict(dtick=1),
    yaxis_gridcolor='lightgrey'
)
fig.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(
    data=us_data,
    x='total_individuals_affected',
    color='skyblue'
)

plt.title('Distribution of Total Individuals Affected per Accident', fontsize=14)
plt.xlabel('Individuals Affected', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()




###  Observation

The distribution of individuals affected per accident is **heavily right-skewed**. Most incidents involve **only one or two individuals**, which is expected, given that small general aviation aircraft typically carry minimal passengers or crew.

There is a **steep drop-off after two individuals**, suggesting:

- A predominance of smaller aircraft in the dataset  
- Potential **underreporting** or **data aggregation limitations**
- Differences in **operational patterns** between general and commercial aviation  

Notably, there are a few **significant outliers**, with some incidents affecting **over 700 people** — almost certainly commercial airline accidents. While rare, these events have high impact and signal a stark contrast in scale between aviation types.

---

###  Interpretation

This column was derived by aggregating the individual injury fields (fatal, serious, minor, and uninjured). Because imputation would compromise data integrity, the values reflect **only reported data**, which leans toward smaller counts due to:

- The nature of general aviation aircraft  
- Possible incomplete records in older or less formal incident reports  

Overall, this reinforces that **aircraft size correlates strongly with the number of individuals affected**. It's uncommon for large, commercial aircraft to have incidents involving only one or two people unless the event occurred **off-duty**, **on the ground**, or **outside of regular passenger operations**.

---

###  Recommendation 

#### For Risk-Aware Investment

While **smaller aircraft dominate accident counts**, this does **not necessarily equate to higher operational risk**. Their prevalence may stem from:

- **Wider use in general aviation**
- **Lighter regulatory oversight**
- Higher probability of **pilot deviations** or **non-commercial use cases**

#### For Operational Safety

Despite frequent incidents, the **human impact per accident is low** in general aviation. Still, it’s vital to:

- Ensure **strict enforcement** of **airworthiness**, **pilot licensing**, and **routine inspections**
- Focus safety protocols on **preventing small-scale but frequent incidents**

#### For Commercial Fleet Planning

The **rarity of high-casualty events** involving large aircraft supports the idea that **modern commercial aviation is extremely safe**, driven by:

- **Rigorous safety regulations**
- **Advanced technology**
- **Conservative and standardized operations**

---

###  Strategic Sweet Spot

I recommend investing in **well-maintained, modern small aircraft** built for **limited passenger loads**, while still acknowledging the **exceptional safety** of large commercial aircraft. Both categories offer viable opportunities — with the key being **strict adherence to safety standards and operational transparency**.


## 4. `injury_severity`: *In any accident whats the most common injury severity?*

In [None]:
us_data['injury_severity_clean'].unique()

In [None]:
injury_counts = us_data['injury_severity_clean'].value_counts().reset_index()
injury_counts.columns = ['Injury Severity', 'Count']

fig = px.bar(
    injury_counts,
    x='Injury Severity',
    y='Count',
    title='Distribution of Injury Severity in Aviation Accidents',
    labels={'Count': 'Number of Accidents'},
    color_discrete_sequence=['#636EFA']
)

fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    yaxis_gridcolor='lightgrey',
    xaxis_tickangle=-45,
    margin=dict(t=60, b=40, l=50, r=40)
)

fig.show()


## OBSERVATION 

The data classifies injury severity into six categories:

- **FATAL**
- **INCIDENT**
- **UNAVAILABLE**
- **UNKNOWN**
- **MINOR**
- **SERIOUS**

These are fairly self-explanatory, but one insight immediately stands out: **FATAL incidents are the most reported**.

This doesn’t imply that aviation accidents are frequent—rather, it reinforces the idea that **when they do occur, they can be catastrophically severe**.

---

### Interpretation

- The high number of **FATAL** labels suggests that aviation accidents are **low in frequency** but **high in severity**.
- The **UNAVAILABLE** and **UNKNOWN** categories likely represent **incomplete or poorly reported cases**, pointing to a need for better data management in aviation reporting systems.
- The relatively **low counts of MINOR and SERIOUS** may indicate **underreporting of non-fatal events**, or it might reflect a reality that **when aircraft fail, they tend to fail hard**.

---

###  Recommendation

While this insight may seem obvious, it carries serious operational implications. Based on these findings, I strongly recommend the following:

- **Invest in modern, well-maintained aircraft**
- **Ensure strict compliance** with all aviation safety protocols
- **Support continuous pilot training**, including **real-world emergency simulations**
- **Leverage predictive maintenance** and monitoring tools powered by **data science and machine learning**

...And yeah, maybe **cross your fingers**, too.

But jokes aside, aviation is statistically **very safe**. The fact that the most reported category is “FATAL” says **more about the severity when incidents happen** than about any suggestion of frequent danger.

**Focus on prevention, preparation, and precision**—that’s the real takeaway.


## 5. `aircraft_category`: *What kind of aircrafts dominate the skies?*

In [None]:

category_counts = us_data['aircraft_category'].value_counts().reset_index()
category_counts.columns = ['aircraft_category', 'count']

fig = px.bar(
    category_counts,
    x='count',
    y='aircraft_category',
    orientation='h',
    title='Distribution of Aircraft Categories',
    labels={'count': 'Number of Aircraft', 'aircraft_category': 'Aircraft Category'},
    color_discrete_sequence=['skyblue']
)

fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    yaxis=dict(categoryorder='total ascending'),
    margin=dict(t=60, b=40, l=100, r=40)
)

fig.show()





### Observation

The most common aircraft types involved in accidents are:

- **Airplanes** (by a large margin)
- **Helicopters**
- Less frequent types like **gliders**, **balloons**, and **gyroplanes**

This distribution isn’t surprising—**airplanes dominate aviation operations globally**, from commercial flights to general aviation. Helicopters are often used in **specialized roles** such as emergency response, military, or executive transport. The rarer categories are typically associated with **tourism, recreation, or niche operations**.

It's important to note that while this column was **imputed using the mode**, which introduces **potential bias**, the overall trend aligns with **real-world aviation dynamics**.

---

### Interpretation

- The dominance of airplanes and helicopters in the dataset **does not imply they're inherently more dangerous**—it simply reflects **how frequently they’re used**.
- **Higher operational exposure** naturally leads to **more recorded incidents**, even if the relative risk per flight hour is low.
- Less common categories (balloons, gliders) appear infrequently, likely due to their **limited usage scope**, **fewer flight hours**, or **underreporting** in official databases.

---

### Business Recommendation (Preliminary)

At this **univariate stage**, I will suggest:

- Focusing safety and operational investment **primarily on airplanes**, since they represent the **bulk of exposure and recorded incidents**.
- Maintaining awareness around helicopters, particularly due to their involvement in **high-risk or specialized missions**.

That said, more **actionable insights** will come during **bivariate and multivariate analysis**—especially when `aircraft_category` is examined alongside factors like:

- **Injury severity**
- **Engine count**
- **Purpose of flight**
- **Weather conditions**

This preliminary insight gives me a solid foundation, but there’s more nuance to uncover when I dig deeper into those relationships.


6. `Make And Model`: *Treating this as a couple, lets find out what makes and models are dominant*

In [None]:
us_data['aircraft_make'].unique()

In [None]:
us_data['make_model'] = us_data['aircraft_make'].str.strip() + " " + us_data['aircraft_model'].str.strip()
top_make_models = (
    us_data['make_model']
    .value_counts()[:20]
    .reset_index()
)
top_make_models.columns = ['Make & Model', 'Count']


fig = px.bar(
    top_make_models,
    x='Count',
    y='Make & Model',
    orientation='h',
    title='Top 30 Most Common Aircraft Make & Model',
    labels={'Count': 'Number of Aircraft', 'Make & Model': 'Aircraft Make & Model'},
    color_discrete_sequence=['skyblue'],
    text='Count'
)

fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    yaxis=dict(tickfont=dict(size=11)),
    margin=dict(t=60, b=40, l=280, r=40), 
    height=1000  
)

fig.update_traces(textposition='outside')

fig.show()




### Observation

The distribution of aircraft involved in aviation accidents is **overwhelmingly dominated by small aircraft**. When I examined the top 30 and top 50 most frequent make and model combinations, **general aviation manufacturers like Cessna, Piper, and Beech** were consistently at the top. These models are typically **light, single- or twin-engine aircraft** used for:

- Flight training
- Personal aviation
- Small business transport

This isn’t just a statistical blip—**small aircraft clearly form the core** of the general aviation landscape represented in this dataset.

---

###  Interpretation

This pattern strongly suggests that **small aircraft are more frequently involved in accidents**—not necessarily because they’re unsafe, but due to several key factors:

- **General aviation** has **less stringent oversight** compared to commercial aviation.
- These aircraft are often flown by **student pilots or hobbyists** with **fewer total flight hours**.
- **Maintenance quality varies** significantly across flight schools and private owners.
- Small aircraft make up the **majority of the active fleet**, so higher accident counts are statistically expected.

Even though the number of accidents is high, the **severity of these incidents is often lower** than in commercial aviation. That said, the presence of fatal outcomes reminds me that risk still needs to be taken seriously.

---

###  Business Recommendation

While this finding may not be flashy, it’s critically important for aviation strategy and risk management. Based on this insight, I recommend:

- **For aviation businesses:** Avoid **older or overused small aircraft models** with historically high accident counts—**unless** they come with **comprehensive maintenance records** and **well-documented pilot training protocols**.
- **For training institutions:** Invest in **modern versions** of high-usage models (especially **post-2015 builds**) and back them with **rigorous safety programs**.
- **For regulators and policymakers:** Strengthen **oversight in general aviation**, especially for **training flights** and **owner-operated aircraft**, where variability is highest.
- **For investors:** Consider diversifying into:
  - **Tech-enhanced small aircraft** with automation and advanced safety features
  - **Aviation technology firms**
  - **Maintenance-as-a-service models**
  - **Specialized insurance** products for high-risk general aviation segments

This segment might not grab headlines, but it offers **clear areas for safety improvement, innovation, and targeted investment**.


## 7. `amature_make`: *Are the aircrafts professionaly or 'Amature' built?*

In [None]:
us_data['amateur_built'].value_counts()

In [None]:

amateur_counts = us_data['amateur_built'].value_counts().reset_index()
amateur_counts.columns = ['Amateur Built', 'Count']

fig = px.pie(amateur_counts, 
             names='Amateur Built', 
             values='Count', 
             title='Proportion of Amateur-Built Aircraft Involved in Accidents',
             color_discrete_sequence=px.colors.qualitative.Set3)

fig.update_traces(textinfo='percent+label', pull=[0.1, 0, 0]) 
fig.update_layout(template='plotly_white', title_font_size=20)
fig.show()




### Observation and Interpretation

The **vast majority of aircraft involved in accidents are not amateur-built**, indicating that they are manufactured by **certified aviation companies** under established regulatory standards. That said, the **small slice of amateur-built aircraft** involved in incidents is worth highlighting.

These **homebuilt planes**, while a minority, come with **inherent risk factors**:

- Use of **non-standard parts** or **DIY assembly methods**.
- **Minimal oversight**, especially when compared to FAA-certified manufacturers.
- In many cases, the **builder is also the pilot**, which introduces **bias, overconfidence**, or insufficient testing.

Although the **"UNKNOWN"** values are negligible, I still note them as a reminder of the **importance of clean and complete reporting**.

---

### Business Recommendation

Based on this distribution, my recommendations are:

- **Avoid heavy investment** in amateur-built aircraft unless the aircraft and its builder have **clear, verifiable safety records** and **airworthiness documentation**.
- If amateur-built aircraft are part of the company’s portfolio or operations ( supporting sport aviation or hobbyist communities), then:
  - Enforce **stricter safety checks and inspection cycles**.
  - Require **pilot training and certification** tailored to amateur-built operation.
  - Consider **insuring these aircraft at higher risk premiums**, or **limiting their use** in high-stakes environments.

While amateur-built aviation can be innovative and community-driven, it needs **specialized handling** to ensure safety and operational integrity.


## 8. `engine_type` : *Whats the common engine type? what are the risks? are they safe? lets investigate!*

In [None]:

engine_type_counts = us_data['engine_type'].value_counts().reset_index()
engine_type_counts.columns = ['Engine Type', 'Count']

engine_type_counts = engine_type_counts.sort_values('Count', ascending=True)

fig = px.bar(
    engine_type_counts,
    x='Count',
    y='Engine Type',
    orientation='h',
    title='Distribution of Engine Types',
    labels={'Count': 'Number of Aircraft', 'Engine Type': 'Engine Type'},
    color_discrete_sequence=['lightskyblue']
)
fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    margin=dict(t=60, b=40, l=70, r=30),
    yaxis=dict(tickmode='linear'),
)
fig.show()




### Observation,Interpretation & Insights

The **reciprocating engine** clearly dominates the distribution, and that’s no surprise. It reflects the overwhelming presence of **small, general aviation aircraft** in the dataset. This **doesn’t mean reciprocating engines are inherently unsafe**—rather, it highlights how frequently they’re used in **private, instructional, or low-capacity flights**, which are more susceptible to accident reporting due to volume, looser oversight, and more varied maintenance practices.

Other engine types—**turboprop, turbo-shaft, turbo-fan, and turbo-jet**—typically power **larger or commercial aircraft**. These aircraft **adhere to stricter regulatory standards**, tend to be **maintained more rigorously**, and **operate in more structured environments**, which may explain their lower representation in the data.

Meanwhile, **niche engine categories** like *electric*, *hybrid-rocket*, or entries marked as **"LR" or "UNK"** are statistical outliers. These likely represent **experimental aircraft, prototype flights, or edge-case data entries**.

>  **Important Caveat**: This dataset captures **accidents**, not fleet inventory. That means the data is **biased toward smaller aircraft**, which have higher exposure and often face different operational conditions than their commercial counterparts.

---

### Business Recommendation

While this data doesn’t directly compare **engine safety or efficiency**, several useful takeaways emerge:

- **All certified engine types are designed for safety**; accident occurrences are far more influenced by **human error, poor maintenance**, or **procedural failures** than by engine type alone.
- **Reciprocating engines**, while accessible and common, come with higher operational variability. If cost-effective, consider upgrading to **turboprop or turbofan engines**, which are more **efficient**, **modern**, and **reliable**—especially in commercial or semi-commercial operations.
- When exploring **investments, partnerships, or procurement strategies**, prioritize metrics like:
  - **Fuel efficiency**
  - **Maintenance support infrastructure**
  - **Lifecycle costs**
  - **Operational reliability**

These insights will become even more powerful when I analyze `engine_type` in conjunction with **injury severity**, **aircraft make**, and **purpose of flight** during bivariate and multivariate phases.


## 9. `purpose_of_flight`: *under what purpose did the accident occur? and what inside can i get?*

In [None]:

purpose_counts = (
    us_data['purpose_of_flight']
    .value_counts()
    .sort_values(ascending=True)  
    .reset_index()
)
purpose_counts.columns = ['Purpose of Flight', 'Count']

fig = px.bar(
    purpose_counts,
    x='Count',
    y='Purpose of Flight',
    orientation='h',
    title='Distribution of Purpose of Flight in Aviation Accidents',
    labels={'Count': 'Number of Accidents', 'Purpose of Flight': 'Purpose'},
    color_discrete_sequence=['#1f77b4']
)

fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14),
    xaxis_gridcolor='lightgrey',
    yaxis=dict(dtick=1), 
    margin=dict(t=60, b=40, l=180, r=40)  
)

fig.show()




## Interpretation & Insights

The **dominance of "PERSONAL" flights** in accident data is unsurprising given the dataset’s skew toward **small, privately operated aircraft**. These aircraft are typically flown by individuals or hobbyists and may not be subject to the same rigorous **maintenance**, **pilot certification**, or **operational oversight** as commercial aircraft.

The second most frequent category, **"INSTRUCTIONAL"**, is similarly intuitive. Flight training involves **novice pilots**, inherently increasing the chance of **operational mistakes**, **mishandled procedures**, or **decision-making delays**. 

As we move into more specialized categories—like **"BUSINESS"**, **"EXECUTIVE/CORPORATE"**, or **"PUBLIC-AIRCRAFT"**—we observe **notably lower accident counts**. These flights are typically operated by **experienced pilots** under **stricter safety protocols**, using **well-maintained aircraft** with **robust standard operating procedures**.

> 🔍 **Data Note**: Inconsistencies such as `"PUBS"` and `"PUBL"` likely reflect minor classification or entry errors. They are presumed to stand for variations of **public aircraft** categories (e.g., *state* vs. *local* government usage), but appear infrequently enough to have minimal statistical impact.

---

### Business Recommendation

- **Invest in executive, business, and corporate aviation segments**: These operations have **lower accident rates**, suggesting **better regulatory compliance**, **professional maintenance standards**, and **higher pilot proficiency**.
  
- **Use caution with personal and instructional aviation**: These segments dominate accident counts not necessarily due to higher inherent danger, but due to **sheer volume and lower oversight**. If engaging in this space:
  - Ensure strict **training**, **maintenance**, and **compliance protocols**.
  - Consider **modernizing fleets** with automation-assisted or safety-enhanced aircraft.
  
- **Policy Implication**: Encourage better **reporting standardization** across public-use categories to improve data clarity (e.g., consolidate "PUBS" and "PUBL" entries).

This variable will offer even more powerful insights when analyzed alongside **injury severity**, **engine type**, and **aircraft category** in later phases of the analysis.


## 10. `weather_conditions`: *How does weather relate to aviation acidents?*

In [None]:

weather_counts = us_data['weather_condition'].value_counts().reset_index()
weather_counts.columns = ['Weather Condition', 'Count']

fig = px.pie(
    weather_counts,
    names='Weather Condition',
    values='Count',
    title='Distribution of Weather Conditions During Accidents',
    color_discrete_sequence=px.colors.sequential.Bluered
)

fig.update_traces(
    textposition='inside',
    textinfo='percent+label'
)

fig.update_layout(
    template='plotly_white',
    title_font_size=20,
    font=dict(size=14)
)

fig.show()




### Observation

The distribution of weather conditions during aviation accidents is heavily dominated by **Visual Meteorological Conditions (VMC)**, followed by a much smaller proportion of **Instrument Meteorological Conditions (IMC)**, with a negligible portion labeled as **UNK (Unknown)**.

### Interpretation

At first glance, it's surprising that most accidents occur in "good weather" (VMC). However, this trend aligns with how aviation, especially general and instructional aviation, operates:

- **VMC** conditions are when the **vast majority of small, personal, and instructional flights occur**. More exposure = more incidents.
- **IMC** accidents are rarer, likely because:
  - Fewer flights occur in IMC, typically restricted to **certified pilots and aircraft**.
  - **Stricter dispatch protocols** reduce IMC risk.
  - **Modern avionics** (GPS, terrain awareness, autopilot) mitigate weather-related hazards.
- Many VMC accidents could reflect **deteriorating weather conditions mid-flight**, especially dangerous for **visually-reliant aircraft**.

The **UNK** label, while minor, points to **reporting gaps** and potential system-level improvements.

### Business Recommendation

To mitigate weather-related risk—particularly in small aircraft operations—stakeholders should:

- **Invest in real-time, low-cost weather monitoring systems** suitable for personal and instructional aircraft.
-  **Reinforce go/no-go decision-making protocols**, even in VMC, to avoid complacency.
-  **Enhance pilot training with simulations** of sudden weather transitions (VMC ➝ IMC).
-  **Promote post-flight reporting and debrief culture** to reduce "UNK" entries and improve future data quality.

> ✈️ While weather isn't the leading cause of aviation accidents, it's a **silent multiplier of risk**, especially for under-equipped or under-trained pilots.


In [None]:
top_states = us_data['state_full'].value_counts().head(25)

plt.figure(figsize=(12, 8))
sns.barplot(
    x=top_states.values,
    y=top_states.index,
    palette='coolwarm'
)
plt.title('Top 20 U.S. States by Number of Aviation Accidents', fontsize=16)
plt.xlabel('Number of Accidents')
plt.ylabel('State')
plt.tight_layout()
plt.show()




###  Observation

Certain U.S. states stand out with significantly higher counts of aviation accidents.

###  Interpretation

States like **California, Texas, Florida, and Alaska** frequently top the accident count rankings — not due to inherent danger, but because of:

- **Higher air traffic volumes**.
- A **greater concentration of general aviation and private aircraft**, particularly in remote or scenic regions (Alaska).
- **Favorable weather**, making year-round flying more common.
- Numerous **flight schools and tourism-related aviation** activity.

In short, **more flights = more reported incidents**, not necessarily more risk per flight.

### Business Recommendation

To improve aviation safety and manage exposure:

- **Prioritize safety programs and proactive inspections** in high-traffic states.
- Recognize that **incident volume often reflects activity volume**, not unsafe conditions.
- **Collaborate with local aviation bodies, schools, and businesses** in these states to:
  - Promote **routine maintenance**.
  - Deliver **targeted safety training**.
  - Improve **data collection and reporting standards**.

> Treat these states as **high-exposure zones**, ideal for piloting innovative safety technologies, maintenance tracking tools, and community outreach programs.


## 11. `far_description`: *What is far description? why do they matter?*

In [None]:
far_description_counts = us_data['far_description'].value_counts().reset_index()
far_description_counts.columns = ['far_description', 'count']

fig = px.bar(
    far_description_counts,
    x='far_description',
    y='count',
    text='count',
    title='Distribution of FAR Descriptions in Aviation Accidents',
    labels={'far_description': 'FAR Description', 'count': 'Number of Records'},
    color='far_description',
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig.update_traces(textposition='outside')
fig.update_layout(
    uniformtext_minsize=8,
    uniformtext_mode='hide',
    xaxis_tickangle=-45,
    yaxis_title='Accident Count',
    xaxis_title='FAR Description',
    showlegend=True
)

The `far_description` field doesn't just categorize accident types — it reflects the **regulatory ecosystem** in which flights are conducted. Understanding it helps decode where **regulatory strength meets operational exposure**.

### Interpretation

#### **Part 91 – General Aviation (likely dominant)**
- Covers non-commercial operations: private flights, training, personal/recreational use.  
- Lower entry barriers, **minimal FAA oversight**, and widely varied pilot skill levels.  
- Many accidents are logged here not because it’s inherently unsafe, but because this is where **the most flying happens with the least systemic safety net**.  
- Also includes experimental/amateur-built aircraft and non-revenue ops — **known risk factors**.

#### **Part 135 – Commuter & On-Demand (Charter)**
- Sweet spot for **commercial viability and manageable risk**.  
- Includes air taxis, medical flights, and small regional carriers.  
- Requires stricter maintenance, crew training, and operational rules, but not as resource-intensive as Part 121.  
- **Lower incident rates (adjusted for volume)** suggest a more dependable segment for initial investment.

#### **Part 121 – Scheduled Commercial Carriers**
- Think major airlines.  
- Highly regulated, with the **best safety records in the industry**.  
- Very few accidents due to mature processes, professional crews, and real-time safety monitoring.  
- Entering this space is **capital-heavy**, requiring substantial compliance infrastructure.

#### **Other Descriptions (Public Use, Military, Foreign, Unknown)**
- Minimal in count.  
- Often fall outside FAA oversight or lack traceable FAR classification.  
- **Exclude from decision-critical dashboards**, but note in footnotes or filters for completeness.

---

### Business Recommendations

####  **Short-Term Strategy – Enter under Part 135**
- **Why**: Balanced operational freedom and safety oversight.  
- **Use Cases**: Charter flights, regional tourism, air ambulance, corporate shuttle.  
- **Action**: Build compliance capacity specific to 135 (training, aircraft logs, maintenance audits).

#### **Caution with Part 91**
Only pursue if:
- You control the training pipeline (e.g., flight school or personal aircraft program).  
- Aircraft are thoroughly vetted (no experimental or poorly documented builds).  
- You implement **voluntary safety programs**, like Safety Management Systems (SMS).

#### **Long-Term Vision – Grow into Part 121**
- **Why**: Safest and most respected segment, but expensive to enter.  
- **Action**: Use early experience to build safety credibility, then transition into regional commercial air services.

#### **Compliance First!**
- Set up an **in-house compliance team** before you even buy the first plane.  
- Adopt **electronic maintenance records**, **recurrent pilot training**, and **data reporting pipelines** from day one.  
- Leverage **FAA audits** and **voluntary disclosure programs** to future-proof operations.


## 12. `schedule type`: *Scheduled? unscheduled? or unknown what operating conditions are the best?*

In [None]:


schedule_counts = us_data['schedule_type'].value_counts().reset_index()
schedule_counts.columns = ['schedule_type', 'count']

fig = px.pie(
    schedule_counts,
    names='schedule_type',
    values='count',
    title='Distribution of Schedule Type in Aviation Accidents',
    color_discrete_sequence=px.colors.sequential.RdBu
)

fig.update_traces(textinfo='percent+label')
fig.update_layout(title_font_size=20)

fig.show()


The `schedule_type` field provides insight into the **operational discipline** of a flight. While it may appear like a mere formality, how scheduled a flight is often correlates with how **regulated, trained, and predictable** its conditions are.

### 
 Observation & Interpretation
#### **Non-Scheduled Flights Dominate**
- The majority of accidents occur in non-scheduled operations.
- These include **personal, instructional, positioning (ferry), and test flights** — all typically part of general aviation.
- These operations are often **less structured, less regulated, and flown by pilots with varying levels of experience**.
- Maintenance may also be done less frequently or less thoroughly, depending on the operator.

#### **Scheduled Flights Have Significantly Fewer Accidents**
- Scheduled flights (like commercial airlines) have lower accident counts despite far more frequent operations.
- This reflects the **power of structure, checklists, regulatory oversight, and professional training**.
- Scheduled carriers follow tight protocols on weather, maintenance, and crew duty limits — a **system that works**.

---

###  Business Recommendations

####  **Lean Into Structured Operations**
- **Why**: Accidents are more prevalent in loosely organized, ad-hoc flying.  
- For companies investing in aviation, prioritize **scheduled or semi-scheduled** flight models to benefit from safer operational patterns.

-----


#  BIVARIATE ANALYSIS

-----

### Objectives
Now that i understand each column individually, this phase focuses on exploring **how two variables interact** to reveal deeper insights. This allows us to:

- Identify combinations that contribute to **aircraft accident severity** or **scale**,  
- Spot patterns that point to **safer vs riskier aircraft types**,  
- Provide **business-driven recommendations** for investment and operations.

Our guiding business problem:

> _“Which aircraft are the lowest risk for our company to purchase and operate in commercial or private aviation?”_

---

### Key Analytical Themes
Each bivariate combination will aim to answer **actionable** questions like:

- _Are certain aircraft makes and models more likely to be involved in severe accidents?_
- _Do personal or instructional flights carry more risk than business or executive flights?_
- _Is accident severity influenced by aircraft category or engine configuration?_
- _Are non-scheduled flights significantly riskier than scheduled ones?_
- _Do regional patterns exist in injury severity or accident impact?_

We will **postpone deeper weather analysis** to the **multivariate phase**, where it will yield more value by being studied in conjunction with other features (e.g., **weather + flight purpose**, or **weather + aircraft type + severity**).

---

###  Key Variable Pairs for Bivariate Analysis

| Variable A              | Variable B              | Purpose |
|-------------------------|-------------------------|---------|
| `aircraft_make_model`   | `injury_severity_clean` | Link brands/models to risk level |
| `purpose_of_flight`     | `injury_severity_clean` | Measure risk by intent |
| `aircraft_category`     | `total_individuals_affected` | Evaluate impact by aircraft type |
| `state_full`            | `total_individuals_affected` | Identify regional trends |


---

###  Visualizations & Tools
I’ll use:
- **Grouped and stacked bar plots** for categorical-to-categorical relationships,
- **Box plots or strip plots** when exploring severity vs continuous variables,
- **Clear labeling**, **intuitive ordering**, and **insightful color schemes** for business-ready storytelling.

---

> The goal: keep it sharp, focused, and always steering toward our **aviation investment decision-making**.



In [None]:
us_data.columns

## 1. *What is the risk profile of different aircraft makes and models based on accident severity?*

In [None]:
top_models = us_data['make_model'].value_counts().head(20).index
df_top = us_data[us_data['make_model'].isin(top_models)]

fig = px.histogram(
    df_top,
    x='make_model',
    color='injury_severity_clean',
    barmode='group',
    category_orders={'make_model': top_models},
    title='Injury Severity Distribution by Aircraft Make & Model in the top 20',
    labels={'make_model': 'Aircraft Make & Model', 'count': 'Number of Accidents'},
    color_discrete_sequence=px.colors.qualitative.Safe
)

fig.update_layout(
    xaxis_tickangle=45,
    xaxis_title='Aircraft Make & Model',
    yaxis_title='Number of Accidents',
    legend_title='Injury Severity',
    bargap=0.2
)

fig.show()

In [None]:
top_models = us_data['make_model'].value_counts()[100:130].index
df_top = us_data[us_data['make_model'].isin(top_models)]

fig = px.histogram(
    df_top,
    x='make_model',
    color='injury_severity_clean',
    barmode='group',
    category_orders={'make_model': top_models},
    title='Injury Severity Distribution by Aircraft Make & Model between top 100 -130',
    labels={'make_model': 'Aircraft Make & Model', 'count': 'Number of Accidents'},
    color_discrete_sequence=px.colors.qualitative.Safe
)

fig.update_layout(
    xaxis_tickangle=45,
    xaxis_title='Aircraft Make & Model',
    yaxis_title='Number of Accidents',
    legend_title='Injury Severity',
    bargap=0.2
)

fig.show()

## 2. Aircraft Model vs. Injury Severity

###  Observation & Interpretation

This bivariate analysis uncovers a stark pattern:  
Accidents involving the **most frequent aircraft models** in this dataset overwhelmingly result in **fatal injuries**. Survival rates are minimal — particularly among small aircraft.

#### Key Contributing Factors:

- **Aircraft Size & Structure**  
  Most accidents involve **1–2 seaters** or small personal aircraft. In such cases, structural limits and lack of onboard safety features mean **crashes are often unsurvivable**.

- **Flight Purpose**  
  Many of these aircraft are used for **personal or instructional flights**, which are:
  - Less stringently regulated
  - Often piloted by less experienced operators
  - Subject to broader environmental and operational variability

- **Model Popularity vs. Safety**  
  Even among well-known models — like the *Boeing A75N1* — **fatality is the leading outcome**.  
  This suggests the nature of the crash, rather than the model’s build quality alone, drives survivability.

> **Notably**, commercial and corporate models are underrepresented in the dataset. This is not likely due to underreporting, but rather because:
> - These aircraft operate under **tight regulatory frameworks**
> - Pilots are **professionally trained**
> - Flight conditions are **tightly controlled**

---

### Business Recommendation

If your company seeks to **minimize investment risk** in aviation:

#### **Favor Commercial & Business-Class Aircraft**
- These aircraft show **lower accident rates**.
- Crashes are statistically rare and often more survivable.
- Ideal for **reliable, large-scale operations** such as logistics, charter, or regional transport.

####  **If Entering the Small Aircraft Market, Raise the Safety Bar:**
- Employ **highly trained pilots** with structured recurrent training.
- Equip aircraft with **modern avionics, weather monitoring, and emergency systems**.
- Enforce **stringent maintenance schedules** and operational protocols.
- Target **niche, high-value sectors** like:
  - Executive city-hopping
  - Scenic/tourism flights
  - Medical/emergency response (with proper infrastructure)

---

###  Strategic Tie-In:
This insight aligns with previous findings on **schedule_type** and **purpose_of_flight** —  
**Structured operations = safer flights**.

In multivariate dashboard, combining `make_model` with `injury_severity_clean`, `purpose_of_flight`, and `weather_condition` will spotlight which aircraft are **not only frequent**, but also **safe under specific conditions**.


 ## 2. *Engine-Type and Engine-Count: Why I Excluded from Bivariate*

When conducting a risk-focused bivariate analysis of aviation accidents, it's tempting to include technical specs like `engine_type` and `number_of_engines`. However, **these variables were intentionally excluded**, and for good reason — the goal of this project is to assess **human-centered aviation risk**, not mechanical performance.

### The Mission: Understand Human Risk, Not Machinery

This analysis is meant to guide investment decisions with a **people-first approach**. We prioritize metrics that reflect **injury and fatality outcomes**, because:
- **Aviation is about moving people and cargo** — but here, we focus on **human safety**.
- Variables like `injury_severity_clean` and `total_individuals_affected` directly capture the **true impact of each accident** on human lives.
- Engine-related fields lack the depth needed for **reliable, ethical conclusions** about safety.

---

###  Why Engine Variables Were Not Used

#### 1.  Lack of Technical Depth
- The dataset contains only broad engine categories (e.g., "RECIPROCATING", "TURBO-FAN").
- It provides **no engine performance data, failure logs, or maintenance records**.
- As such, **any conclusions drawn would be speculative and potentially misleading**.

#### 2. All Engines Are Certified for Safety
- Aviation engines must meet **strict FAA, EASA, and ICAO safety certifications**.
- By the time an engine is installed on an aircraft, it has undergone **years of testing**.
- **Engine type does not correlate directly with crash risk** unless coupled with real-world failure and maintenance data — which we don't have.

#### 3. Slow Tech Adoption in Aviation
- Aircraft engine innovation moves **very slowly** due to:
  - **Decades-long aircraft life cycles**
  - **High certification costs**
  - **Extreme focus on safety and reliability**
- Example: The **CFM56 engine**, developed in the 1970s, still powers thousands of planes in 2025.

#### 4. Low Engine Diversity
- Just a **handful of companies dominate** aircraft engine manufacturing:
  - **GE Aviation, Pratt & Whitney, Rolls-Royce, CFM International, Safran**
  - In general aviation: **Lycoming, Continental, Rotax, Jabiru**
- This makes `engine_type` a **poor proxy** for meaningful variability in safety outcomes.

#### 5.  Poor External Joinability
- Although `registration_number` exists, it’s not unique or clean enough to **reliably join with FAA engine data**.
- Attempting to enrich this field would lead to **inconsistent merges or forced imputations**, weakening analytical validity.

---

###  Focus on Human-Centric Variables

#### `injury_severity_clean`  
- Offers a **clear, categorical representation** of human harm: *Fatal*, *Serious*, *Minor*, *None*  
- Easy to aggregate and visualize across aircraft types and flight conditions.

#### `total_individuals_affected`  
- Captures **true accident scale**, regardless of how many people were onboard.
- Offers a **normalized, scalable lens on impact** — especially useful across different aircraft classes.

> Importantly, **both variables are clean and require no imputation**, making them ethically appropriate for risk assessment.

---

### Final Insight

While engines are mechanically vital, they are **not suitable for this dataset’s scope or depth of analysis**.

Instead, to assess aircraft safety risk:

- **Prioritize models with historically low fatality outcomes.**
- **Ensure top-tier maintenance, regardless of engine type.**
- **Invest in structured flight operations** (scheduled, commercial services).



## 3. *Injury severity vs States*

In [None]:

top_states = (
    us_data.groupby('state_full')
    .size()
    .sort_values(ascending=False)
    .head(30)
    .index
)

state_severity_counts = (
    us_data[us_data['state_full'].isin(top_states)]
    .groupby(['state_full', 'injury_severity_clean'], as_index=False)
    .size()
)

state_severity_counts.rename(columns={"size": "count"}, inplace=True)

fig = px.bar(
    state_severity_counts,
    x="state_full",
    y="count",
    color="injury_severity_clean",
    title="Injury Severity Distribution Across Top 30 States",
    labels={"state_full": "State", "count": "Number of Cases", "injury_severity_clean": "Injury Severity"},
    barmode="group"
)

fig.update_layout(
    xaxis_tickangle=-45,
    legend_title_text="Injury Severity",
    margin=dict(l=40, r=40, t=60, b=100)
)

fig.show()


### Observation

States like **California, Texas, Florida, and Arizona** consistently top the charts for both:
- **Accident frequency**
- **Fatal and serious injuries**

These states are responsible for a significant share of total aviation incidents — often involving small aircraft and personal or instructional flights.

---

###  Interpretation

Why these states?

They share common characteristics that amplify aviation activity:

-  **Favorable Weather:** Clear skies and mild winters make flying more consistent year-round
-  **High Volume of General Aviation:** These states are hubs for:
  - **Flight training schools**
  - **Tourism and air charters**
  - **Agricultural aviation**
  - **Private and business travel**
-  **Dense Airfield Networks:** Numerous private strips, FBOs, and local airports.
- **Geographical Size:** Larger territories = more aircraft ownership and intra-state travel.

More flights mean more **exposure**, not necessarily more **risk per flight**.

Severity remains constant: **Accidents in these areas still predominantly result in fatalities**, especially among non-commercial or unscheduled operations.

---

### Business Recommendation

If entering the aviation industry, **view these states as high-volume, not high-failure**:

####  Market Entry Strategy
- **Target** these states due to strong aviation culture and demand.
- **Focus on niches** like:
  - Commercial and scheduled operations (which have tighter regulations)
  - Business jets and high-net-worth clientele (often better maintained)

#### Safety-First Investment
- If pursuing general aviation (training, personal aircraft):
  - Invest in **top-tier safety systems**
  - Implement **rigorous maintenance programs**
  - Hire only **certified, experienced instructors**

#### Strategic Add-on: Risk Services
- Consider offering:
  - **Aviation risk consulting**
  - **Insurance products tailored for private and instructional operators**
- These services are likely in high demand where **aviation density is greatest**.

---



## 4. *Individuals affected vs Aircraft Category*

In [None]:

cat_impact = us_data.groupby('aircraft_category', as_index=False)['total_individuals_affected'].mean()

cat_impact = cat_impact.sort_values(by='total_individuals_affected', ascending=False)

fig = px.bar(
    cat_impact,
    x='aircraft_category',
    y='total_individuals_affected',
    color='aircraft_category',
    title=' Average Number of Individuals Affected per Aircraft Category',
    labels={
        'aircraft_category': 'Aircraft Category',
        'total_individuals_affected': 'Avg. Individuals Affected per Accident'
    },
    text='total_individuals_affected'
)

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(
    showlegend=False,
    yaxis=dict(title='Avg. Individuals Affected'),
    xaxis=dict(title='Aircraft Category'),
    uniformtext_minsize=8,
    uniformtext_mode='hide'
)

fig.show()


## 5. *Aircraft Category vs. Total Individuals Affected*

###  Observation

While the dataset is heavily skewed toward **small general aviation aircraft** (1–2 seaters), a surprising trend emerges:

- *Balloons** have the **highest average number of people affected per accident**, likely due to tourist operations where 6–20+ individuals fly together.
- **Single and multi-engine fixed-wing aircraft** dominate the dataset in volume, often involving solo or duo flights — so individual accident counts are lower, but **frequency is high**.
- **Commercial aircraft** like Boeing jets appear *infrequently* — but when they do, they show **very high passenger counts** (e.g., 700+ affected) due to their scale. Despite that, **fatalities are rare** in these records, highlighting strong safety systems.

---

###  Interpretation

We’re seeing **risk concentration** at both ends of the spectrum:

####  **Balloons**  
- High group exposure in a fragile, weather-sensitive platform.  
- Accidents, while rare, affect many due to tight passenger density and lack of crash protection.

####  **Small Aircraft (1–2 seaters)**  
- The majority of accidents.  
- Usually lower total impact per event — but **very high fatality rates**.  
- Many are used for personal, instructional, or ad-hoc operations — often with **less regulation, oversight, or pilot rigor**.

#### **Medium to Large Commercial Jets**  
- Few entries in the data — but crucial to interpret.  
- Despite carrying hundreds, these flights **very rarely result in fatalities**.  
- Reflect the power of **structured operations**, **rigorous regulation**, and **professional training**.

---


#### Insights

1. **Don't underestimate balloons** — the serene image hides real risk:
   - High group exposure + minimal control = disproportionate impact potential.
   - Require strict weather rules, expert piloting, and emergency protocols.

2. **Treat small aircraft with high caution**:
   - They dominate accident volume and have fatality-heavy outcomes.
   - Investments here must **prioritize airworthiness, pilot training, and oversight** — especially for instructional or recreational purposes.

3. **Commercial aviation remains the gold standard**:
   - Even with the largest number of people per flight, the data confirms its safety.
   - This suggests that **structure, standardization, and accountability are effective risk reducers**.

4. **Balance your risk strategy based on aircraft category**:
   - **High-volume, low-capacity flights** (e.g., flight schools) require systemic operational discipline.
   - **Low-frequency, high-capacity platforms** (e.g., commercial jets, tour balloons) require disaster planning and strong safety architecture.

---


## 6. *Injury severity Trends over the years*

In [None]:
year_severity_counts = us_data.groupby(['event_year', 'injury_severity_clean'], as_index=False).size()
year_severity_counts.rename(columns={"size": "count"}, inplace=True)
fig = px.bar(
    year_severity_counts,
    x="event_year",
    y="count",
    color="injury_severity_clean",
    title="Injury Severity Trends Over the Years",
    labels={
        "event_year": "Year",
        "count": "Number of Cases",
        "injury_severity_clean": "Injury Severity"
    },
    barmode="stack"
)

fig.update_layout(
    xaxis=dict(dtick=5),
    legend_title_text="Injury Severity",
    margin=dict(l=40, r=40, t=60, b=80),
    xaxis_tickangle=-45
)

fig.show()



### Observation & Interpretation

Analyzing `event_year` against `injury_severity_clean` reveals a compelling story:

-  **Steady decline in accidents and fatalities** leading up to 2020.
- **Sharp dip in 2020** — directly aligning with the global halt in air travel during the COVID-19 pandemic.
- **Sudden spike post-2020**, particularly in **fatal injuries** — likely due to:
  - Aircraft being rapidly brought out of storage.
  - Pilots returning after long inactivity (skill fade).
  - Ground crews and maintenance teams working through massive backlogs.
  - Operational inconsistencies in the early reopening phases.

Despite overall improvements in aviation safety across decades, one stubborn fact remains:
> “When accidents do happen — they are still overwhelmingly fatal.”

---

###  Insight

This pattern underscores the **dual nature of aviation risk**:
- **Frequency** can be reduced via smarter scheduling, oversight, and policy.
- But **severity**, when incidents occur, **remains catastrophic** — particularly for small, general aviation aircraft.

---

### Business Recommendations

#### 1. **Invest in Preventive Infrastructure**
- Shift focus from **reactive safety** to **proactive prevention**.
- Maintain readiness audits and technical assessments for any fleet that’s been grounded — especially after industry-wide disruptions.

#### 2. **Scrutinize Restart Phases**
- Avoid launching or scaling operations during unstable recovery periods (like post-2020) unless you have:
  - Fully re-certified pilots
  - Proven aircraft airworthiness
  - Updated risk assessments

#### 3. **Incorporate Historical Severity in Risk Modeling**
- When creating actuarial or operational models, factor in **long-term severity trends** — especially for insurance, underwriting, and investment decisions.
- Account for post-disruption volatility as a **separate risk class**.

#### 4. **Support Crew Reentry Programs**
- Partner with training institutions to provide **post-grounding recertification programs**.
- Consider subsidies or partnerships for simulation-based retraining to mitigate pilot and crew risk.

---
> The takeaway isn’t just that fewer flights mean fewer accidents — it’s that **restarting too quickly, without adequate preparation, is a risk multiplier**.

**Smart aviation businesses will plan for the next disruption, not just recover from the last.**


#  Multivariate Analysis

Now that **univariate** and **bivariate** explorations have illuminated the behavior of individual variables and their pairwise relationships, we deepen our investigation by examining how **three or more variables interact** simultaneously.

In a high-risk, tightly regulated domain like aviation, **complex variable interplays** often reveal the most actionable and strategic insights. This level of analysis moves us from "what happened" to "why it happens under specific operational contexts."

---

###  Objective

The goal of multivariate analysis is to **pinpoint combinations** of factors — including aircraft models, flight purposes, and injury outcomes — that drive **higher or lower operational risk**.

By doing so, we empower decision-makers to:

- Choose safer aircraft and operation types,
- Design smarter business strategies for aviation entry,
- Implement targeted safety and maintenance protocols,
- Build more nuanced, data-backed risk models.

---

### Focus Areas for Analysis

Multivariate combinations are chosen based on their relevance to **aviation use cases** and their impact on **human outcomes** — a core lens for ethical and actionable aviation analytics.

#### Key Axes of Exploration:

1. **Aircraft Usage Context**
   - `purpose_of_flight` (e.g., personal, instructional, business, commercial)
   - `schedule_type` (scheduled vs non-scheduled)

2. **Human Impact**
   - `injury_severity_clean` (Fatal, Serious, Minor, None)
   - `total_individuals_affected`

3. **Temporal and Geographic Trends**
   - `event_year`, `state_full`, `weather_condition`

4. **Aircraft-Specific Characteristics**
   - `aircraft_model`, `aircraft_category`, `aircraft_damage`, `amateur_built`

---

### Why Multivariate Matters

Simple correlations don’t always capture the **real-world complexity** of aviation incidents. For example:

> A small aircraft might seem risky overall, but when used for **instructional purposes in Arizona**, its fatality rate might rise sharply due to high flight school volume and less experienced pilots.

This type of **contextual insight** can only be uncovered through multivariate analysis — and it's exactly the kind of intelligence that drives **precision policy, investment, and operational decisions**.

---



## 1. *Injury Severity by Aircraft Model and Purpose of Flight*

In [None]:
top_models = (
   us_data['make_model']
    .value_counts()[:5]
    .index
)

filtered_df =us_data[us_data['make_model'].isin(top_models)]

grouped = (
    filtered_df
    .groupby(['make_model', 'purpose_of_flight', 'injury_severity_clean'], as_index=False)
    .size()
    .rename(columns={'size': 'count'})
)

fig = px.bar(
    grouped,
    x="make_model",
    y="count",
    color="injury_severity_clean",
    facet_col="purpose_of_flight",
    title="Injury Severity by Aircraft Model and Purpose of Flight",
    labels={
        "make_model": "Aircraft Model",
        "count": "Number of Cases",
        "injury_severity_clean": "Injury Severity"
    },
    barmode="stack",
    height=600
)

fig.update_layout(
    xaxis_tickangle=-45,
    margin=dict(l=40, r=40, t=60, b=120),
    legend_title_text="Injury Severity",
    showlegend=True
)

fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1])) 
fig.show()




### Observation & Interpretation

Multivariate analysis across `make_model`, `purpose_of_flight`, and `injury_severity_clean` reveals a clear and consistent pattern:

-  **Instructional and Personal flights** dominate the dataset in both accident volume and severity.
  - These categories report **significantly higher frequencies of fatal and serious injuries** across nearly all aircraft models.

####  Why This Makes Sense:
- **Instructional flights** often involve **student pilots or trainees**, naturally introducing **greater risk due to inexperience**.
- **Personal flights** are often less regulated, with **variable pilot skill levels** and **irregular maintenance standards**. These may include hobbyists, amateur-built aircraft, or non-commercial ventures with minimal oversight.

In contrast:
-  **Commercial**, **Business**, and **Executive** operations show **lower accident rates** and generally **safer profiles** — thanks to:
  - Certified, trained pilots
  - Regulated schedules
  - Standardized maintenance
  - Modern avionics and infrastructure

However, when these do experience incidents, **severity spikes** due to **higher altitudes, speeds**, and **passenger counts** — though this is rare.

---

### Strategic Recommendation

####  For Low-Risk Investment Strategy:
Focus on **commercial or business-class aviation**, where accident frequency is low and safety oversight is high. These operations typically offer:
- Strict regulatory compliance
- Experienced, certified pilots
- Proactive maintenance routines
- Structured, reliable scheduling systems

#### If Entering Personal or Instructional Markets:
While higher in risk, these markets offer **scalable opportunities** (e.g., tourism, flight schools, rental clubs). To mitigate risk:
-  Implement **rigorous pilot qualification criteria**
-  Require **flight simulator training and recurrent check-rides**
-  Enforce **aircraft maintenance audits**
-  Equip fleets with **modern safety tech** (e.g., ADS-B, stall protection, terrain warning systems)

---
> Purpose of flight is a leading risk indicator. The more casual or under-regulated the operation, the higher the risk. Smart aviation ventures must balance **opportunity volume with operational safety**.



## 2. Injury Severity Over Time by Weather Condition

In [None]:
weather_trend = (
    us_data
    .groupby(['event_year', 'weather_condition', 'injury_severity_clean'], as_index=False)
    .size()
    .rename(columns={"size": "count"})
)

weather_trend = weather_trend[weather_trend["weather_condition"] != "UNK"]

fig = px.bar(
    weather_trend,
    x="event_year",
    y="count",
    color="injury_severity_clean",
    facet_col="weather_condition",
    barmode="stack",
    labels={
        "event_year": "Year",
        "count": "Number of Accidents",
        "injury_severity_clean": "Injury Severity"
    },
    title="Injury Severity Over Time by Weather Condition"
)

fig.update_layout(
    legend_title_text="Injury Severity",
    margin=dict(l=40, r=40, t=60, b=80),
    xaxis_tickangle=-45
)

fig.show()


## 3. Total Individuals Affected by Aircraft Category Across States

In [None]:

grouped_df = us_data.groupby(['state_full', 'aircraft_category'], as_index=False)['total_individuals_affected'].sum()


fig = px.bar(
    grouped_df,
    x="state_full",
    y="total_individuals_affected",
    color="aircraft_category",
    title="Total Individuals Affected by Aircraft Category Across States",
    labels={
        "total_individuals_affected": "Total Individuals Affected",
        "state_full": "State",
        "aircraft_category": "Aircraft Category"
    },
    barmode="stack"
)

fig.update_layout(
    xaxis_tickangle=-45,
    legend_title_text="Aircraft Category",
    margin=dict(l=40, r=40, t=60, b=100)
)

fig.show()



### Observation

- **Airplanes overwhelmingly dominate** the count of individuals affected across all U.S. states — unsurprising, as fixed-wing aircraft are the most commonly used for both personal and commercial aviation.

- States such as **California, Texas, Florida, and Arizona** report the **highest number of individuals affected**. These states host major hubs for:
  - **Tourism and business aviation**,
  - **Flight schools and training programs**,
  - **Large general aviation fleets**.

- The dataset skews toward **small aircraft**, especially those used in private or instructional contexts. This naturally inflates incident frequency involving 1–6 individuals, contributing heavily to total affected counts.

---

### Interpretation

This pattern reflects real-world aviation exposure:

- **High-impact states** are also the busiest aviation corridors — more aircraft, more operations, more people on board.

- **Small aircraft** dominate incident counts and affect totals **not because they’re more dangerous per se**, but because:
  - They **make up the bulk of flights**,
  - Operate under **less stringent oversight**,
  - Often carry **less experienced pilots**,
  - And are used in **diverse, less-regulated missions** like training, recreation, and private use.

- Meanwhile, **commercial, business, and executive aircraft** — while underrepresented in this dataset — show:
  - **Fewer total individuals affected**, and
  - **Far lower incident frequency**, reflecting better structure, regulation, and crew proficiency.

---

### Business Recommendation


####  Aircraft Type
- **Prioritize aircraft categories aligned with structured operations** — such as commercial, business, and executive jets — where:
  - Maintenance cycles are enforced,
  - Flight planning is regulated,
  - Pilot certification is stringent.

- **Be cautious with small personal or instructional aircraft investments**, unless paired with:
  - High safety system standards,
  - Professional training programs,
  - Risk management or insurance strategies.

#### Location Strategy
- Consider operating in or near **aviation-dense states**, but don’t view that as increased risk.
  - These regions offer **infrastructure**, **talent pools**, and **market opportunities**.
  - Risk is mitigated when operations are structured and professionally managed.

>  **Insight**: Incidents in small aircrafts are a function of **volume**, not necessarily poor design — but they highlight the importance of **operational discipline** and **fleet choice** in minimizing impact.

---
