## Aviation Accident Risk Analysis: Data-Driven Recommendations for Safer Investments
This project explores historical aviation accident data to identify patterns, contributing factors, and risk profiles associated with various aircraft models, flight conditions, and operational phases. By integrating accident records with regulatory data, weather conditions, and aircraft registration details, I aim to uncover actionable insights that support strategic decision-making—particularly for stakeholders assessing aircraft safety before investment or deployment.

Through a combination of statistical techniques and visual analytics, this analysis reveals key trends spanning decades of incidents. The ultimate goal: to deliver **at least three concrete, data-backed business recommendations** that enhance aviation safety and reduce investment risk for operators, insurers, and aviation decision-makers.

## Guiding Questions for Analysis

To shape meaningful business recommendations and uncover the underlying factors contributing to aviation accidents, the following key questions will guide me in my analysis:

1. **Which aircraft models are associated with the highest and lowest accident rates, and how do these rates compare when normalized by fleet size or registration volume?**  
   *→ Informs investment risk by identifying safer aircraft models.*

2. **What role do weather conditions play in aviation accidents, and which specific weather types are most frequently linked to severe outcomes?**  
   *→ Supports operational planning and risk mitigation under adverse weather.*

3. **Are there identifiable trends in accidents across different phases of flight (e.g., takeoff, cruise, landing), and do these vary by aircraft type or operator category?**  
   *→ Guides targeted safety interventions at high-risk phases.*

4. **To what extent do regulatory or maintenance-related issues contribute to accident frequency or severity?**  
   *→ Informs policy adjustments and helps rank compliance risk across aircraft categories.*

5. **Have accident patterns shifted over time, and what does this reveal about the effectiveness of safety regulations or technological advancements?**  
   *→ Tracks progress and identifies areas needing continued focus.*

6. **Are there regional or geographical patterns in accident occurrence, especially in relation to weather or regulation enforcement?**  
   *→ Offers strategic insight for operators expanding into new territories.*

## PHASE ONE:  Data Understanding

In this section, i will dive into a comprehensive examination of all datasets i will use in the project. The goal is to assess their structure, contents, and quality — and begin identifying how they can be integrated to support meaningful analysis and actionable insights.

---

###  Objectives

- Understand the schema, variables, and value distributions in each dataset.
- Assess data quality: missing values, inconsistencies, encoding issues.
- Identify relationships and join keys across datasets.
- Define preprocessing needs for each dataset.

---

###  Approach

#### 1. **Main Exploration (Aviation DAta)**
- Load the aviation accident dataset.
- Inspect variable types and value ranges.
- Identify missing or inconsistent values.
- Explore time, location, aircraft model, and severity distributions.

#### 2. **Explore Supplementary Data**
- Review each FAA data:
  - Are the values well-formatted?
  - Any obvious missing or invalid entries?
  - What columns are useful?

#### 3. **Plan for Dataset Integration**
- Identify common keys for joining:
  - `Registration.Number` ↔ `N-Number` (FAA)
  - `Model` ↔ `MODEL` (FAA)
  - Date + Lat/Lon proximity ↔ GHCND Weather
- Consider transformations (e.g., date parsing, coordinate matching).

---



In [None]:
#importing standard libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)

## PART ONE: Core Data Set Understanding
The primary dataset for this project consists of detailed records of aviation accidents, capturing various attributes such as accident number, date, aircraft model, flight phase, location, injury severity, and more. This dataset serves as the backbone of my analysis and will help me uncover core patterns in accident frequency, severity, and causes.

Before diving into analysis, i will begin by examining the structure and content of this dataset to understand its variables, detect missing or inconsistent data, and identify potential areas for transformation. This step is critical in ensuring that my insights are grounded in clean, reliable, and meaningful data.

**Objectives:**
- Get familiar with the features (columns) present in the dataset  
- Check the completeness and data types of each feature  
- Identify key columns that will drive our analysis.
- Detect potential issues such as missing values, formatting inconsistencies, or ambiguous entries  

This understanding will guide the cleaning, enrichment, and merging steps to follow as i prepare this data for deeper analysis and cross-linking with the supplementary datasets.


In [None]:
#Loading the data
aviation_data = pd.read_csv("Data/Aviation-data/AviationData.csv", encoding='latin1')

In [None]:
#check the shape
aviation_data.shape

In [None]:
#preview of the first five rows
aviation_data.head()

In [None]:
#check the last five rows
aviation_data.tail()

In [None]:
#checking columns
aviation_data.columns

In [None]:
#quick view of the data set
aviation_data.info()

In [None]:
#Checking numerucal data
aviation_data.describe().T

In [None]:
#Checking categorical Data
aviation_data.describe(include='O').T

In [None]:
#Checking for missing values
aviation_data.isna().any()



###  Findings
- The dataset successfully loaded using `latin1` encoding due to extended character sets in some fields.
- A preliminary inspection using `.head()` and `.tail()` confirms the structure is consistent across rows.

###  Columns & Features
- The dataset contains a wide range of features including:
  - Aircraft information (make, model, engine type, registration number, etc.)
  - Flight conditions (weather, phase of flight, purpose of flight)
  - Accident details (date, location, injury severity, aircraft damage, narrative)

- Column names are inconsistent and will require **standardization and renaming** for readability and usability in analysis.

###  Data Types and Initial Insights
- The `.info()` summary reveals a mixture of:
  - **Categorical features** such as `Injury.Severity`, `Weather.Condition`, and `Aircraft.Damage`
  - **Date fields** like `Event.Date`, which will be parsed into datetime format

### Missing Data
- A significant number of features contain **missing or null values**, particularly in:
  - latitude and longitude
  - Airport name and Code
  - Aircraft category
These issues will be addressed during the **Data Cleaning** phase.

---

This initial preview establishes a foundational understanding of the dataset. Further steps will involve cleaning, transforming, and preparing the data for analysis.


## PART TWO: Supplementary Dataset(s) Understanding

To enrich the core data set and support deeper, more actionable insights, i decided to research and found supplimentary data from Federal Aviation Administration (FAA) and U.S. State Codes to fortify my analysis. Each dataset serves a specific analytical purpose and will be preprocessed accordingly.

---

### FAA Aircraft Registration Data

- **Files:** `MASTER.txt`, `ENGINE.txt`, `ACFTREF.txt`, `DEALER.txt`, `DEREG.txt`, `DOCINDEX.txt`
- **Purpose:** Provides detailed metadata about aircraft including model specifications, engine details, and ownership history.

---

### FAA Regulations and Incident Data

- **Files:** Cleaned regulation dataset (CSV)
- **Purpose:** Captures regulatory environment and safety measures in place during various incidents.

---

### U.S. State Codes Dataset

- **File:** `US_States_Codes.csv`
- **Purpose:** Translates state abbreviations to full names and standard codes.

---

These supplementary datasets will be cleaned, normalized, and merged with the main accident data using common identifiers such as `Registration.Number`, `Model`, and `Event.Date`. This integration will unlock multi-dimensional insights and strengthen the final recommendations.


##  FAA Aircraft Registration Data Overview

To enrich the aviation accident dataset and gain deeper insight into aircraft-specific characteristics, we incorporate supplementary data provided by the FAA. These files contain detailed registration, technical, and deregistration records for civil aircraft in the United States. Below is a description of each dataset and its intended use in the project:

### 1. `MASTER.txt`
- **Description**: This file includes comprehensive records of all currently registered aircraft, with details such as registration numbers, manufacturer info, year of manufacture, type of registrant (e.g., individual, corporation), aircraft type, engine type, and airworthiness certification dates.
- **Usage**: We will use `MASTER.txt` to extract key aircraft metadata and merge it with the main accident dataset using the `N-NUMBER` (which corresponds to `Registration.Number`). This will allow analysis of accident trends based on aircraft age, type, ownership category, and certification status.

### 2. `ACFTREF.txt`
- **Description**: A reference file mapping manufacturer and model codes to their descriptive names, including weight class and engine type.
- **Usage**: This will be used to decode the `MFR MDL CODE` in the `MASTER.txt` file, enabling us to identify specific aircraft makes and models in a readable format. This is essential for evaluating accident patterns associated with certain aircraft types.

### 3. `ENGINE.txt`
- **Description**: Contains technical specifications of various aircraft engines, linked by engine model codes.
- **Usage**: We can link this to the engine code field in the `MASTER.txt` file (`ENG MFR MDL`) to analyze whether engine type or engine-specific characteristics correlate with accident severity or frequency.

### 4. `DEREG.txt`
- **Description**: Records of deregistered aircraft, including reasons and dates of deregistration.
- **Usage**: This file may help in identifying aircraft that were involved in an accident and subsequently deregistered. We can use this to validate the aircraft's operational status post-accident and examine patterns in deregistration reasons.

---

By leveraging these datasets, we can build a richer, aircraft-level profile for each accident, supporting more robust analysis and stronger business recommendations.


 ------

## FAA Aircraft Registration Data (MASTER.txt)

The MASTER.txt file provides comprehensive registration information for aircraft in the United States. It includes ownership details, aircraft identifiers, location of registrants, certification statuses, and model references that can be linked to technical aircraft data from ACFTREF.txt.



In [None]:
#loading the master file
master = pd.read_csv("supplimentary-data/ReleasableAircraft/MASTER.txt", delimiter=',', low_memory=False,)

In [None]:
master.shape

In [None]:
master.head()


In [None]:
master.tail()

In [None]:
master.columns

In [None]:
master.info()

In [None]:
master.describe().T

In [None]:
master.describe(include='O').T

In [None]:
master.isna().any()


####  Findings:

- The data loaded successfully using `pd.read_csv()` with `delimiter=','`.
- `.head()` and `.tail()` checks confirm consistent formatting and no structural corruption across rows.
- All **35 columns** were correctly recognized and parsed.

####  Key Columns:

- **`N-NUMBER`**: FAA registration number; serves as a unique aircraft ID.
- **`MFR MDL CODE`**: Manufacturer/model code — links to `ACFTREF.txt` for aircraft technical data.
- **`ENG MFR MDL`**: Engine model/manufacturer — links to `ENGINE.txt` for engine specifications.
- **`YEAR MFR`**: Aircraft manufacturing year — useful for age profiling.
- **`TYPE REGISTRANT`, `NAME`, `STREET`, `CITY`, `STATE`, `ZIP CODE`**: Registrant details for identifying aircraft ownership and geographic distribution.
- **`CERTIFICATION`, `TYPE AIRCRAFT`, `TYPE ENGINE`, `STATUS CODE`**: Technical and regulatory attributes.
- **`AIR WORTH DATE`, `EXPIRATION DATE`**: Aircraft certification and registration validity.

#### Data Quality:

- No missing values were observed in the sample preview.
- One column, **`Unnamed: 34`**, appears to be empty and will be dropped during cleaning.
- Some column names (e.g., `' KIT MODEL'`) include leading/trailing whitespace and will be standardized.

---

####  Planned Usage:

This dataset will enhance the **main aviation accident dataset** by providing:

- Aircraft ownership and certification context.
- Insight into how factors like **aircraft age**, **registrant type**, or **certification status** relate to accident **frequency** or **severity**.
- Support for constructing **risk profiles** for different aircraft types based on their historical and regulatory data.



## FAA Aircraft Reference Data (ACFTREF.txt)

The ACFTREF.txt file contains structured reference data for aircraft, detailing the manufacturer, model, engine type, aircraft category, number of engines and seats, weight class, and certification information. This dataset is clean and consistent, with well-defined column names and no missing values, making it readily usable for merging and analysis.

In [None]:
#loading the acftref file
acftref = pd.read_csv("supplimentary-data/ReleasableAircraft/ACFTREF.txt", delimiter=',', low_memory=False)

In [None]:
acftref.shape

In [None]:
acftref.head()

In [None]:
acftref.tail()

In [None]:
acftref.columns

In [None]:
acftref.info()

In [None]:
acftref.describe().T

In [None]:
acftref.describe(include='O').T

In [None]:
acftref.isna().any()

 ---

#### Findings:

- Successfully loaded using `pd.read_csv()` with `delimiter=','`.
- Data is clean with clearly labeled columns and no immediate signs of missing or malformed values.
- Column names are structured and self-descriptive, requiring minimal preprocessing.

####  Key Columns:

- **`CODE`**: Unique identifier for each aircraft model — can be linked to `MASTER.txt` via `MFR MDL CODE`.
- **`MFR`, `MODEL`**: Aircraft manufacturer and model — provides context for identifying specific aircraft configurations.
- **`TYPE-ACFT`, `TYPE-ENG`**: Encoded aircraft and engine types — useful for categorizing incidents by type.
- **`AC-CAT`**: Aircraft category (e.g., airplane, rotorcraft) — helpful for grouping and comparative analysis.
- **`NO-ENG`, `NO-SEATS`**: Details on aircraft engine count and seating capacity — key for estimating potential occupancy and accident impact.
- **`AC-WEIGHT`**: Aircraft weight classification (e.g., Class 1, Class 3) — used in understanding accident risk per weight class.

---

####  Planned Usage:

This dataset will serve as a **technical reference** for enriching the main aviation accident dataset. By linking through keys like `MFR MDL CODE`, it enables:

- Assessment of **aircraft-specific risk factors**, such as engine type or seating capacity.
- Enhanced ability to generate **data-backed safety insights** and recommendations based on aircraft configuration.



## FAA Engine Reference Data (ENGINE.txt)

The ENGINE.txt file contains reference data about aircraft engines registered with the FAA. It supplements the main dataset by providing technical specifications related to engine make, model, and performance attributes.

In [None]:
engine = pd.read_csv("supplimentary-data/ReleasableAircraft/ENGINE.txt", delimiter=',', low_memory=False)

In [None]:
engine.shape

In [None]:
engine.head()

In [None]:
engine.columns

In [None]:
engine.tail()

In [None]:
engine.describe().T

In [None]:
engine.describe(include='O').T

In [None]:
engine.isna().any()

###  Engine Reference File (`ENGINE.txt`)

---

####  Findings:

- File successfully read using `pd.read_csv()` with `delimiter='|'`.
- Columns are clean, consistently formatted, and intuitive.
- No missing values were identified in initial inspection.
- One extraneous column (`Unnamed: 6`) appears to be empty and will be dropped during preprocessing.

---

#### Columns Overview:

| Column Name   | Description |
|---------------|-------------|
| **`CODE`**         | Unique identifier for each engine model. Links to `ENG MFR MDL` in `MASTER.txt`. |
| **`MFR`**          | Engine manufacturer (e.g., Lycoming, Pratt & Whitney). |
| **`MODEL`**        | Engine model name/designation. |
| **`TYPE`**         | Numerical or coded value representing engine type. May require external decoding for interpretation. |
| **`HORSEPOWER`**   | Power output of the engine in horsepower. Useful for performance analysis. |
| **`THRUST`**       | Thrust power (likely in pounds-force) — relevant for jet and turbine engines. |
| **`Unnamed: 6`**   | Empty column (likely due to trailing delimiter in raw file); to be dropped. |

---

#### Usage Strategy:

This dataset will enhance the analysis by:

- **Profiling engine performance** (e.g., power-to-weight ratios, aircraft capability).
- Investigating **correlations between engine specs and accident frequency or severity**.
- Identifying **failure trends** across manufacturers and models for better safety recommendations.
- Supporting the development of **engine-specific risk metrics** for use in fleet management or policy planning.

After minor cleaning (dropping the empty column), this file is **analysis-ready**.


## FAA Deregistered Aircraft Data (`DEREG.txt`)

In [None]:
def handle_bad_line(bad_line):
    print("Bad line encountered:", bad_line)
    return None  # skip the bad line

dereg = pd.read_csv(
    "supplimentary-data/ReleasableAircraft/DEREG.txt",
    delimiter=',',
    engine='python',
    on_bad_lines=handle_bad_line
)




In [None]:
dereg.shape

In [None]:
dereg.head()

In [None]:
dereg.tail()

In [None]:
dereg.info()

In [None]:
dereg.describe(include='O').T

In [None]:
dereg.describe().T

In [None]:
dereg.columns

In [None]:
dereg.isna().any()

## Findings: FAA Deregistered Aircraft Data (`DEREG.txt`)

The `DEREG.txt` file presented initial loading challenges due to a malformed line in the dataset. To address this, a custom function was implemented to skip the corrupted row during file read-in using the Python engine. This allowed the dataset to load successfully without compromising the integrity of the rest of the data.

---

###  File Status

- **File successfully loaded** after handling a single bad line.
- **Delimiter:** `,`
- **Engine used:** `python` (to support custom bad-line handling)
- **Data Quality:** Relatively clean; most fields are well-structured and populated.
- **Next Steps:** Full inspection and cleaning will be performed during the data wrangling phase.

---

###  Columns Overview

The file contains detailed deregistration and historical aircraft information, including:

- `N-NUMBER`, `SERIAL-NUMBER`, `MFR-MDL-CODE`: Unique identifiers for aircraft tracking.
- `ENG-MFR-MDL`, `YEAR-MFR`, `CERTIFICATION`: Technical and regulatory aircraft details.
- `NAME`, `MAILING & PHYSICAL ADDRESSES`, `COUNTRY`, `STATE`: Owner/registrant contact information.
- `STATUS-CODE`, `CANCEL-DATE`, `AIR-WORTH-DATE`: Registration and airworthiness history.
- `MODE S CODE` & `HEX`: Avionics transponder identifiers.
- `OTHER-NAMES`: Additional ownership or alias records.
- `Unnamed: 38`: Appears to be empty and will likely be dropped during cleaning.

---

###  Usage in Analysis

This dataset will supplement the main aviation accident dataset by:

- Providing insights into **aircraft deregistration patterns**, potentially flagging risks for previously deregistered or non-airworthy aircraft.
- Supporting analysis of how **registration timelines and cancellation dates** correlate with accident occurrence.
- Enabling enhanced **ownership and certification history tracking**, useful for investigating compliance or systemic issues.


## FAA REGULATION DATA
This dataset contains information about changes to Federal Aviation Administration (FAA) regulations. It is structured and consistent, with minimal preprocessing required.

In [None]:
regulation = pd.read_csv('supplimentary-data/Regulation-data/all_current_ACs_as_of_2025-06-24.csv')

In [None]:
regulation.shape

In [None]:
regulation.head()

In [None]:
regulation.tail()

In [None]:
regulation.info()

In [None]:
regulation.describe().T

In [None]:
regulation.describe(include='O').T

In [None]:
regulation.columns

In [None]:
regulation.isna().any()

## Findings 
using the `.head` and `.tail`, the data is consistent all through 

| Column Name     | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| `CHANGENUMBER`   | Unique identifier for the regulation change. Missing in some rows.          |
| `DATE`           | Date the regulation change was recorded or issued.                          |
| `DOCUMENTNUMBER` | Official document number for the regulatory record.                         |
| `OFFICE`         | FAA office responsible for the change or publication.                       |
| `TITLE`          | Title or summary of the regulatory change or document.                      |

---

### Data Quality

- The file is **relatively clean and analysis-ready**.
- The only notable missing values are in the `CHANGENUMBER` column.
- **No corrupted or malformed lines** observed.
- Column naming is already **consistent and descriptive**.

---

### Usage in Analysis

This dataset will be used to:

- **Overlay regulatory changes over time** with accident trends, helping identify correlations between new rules and safety outcomes.
- **Associate specific regulation documents** with incident dates or aircraft models where applicable.
- **Enrich the narrative** around FAA oversight, identifying whether accidents occurred before or after relevant safety regulations were enacted.


## PHASE TWO: DATA CLEANING & WRANGLING 

In this section, I begin the data cleaning and wrangling phase of my analysis. After getting an overview of all datasets, I will now dig deeper into the structure and contents of the aviation accident data — the main dataset powering my analysis.

My goal here is to:
- Understand the **meaning and relevance** of each column
- Decide which features are **critical for analysis**, and which can be **dropped or transformed**
- Handle **missing values** using clear logic
- Create a **clean, well-structured dataset** ready for Exploratory Data Analysis (EDA) and risk modeling

---

###  Why Focus on U.S. Data?

The aviation accident dataset spans both domestic and international incidents from **1962 to 2023**. After analyzing the `Country` column, I found:

- Total records in dataset: **88,889**
- Records with `Country == "United States"`: **82,248**
- Proportion of U.S. data: 92.5%


Given that **over 92% of the data is U.S.-based**, it is statistically sound to anchor my cleaning and initial analysis on this subset. This choice ensures:
- High-quality and consistent data (due to FAA reporting standards)
- Easier cross-referencing with other FAA and registration datasets
- A more stable foundation for accurate risk modeling and business recommendations

---

###  What About the Non-U.S. (Diaspora) Data?

While my focus will be on U.S.-based data for the purposes of cleaning, modeling, and initial business recommendations, I will **not discard the international data**.

Instead, I will:
- Preserve a cleaned version of non-U.S. (diaspora) data separately
- Consider adding it in later as a **secondary insight layer**
- Allow for potential **interactive filtering in dashboards** (e.g., U.S. vs. Global view)

This approach ensures that my analysis is both **deep (U.S. focus)** and **scalable (global relevance)**.

---

### Next Steps in Cleaning

I will now:
1. Filter and work with U.S. records only (`Country == "United States"`)
2. Examine each column in detail
3. Handle missing values logically
4. Clean inconsistencies (e.g., in aircraft model names, date formats, injury reports)
5. Save a clean version of the dataset for further analysis

Once complete, this cleaned dataset will form the foundation for:
- Exploratory Data Analysis
- Aircraft risk profiling
- Visualizations and business intelligence recommendations



In [None]:
#make a copy of the aviation data set since its good practice not working on the actual data 
aviation_data_copy = aviation_data.copy()


## Column Name Meanings

| Column Name              | Meaning                                                                       |
| ------------------------ | ----------------------------------------------------------------------------- |
| `Investigation.Type`     | Whether the event was an "Accident" or "Incident". Accidents are more severe. |
| `Accident.Number`        | Unique ID for each event. Serves as the primary key.                          |
| `Event.Date`             | Date the accident or incident occurred.                                       |
| `Location`               | General description of where the event happened (e.g., city, area).           |
| `Country`                | Country where the event occurred.                                             |
| `Latitude`               | Geographic coordinate (north-south) of the event.                             |
| `Longitude`              | Geographic coordinate (east-west) of the event.                               |
| `Airport.Code`           | FAA/IATA code of the airport involved (if any).                               |
| `Airport.Name`           | Full name of the airport involved (if any).                                   |
| `Injury.Severity`        | Summary of the severity of injuries (e.g., Fatal, Serious, Minor).            |
| `Aircraft.damage`        | Description of damage sustained by the aircraft.                              |
| `Aircraft.Category`      | General category of aircraft (e.g., airplane, rotorcraft).                    |
| `Registration.Number`    | Aircraft registration number (like a license plate).                          |
| `Make`                   | Manufacturer of the aircraft (e.g., Boeing, Cessna).                          |
| `Model`                  | Specific model of the aircraft.                                               |
| `Amateur.Built`          | Indicates if the aircraft was amateur-built ("Yes" or "No").                  |
| `Number.of.Engines`      | Number of engines the aircraft had.                                           |
| `Engine.Type`            | Description of the aircraft’s engine type.                                    |
| `FAR.Description`        | FAA regulatory category under which the aircraft was operating.               |
| `Schedule`               | Indicates if the flight was scheduled or unscheduled.                         |
| `Purpose.of.flight`      | Reason or purpose for the flight (e.g., personal, training).                  |
| `Air.carrier`            | Name of the air carrier, if applicable (commercial flights).                  |
| `Total.Fatal.Injuries`   | Total number of people who died in the event.                                 |
| `Total.Serious.Injuries` | Total number of people with serious injuries.                                 |
| `Total.Minor.Injuries`   | Total number of people with minor injuries.                                   |
| `Total.Uninjured`        | Total number of people who were not injured.                                  |
| `Weather.Condition`      | Weather during the event (e.g., VMC, IMC, UNK).                               |
| `Broad.phase.of.flight`  | Phase of flight during which the event occurred (e.g., landing, taxi).        |
| `Report.Status`          | Indicates if the report is preliminary or final.                              |
| `Publication.Date`       | Date the report was published.                                                |

----
The meanings provide description of what each column entails, thus expanding my domain knowledge on the data set

In [None]:
#cleaning and renaming the columns 
aviation_data_copy.rename(columns={
    'Investigation.Type': 'Investigation_Type',
    'Accident.Number': 'Accident_Number',
    'Event.Date': 'Event_Date',
    'Airport.Code': 'Airport_Code',
    'Airport.Name': 'Airport_Name',
    'Injury.Severity': 'Injury_Severity',
    'Aircraft.damage': 'Aircraft_Damage',
    'Aircraft.Category': 'Aircraft_Category',
    'Registration.Number': 'Registration_Number',
    'Make': 'Aircraft_Make',
    'Model': 'Aircraft_Model',
    'Amateur.Built': 'Amateur_Built',
    'Number.of.Engines': 'Number_of_Engines',
    'Engine.Type': 'Engine_Type',
    'FAR.Description': 'FAR_Description',
    'Schedule': 'Schedule_Type',
    'Purpose.of.flight': 'Purpose_of_Flight',
    'Air.carrier': 'Air_Carrier',
    'Total.Fatal.Injuries': 'Fatal_Injuries',
    'Total.Serious.Injuries': 'Serious_Injuries',
    'Total.Minor.Injuries': 'Minor_Injuries',
    'Total.Uninjured': 'Uninjured',
    'Weather.Condition': 'Weather_Condition',
    'Broad.phase.of.flight': 'Phase_of_Flight',
    'Report.Status': 'Report_Status',
    'Publication.Date': 'Publication_Date'
}, inplace=True)

aviation_data_copy.columns

In [None]:
def clean_column_names(df):
    df.columns = (
        df.columns
        .str.strip()
        .str.lower()
        .str.replace('.', '_', regex=False)
        .str.replace(' ', '_', regex=False)
    )
    return df

aviation_data_copy = clean_column_names(aviation_data_copy)

aviation_data_copy.columns


In [None]:
#make a copy of the US-subset
us_data = aviation_data_copy[aviation_data_copy['country'] == 'United States'].copy()
#make a copy for the diaspora data
diaspora_data = aviation_data_copy[aviation_data_copy['country'] != 'United States'].copy()

##  General Rules for Dropping Data

Dropping data—whether rows or columns—should be done cautiously, guided by domain knowledge and data quality goals. Below are standard, defensible rules that i will use in my analysis.

---

###  Dropping Columns

I will drop a column if:

- It has a **high percentage of missing values** (typically > 50–70%) and is not critical for analysis.
- It contains **only a single unique value** (i.e., zero variance — no information gain).
- It is a **duplicate of another column** (redundancy).
- The data is **irrelevant to the current analysis objectives** (e.g., IDs or metadata not used for joins or context).
- It is **impossible to interpret or decode** (e.g., poorly documented, encoded variables with no lookup).

---

###  Dropping Rows

I will  drop a row if:

- **Critical columns are missing**, especially where imputation is not appropriate (e.g., timestamps, unique identifiers, target variable).
- It contains **clearly erroneous or corrupted data** (e.g., wrong data types, impossible values like negative injuries or invalid dates).
- It is a **complete duplicate** of another row.
- It **violates integrity constraints**, such as conflicting values across dependent fields.

---

### Cautions

- Consider **imputation or transformation** before dropping — dropping should be the **last resort** if data is unrecoverable.
- **Document your rationale** for each drop, especially in sensitive or audit-heavy domains like aviation or healthcare.
- Consider the **impact on representativeness**: Dropping too many rows can introduce bias or reduce statistical power.

---

### Best Practice

I will use `.info()`, `.isnull().sum()`, and `.nunique()` early in EDA to assess the quality of each column and  back decisions with simple visuals (e.g., **missingness heatmaps** or **histograms**).


In [None]:

us_data.info()

In [None]:
us_data.isna().any()

## Findings ?
The only columns with no missing data are event_id,investigation_type,accident_number, event_date
i will go through each column one by one and keep relevant data and drop the rest, i can also use domain knowlege to fill some columns, either way i want to maintain minimal bias according to what i have.

## COLUMN BY COLUMN INVESTIGATION AND VERDICT
----


### COLUMN ONE:LOCATION

I noticed a pattern which might help me in one way or another fill missing data in my aviation data set, this is what i found

### The NTSB `accident_number` has a patter as shown and explained below:

According to the **NTSB Aviation Data Dictionary**, the structure of an `accident_number` follows a specific pattern:

###  Format Breakdown:

- **First 3 characters**: NTSB **office code**  
  *Example*: `MIA` = Miami Regional Office

- **Next 2 digits**: **Fiscal year** of the investigation  
  *Example*: `85` = Fiscal Year 1985

- **Next 2 letters**: **Investigation category and mode**  
  *Indicates whether the investigation involved airline, marine, etc.*

- **Next 3 digits**: A **sequential number** showing the order the case was opened in that fiscal year

- **Optional final letter**: Indicates **multiple aircraft** involved in the same event

---

### Example: `MIA85LAMS1`

This breaks down as:

- `MIA` → Miami NTSB Office  
- `85` → Fiscal year 1985  
- `L` → Likely a **major investigation** in **aviation** mode  
- `AMS` → Additional **category codes**  
- `1` → First in the sequence (possibly one of multiple aircraft)

##  Final Verdict on Missing `Location` Values (U.S. Data)

While analyzing the `Accident_Number` syntax, i came up with the following insights:

- The prefix (e.g., `MIA`, `FTW`, `LAX`) typically refers to the **NTSB regional office** that conducted the investigation — **not necessarily the accident location**.
- In some cases, the prefix aligns with the actual location.
- However, in other instances, the office may be **geographically distant** from where the accident occurred, making it **unreliable as a proxy** for true location.

---

### Conclusion

Although the `Accident_Number` can offer **hints**, it **cannot be consistently used** to infer accurate location data. thus the missing values i will use fillna with Unknown

---

In [None]:
#location
us_data['location'] = us_data['location'].fillna("Unknown")
us_data.isna().sum()


At this stage i will split this column into two new columns before i proceed with the cleaning 

The `location` column combines city and state information in a single string ( `"COCOA, FL"`). To support more granular geographic analysis, i will split this column into two distinct fields:

- **`city`** – the name of the city, town, or municipality where the event occurred  
- **`state`** – the two-letter U.S. state abbreviation ( `FL`, `CA`)

-----

- **Missing or malformed entries**: If the `location` field is  missing or did not contain a comma, both `city` and `state` will be assigned `'Unknown'`.
- **Whitespace handling**: Leading and trailing whitespaces will be stripped from both city and state values for consistency.
- **State validation**: I will  U.S. state code reference provided  (`USState_Codes.csv`)  to map abbreviations to full state names.
- **New field – `state`**: This additional column improves interpretability and supports advanced analysis (e.g., aggregating by full state name).

By structuring the `location` data this way, we enable more precise regional breakdowns, simplify future joins with FAA and weather datasets, and enhance the overall analytical quality of the dataset.


In [None]:
us_data[['city', 'state']] = (
    us_data['location']
    .fillna('Unknown, Unknown')            
    .str.split(',', n=1, expand=True)     
    .apply(lambda x: x.str.strip())        
)
us_data['city'] = us_data['city'].fillna('Unknown')
us_data['state'] = us_data['state'].fillna('Unknown')
us_data['state'] = us_data['state'].str.upper()

us_data.head()

In [None]:
#loading the USState_Codes.csv
state_codes = pd.read_csv('Data/Aviation-data/USState_Codes.csv')
state_codes.info()
state_codes.head()

In [None]:
state_codes.rename(columns={
    'Abbreviation':'state',
    'US_State':'state_full'
}, inplace=True)
us_data = us_data.merge(state_codes, on='state', how='left')


In [None]:
us_data['city'] = us_data['city'].str.strip().str.upper()
us_data['state_full']=us_data['state_full'].str.strip().str.upper()
us_data['state']=us_data['state'].str.strip().str.upper()
us_data.head()

##  Dropping Latitude and Longitude

The `latitude` and `longitude` columns represent the geographical coordinates of where each accident occurred. After evaluating their utility for this analysis, a decision was made to **drop both columns** based on the following rationale:

---

###  Reasons for Dropping:

- **Over 60% of values are missing** — specifically, `49,983` out of `82,248` us_data for the latitude column records have null entries, making reliable imputation impractical.
- The dataset lacks sufficient **contextual data** (e.g., accident causes, airport coordinates) needed to generate meaningful geospatial insights.
- The **focus of this analysis** is on **temporal, categorical, and severity-based trends**, rather than spatial or geographical mapping.
- Performing distance calculations (e.g., accident site to nearest airport) would require **external datasets and geolocation logic** not currently available in this phase.

---

###  Final Verdict:

To preserve dataset **cleanliness** and **analytical focus**, both `latitude` and `longitude` columns have been **dropped**.


In [None]:
print(us_data.columns)


In [None]:
us_data.drop(columns=['latitude', 'longitude'], inplace=True)


## AIRPORT CODES AND NAMES

## Why I Considered Using Airport Data

At first, I thought about keeping `airport_code` and `airport_name` in the dataset. They could be useful **if** I were analyzing:

- Accidents by airport
- Geographic clustering
- Infrastructure-related risks at specific airports

But for that kind of analysis, I’d need **supporting geospatial data** like:

- Fuel logs or flight paths  
- Maintenance or repair history  
- Distances between the origin and crash site  
- Data on airport infrastructure or traffic density

---

## Why I’m Dropping It

After thinking it through, I decided to drop the airport data because:

- I’m **not analyzing airport-specific risks** in this project  
- I **don’t have the complementary data** needed to make airport-level insights meaningful  
- I already have **location data**, which offers better granularity — and I’ve taken the time to clean it  
- Plus, **airport names and codes are often inconsistent** or messy in large datasets — keeping them without a clear purpose would just add noise
- Lastly there mentioned columns have alot missing data

---

###  Final Decision

I’m dropping `airport_code` and `airport_name` to keep the dataset lean, focused, and clean.


In [70]:
us_data.drop(columns=['airport_code', 'airport_name'], inplace=True)
