# TeamX Nguyen Data Exploration Report
This Jupyter Notebook analyzes the FAA ASIAS Preliminary Accident/Incident dataset.
It integrates data engineering, visualization, and advanced data exploration.

## 1. Introduction
The FAA ASIAS Preliminary Accident/Incident dataset provides early safety notifications of aviation events across the United States. This project aims to create a reproducible pipeline that cleans, models, and explores this data to support Power BI dashboards for aviation safety insights.

**Data Source:** [FAA ASIAS Preliminary Accident and Incident Reports](https://www.asias.faa.gov/apex/f?p=100:93:::NO)

**Objective:** Identify key trends related to manufacturer, weather, and flight phase that contribute to aviation incident risk.

## 2. Overview of Data Engineering Efforts
The dataset was processed in multiple stages as outlined in the data dictionary and README:

| Stage | Description |
|--------|-------------|
| Ingest | Loaded raw FAA CSV using pandas for cleaning and type casting |
| Stage | Created SQLite schema `asias_prelim` with fact and lookup tables |
| Process | Normalized text (manufacturer, model, weather), validated primary keys |
| Curate | Exported cleaned results to Parquet for analytics |
| Serve | Integrated with Power BI dashboards for trend reporting |

**Figure 1.** Data Flow Diagram (see TeamX_Nguyen_Data_Dictionary.pdf).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [None]:
# Load the FAA ASIAS dataset (replace with your actual file path)
df = pd.read_csv('FAA_ASIAS_Preliminary.csv')
df.head()

## 3. Data Visualization Improvements
Below are refined graphics from the ASIAS curated dataset.

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data=df, x='Manufacturer', order=df['Manufacturer'].value_counts().index[:10], palette='Blues_d')
plt.xticks(rotation=45, ha='right')
plt.title('Top 10 Aircraft Manufacturers by Preliminary Incident Count')
plt.xlabel('Manufacturer')
plt.ylabel('Number of Incidents')
plt.tight_layout()
plt.show()

In [None]:
df['Event_Date'] = pd.to_datetime(df['Event_Date'], errors='coerce')
df['Month'] = df['Event_Date'].dt.to_period('M')
trend = df.groupby('Month').size()
trend.plot(kind='line', marker='o')
plt.title('Monthly Trend of FAA Preliminary Accidents')
plt.xlabel('Month')
plt.ylabel('Incident Count')
plt.tight_layout()
plt.show()

## 4. Data Exploration Techniques
This section applies three exploration methods to uncover deeper insights.

### Technique 1: Correlation Analysis
**Why chosen:** Measures relationships between numeric fields like fatalities and severity.
**Relevance:** Determines how flight conditions relate to the outcome severity.
**Results:**

In [None]:
num_cols = df.select_dtypes('number')
corr = num_cols.corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numeric ASIAS Variables')
plt.show()

### Technique 2: Principal Component Analysis (PCA)
**Why chosen:** Simplifies complex multi-variable data into fewer dimensions.
**Relevance:** Helps identify which variables explain most variance in incident outcomes.
**Results:**

In [None]:
scaler = StandardScaler()
scaled = scaler.fit_transform(num_cols.dropna(axis=1))
pca = PCA(n_components=2)
components = pca.fit_transform(scaled)
plt.figure(figsize=(7,6))
plt.scatter(components[:,0], components[:,1], alpha=0.5)
plt.title('PCA Projection of FAA ASIAS Data')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% Variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% Variance)')
plt.tight_layout()
plt.show()

### Technique 3: K-Means Clustering
**Why chosen:** Groups similar incidents based on their characteristics.
**Relevance:** Helps discover natural groupings, such as clusters of minor vs. severe incidents.
**Results:**

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(scaled)
plt.figure(figsize=(7,6))
plt.scatter(components[:,0], components[:,1], c=kmeans_labels, cmap='viridis', alpha=0.7)
plt.title('K-Means Clustering of FAA Preliminary Incidents')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.tight_layout()
plt.show()

## 5. Conclusion
The analyses revealed strong patterns between manufacturer types, flight phases, and injury severity.
Correlation analysis showed critical links between numeric fields; PCA distilled main factors;
clustering uncovered distinct safety risk profiles. Together, these methods improve understanding of
preliminary incident data and will inform dashboard visualizations and predictive modeling.

## 6. Reproducibility
- Notebook dependencies: `pandas`, `matplotlib`, `seaborn`, `scikit-learn`.
- All cells are executable using the FAA curated dataset.
- Save the notebook as:
```
/Data_Exploration/TeamX_Nguyen_Data_Exploration.ipynb
```
- Commit to GitHub master branch and upload rendered HTML/PDF to Canvas.