# 🧠 01 – Feature Engineering: Enhancing the Crash Dataset

Welcome to the **Feature Engineering** stage of the project. Now that we've cleaned the raw airplane crash data, it's time to extract more meaningful variables — the kind that help us see patterns, answer complex questions, and eventually feed predictive models.

Feature engineering is the art of transforming raw columns into informative, structured signals. It helps us move from messy data → valuable insights.

---

### ✨ Goals of This Notebook

- Create new features that will be useful for:
  - Exploratory Data Analysis (EDA)
  - Segmentation and comparison
  - Future machine learning or pattern detection
- Add semantic meaning to data using domain intuition

---

### 📦 What You'll Find Here

| Section | Purpose |
|--------|---------|
| Temporal Features | Extract `Year`, `Month`, `Weekday`, `Decade` from `date` |
| Military vs Civilian | Flag crashes with military involvement |
| Severity Indicators | Calculate `Fatality Rate`, `Is_Fatal` binary |
| Aircraft Grouping | Simplify aircraft types for easier comparison |
| Crash Geography | Tag locations as Land, Water, Island, etc. |

---

In [5]:
import pandas as pd
import numpy as np

df = pd.read_csv("cleaned_airplane_crashes.csv")
df.head()

Unnamed: 0,date,time,location,operator,route,ac_type,aboard,aboard_passangers,aboard_crew,fatalities,fatalities_passangers,fatalities_crew,ground,summary
0,1908-09-17,17:18:00,"Fort Myer, Virginia",Military - U.S. Army,Demonstration,Wright Flyer III,2.0,1.0,1.0,1.0,1.0,0.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,1912-07-12,06:30:00,"Atlantic City, New Jersey",Military - U.S. Navy,Test flight,Dirigible,5.0,0.0,5.0,5.0,0.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,1913-08-06,Unknown,"Victoria, British Columbia, Canada",Private,Unknown,Curtiss seaplane,1.0,0.0,1.0,1.0,0.0,1.0,0.0,The first fatal airplane accident in Canada oc...
3,1913-09-09,18:30:00,Over the North Sea,Military - German Navy,Unknown,Zeppelin L-1 (airship),20.0,0.0,0.0,14.0,0.0,0.0,0.0,The airship flew into a thunderstorm and encou...
4,1913-10-17,10:30:00,"Near Johannisthal, Germany",Military - German Navy,Unknown,Zeppelin L-2 (airship),30.0,0.0,0.0,30.0,0.0,0.0,0.0,Hydrogen gas which was being vented was sucked...


In [None]:
# 1. Convert Date column to datetime
df["date"] = pd.to_datetime(df["date"], errors="coerce")

# 2. Extract Year, Month, Day, Weekday
df["Year"] = df["date"].dt.year
df["Month"] = df["date"].dt.month
df["Weekday"] = df["date"].dt.day_name()

# 3. Create Decade
df["Decade"] = (df["Year"] // 10) * 10

In [7]:
# 4. Is Military? -> If 'military' appears in operator name
df["is_Military"] = df["operator"].str.contains("military", case=False, na=False).astype(int)

In [8]:
# 5. Fatality Rate
df["Fatality_Rate"] = df["fatalities"] / df["aboard"]
df["Fatality_Rate"] = df["Fatality_Rate"].fillna(0).round(2)

# 6. Is Fatal? -> Binary flag for whether the crash had any deaths
df["Is_Fatal"] = (df["fatalities"] > 0).astype(int)

In [9]:
# 7. Aircraft Type Simplification
def simplify_aircraft_type(ac_type):
    if pd.isna(ac_type): return "Unknown"
    ac_type = ac_type.lower()
    if "zeppelin" in ac_type or "dirigible" in ac_type:
        return "Airship"
    elif "biplane" in ac_type:
        return "Biplane"
    elif "jet" in ac_type:
        return "Jet"
    elif "helicopter" in ac_type:
        return "Helicopter"
    elif "seaplane" in ac_type:
        return "Seaplane"
    else:
        return "Other"

df["Aircraft_Type_Simple"] = df["ac_type"].apply(simplify_aircraft_type)

In [10]:
# 8. Crash Location Type
def extract_location_type(location):
    if pd.isna(location): return "Unknown"
    location = location.lower()
    if any(word in location for word in ["sea", "ocean", "bay", "gulf"]):
        return "Water"
    elif any(word in location for word in ["mountain", "ridge", "hill", "peak"]):
        return "Mountain"
    elif any(word in location for word in ["island"]):
        return "Island"
    else:
        return "Land"

df["Crash_Location_Type"] = df["location"].apply(extract_location_type)

In [11]:
df

Unnamed: 0,date,time,location,operator,route,ac_type,aboard,aboard_passangers,aboard_crew,fatalities,...,summary,Year,Month,Weekday,Decade,is_Military,Fatality_Rate,Is_Fatal,Aircraft_Type_Simple,Crash_Location_Type
0,1908-09-17,17:18:00,"Fort Myer, Virginia",Military - U.S. Army,Demonstration,Wright Flyer III,2.0,1.0,1.0,1.0,...,"During a demonstration flight, a U.S. Army fly...",1908,9,Thursday,1900,1,0.5,1,Other,Land
1,1912-07-12,06:30:00,"Atlantic City, New Jersey",Military - U.S. Navy,Test flight,Dirigible,5.0,0.0,5.0,5.0,...,First U.S. dirigible Akron exploded just offsh...,1912,7,Friday,1910,1,1.0,1,Airship,Land
2,1913-08-06,Unknown,"Victoria, British Columbia, Canada",Private,Unknown,Curtiss seaplane,1.0,0.0,1.0,1.0,...,The first fatal airplane accident in Canada oc...,1913,8,Wednesday,1910,0,1.0,1,Seaplane,Land
3,1913-09-09,18:30:00,Over the North Sea,Military - German Navy,Unknown,Zeppelin L-1 (airship),20.0,0.0,0.0,14.0,...,The airship flew into a thunderstorm and encou...,1913,9,Tuesday,1910,1,0.7,1,Airship,Water
4,1913-10-17,10:30:00,"Near Johannisthal, Germany",Military - German Navy,Unknown,Zeppelin L-2 (airship),30.0,0.0,0.0,30.0,...,Hydrogen gas which was being vented was sucked...,1913,10,Friday,1910,1,1.0,1,Airship,Land
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4967,2022-11-21,10:15:00,"Medellín, Colombia",AeroPaca SAS,Medellín - Pizarro,Piper PA-31-350 Navajo Chieftain,8.0,6.0,2.0,8.0,...,The plane was chartered to carry a team of six...,2022,11,Monday,2020,0,1.0,1,Other,Land
4968,2023-01-15,10:50:00,"Pokhara, Nepal",Yeti Airlines,Kathmandu - Pokhara,ATR 72-500-72-212-A,72.0,68.0,4.0,72.0,...,"Before approach, the pilot requested a change ...",2023,1,Sunday,2020,0,1.0,1,Other,Land
4969,2023-09-16,Unknown,"Barcelos, Brazil",Manaus Aerotaxi,Unknown,Embraer EMB-110P1 Bandeirante,14.0,12.0,2.0,14.0,...,The air taxi crashed in heavy rain while attem...,2023,9,Saturday,2020,0,1.0,1,Other,Land
4970,2023-10-29,06:30:00,"Rio Branco, Brazil",ART Taxi Aero,Rio Branco - Envira,Cessna 208B Grand Caravan,12.0,10.0,2.0,12.0,...,The air taxi crashed into a heavy wooded area ...,2023,10,Sunday,2020,0,1.0,1,Other,Land


In [12]:
# ✅ Save Feature-Enriched Dataset
df.to_csv("feature_engineered_crashes.csv", index=False)

## 🎯 Feature Engineering Summary

This file created several new variables to unlock deeper analysis and better insights:

| Feature Name            | Description |
|-------------------------|-------------|
| `Year`, `Month`, `Weekday` | Extracted from crash date |
| `Decade` | Groups crashes by 10-year periods |
| `is_Military` | Flags if the operator is military |
| `Fatality_Rate` | Proportion of people aboard who died |
| `Is_Fatal` | Binary: 1 if there were any fatalities |
| `Aircraft_Type_Simple` | Simplified aircraft types like Jet, Airship, etc. |
| `Crash_Location_Type` | Categorized crash locations (Water, Mountain, Island, Land) |

📁 File saved: `feature_engineered_crashes.csv`

---

### ⏭️ Next Step:
Start with the **BASIC QUESTIONS** notebook to begin exploring the dataset with real insights.