# Notebook 02: EDA

In this notebook I'm going to dive a bit deeper into the (now cleaned) dataset

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import altair as alt
import altair_data_server

from src.data.load_data import Data

alt.data_transformers.enable("data_server")


DataTransformerRegistry.enable('data_server')

In [2]:
df = Data().load(clean=True)

In [3]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4814 entries, 0 to 5267
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          4814 non-null   datetime64[ns]
 1   year          4814 non-null   int64         
 2   month         4814 non-null   string        
 3   location      4814 non-null   string        
 4   country       4814 non-null   category      
 5   sector        4814 non-null   category      
 6   operator      4814 non-null   string        
 7   manufacturer  4814 non-null   category      
 8   type          4814 non-null   string        
 9   aboard        4814 non-null   int64         
 10  fatalities    4814 non-null   int64         
 11  fatality_pct  4812 non-null   float64       
 12  ground        4814 non-null   int64         
 13  summary       4814 non-null   string        
dtypes: category(3), datetime64[ns](1), float64(1), int64(4), string(5)
memory usage: 2.9 MB


In [4]:
df.sample(10)

Unnamed: 0,date,year,month,location,country,sector,operator,manufacturer,type,aboard,fatalities,fatality_pct,ground,summary
1280,1953-09-01,1953,September,"Vail, Washington",United States,Civilian,Regina Cargo Airlines,Douglas,Douglas DC-3,21,21,1.0,0,Crashed 26 nm short of McChord AFB. The pilot'...
3978,1991-02-17,1991,February,"Cleveland, Ohio",United States,Civilian,Ryan International Airlines,McDonell Douglas,McDonnell Douglas DC-9-15RC,2,2,1.0,0,The cargo plane stalled during takeoff cart wh...
1702,1960-12-17,1960,December,"Munich, West Germany",West Germany,Military,Military - U.S. Air Force,Convair,Convair C-131D (CV-340-79),20,20,1.0,31,The aircraft lost an engine on takeoff from Mu...
1521,1958-02-06,1958,February,"Munich, Germany",Germany,Civilian,British European Airways,Airspeed,Airspeed Ambassador A5-57,44,23,0.522727,0,The aircraft crashed during takeoff in a snows...
1018,1949-08-15,1949,August,"Off Lurga Point, Ireland",Ireland,Civilian,Transocean Air Lines,Douglas,Douglas DC-3 (C-54A-DO),58,8,0.137931,0,Fuel exhaustion forced the plane to ditch in t...
4321,1995-01-30,1995,January,"Kuei Shan Hsiang, Taiwan",Taiwan,Civilian,Transasia Airways,Aerospatiale,Aerospatiale ATR-72,4,4,1.0,0,Crashed while en route on a positioning flight.
1170,1951-12-30,1951,December,"Near Phoenix, Arizona",United States,Military,Military - U.S. Air Force,Douglas,Douglas C-47D,28,28,1.0,0,Crashed into a mountainside at an altitude of ...
3934,1990-03-27,1990,March,"Near Kuito, Angola",Angola,Military,Military - Angolan Air Force,CASA,CASA 212 Aviocar 300,25,25,1.0,0,Shot down with a missile fired by UNITA rebels.
1282,1953-09-14,1953,September,"Chablekal, Mexico",Mexico,Civilian,Transportes Aéreos Mexicanos,Douglas,Douglas C-47A,2,1,0.5,0,The cargo plane struck a tower in fog while at...
5202,2008-05-23,2008,May,"Billings, Montana",United States,Civilian,Alpine Aviation,Beechcraft,Beechcraft 1900C,1,1,1.0,0,"After taking off and being told to turn left,..."


## Data Cleaning

So you'll see we now have a cleaned up version of the raw data when I pass `clean = True` to the `load` method of the `Data` class. If you're interested, this is all implemented in `src/data/load_data.py`.

Basic summary of the cleaning steps:

* Dropped a few columns:
    * `Flight #` - I didn't really see any gainful insights coming from this one, essentially just random numbers generated by the airline
    * `Registration` - Similar thing. This is the aircraft's registration number e.g. G-EGHG.
    * `cn/In` - Same as registration, this refers to the aircraft's airframe ID or model number.
    * `Route` - Initially I wanted to keep this to see if any one route is particularly hazardous etc. but on some scratchpad analysis it seemed there were just too many distinct values to really learn anything.
    * `Time` - This is the one I'm most sad about losing. I strongly suspect that accidents at night time are more common than during the day. But there were lots of missing values and I couldn't get it to play nicely in my cleaning pipeline with pandas datetime parsing. It ended up parsing the date correctly but then because the time was `NaN` the whole thing would get cast to an object. If anyone knows a good way of handling this let me know!

* Did some basic formatting and tidying i.e. stripping any whitespace from text, dropped any remaining `NaNs`

* Extracted some features like `year`, `month`, `country`, whether it was a military or commercial flight etc.

* Grouped some countries for example, the country extractor would return names of US states, so I wrote a method to group them under `United States`. Similarly grouped `USSR` and `Russia` as well as grouping lots of misspelled versions of `Atlantic Ocean` and `Pacific Ocean`.

* And finally, reordered the columns to satisfy my OCD!

## EDA

Let's get into it!

I'm going to start with my favourite auto-data-magic-summariser... pandas-profiling. I usually do this at the start of an EDA and keep it around to refer back to.

I also basically exclusively use [Altair](https://altair-viz.github.io) for visualisations like this. I love its declarative grammar, you just initialise a chart and pipe everything into it and I've yet to run into something I couldn't do.

In [5]:
# ProfileReport(df)

Let's start by finding out the top 10 most dangerous countries to fly in...

In [6]:
(alt.Chart(df.groupby(by = "country")[["fatalities"]].sum().nlargest(10, "fatalities").reset_index())
.mark_bar()
.encode(
    x = alt.X("country:N", title = "Country", sort = "-y",),
    y = alt.Y("fatalities:Q", title = "Total Fatalities"),
    tooltip = ["country", "fatalities"],
).properties(
    title = "Total Fatalities by Country (Top 10)",
    height = 500,
    width = 750
).configure_axisX(
    labelAngle = -40
))

Wow, the US does not come off well here. This could be real, however it could also be anything like:

* The data was collected from a source in the US enabling more easy data collection for domestic accidents
* The population of the US regularly travelling by air is likely to be high compared to the other countries. What we should really look at if we want to determine liklihood of crash fatality is the fatalities divided by the flying population.

Let's instead look at the ratio of fatalities to total number of accidents.

A quick look at the value counts reveals that the US has far more accidents on record. Again, this could be a real observation or an artifact discussed above.

In [7]:
df['country'].value_counts()

United States                          1334
Russia                                  216
Brazil                                  173
Colombia                                140
Canada                                  138
                                       ... 
Midway Island Naval Air Station           1
Milford Sound                             1
Minnisota                                 1
Moldova                                   1
off the Philippine island of Elalat       1
Name: country, Length: 412, dtype: int64

In [8]:
top_10 = set(["United States", "Russia", "Brazil", "Colombia", "France", "India", "Japan", "Indonesia", "China", "Canada"])
top_10_df = df[df["country"].isin(top_10)]

In [9]:
top_10_df.groupby("country").mean().sort_values(by="fatality_pct", ascending = False).dropna()

Unnamed: 0_level_0,year,aboard,fatalities,fatality_pct,ground
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Colombia,1978.628571,22.55,20.328571,0.873064,0.292857
India,1967.054348,34.423913,27.695652,0.867188,0.869565
Russia,1981.824074,50.416667,42.00463,0.867038,0.268519
Japan,1968.463415,74.390244,48.902439,0.866045,0.04878
France,1958.907407,29.925926,24.277778,0.848833,0.259259
United States,1971.595202,18.888306,12.025487,0.835286,4.334333
Brazil,1972.942197,25.011561,17.82659,0.792493,0.410405
Canada,1976.971014,19.775362,13.224638,0.783645,0.108696
China,1965.044118,41.191176,27.838235,0.749167,0.661765
Indonesia,1989.3375,37.9625,24.8625,0.735071,0.625


In [19]:
alt.Chart(top_10_df.groupby("country").mean().sort_values(by="fatality_pct", ascending = False).dropna().reset_index()).mark_bar().encode(
    x = alt.X("country:N", title = "Country", sort = "-y"),
    y = alt.Y("fatality_pct:Q", title = "Crash Fatality Rate", axis = alt.Axis(format = "%")),
    color = alt.Color("fatality_pct:Q", title = None, scale = alt.Scale(scheme = "blues"))
).properties(
    title = "Mean Crash Fatality Rate by Country",
    height = 500,
    width = 750
).configure_axisX(
    labelAngle = -40
)

In [11]:
month_order = [
    "January",
    "February",
    "March",
    "April",
    "May",
    "June",
    "July",
    "August",
    "September",
    "October",
    "November",
    "December",
]

alt.Chart(df).mark_bar(opacity = 0.4).encode(
    x = alt.X("month:N", title = "Month", sort = month_order),
    y = alt.Y("sum(fatalities):Q", title = "Total Fatalities", stack = None),
    color = alt.Color("sector:N", title = "Sector")
).properties(
    title = "Total Fatalities by Month",
    height = 500,
    width = 750
).configure_axisX(
    labelAngle = -40
)

In [12]:
chart = (alt.Chart(df).mark_line(interpolate="basis").encode(
    x = alt.X("year(date):T", title = "Year"),
    y = alt.Y("mean(fatality_pct):Q", title = "Fatality Rate", axis=alt.Axis(format="%")),
    color = alt.Color("sector:N", title = "Sector")
).properties(
    title = "Crash Fatality Rate by Year and Sector",
    width = 750,
    height = 500
))

chart

In [13]:
alt.Chart(df).mark_bar(opacity = 0.6).encode(
    x = alt.X("fatality_pct:Q", title = "Crash Fatality Rate", bin = alt.Bin(maxbins=10)),
    y = alt.Y("count()", stack = None),
    color = alt.Color("sector:N", title = "Sector")
).properties(
    title = "Crash Fatality Rate Histogram",
    height = 500,
    width = 750
)


In [14]:
alt.Chart(df[df["fatality_pct"] < 1]).mark_bar(opacity = 0.6).encode(
    x = alt.X("fatality_pct:Q", title = "Crash Fatality Rate", bin = alt.Bin(maxbins=20)),
    y = alt.Y("count()", stack = None),
    color = alt.Color("sector:N", title = "Sector")
).properties(
    title = "Crash Fatality Rate Histogram (Crashes with Fatality Rate < 1)",
    height = 500,
    width = 750
)

In [15]:
alt.Chart(df.groupby("manufacturer").sum().nlargest(10, "fatalities").reset_index()).mark_bar().encode(
    x = alt.X("manufacturer:N", title = "Manufacturer", sort = "-y"),
    y = alt.Y("fatalities:Q", title = "Fatalities")
).properties(
    title = "Fatalities by Aircraft Manufacturer",
    width = 750,
    height = 500
).configure_axisX(
    labelAngle = -40
)

In [16]:
top_10_manf = set(["Boeing", "Douglas", "Lockheed", "McDonell Douglas", "Antonov", "Tupolev", "Ilyushin", "Airbus", "De Havilland", "Fokker"])
top_10_manf_df = df[df["manufacturer"].isin(top_10_manf)]

In [18]:
alt.Chart(top_10_manf_df.groupby("manufacturer").mean().sort_values(by="fatality_pct", ascending = False).dropna().reset_index()).mark_bar().encode(
    x = alt.X("manufacturer:N", title = "Manufacturer", sort = "-y"),
    y = alt.Y("fatality_pct:Q", title = "Crash Fatality Rate", axis = alt.Axis(format = "%")),
    color = alt.Color("fatality_pct:Q", title = None, scale = alt.Scale(scheme="blues"))
).properties(
    title = "Mean Crash Fatality Rate by Aircraft Manufacturer",
    height = 500,
    width = 750
).configure_axisX(
    labelAngle = -40
)