# Notebook 01: Initial Data Exploration

In this notebook I'm going to load in the raw data just to get a feel for it and see what kind of cleaning operations are needed and what I might be able to do with it.

In [1]:
import pandas as pd
from src.config import RAW_DATA
from pandas_profiling import ProfileReport

In [2]:
data = RAW_DATA.joinpath("Airplane_Crashes_and_Fatalities_Since_1908.csv")

In [3]:
df = pd.read_csv(data)

## Data Overview

First, let's get a high level overview of the data. One of my favourite tools for this is pandas-profiling

In [4]:
ProfileReport(df)

Summarize dataset: 100%|██████████| 27/27 [00:04<00:00,  6.19it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.26s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.28it/s]




### Observations

* Looks like we're missing a bit of data, thankfully some of these aren't that important (e.g. `Flight #`)

* There's a lot of missing data in `Time` which is a shame as I think that would have some interesting insights

* `Registration` and `cn/In` just refer to the registration number or airframe ID number of the aircraft involved and probably isn't that useful

* `Summary` contains the description of the crash. This column will be particularly useful in trying to classify them

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          5268 non-null   object 
 1   Time          3049 non-null   object 
 2   Location      5248 non-null   object 
 3   Operator      5250 non-null   object 
 4   Flight #      1069 non-null   object 
 5   Route         3562 non-null   object 
 6   Type          5241 non-null   object 
 7   Registration  4933 non-null   object 
 8   cn/In         4040 non-null   object 
 9   Aboard        5246 non-null   float64
 10  Fatalities    5256 non-null   float64
 11  Ground        5246 non-null   float64
 12  Summary       4878 non-null   object 
dtypes: float64(3), object(10)
memory usage: 535.2+ KB


* It also looks like Dates and Times are simply stored as objects, we'll convert these to datetimes later

* Let's inspect a random sample of the data

In [6]:
df.sample(20)

Unnamed: 0,Date,Time,Location,Operator,Flight #,Route,Type,Registration,cn/In,Aboard,Fatalities,Ground,Summary
1071,07/27/1950,04:30,"Off O-shima Island, Japan",Military - U.S. Air Force,,,Douglas C-47D,44-76439A,,26.0,25.0,0.0,The aircraft took off from O-shima and reached...
993,02/02/1949,,"Trinity Bay, Newfoundland",Saint Lawrence Airways,,,Avro Anson,CF-FEO,3708,6.0,6.0,0.0,
3738,12/30/1987,,PacifiOcean,Merpati Nasantara Airlines,,Samarinda - Berau,de Havilland Canada DHC-6 Twin Otter 300,PK-NUY,459,17.0,17.0,0.0,"Disappeared between Samarinda and Berau, Indon..."
3069,09/02/1978,10:21,AtlantiOcean,Antillies Air - Air Taxi,,"St. Croix, VI - St. Thomas, VI",Grumman G-21A,N7777V,,11.0,4.0,0.0,The aircraft made a force landing in the water...
3781,08/02/1988,17:42,"Reykjavik, Iceland",Geoterrex,,"Narsarsuaq, Greenland - Reykjavik, Iceland",CASA 212 Aviocar 200,C-GILU,245,3.0,3.0,0.0,"The plane, on a positioning flight entered a s..."
2085,11/22/1966,12:20,"Near Aden, Yemen",Aden Airways,,,Douglas DC-3,VR-AAN,4284,30.0,30.0,0.0,The aircraft crashed into the desert 20 minute...
2835,08/07/1975,16:11,"Denver, Colorado",Continental Airlines,426.0,Denver - Wichita,Boeing B-727-224,N88777,19798/608,131.0,0.0,0.0,The aircraft climbed to about 100 feet above r...
3246,09/15/1980,00:00,"Near Medina, Saudi Arabia",Military - Royal Saudi Air Force,,,Lockheed C-130E,453,4128,89.0,89.0,0.0,Crashed into the desert after taking off. Repo...
2058,07/04/1966,15:59,"Auckland, New Zealand",Air New Zealand,,Training,Douglas DC-8-52,ZK-NZB,45751/231,5.0,2.0,0.0,The incurrence of reverse thrust during simula...
994,02/04/1949,,"Castel Benito, Libya",Skyways of London,,Khartoum - Castel Benito,Douglas C-54A-1-DO Skymaster,G-AJPL,7464,53.0,1.0,0.0,"The No. 4, followed by the No. 3 engines faile..."


Looks like the bulk of the work will be cleaning up text columns:

* `Location` appears to be mostly in the format `location, country` this is in obvious candidate for a feature

* Similarly `Operator` appears to show `Military - {something}`, I'm willing to bet the military has far more accidents than commercial aviation just by the nature of their flights

* `Route` could be interesting but I think there could be too many distinct values to really make use of this

* `Summary` is where the real work is I think. Getting this into some sort of clustering crash by description is my main goal.

## Cleaning

So now I've figured out roughly what I want to do, I'll clean the data up and make it so I can easily load it in later in `src.data`

Cleaning steps are going to be:

* Remove redundant columns

* Cast column dtypes

* Drop anything with a missing `Summary` as I think without that it will be very hard to classify them