# Imputing NTSB Master dataset

This notebook imputes NTSB master dataet, which originates from the Jupyter Notebook `ntsb_feature_selection.ipynb`.

***

#### Notes 
- Since we don't want to drop columns based on their frequency in the test set, this is probably a temporary measure to simplify our exploration, but we can take it out once we have a model
- The following should be run __before__ drop infrequent values, which fills in missing entries

Eventually when we impute values, I think we should do so __before__ dropping any columns / infrequent values of categorical variables because we may use that information in imputation even if we don't use it in modeling. E.g. if we only two occurences of a particular aircraft model, that could still be useful for imputing missing info about the aircraft.

#### `total_person_count`
We have some missing data for `total_person_count`, which can mostly be calculated, but:
1. there are a few (~10) cases where we're missing data for both aircraft in a multi-aircraft event. 
2. if `inj_tot_t` and aircraft-level counts are missing and the other event-level injury counts are 0, this does not necessarily indicate that there were only unmanned aircraft involved. It seems that the other event-level injury counts default to 0 when they are unknown, and only `inj_tot_t` is left blank in the dataset.

I found the following reasons why the injury counts might be unknown:
- Crash occurred outside of the USA --> no NTSB investigation
- Incident and not accident --> superficial investigation
- Aircraft damage was discovered in an inspection --> investigation could not determine whether injuries occurred

#### Filtering data
- I think we should filter out (a) events outside the USA and (b) non-accidents because of a high likelihood of sparse data. This leaves us with ~83% of the data. 
- If we do this, most of the variables now have well over 80% of the values present
- My gut is to filter before the train-test split, but I'm not certain that's right (or that it matters)

#### Other notes
- Oddly, `gust_kts` is 100% present but `wind_vel_kts` is ~20% missing. When `wind_vel_kts` is missing, `gust_kts` is 0 more than 99% of the time, which probably means that 0 is entered by default when it's unknown

### Imputing values

Categorical
- Target variables (does it ever make sense to impute a target variable, or should we just omit from the dataset / performance metrics?)
  - `damage`: guess based on injury severity 
  - `ev_highest_injury`: calculate from injury counts
- All others: 'other/unknown'

Numerical
- Calculate from other data: `total_person_count`, `Minor_count`, `None_count`, `Serious_count`, `Fatal_count`, `injured_person_count`, `ev_highest_injury`, `inj_tot_t`
- `latitude`, `longitude`: randomly sample? (not a huge issue -- it's only 1 row)
- `Environmental issues`, `Organizational issues`, `Personnel issues`: impute 0
- `num_eng`: find max number of passengers on a 1-engine aircraft, impute 1 for aircraft with at most this many passengers, 2 for aircraft with more passengers

In [26]:
import pandas as pd
import missingno as mno
import matplotlib.pyplot as plt
import numpy as np

In [27]:
data = pd.read_csv("../data/ntsb/cleaned/master_train.csv")

In [28]:
data = data.loc[(data['ev_country']=='USA') & (data['ev_type']=='ACC')] # Limit to US accidents 
data.drop(columns=['ev_country', 'ev_type'], inplace=True)

data = data.loc[~data['inj_tot_t'].isna()]

## clean more : impute ground injury values
data[['inj_f_grnd', 'inj_m_grnd', 'inj_s_grnd']]= data[['inj_f_grnd', 'inj_m_grnd', 'inj_s_grnd']].fillna(0) 

data['ground_injury_total'] = data[['inj_f_grnd', 'inj_m_grnd', 'inj_s_grnd']].sum(axis=1)

In [29]:
for col in data.columns:
    pna = data[col].isna().sum() / len(data)
    if pna > 0.2 :
        print(col, round(pna,3))


on_ground_collision 0.974
wind_vel_kts 0.211
owner_acft 0.423
oprtng_cert 1.0
oper_cert 1.0
evacuation 1.0
rwy_len 0.477
rwy_width 0.479
acft_year 0.515
fuel_on_board 0.708


In [30]:
def drop_sparse_columns(data, threshold, safe_cols=None):
    '''
    Drops columns from data that do not contain at least a given proportion of non-empty entries
    
    Inputs
        data: pandas DataFrame
        threshold: float in [0,1], all columns with less than this proportion of non-empty entries are dropped
        safe_cols: list of names of columns that should not be dropped even if they are below the sparsity threshold
    Outputs
        data: same DataFrame with appropriate columns dropped
    '''
    # list of columns to drop if they are too sparse
    unsafe_cols = [col for col in data.columns if col not in safe_cols]

    for col in unsafe_cols:
        # calculate proportion of na entries in col
        prop_na = data[col].isna().sum() / len(data)
        
        # drop col if the column is too sparse
        if prop_na > 1 - threshold:
            data.drop(columns=col, inplace=True)
    
    return data

In [31]:
data = drop_sparse_columns(data, 0.8, safe_cols=['damage', 'acft_category', 'acft_make', 'acft_model'])

In [32]:
# Already processed
data.drop(columns=['Aircraft', 'Aircraft_Key', 'ev_id', 'finding_description'], inplace=True)

# Possible data leakage
data.drop(columns=['acft_fire', 'acft_expl'], inplace=True)

# Probably not relevant (ev_time seemingly boils down to light_cond)
data.drop(columns=['wx_dew_pt', 'type_fly', 'ev_time'], inplace=True)

# (Almost) all rows have same value
data.drop(columns=['certs_held', 'unmanned'], inplace=True)

In [33]:
## Imputing `damage` value
## Since approximately 3.9% of the entries in the damage column are missing, we choose to drop these rows.
data = data.loc[~data['damage'].isna()]

In [34]:
## Imputing other categorical values

for col in data.columns:
    mask = data[col].isna()
    if any(mask) and data[col].dtype == 'object':
        data.loc[mask,col] = data[mask][col].replace(np.nan,'other/unknown')

In [35]:
## Imputing findings

data[['Environmental issues', 'Organizational issues', 'Personnel issues']] = data[['Environmental issues', 'Organizational issues', 'Personnel issues']].fillna(0)

In [36]:
## Imputing total person count
## There are only four event_key ['20080505X00589_2', '20130118X53100_2' , '20160218X94149_2', '20170913X72254_2']
## Whatever the exact circumstances (e.g., parked at the airport or taxiing), there were no occupants in these second aircraft, as all reported injuries (inj_tot_t) are attributed to the first aircraft in each event.
count = ['event_key', 'Fatal_count', 'Minor_count', 'None_count', 'Serious_count', 'total_person_count', 'injured_person_count','ground_injury_total', 'inj_tot_t']
data.loc[data['total_person_count'].isna(),count] = data[data['total_person_count'].isna()][count].fillna(0)

In [37]:
## For aircrafts having single engine, data shows the number of seats is less than 20.

## For single-engine aircraft, check the maximum number of seats.
print('Maximum number of total seats in single engine aircraft', np.max(data[data['num_eng']==1]['total_seats'].fillna(0).values))
## If this print shows '3300', that must be typo.


## Imputing value 1 into 'num_eng' if the total seats is less than 20.
data.loc[(data['num_eng'].isna())& (data['total_seats']<=20), 'num_eng'] = data[(data['num_eng'].isna())& (data['total_seats']<=20)]['num_eng'].fillna(1)

## There are only four cases both total seats and the number of engines are blank.
## All have two engines typically.
print('Below table shows accidents whose total seats and the number of engines are blank.')
print(data[(data['num_eng'].isna())& (data['total_seats'].isna()) & (data['total_person_count']>20)][['event_key','acft_model']])

## We checked each acft_models in the table, and all of them has typically two engines.
data.loc[(data['num_eng'].isna())& (data['total_seats'].isna()), 'num_eng'] = data[(data['num_eng'].isna())& (data['total_seats'].isna())]['num_eng'].fillna(2)

## And other cases also have two engines typically.
data.loc[data['num_eng'].isna(), 'num_eng'] = data[data['num_eng'].isna()]['num_eng'].fillna(2)

Maximum number of total seats in single engine aircraft 3300.0
Below table shows accidents whose total seats and the number of engines are blank.
              event_key    acft_model
8320   20141217X43728_1           737
8453   20150305X42958_1         MD 88
9234   20151029X44249_1           767
9479   20160225X92701_1        EMB145
11463  20171219X90251_1  ERJ170-200LR
12086  20180725X32722_1          MD88


In [38]:
##Check whether additional imputations are needed except for the 'total_seats' column.

for col in data.columns:
    mask = data[col].isna()
    if any(mask):
        print(col)

total_seats


In [39]:
data.to_csv('../data/ntsb/cleaned/master_train_imputed.csv',index=False)