# NTSB Feature Selection

This notebook examines the rates of missingness for each value in the master NTSB dataset, as well as population for dummy variables, allowing us to pare down features which may not be likely to be useful/predictive.

### Note: 
This notebook employs the **'missingno'** package, which can be installed with a "pip install missingno" command in your Terminal.
***


In [20]:
import pandas as pd
import missingno as mno

In [21]:
data = pd.read_csv("../data/ntsb/cleaned/master_train.csv")

In [22]:
missing = pd.DataFrame(data.isna().sum())
missing_pct = missing / len(data)

In [None]:
# mno.bar(data,sort='ascending')

### Drop columns we don't need

- There's a function for dropping sparse columns

In [28]:
def drop_sparse_columns(data, threshold, safe_cols=None):
    '''
    Drops columns from data that do not contain at least a given proportion of non-empty entries
    
    Inputs
        data: pandas DataFrame
        threshold: float in [0,1], all columns with less than this proportion of non-empty entries are dropped
        safe_cols: list of names of columns that should not be dropped even if they are below the sparsity threshold
    Outputs
        data: same DataFrame with appropriate columns dropped
    '''
    # list of columns to drop if they are too sparse
    unsafe_cols = [col for col in data.columns if col not in safe_cols]

    for col in unsafe_cols:
        # calculate proportion of na entries in col
        prop_na = data[col].isna().sum() / len(data)
        
        # drop col if the column is too sparse
        if prop_na > 1 - threshold:
            data.drop(col, inplace=True)
    
    return data

In [None]:
# safe_cols = data.columns makes the function do nothing
# change this parameter in order to deploy the function
data = drop_sparse_columns(data, 0.3, safe_cols=data.columns)

### Get dummy variables for non-sparse values in categorical columns

In [27]:
def get_dummies_frequent_values(data, threshold, columns=None):
    '''
    Get dummy variables for a column, but only for values whose frequency is at least given proportion
    All other values have no dummy variable. Note that NA values are treated the same as infrequent values

    Note: we require the columns argument so that we don't consider non-dummy columns in the loop below
    
    Inputs
        data: pandas DataFrame
        columns: list of column names to turn into dummies
        threshold: float in [0,1], dummy column only created for value whose frequency is at least this proportion
    Outputs
        one_hot_data: pandas DataFrame of required dummy variables
    '''

    orig_cols = data.columns

    data = pd.get_dummies(data, columns=columns, dtype=int)

    # Use set difference to get list of new dummy columns
    new_cols = list(set(data.columns) - set(orig_cols))

    # drop sparse columns
    for col in new_cols:
        # frequency of value corresponding to this column
        freq = sum(data[col]) / len(data)

        if freq < threshold:
            data.drop(columns=col, inplace=True)

    # TODO drop columns as needed to avoid multicollinearity (if necessary, although I assume it will be)

    return data

In [26]:
# threshold set to 0.04 to get dummies for DAYL and NITE only
# TODO will this work on the test / validation data if the proportions of each light_cond are a bit different?

data = get_dummies_frequent_values(data, ['light_cond'], 0.03)
data.head()

Unnamed: 0,ev_id,ev_type,ev_time,ev_year,ev_month,on_ground_collision,latitude,longitude,apt_dist,wx_dew_pt,...,acft_year,fuel_on_board,unmanned,finding_description,Aircraft,Environmental issues,Organizational issues,Personnel issues,light_cond_DAYL,light_cond_NITE
0,20091106X42849,ACC,1810.0,2009,11,,018172N,0066245W,0.0,0.0,...,,,0.0,Personnel issues-Action/decision-Action-Delaye...,0.0,0.0,0.0,1.0,1,0
1,20181001X33722,ACC,307.0,2018,9,,061147N,1495635W,3.0,41.0,...,1959.0,27.0,0.0,Personnel issues-Action/decision-Action-Incorr...,0.0,1.0,0.0,2.0,1,0
2,20161201X23648,ACC,2015.0,2016,11,,302349N,0872056W,0.0,61.0,...,2012.0,10.0,0.0,Personnel issues-Task performance-Use of equip...,3.0,0.0,0.0,2.0,1,0
3,20130415X82334,ACC,1515.0,2013,4,,381332N,0805214W,0.0,34.0,...,,,0.0,Aircraft-Aircraft oper/perf/capability-Perform...,1.0,0.0,0.0,1.0,1,0
4,20160825X83222,ACC,1515.0,2016,8,,411436N,0765529W,0.0,57.0,...,1979.0,25.0,0.0,Personnel issues-Psychological-Attention/monit...,0.0,1.0,0.0,1.0,1,0
