# Airline Data Modeling
___

This will be a classification model intended to predict budget vs. non-budget commercial passenter airlines based on safety data collected from the NSTB aviation accident [database](https://www.ntsb.gov/Pages/AviationQueryV2.aspx). The bud get airlines will be the target and assigned as '1'.

## Contents
---
- [Data Cleaning & Setting Target](#Data-Cleaning-&-Setting-Target)
- [Train/Test/Split & Base Model](#Train/Test/Split-&-Base-Model)

#### Budget Airlines (Target)

* Alaska
* Frontier
* JetBlue
* Allegiant
* Spirit
* Sun Country

#### Non-Budget Airlines

* American
* United
* Delta
* US Airways 
* Continental

Note: US Airways merged with American and Continental merged with United.

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore", message="Found unknown categories in columns")

### Data Cleaning & Setting Target
___

Read in and drop columns we won't need for modeling.  Kept tail number and lat/long for research purposes only.  Tail number is great for looking up plane models.


In [2]:
airlines_df = pd.read_csv('./data/text_processed_aviation_data.csv')

In [3]:
airlines_df.columns

Index(['event_type', 'event_date', 'tail_number', 'highest_injury_level',
       'fatal_injury_count', 'serious_injury_count', 'minor_injury_count',
       'probable_cause', 'latitude', 'longitude', 'airport_id', 'operator',
       'make', 'aircraft_damage', 'model'],
      dtype='object')

In [4]:
columns_to_drop = ['tail_number', 'event_date', 'latitude', 'longitude']

In [5]:
airlines_df.drop(columns = columns_to_drop, inplace = True)

In [6]:
airlines_df['operator'].value_counts()

operator
american       129
delta          102
united          89
southwest       62
continental     54
us airways      52
alaska          22
frontier        15
jetblue         11
hawaiian         7
allegiant        6
spirit           5
sun country      4
Name: count, dtype: int64

Binarize the budget and non-budget airlines.

In [7]:
target_map = {'american':0,
              'delta':0,
              'united':0,
              'southwest':1,
              'continental':0,
              'us airways':0,
              'alaska':1,
              'frontier':1,
              'jetblue':1,
              'hawaiian':0,
              'allegiant':1,
              'spirit':1,
              'sun country':1
             }

airlines_df['operator'] = airlines_df['operator'].replace(target_map)

In [8]:
# Not that the values should be stratified for modeling as they are uneven.
airlines_df['operator'].value_counts()

operator
0    433
1    125
Name: count, dtype: int64

## Train/Test/Split & Base Model
___

In [9]:
# Set X and y
y = airlines_df['operator']
X = airlines_df.drop(columns = 'operator')

In [10]:
# Train/Test/Split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state=42,
                                                   stratify=y)  

**What is the base model?**
* There is a **77%** chance of correctly predicting a non-budget airline.

In [11]:
y_train.value_counts(normalize = True)

operator
0    0.77512
1    0.22488
Name: proportion, dtype: float64

In [12]:
#Ensuring that stratify is maintaining the slight skew between the types of airlines.
y_train.value_counts(normalize = True), y_test.value_counts(normalize = True)

(operator
 0    0.77512
 1    0.22488
 Name: proportion, dtype: float64,
 operator
 0    0.778571
 1    0.221429
 Name: proportion, dtype: float64)

## Feature Engineering
___

All of the columns in the dataset are categorical except for the fatal/serious/minor injury counts.  Therefore, they shall be one-hot encoded. We will use a standard scaler on the injury counts only.  I am not sure if the highest_injury_count column is helpful when pair with quantitative injury counts. Possibly something to explore given time.

In [13]:
ohe = OneHotEncoder()

In [14]:
#Set up a column transformer

ohe_columns = ['event_type', 'highest_injury_level', 'airport_id', 'make', 'aircraft_damage', 'model']
ss_columns = ['fatal_injury_count','serious_injury_count', 'minor_injury_count']


ctx = ColumnTransformer(
    transformers = [
        ('ohe', OneHotEncoder(drop = 'first',
                              handle_unknown='ignore',
                              sparse_output = False),
                                ohe_columns),
        ('sc', StandardScaler(), ss_columns),
        ('cvec', CountVectorizer(), 'probable_cause')
    ],
    verbose_feature_names_out = False,
    remainder = 'passthrough'
)


In [15]:
pipe = Pipeline([
    ('ctx', ctx),
    ('lr', LogisticRegression())
])


## Modeling
___

Logistic Regression and CoutVectorizer

In [16]:
pipe.fit(X_train, y_train)

In [17]:
pipe.score(X_train, y_train)

0.9784688995215312

In [18]:
pipe.score(X_test, y_test)



0.7785714285714286

This 78% score is marginally better than the baseline of 68%, however it is extremely overfit.  I recognize the error above and assume it's because for some of the airplane models and airports, only one instance shows up in the entire set.

### Parameter Tweaking

Used a range of vales for each of these parameters except the penalty and solver.  Ensured that these were the best parameters by checking values on higher and lower ends.

In [19]:
pipe_params = {
    'ctx__cvec__max_features': [1500], 
    'ctx__cvec__min_df': [0.0001], 
    'ctx__cvec__max_df': [0.5], 
    'ctx__cvec__ngram_range': [(1,1)],
    'lr__C': [1.0],
    'lr__penalty': ['l2'],
    'lr__solver': ['liblinear']
}

In [20]:
gs = GridSearchCV(pipe,
                  pipe_params,
                  cv = 5)

In [21]:
gs.fit(X_train, y_train)



In [22]:
#What was the best score?
print(f'The best accuracy score in all models tested in grid search is {round(gs.best_score_* 100,2)}%')

The best accuracy score in all models tested in grid search is 83.49%


In [23]:
# Breakdown of the parameters chosen to make the best model.  Liblinear was the only option added to pipe parameters.
gs.best_params_

{'ctx__cvec__max_df': 0.5,
 'ctx__cvec__max_features': 1500,
 'ctx__cvec__min_df': 0.0001,
 'ctx__cvec__ngram_range': (1, 1),
 'lr__C': 1.0,
 'lr__penalty': 'l2',
 'lr__solver': 'liblinear'}

## Predictions and Interpretation
___

In [24]:
#Training set results
print(f'Training accuracy score is at {round(gs.score(X_train, y_train)* 100,2)}%')

Training accuracy score is at 97.85%


In [25]:
# Testing set results is at 77% - Null model is at 77%, so this model
# is unable to predict airline based on safety data.
print(f'Testing accuracy score is at {round(gs.score(X_test, y_test)* 100,2)}%')

Testing accuracy score is at 77.86%




In [26]:
# Calculating model predikctions (budget airlines = 1 and non-budget = 0)
# and the probability of each raw value.
pred = gs.predict(X_test)
pred[:20]



array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [27]:
gs.predict_proba(X_test)[:20]



array([[0.93245872, 0.06754128],
       [0.78279545, 0.21720455],
       [0.93095636, 0.06904364],
       [0.93007834, 0.06992166],
       [0.67584885, 0.32415115],
       [0.69038505, 0.30961495],
       [0.28159463, 0.71840537],
       [0.98929777, 0.01070223],
       [0.90205364, 0.09794636],
       [0.91409828, 0.08590172],
       [0.84746087, 0.15253913],
       [0.99513452, 0.00486548],
       [0.62827619, 0.37172381],
       [0.97030588, 0.02969412],
       [0.97605314, 0.02394686],
       [0.83227688, 0.16772312],
       [0.989321  , 0.010679  ],
       [0.52586823, 0.47413177],
       [0.98541353, 0.01458647],
       [0.95057227, 0.04942773]])

In [28]:
# Accuracy Score - What precentage of total predictions were correct?
print(f'This model has an accuracy score of {round(metrics.accuracy_score(y_test, pred) * 100,2)}%')

This model has an accuracy score of 77.86%
