# Trees & Forests & Bagging & Boosting
___

None of these models did better (sometimes even worse) than the baseline score of 78%.  Safe to assume that we are unable to made a predictive model based on gathered NTSB data on commercial aviation accidents and incidents.

All pipe parameters in this notebook have been searched over to find ideal settings.

## Contents
---
- [Data Cleaning & Setting Target](#Data-Cleaning-&-Setting-Target)
- [Train/Test/Split & Base Model](#Train/Test/Split-&-Base-Model)
- [Decision Tree Bagging](#Decision-Tree-Bagging)
- [Random Forests](#Random-Forests)
- [Adaboost](#Adaboost)
- [GradientBoost](#Gradientboost)

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier,RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore", message="Found unknown categories in columns")

### Data Cleaning & Setting Target
___

In [2]:
airlines_df = pd.read_csv('./data/text_processed_aviation_data.csv')

In [3]:
columns_to_drop = ['tail_number', 'event_date', 'latitude', 'longitude']
airlines_df.drop(columns = columns_to_drop, inplace = True)

In [4]:
target_map = {'american':0,
              'delta':0,
              'united':0,
              'southwest':1,
              'continental':0,
              'us airways':0,
              'alaska':1,
              'frontier':1,
              'jetblue':1,
              'hawaiian':0,
              'allegiant':1,
              'spirit':1,
              'sun country':1
             }

airlines_df['operator'] = airlines_df['operator'].replace(target_map)

### Train/Test/Split & Base Model
___

In [5]:
# Set X and y
y = airlines_df['operator']
X = airlines_df.drop(columns = 'operator')

In [6]:
# Train/Test/Split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state=42,
                                                   stratify=y)  

**What is the base model?**
* There is a **77%** chance of correctly predicting a non-budget airline.

In [7]:
y_train.value_counts(normalize = True)

operator
0    0.77512
1    0.22488
Name: proportion, dtype: float64

### Decision Tree Bagging
___

Again, this model was no better than the baseline of 77%  It actually seems worse at 73%.  I will try Random Forests next.

In [8]:
#Set up a column transformer

ohe_columns = ['event_type', 'highest_injury_level', 'airport_id', 'make', 'aircraft_damage', 'model']
ss_columns = ['fatal_injury_count','serious_injury_count', 'minor_injury_count']


ctx = ColumnTransformer(
    transformers = [
        ('ohe', OneHotEncoder(drop = 'first',
                              handle_unknown='ignore',
                              sparse_output = False),
                                ohe_columns),
        ('sc', StandardScaler(), ss_columns),
        ('cvec', CountVectorizer(), 'probable_cause')
    ],
    verbose_feature_names_out = False,
    remainder = 'passthrough'
)


In [9]:
pipe = Pipeline([
    ('ctx', ctx),
    ('bag', BaggingClassifier(estimator=DecisionTreeClassifier(), random_state = 42))
])


In [10]:
pipe_params = {
    'ctx__cvec__max_features': [200],
    'ctx__cvec__min_df': [3],
    'ctx__cvec__max_df': [0.50],
    'ctx__cvec__ngram_range': [(1,3)],
    'bag__n_estimators': [50],
    'bag__estimator__max_depth': [5],
}

In [11]:
#GridSearch set-up
gs = GridSearchCV(pipe,
                  pipe_params,
                  cv = 5)

In [12]:
# Fit and score on the training data.
gs.fit(X_train, y_train)

In [13]:
print(gs.score(X_train, y_train))
gs.score(X_test, y_test)

0.9210526315789473


0.7214285714285714

In [14]:
gs.best_params_

{'bag__estimator__max_depth': 5,
 'bag__n_estimators': 50,
 'ctx__cvec__max_df': 0.5,
 'ctx__cvec__max_features': 200,
 'ctx__cvec__min_df': 3,
 'ctx__cvec__ngram_range': (1, 3)}

#### Predictions and Interpretation
___

In [15]:
#Training set results
print(f'Training accuracy score is at {round(gs.score(X_train, y_train)* 100,2)}%')

Training accuracy score is at 92.11%


In [16]:
# Testing set results is at 77% - Null model is at 72%, so this model
# is unable to predict airline based on safety data.
print(f'Testing accuracy score is at {round(gs.score(X_test, y_test)* 100,2)}%')

Testing accuracy score is at 72.14%


In [17]:
# Calculating model predictions (budget airlines = 1 and non-budget = 0)
# and the probability of each raw value.
pred = gs.predict(X_test)
pred[:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [18]:
gs.predict_proba(X_test)[:20]

array([[0.88961631, 0.11038369],
       [0.63600845, 0.36399155],
       [0.88513531, 0.11486469],
       [0.934147  , 0.065853  ],
       [0.50167265, 0.49832735],
       [0.5122516 , 0.4877484 ],
       [0.50421764, 0.49578236],
       [0.94732597, 0.05267403],
       [0.90948058, 0.09051942],
       [0.62632697, 0.37367303],
       [0.55943129, 0.44056871],
       [0.94732597, 0.05267403],
       [0.51610448, 0.48389552],
       [0.94732597, 0.05267403],
       [0.94732597, 0.05267403],
       [0.90977051, 0.09022949],
       [0.92785695, 0.07214305],
       [0.63350577, 0.36649423],
       [0.94732597, 0.05267403],
       [0.94732597, 0.05267403]])

### Random Forests
---

Once again, the model is severly overfit and the testing score is the same as the baseline average.

In [19]:
rf = RandomForestClassifier()

In [20]:
pipe_params = {
    'ctx__cvec__max_features': [200],
    'ctx__cvec__min_df': [3],
    'ctx__cvec__max_df': [0.50],
    'ctx__cvec__ngram_range': [(1,3)],
    'rf__n_estimators': [200],
    'rf__max_depth': [None],
    'rf__random_state': [42]
}

In [21]:
pipe = Pipeline([
    ('ctx', ctx),
    ('rf', RandomForestClassifier())
])

In [22]:
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 cv = 5,
                 n_jobs=-1)

In [23]:
gs.fit(X_train, y_train)

In [24]:
gs.best_params_

{'ctx__cvec__max_df': 0.5,
 'ctx__cvec__max_features': 200,
 'ctx__cvec__min_df': 3,
 'ctx__cvec__ngram_range': (1, 3),
 'rf__max_depth': None,
 'rf__n_estimators': 200,
 'rf__random_state': 42}

#### Predictions and Interpretation
___

In [25]:
#Training set results
print(f'Training accuracy score is at {round(gs.score(X_train, y_train)* 100,2)}%')

Training accuracy score is at 99.76%


In [26]:
# Testing set results is at 78% - Null model is at 77%, so this model
# is unable to predict airline based on safety data.
print(f'Testing accuracy score is at {round(gs.score(X_test, y_test)* 100,2)}%')

Testing accuracy score is at 77.86%


In [27]:
# Calculating model predictions (budget airlines = 1 and non-budget = 0)
# and the probability of each raw value.
pred = gs.predict(X_test)
pred[:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [28]:
gs.predict_proba(X_test)[:20]

array([[0.795, 0.205],
       [0.665, 0.335],
       [0.785, 0.215],
       [0.915, 0.085],
       [0.66 , 0.34 ],
       [0.64 , 0.36 ],
       [0.68 , 0.32 ],
       [1.   , 0.   ],
       [0.795, 0.205],
       [0.81 , 0.19 ],
       [0.695, 0.305],
       [0.92 , 0.08 ],
       [0.655, 0.345],
       [0.765, 0.235],
       [0.935, 0.065],
       [0.8  , 0.2  ],
       [0.96 , 0.04 ],
       [0.715, 0.285],
       [1.   , 0.   ],
       [0.965, 0.035]])

### Adaboost
___

Model score is again, the same as the baseline score.  Used Adaboost with both DecisionTree Classifier and RandomForests with same results.

In [29]:
pipe_params = {
    'ctx__cvec__max_features': [200],
    'ctx__cvec__min_df': [3],
    'ctx__cvec__max_df': [0.25],
    'ctx__cvec__ngram_range': [(1,3)],
    'ada__n_estimators': [50],
    'ada__learning_rate': [0.5],
    'ada__random_state': [ 24]
}

In [30]:
pipe = Pipeline([
    ('ctx', ctx),
    ('ada', AdaBoostClassifier(estimator = RandomForestClassifier()))
])

In [31]:
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 cv = 5,
                 n_jobs=-1)

In [32]:
gs.fit(X_train, y_train)

In [33]:
gs.best_params_

{'ada__learning_rate': 0.5,
 'ada__n_estimators': 50,
 'ada__random_state': 24,
 'ctx__cvec__max_df': 0.25,
 'ctx__cvec__max_features': 200,
 'ctx__cvec__min_df': 3,
 'ctx__cvec__ngram_range': (1, 3)}

In [34]:
#Training set results
print(f'Training accuracy score is at {round(gs.score(X_train, y_train)* 100,2)}%')

Training accuracy score is at 99.76%


In [35]:
# Testing set results is at 78% - Null model is at 77%, so this model is still
# unable to predict airline based on safety data.  al
print(f'Testing accuracy score is at {round(gs.score(X_test, y_test)* 100,2)}%')

Testing accuracy score is at 77.14%


In [36]:
# Calculating model predictions (budget airlines = 1 and non-budget = 0)
# and the probability of each raw value.
pred = gs.predict(X_test)
pred[:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [37]:
gs.predict_proba(X_test)[:20]

array([[8.46944817e-01, 1.53055183e-01],
       [7.59457490e-01, 2.40542510e-01],
       [7.35539486e-01, 2.64460514e-01],
       [6.61927652e-01, 3.38072348e-01],
       [6.17833265e-01, 3.82166735e-01],
       [5.41328230e-01, 4.58671770e-01],
       [5.84497931e-01, 4.15502069e-01],
       [9.99926377e-01, 7.36225377e-05],
       [8.17147506e-01, 1.82852494e-01],
       [7.70160438e-01, 2.29839562e-01],
       [6.91399205e-01, 3.08600795e-01],
       [9.06243822e-01, 9.37561781e-02],
       [6.26745277e-01, 3.73254723e-01],
       [7.71733054e-01, 2.28266946e-01],
       [9.28709058e-01, 7.12909421e-02],
       [7.42364995e-01, 2.57635005e-01],
       [9.72267127e-01, 2.77328727e-02],
       [7.33808089e-01, 2.66191911e-01],
       [9.98813685e-01, 1.18631474e-03],
       [9.14922561e-01, 8.50774394e-02]])

### GradientBoost
___

In [38]:
pipe_params = {
    'ctx__cvec__max_features': [200],
    'ctx__cvec__min_df': [3],
    'ctx__cvec__max_df': [0.25],
    'ctx__cvec__ngram_range': [(1,3)],
    'gb__n_estimators': [25],
    'gb__max_depth': [None],
    'gb__random_state': [ 24]
}

In [39]:
pipe = Pipeline([
    ('ctx', ctx),
    ('gb', GradientBoostingClassifier())
])

In [40]:
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 cv = 5,
                 n_jobs=-1)

In [41]:
gs.fit(X_train, y_train)

In [42]:
gs.best_params_

{'ctx__cvec__max_df': 0.25,
 'ctx__cvec__max_features': 200,
 'ctx__cvec__min_df': 3,
 'ctx__cvec__ngram_range': (1, 3),
 'gb__max_depth': None,
 'gb__n_estimators': 25,
 'gb__random_state': 24}

In [43]:
#Training set results
print(f'Training accuracy score is at {round(gs.score(X_train, y_train)* 100,2)}%')

Training accuracy score is at 99.76%


In [44]:
# Testing set results is at 72% - Null model is at 77%, so this model is worse than
# the baseline.
print(f'Testing accuracy score is at {round(gs.score(X_test, y_test)* 100,2)}%')

Testing accuracy score is at 72.86%


In [45]:
# Calculating model predictions (budget airlines = 1 and non-budget = 0)
# and the probability of each raw value.
pred = gs.predict(X_test)
pred[:20]

array([0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
      dtype=int64)

In [46]:
gs.predict_proba(X_test)[:20]

array([[0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.05892014, 0.94107986],
       [0.05892014, 0.94107986],
       [0.05892014, 0.94107986],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.05892014, 0.94107986],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.05892014, 0.94107986],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773],
       [0.98176227, 0.01823773]])