# Unit 2 Assessment

In this assignment, we will focus on airline incidents. The data set for this assignment includes information on the cost of bird strikes. Use this data set to see if you can predict the cost of a bird strike (i.e., the `Total Cost` column in the data set) based on the attributes of the incident. This is important because this model can make a cost prediction as soon as a bird strike incident happens.

## Description of Variables

The description of variables are provided in "Airline - Data Dictionary.docx"

## Goal

Use the **airline.csv** data set and build models to predict **Total Cost**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Important hints:

* This assignment requires you to work with a text-based column in addition to regular numeric/categorical columns. So you will have to pay attention to your pipelines during data processing.
* You can do your data prep before or after the train/test split. Regardless, you should use train_test_split only once. If you find yourself using it twice, it means you are doing something wrong.
* Recommended approach: 
    * import the data and perform the train/test split - like we always do. 
    * identify the names of numeric, categorical, feature engineered, and text columns - like we always do
    * create individual pipelines for each type of column - like we always do. For the text pipeline, I would recommend the TFIDF Vectorizer with SVDs. Though, you can also use TFIDF Vectorizer with top N terms (without SVDs).
    * combine all pipelines using the column transformer - like we always do 

# Section 1: 

## Data Prep (5 points)

In [None]:
#Imports

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

np.random.seed(13)

In [None]:
# Read in the dataset:

airline = pd.read_csv("airline.csv")
airline.head()

In [None]:
# Split the data
from sklearn.model_selection import train_test_split

train, test = train_test_split(airline, test_size=0.3)

In [None]:
# Identify the dependent variable for train and test
train_y = train['Total Cost']
test_y = test['Total Cost']

# Create independent variables by removing the dependent variable for train and test
train_inputs = train.drop(['Total Cost'], axis=1)
test_inputs = test.drop(['Total Cost'], axis=1)

## Feature Engineering (1 point)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

In [None]:
# Look at the descriptive statistics for the 'Number Objects' column
train_inputs['Number_Objects'].describe()

In [None]:
# Look at the value counts for 'Number Objects'
train_inputs['Number_Objects'].value_counts()

In [None]:
# Define a function to bin the 'Number Objects' data to help address skewed distribution

def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df2 = df.copy()
    
    df2['Number_Objects_binned'] = pd.cut(df2['Number_Objects'],
                                       bins=[1,1.5,6,11,1000],
                                       labels=False, 
                                       include_lowest=True,
                                       ordered=True)
    
    return df2[['Number_Objects_binned']]

In [None]:
# Check results of the new function on train_inputs
new_col(train_inputs).value_counts()

In [None]:
# Define a function for the text mining process

def text_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, convert the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df1).ravel()

In [None]:
# Review the data types for processing consideration
train_inputs.dtypes

In [None]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

# Identify the binary, text and feature engineered columns so we can pass them through without transforming
binary_columns = ['Warning']
feat_eng_column = ['Number_Objects']
text_column = ['Description']

# Remove the binary and text column from categorical columns.
categorical_columns.remove('Description')
categorical_columns.remove('Warning')


In [None]:
# Display/Confirm binary columns
binary_columns

In [None]:
# Display/Confirm numeric columns
numeric_columns

In [None]:
# Display/Confirm categorical columns
categorical_columns

In [None]:
# Define the numeric pipeline

numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [None]:
# Define the categorical pipeline

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
# Define the binary pipeline with ordinal encoder to prevent object datatype with later transform
from sklearn.preprocessing import OrdinalEncoder
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[['N', 'Y']]))])

In [None]:
# Define the binned values pipeline

my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col)),
                               ('scaler', StandardScaler())])

In [None]:
# Define the text mining pipeline

text_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='constant', fill_value='NA')),            
                ('new_column', FunctionTransformer(text_col)),
                ('vectorizer', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=300, n_iter=10))
            ])

In [None]:
# Create the preprocessor with all the column types

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns),
        ('trans', my_new_column, feat_eng_column),
        ('text', text_transformer, text_column)
        ],
        remainder='drop')

In [None]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

In [None]:
# Check the shape of the transformed dataset
train_x.shape

In [None]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

In [None]:
test_x.shape

## Find the Baseline (1 point)

In [None]:
# Import functions for baseline determination

from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error

dummy_regr = DummyRegressor(strategy="mean")

dummy_regr.fit(train_x, train_y)

In [None]:
#Baseline Train RMSE
dummy_train_pred = dummy_regr.predict(train_x)

baseline_train_mse = mean_squared_error(train_y, dummy_train_pred)

baseline_train_rmse = np.sqrt(baseline_train_mse)

print('Baseline Train RMSE: {}' .format(baseline_train_rmse))

In [None]:
#Baseline Test RMSE
dummy_test_pred = dummy_regr.predict(test_x)

baseline_test_mse = mean_squared_error (test_y, dummy_test_pred)

baseline_test_rmse = np.sqrt(baseline_test_mse)

print('Baseline Test RMSE: {}' .format(baseline_test_rmse))

# Section 2: 

Build the following models:


## Decision Tree: (1 point)

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=5) 

tree_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

Yes it is overfitting.

In [None]:
# Import randomizedsearchCV function to perform a grid search

from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(1, 10), 
     'max_depth': np.arange(1,5)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=20,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_y)

In [None]:
grid_search.best_estimator_

In [None]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

## Voting regressor (1 points):

The voting regressor should have at least 3 individual models

In [None]:
# Import more functions for Voting Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor

# Define the Voting Regressor parameters
dtree_reg = DecisionTreeRegressor(max_depth=5)
svm_reg = SVR(kernel="rbf", C=1, epsilon=0.1, gamma=1) 
rnd_reg = RandomForestRegressor(n_estimators=500, max_depth=5, n_jobs=-1) 

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('rnd', rnd_reg)])

voting_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

In [None]:
# Inspect results each classifier
for reg in (dtree_reg, svm_reg, rnd_reg, voting_reg):
    reg.fit(train_x, train_y)
    test_y_pred = reg.predict(test_x)
    print(reg.__class__.__name__, 'Test rmse=', np.sqrt(mean_squared_error(test_y, test_y_pred)))

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [None]:
# Redefine the Voting Regressor parameters to address overfitting
dtree_reg = DecisionTreeRegressor(max_depth=3)
svm_reg = SVR(kernel="rbf", C=20, epsilon=0.01, gamma=.01) 
rnd_reg = RandomForestRegressor(n_estimators=500, max_depth=3, n_jobs=-1) 

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('rnd', rnd_reg)])

voting_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

In [None]:
from sklearn.ensemble import AdaBoostRegressor 

ada_reg = AdaBoostRegressor( 
            SVR(), n_estimators=500, 
            learning_rate=0.1) 

ada_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = ada_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = ada_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [None]:
# Attempt to address overfitting by decreasing C from default 1 to 0.01 and decrease n_estimators
ada_reg = AdaBoostRegressor( 
            SVR(C=0.01), n_estimators=200, 
            learning_rate=1) 

ada_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = ada_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = ada_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

## Neural network: (1 point)

In [None]:
# Import MLPRegressor for NN
from sklearn.neural_network import MLPRegressor

mlp_reg = MLPRegressor(hidden_layer_sizes=(100,))

mlp_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

In [None]:
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,),
                       max_iter=1000,
                       early_stopping=True,
                      alpha = 0.1)

mlp_reg.fit(train_x, train_y)

In [None]:
#Train RMSE
train_pred = mlp_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = mlp_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

## Grid search (1 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'hidden_layer_sizes': [(50,), (100,), (150,), (200,), (250,)],
             'activation': ['relu', 'tanh', 'logistic'],
             'alpha' : [0.00001, .0001, .001, .01]}

mlp_gs = RandomizedSearchCV(MLPRegressor(max_iter=1000, early_stopping=True), 
                            param_grid, cv=3, 
                            random_state=13, 
                            n_iter=30, n_jobs=-1)

mlp_gs.fit(train_x, train_y)

In [None]:
mlp_gs.best_params_

In [None]:
#Train RMSE:
train_pred = mlp_gs.best_estimator_.predict(train_x)
train_mse = mean_squared_error(train_y, train_pred)
train_rmse = np.sqrt(train_mse)

print('Test RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = mlp_gs.best_estimator_.predict(test_x)
test_mse = mean_squared_error(test_y, test_pred)
test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

### Is the model overfitting? Provide your answer below. If yes, please add more cells below and show how you corrected overfitting. If your model is overfitting and you don't correct it, you will lose points. (0.25 points)

# Discussion (3 points in total)


## List the train and test values of each model you built (1 points)

## Which model performs the best and why? (1 points) 

Hint: The best model is the one that has the best TEST value (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## How does it compare to baseline? (1 points)