# Investor Classifier, Part II
Welcome to the eleventh lesson! This Jupyter Notebook file is meant to accompany **L11 - Investor Classifier II.**

Type your solutions for each exercise in the code cells below, and then press **Shift + Enter** to execute your code. Then, check the solution video to see how you did!

### 1. Redundant Features & Dummy Variables

**Lesson Workspace**

In [1]:
# Import NumPy, Pandas, Pyplot, and Seaborn
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns


In [2]:
# Create a new DataFrame with investor_data_2.csv
investor_data = pd.read_csv('C:\\Users\\zacha\\Desktop\\MLE Test\\ML 06. Module Files\\investor_data_2.csv')
investor_data.head(3)

Unnamed: 0,investor,commit,deal_size,invite,rating,int_rate,covenants,total_fees,fee_share,prior_tier,invite_tier,tier_change,fee_percent,invite_percent
0,Goldman Sachs,Commit,300,40,2,Market,2,30,0.0,Participant,Bookrunner,Promoted,0.0,0.133333
1,Deutsche Bank,Decline,1200,140,2,Market,2,115,20.1,Bookrunner,Participant,Demoted,0.174783,0.116667
2,Bank of America,Commit,900,130,3,Market,2,98,24.4,Bookrunner,Bookrunner,,0.24898,0.144444


<font color = 'blue'> **EXERCISE 1.1** </font>

In [3]:
# Remove the invite_tier, fee_share, and invite features
investor_data = investor_data.drop(['invite_tier', 'fee_share', 'invite'], axis=1)
investor_data.shape


(7233, 11)

<font color = 'blue'> **EXERCISE 1.2** </font>

In [4]:
# Create dummy variables
investor_data = pd.get_dummies(investor_data)
investor_data.shape


(7233, 21)

**Lesson Workspace**

In [5]:
investor_data.head(1)

Unnamed: 0,deal_size,rating,covenants,total_fees,fee_percent,invite_percent,investor_Bank of America,investor_Deutsche Bank,investor_Goldman Sachs,investor_MUFG Union,...,commit_Commit,commit_Decline,int_rate_Above,int_rate_Below,int_rate_Market,prior_tier_Bookrunner,prior_tier_Participant,tier_change_Demoted,tier_change_None,tier_change_Promoted
0,300,2,2,30,0.0,0.133333,0,0,1,0,...,1,0,0,0,1,0,1,0,0,1


In [6]:
# Drop the commit_Commit Series from your DataFrame
investor_data = investor_data.drop('commit_Commit', axis=1)
investor_data.shape


(7233, 20)

In [7]:
# Define new DataFrames for your target variable and input features
target = investor_data.commit_Decline
inputs = investor_data.drop('commit_Decline', axis=1)


### 2. Stratified Random Sampling

**Lesson Workspace**

In [8]:
# Split your data using stratified random sampling
from sklearn.model_selection import train_test_split
split_list = train_test_split(inputs, target, test_size=0.2, random_state=1, stratify=investor_data.commit_Decline)


<font color = 'blue'> **EXERCISE 2.1** </font>

In [9]:
# Unpack split_list into four new objects and print their shapes
input_train, input_test, target_train, target_test = split_list
print(input_train.shape)
print(input_test.shape)
print(target_train.shape)
print(target_test.shape)


(5786, 19)
(1447, 19)
(5786,)
(1447,)


### 3. Pipelines & Hyperparameter Grids

**Lesson Workspace**

In [10]:
# Import Scikit-Learn functions and classifiers
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


<font color = 'blue'> **EXERCISE 3.1** </font>

In [11]:
# Create your pipeline dictionary
pipelines = {
    'l1' : make_pipeline(StandardScaler(), LogisticRegression(penalty='l1', random_state=1, solver='liblinear')),
    'l2' : make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', random_state=1, solver='liblinear')),
    'rf' : make_pipeline(StandardScaler(), RandomForestClassifier(random_state=1)),
    'gb' : make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
}


In [12]:
for key, value in pipelines.items():
    print(key, type(value))

l1 <class 'sklearn.pipeline.Pipeline'>
l2 <class 'sklearn.pipeline.Pipeline'>
rf <class 'sklearn.pipeline.Pipeline'>
gb <class 'sklearn.pipeline.Pipeline'>


<font color = 'blue'> **EXERCISE 3.2** </font>

In [13]:
# Create the hyperparameter grids and store them in a new dictionary
l1_hyperparameters = {
    'logisticregression__C' : [0.1, 1, 10]
}

l2_hyperparameters = {
    'logisticregression__C' : [0.1, 1, 10]
}

rf_hyperparameters = {
    'randomforestclassifier__n_estimators' : [100, 200],
    'randomforestclassifier__max_features' : ['auto', 0.3, 0.6]
}

gb_hyperparameters = {
    'gradientboostingclassifier__n_estimators' : [100, 200],
    'gradientboostingclassifier__learning_rate' : [0.05, 0.1, 0.2],
    'gradientboostingclassifier__max_depth' : [1, 3, 5]
}

hyperparameters = {
    'l1' : l1_hyperparameters, 
    'l2' : l2_hyperparameters,
    'rf' : rf_hyperparameters,
    'gb' : gb_hyperparameters
}


In [14]:
for key in ['l1', 'l2', 'rf', 'gb'] :
    if key in hyperparameters :
        if type(hyperparameters[key]) is dict :
            print(key, 'was found, and it is a grid.')
        else :
            print(key, 'was found, but it is not a grid.')
    else :
        print(key, 'was not found.')

l1 was found, and it is a grid.
l2 was found, and it is a grid.
rf was found, and it is a grid.
gb was found, and it is a grid.


### 4. Cross-Validation

**Lesson Workspace**

In [15]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV


<font color = 'blue'> **EXERCISE 4.1** </font>

In [16]:
# Create a dictionary containing your untrained models
models = {}

for key in pipelines.keys() :
    models[key] = GridSearchCV(pipelines[key], hyperparameters[key], cv=5)


<font color = 'blue'> **EXERCISE 4.2** </font>

In [17]:
# Write a for loop to train your models
for key in models.keys():
    models[key].fit(input_train, target_train)
    print(key, ' is trained and tuned.')


l1  is trained and tuned.
l2  is trained and tuned.
rf  is trained and tuned.
gb  is trained and tuned.


### 5. Model Selection

**Lesson Workspace**

In [18]:
# Print the confusion matrix for your L1 Logistic Regression model
from sklearn.metrics import confusion_matrix

pred = models['l1'].predict(input_test)
print(confusion_matrix(target_test, pred))


[[1124   22]
 [  23  278]]


In [19]:
# Import functions
from sklearn.metrics import roc_curve, auc

# Calculate ROC curve and print L1 AUROC
fpr, tpr, thresholds = roc_curve(target_test, pred)
print('l1')
print('AUROC =', round(auc(fpr, tpr), 3))


l1
AUROC = 0.952


<font color = 'blue'> **EXERCISE 5.1** </font>

In [20]:
# Write a for loop that calculates and prints each model's AUROC
for key in models.keys() :
    pred = models[key].predict(input_test)
    fpr, tpr, thresholds = roc_curve(target_test, pred)
    print(key)
    print('AUROC =', round(auc(fpr, tpr), 4))
    print('---')


l1
AUROC = 0.9522
---
l2
AUROC = 0.9518
---
rf
AUROC = 0.9616
---
gb
AUROC = 0.9683
---
