## Predicting the Optimal APR for e-Car
### Nomis Solutions - LT 12

### Part 2 - Gradient Boosting Method

Results of the Gradient Boosting Method show high accuracy accross all Customer Tiers relative to other models used in the analysis.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from tqdm.autonotebook import tqdm
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import GridSearchCV



### Reading Data

In [2]:
raw = pd.read_excel('NomisB.xlsx', na_values=' ')

In [3]:
print(raw.shape)
raw.columns

(208085, 12)


Index(['Tier', 'FICO', 'Approve Date', 'Term', 'Amount', 'Previous Rate',
       'Car  Type', 'Competition rate', 'Outcome', 'Rate', 'Cost of Funds',
       'Partner Bin'],
      dtype='object')

In [4]:
df = raw.copy()

# Previous Rate NA = 0
df = df.fillna(0)

# Drop date
df = df.drop('Approve Date', axis=1)

# Partner Bin is categorical
df['Partner Bin'] = df['Partner Bin'].astype('category')
df = pd.get_dummies(df)

# Drop Amount that is too small
df = df[df.Amount>10]

### Segmenting Data based on Tiers

In [5]:
# combi = (Outcome, Tier)
combi = [(1,1),(1,2),(1,3),(1,4),(0,1),(0,2),(0,3),(0,4)]

Xy = {i : { 
            'X' : df.groupby(['Outcome', 'Tier']).get_group(i).drop(['Outcome','Rate','Tier'], axis=1),
            'y' : df.groupby(['Outcome', 'Tier']).get_group(i).Rate
          }
      for i in combi}

In [6]:
list(range(1,5))

[1, 2, 3, 4]

### Applying GBM

In [8]:
results = {}
for i in tqdm(range(1,5)):
    X = Xy[(1,i)]['X']
    y = Xy[(1,i)]['y']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=1)

    param_grids = {'learning_rate': [0.2, 0.1, 0.05],
                   'max_depth': [3, 4, 6],
                   'min_samples_leaf': [2,3,4],
                   'max_features':[0.5,0.3,0.2]}   

    est = GradientBoostingRegressor(n_estimators=50)
    gs_cv = GridSearchCV(est, param_grids, n_jobs=-1, cv=5).fit(X_train, y_train)
    results[(1,i)] = {
        'model' : gs_cv,
        'best_params' : gs_cv.best_params_,
        'acc' : gs_cv.score(X_test, y_test)
    }

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




In [9]:
add_profit ={}
for i in tqdm(range(1,5)):
    exp = pd.concat([Xy[(0,i)]['X'].reset_index(drop=True), 
                 Xy[(0,i)]['y'].reset_index(drop=True),
                 pd.DataFrame({'Predicted rate' :np.round(results[(1,i)]['model'].predict(Xy[(0,i)]['X']), 2)})], axis=1)
    temp = exp[(exp['Predicted rate'] < exp['Rate']) & (exp['Predicted rate'] < exp['Competition rate']) ]
    
    add_profit[(0,i)] = {
        'Captured':  temp.shape[0],
        'Captured pct': round(temp.shape[0]/exp.shape[0]*100, 1),
        'Profit' : (temp.Amount * temp.Term/12 * (temp['Predicted rate'] - temp['Cost of Funds'])/100).sum()
    }

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




In [10]:
sum([add_profit[i]['Profit'] for i in add_profit])/1000000

222.17515435096757

### Calculating Profits for each of the Segments for Outcome 1

In [11]:
add_profit

{(0, 1): {'Captured': 44675,
  'Captured pct': 58.8,
  'Profit': 199825969.2103832},
 (0, 2): {'Captured': 3327, 'Captured pct': 9.3, 'Profit': 20399108.693883263},
 (0, 3): {'Captured': 415, 'Captured pct': 1.3, 'Profit': 1939704.65890111},
 (0, 4): {'Captured': 2, 'Captured pct': 0.0, 'Profit': 10371.787800000002}}

In [12]:
[results[i]['acc'] for i in results]

[0.915138532237375, 0.8718606471173999, 0.8621208067161452, 0.8428563217815007]

| TIER       | Accuracy (GBM)        |
| ------------- |:-------------:|
|1 | 0.9020883779290694 |
|2    |0.8821695815569348       |
| 3 | 0.8592910813705846   |
| 4 | 0.8553037883010983   |

**Gradient Boosting Method results in the highest accuracies compared to Random Forest, Ridge Regression and Support Vector Regressor**