#### Data science: Direct marketing optimization
##### Task:
Use dummy data to maximize revenue from direct marketing campaigns.
##### Data:                   
For the analysis, several tables are available:                  
1) Social-demographical data (age, gender, tenure in a bank)                 
2) Products owned + actual volumes (current account, saving account, mutual funds, overdraft, credit card, consumer loan)      
3) Inflow/outflow on C/A, aggregated card turnover (monthly average over past 3 months)          
4) For 60 % of clients actual sales + revenues from these are available (training set)                          

##### Conditions:     
> The bank has capacity to contact only 15 pct. of the clients (cca 100 people) with a marketing offer and each client can be targeted only once.Proposed steps:      
1. Create an analytical dataset (both training and targeting sets)                  
2. Develop 3 propensity models (consumer loan, credit card, mutual fund) using training data set                
3. Optimize targeting clients with the direct marketing offer to maximize the revenue 

##### Expected result:                                            
1) Which clients have higher propensity to buy consumer loan?             
2) Which clients have higher propensity to buy credit card?            
3) Which clients have higher propensity to buy mutual fund?              
4) Which clients are to be targeted with which offer? General description.            
5) What would be the expected revenue based on your strategy?             
##### The executive summary of the analysis should not be larger than two pages. Attach the technical report, list of clients to be contacted with which offer, data, algorithms and codes used.

In [514]:
# importing required packages
import pandas as pd
import numpy as np 
from collections import Counter
from sklearn.impute import KNNImputer
import math
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import arange
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression,LinearRegression,Ridge
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,RandomForestRegressor,AdaBoostRegressor
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score,roc_curve
from sklearn.metrics import classification_report,accuracy_score,precision_score,recall_score,confusion_matrix,roc_auc_score,f1_score, precision_recall_curve,auc
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn import metrics
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, cross_validate, KFold, StratifiedKFold, RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
import eli5
import pickle

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)

##### Reading Data Set

In [526]:
# reading the data set
df_demog = pd.read_excel("Data\Task_Data_Scientist_Dataset.xlsx",engine='openpyxl',sheet_name='Soc_Dem')
df_prod = pd.read_excel("Data\Task_Data_Scientist_Dataset.xlsx",engine='openpyxl',sheet_name='Products_ActBalance')
df_in_out = pd.read_excel("Data\Task_Data_Scientist_Dataset.xlsx",engine='openpyxl',sheet_name='Inflow_Outflow')
df_sales = pd.read_excel("Data\Task_Data_Scientist_Dataset.xlsx",engine='openpyxl',sheet_name='Sales_Revenues')

In [None]:
df_demog.head(3)

In [None]:
df_prod.head(3)

In [None]:
df_in_out.head(3)

In [None]:
df_sales.head(3)

##### Data Exploration

In [None]:
# printing shape of provided data set
print("Print shape of Social Demographic data set: ",df_demog.shape)
print("Print shape of Products Owned and their actual volumes data set: ",df_prod.shape)
print("print shape of Inflow and Outflow data set: ",df_in_out.shape)
print("print shape of Train set data set: ",df_sales.shape)

we can see that their are 28 clients not present in Inflow and Outflow data set. Before merging these data set we have to drop those clients from Social Demographic and Products Owned data set.

In [None]:
# merging that two data set having same shape
df = pd.merge(df_demog, df_prod, how="left", on=["Client"])
df = pd.merge(df, df_in_out, how="left", on=["Client"])

In [None]:
print("Print shape of combined data set: ",df.shape)

##### Spliting the data set into Train and Test

In [None]:
df_train = pd.merge(df, df_sales[['Client','Sale_CL','Revenue_CL']], how="inner", on=["Client"])

In [None]:
print("Print shape of combined data set: ",df_train.shape)

In [None]:
df_train.head(3)

we can see that most of columns are having very big range and few are having small values so before applying our models to the data set we have to do the scaling of the data set.

##### Data Pre-Processing

In [None]:
# columns which are not required for sale of Consumer loan prediction
# columns_sale_cl = ['Count_MF','Count_CC','ActBal_MF','ActBal_CC']
columns_sale_cl = ['Count_CL','ActBal_CL']
df_train.drop(columns_sale_cl,inplace = True,axis = 1)

In [None]:
# finding total number of duplicate values in data set if any
print('Total number of duplicate values in the data set is/are: {}'.format(df_train.duplicated().sum()))

There are no duplicate rows in the data set

In [None]:
# checking types of the columns in the data set
df_train.dtypes

In [None]:
# checking for null values in the data set
col = df_train.columns
for i in col:
    # count number of rows with missing values
    n_miss = df_train[[i]].isnull().sum()
    perc = n_miss / df_train.shape[0] * 100
    print('%s, Missing: %d (%.1f%%)' % (i, n_miss, perc))

We can see that their are lot of missing values in the data set. Before applying any algorithm we have to either impute the values or drop the values.    
1) For sex we have two rows missing so I will impute it with U (Unknown) considering that client might not want to reveal their gender.   
2) For features from Inflow Outflow data set, having 18 rows missing in all of the feature we will impute it with 0 considering that client is in active in past 3 months.        
3) For feature from Product Owned data set we have almost 70-90% data set missing in all features. I think the feature might add value to our model so I will impute this also with 0 considering that client don't avail these features from the bank.    

We are not using mean or median imputation because it ignores the feature correlation and will also reduce the variance. Since the data set is very small, smaller variance leads to the narrower confidence interval in the probability distribution. This will lead to bias to our model.    

In [None]:
# replacing nan values of Sex field with U- Unknown
# df_train.dropna(subset = ["Sex"], inplace=True)

df_train.Sex = df_train.Sex.replace(np.nan,"U",regex=True)

In [None]:
# We have to convert Sex from object to numeric type
# df_train.Sex.unique()

# converting M and F to 1 and 0
df_train.Sex = df_train.Sex.replace({'M':1, 'F':0,'U':2})

In [None]:
# imputing with KNNImputer
# col_mean = ['VolumeCred','VolumeCred_CA','TransactionsCred','TransactionsCred_CA','VolumeDeb','VolumeDeb_CA',
#            'VolumeDebCash_Card','VolumeDebCashless_Card','VolumeDeb_PaymentOrder','TransactionsDeb','TransactionsDeb_CA'
#            ,'TransactionsDebCash_Card','TransactionsDebCashless_Card','TransactionsDeb_PaymentOrder']
# k = math.sqrt(df_train.shape[0])
# imputer = KNNImputer(n_neighbors=100, weights='uniform', metric='nan_euclidean')
# df_train = imputer.fit_transform(df_train)

In [None]:
df_train.describe().T

we can see that minimum value for age is zero that means age column is having some erroneous values, we have to analyse age column and see how we can either impute values or if can't impute than drop the rows.

In [None]:
# density plot with tenure of customer with bank
sns.set_style('darkgrid')
sns.set(rc={'figure.figsize':(8,6)})
sns.distplot(df_train.Age, bins=30);

In [None]:
# Checking age of customer where age is less than the tenure with the bank
df_train.query('Age*12 <=Tenure')

We can see that their are 34 rows where age less than the tenure with the bank so we assume that either data in age or tenure is incorrect. But after carefuly considering both the columns we can see that in some case age is even less than 10 years and the client is holding a current account with the bank. So we can say that values in age is wrong.    

##### Assumption     
1) we assume that to have a bank account with bank client must be atleast of 10 years. Since, even to have a student account the student must be atleast 10 years.      
2) To impute the age we will add 10 years with tenure of the client.

In [None]:
# imputing age with tenure + 120 months 
df_train.Age = np.where((df_train.Age *12 <= df_train.Tenure),round(df_train.Tenure/12) + 10,df_train.Age)

In [None]:
# imputing other values with 0 in the data set
df_train.fillna(0,inplace = True)

In [None]:
df_train.isnull().sum()

In [None]:
# statistical analysis of the data set
df_train.describe().T

Still we can see some clients having age less than 10 years so now we will impute these with KNNImputer considering these vales are missing at random.    

In [None]:
# imputing with KNNImputer
k = math.sqrt(df_train.shape[0])
imputer = KNNImputer(n_neighbors=round(k), weights='uniform', metric='nan_euclidean')
# df_train.Age = df_train.Age.replace('Age<=10',np.nan,regex=True)
df_train.Age = df_train.Age.mask(df_train.Age <= 10)
df_train[col] = imputer.fit_transform(df_train.values)
# df_train.Age = np.where((df_train.Age *12 <= df_train.Tenure),imputer.fit_transform(df_train.Age.value),df_train.Age)

In [None]:
sns.pairplot(df_train[col], hue='Sale_CL', corner=True);

In [None]:
# Checking for multicollinearity

plt.figure(figsize=(20,12))
sns.heatmap(df_train.corr(),cmap='RdBu',annot=True);

There is some corelation between the features, but we will use different algorithms which can handle these corelation. Other option is to drop the features which are having high corelation.

In [None]:
# density plot with tenure of customer with bank
sns.set_style('darkgrid')
sns.set(rc={'figure.figsize':(10,8)})
sns.distplot(df_train.Tenure, bins=30);

In [None]:
sns.set(rc={'figure.figsize':(6,4)})
sns.countplot(df_train.Sale_CL);

Their is clear imbalance in the data set we have to handle this also while applying our machine learning algorithm.
Two ways by which we can handle this class imbalance problem:    
1) By adjusting the class weight while training    
2) By over/under sampling of the data set    

In [None]:
#Splitting data set into train and test
# deviding the data set into target and predictors
X = df_train.copy()
X.drop(['Client','Sale_CL','Revenue_CL'],inplace = True, axis = 1)
y_sale_cl = df_train.iloc[:,28].values
y_revenue_cl = df_train.iloc[:,29].values
X_train, X_test, y_train, y_test = train_test_split(X, y_sale_cl, test_size=0.2, stratify=y_sale_cl, random_state=1)


##### Machine learning Models

Applying ML classification models for predicting sale of consumer loan

I will be creating a function for model evaluation so that I don't have to write the same code again and again and will be evaluating the model based on different metrics. Based on this evaluation I will decide which algorithm needs parameter tunning and can be improved.

In [None]:
def model_evaluation(model,scale = False, classification = False):
    if classification == True:
        scoring = [    'precision', 
                       'recall',
                       'f1', 
                       'accuracy']
    else:
        scoring = [    
                       'neg_mean_squared_error',
                       'neg_mean_absolute_error',
                       'neg_root_mean_squared_error'
                       ]
    # Declaring parameters
    R_STATE = 1
    over = RandomOverSampler(sampling_strategy=0.3,random_state = 1)
    under = RandomUnderSampler(sampling_strategy=0.5, random_state = 1)
    if scale == True :
        Steps = [
        #             ('i', KNNImputer(n_neighbors=31)),
    #                 ('ov',over),
    #                 ('un',under),
                    ('minmaxscaler',MinMaxScaler(feature_range=(0, 1))),
                    ('m', model)
                ]
    else:
        Steps = [
            #             ('i', KNNImputer(n_neighbors=31)),
        #                 ('ov',over),
        #                 ('un',under),
                        ('m', model)
                    ]
    pipeline = Pipeline(steps=Steps)
    # evaluate the model
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1)
    if classification == True:
        scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)
        # store results
        score_df = pd.DataFrame(scores)
    else:
        scores = cross_validate(pipeline, X_train_reg, y_train_reg, scoring=scoring, cv=10, n_jobs=-1)
        # store results
        score_df = pd.DataFrame(scores)
    return score_df

In [None]:
print("Model Evaluation for Ada Boost Algorithm")
model_evaluation(AdaBoostClassifier(),scale = False,classification=True)

In [None]:
print("Model Evaluation XGBoost Algorithm")
model_evaluation(XGBClassifier(),scale = False,classification=True)

In [516]:
print("Model Evaluation Random Forest Classifier Algorithm")
model_evaluation(RandomForestClassifier(),scale = False,classification=True)

Model Evaluation Random Forest Classifier Algorithm


Unnamed: 0,fit_time,score_time,test_precision,test_recall,test_f1,test_accuracy
0,2.990541,0.211409,0.8,0.173913,0.285714,0.741935
1,1.892108,0.126805,0.785714,0.23913,0.366667,0.754839
2,2.199315,0.136008,0.5,0.130435,0.206897,0.703226
3,1.93171,0.156,0.461538,0.12766,0.2,0.690323
4,1.808655,0.133008,0.466667,0.148936,0.225806,0.690323
5,1.936865,0.1404,0.529412,0.195652,0.285714,0.709677
6,1.790055,0.1248,0.428571,0.065217,0.113208,0.696774
7,1.710632,0.0936,0.571429,0.26087,0.358209,0.722581
8,1.414206,0.099204,0.583333,0.148936,0.237288,0.709677
9,1.394416,0.087005,0.625,0.212766,0.31746,0.722581


In [517]:
print("Model Evaluation Logistic Regression Algorithm")
model_evaluation(LogisticRegression(class_weight={0:0.3,1:0.7}),scale = True,classification=True)

Model Evaluation Logistic Regression Algorithm


Unnamed: 0,fit_time,score_time,test_precision,test_recall,test_f1,test_accuracy
0,0.606035,0.072603,0.439394,0.630435,0.517857,0.651613
1,0.618035,0.052801,0.428571,0.586957,0.495413,0.645161
2,0.573033,0.058202,0.405405,0.652174,0.5,0.612903
3,0.489028,0.069004,0.391304,0.574468,0.465517,0.6
4,0.121404,0.0468,0.360465,0.659574,0.466165,0.541935
5,0.098803,0.0468,0.404494,0.782609,0.533333,0.593548
6,0.079203,0.0624,0.357143,0.543478,0.431034,0.574194
7,0.083401,0.0468,0.34375,0.478261,0.4,0.574194
8,0.0624,0.0312,0.592593,0.680851,0.633663,0.76129
9,0.0624,0.0312,0.415385,0.574468,0.482143,0.625806


In [518]:
print("Model Evaluation Naive Bayes Algorithm")
model_evaluation(MultinomialNB(),scale = True,classification=True)

Model Evaluation Naive Bayes Algorithm


Unnamed: 0,fit_time,score_time,test_precision,test_recall,test_f1,test_accuracy
0,0.147009,0.041201,1.0,0.021739,0.042553,0.709677
1,0.17801,0.041201,1.0,0.021739,0.042553,0.709677
2,0.17801,0.056801,0.0,0.0,0.0,0.703226
3,0.17501,0.041201,0.25,0.021277,0.039216,0.683871
4,0.0468,0.059801,0.0,0.0,0.0,0.696774
5,0.0468,0.046201,0.5,0.021739,0.041667,0.703226
6,0.074401,0.048003,0.666667,0.043478,0.081633,0.709677
7,0.057201,0.043003,0.0,0.0,0.0,0.703226
8,0.049003,0.032002,1.0,0.021277,0.041667,0.703226
9,0.043002,0.036002,0.0,0.0,0.0,0.696774


After validating all the models we can see that AdaBoost and Logistic regression performed well with the given data set so we will perform hyperparameter tunning to see if we can improve the model performance

##### AdaBoost Classifier

In [519]:
# creating a parameter grid for cross validation and hyper parameter tuning

cv_xgb = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1)
param_grid_xgb = {'m__max_depth':[6,7,8],
              'm__gamma':[0,1],
               'm__alpha': [0,1]
#               'm__class_weight':[{0:0.1,1:0.9},{0:0.20,1:0.80},'balanced']
#               'm__solver':['lbfgs','saga','liblinear']
              
}
model = XGBClassifier()
Steps_xgb = [('m', model)]
pipeline_xgb = Pipeline(steps = Steps_xgb)
grid_search_xgb = GridSearchCV(estimator = pipeline_xgb, param_grid = param_grid_xgb, cv = cv_xgb, n_jobs = -1, verbose = 2)

In [520]:
grid_search_xgb.fit(X_train, y_train)
grid_search_xgb.best_params_

Fitting 10 folds for each of 12 candidates, totalling 120 fits


{'m__alpha': 0, 'm__gamma': 0, 'm__max_depth': 7}

In [521]:
model_xgb = grid_search_xgb.best_estimator_

In [522]:
# predicting the values using X_test data set
y_pred_xgb = model_xgb.predict(X_test)

In [523]:
print(classification_report(y_test,y_pred_xgb))

              precision    recall  f1-score   support

         0.0       0.75      0.90      0.82       136
         1.0       0.58      0.31      0.40        58

    accuracy                           0.73       194
   macro avg       0.67      0.61      0.61       194
weighted avg       0.70      0.73      0.70       194



##### Logistic Regression Alogorithm

In [524]:
# creating grid for Logistic regression
cv_lr = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1)
param_grid_lr = {'m__max_iter':[40,70,100],
              'm__penalty':['l1','l2','elasticnet'],
              'm__class_weight':[{0:0.35,1:0.65},{0:0.30,1:0.70},'balanced']
#               'm__solver':['lbfgs','saga','liblinear']
                }
model = LogisticRegression()
Steps_lr = [
            ('msc',MinMaxScaler(feature_range=(0, 1))),
#             ('sc',StandardScaler()),
            ('m', model)
]
pipeline_lr = Pipeline(steps = Steps_lr)
grid_search_lr = GridSearchCV(estimator = pipeline_lr, param_grid = param_grid_lr, cv = cv_lr, n_jobs = -1, verbose = 2)                 

In [525]:
grid_search_lr.fit(X_train, y_train)
grid_search_lr.best_params_

Fitting 10 folds for each of 27 candidates, totalling 270 fits


KeyboardInterrupt: 

In [None]:
model_lr = grid_search_lr.best_estimator_

In [None]:
# predicting the values using X_test data set
y_pred_lr = model_lr.predict(X_test)

In [None]:
print(classification_report(y_test,y_pred_lr))

In [None]:
confusion_matrix(y_test,y_pred_lr)

In [None]:
# ROC curve

# predict probabilities
yhat_roc = model_lr.predict_proba(X_test)
# retrieve just the probabilities for the positive class
pos_probs_roc = yhat_roc[:, 1]
# plot no skill roc curve
plt.plot([0, 1], [0, 1], linestyle='--', label='No Sale')
# calculate roc curve for model
fpr, tpr, _ = roc_curve(y_test, pos_probs_roc)
# plot model roc curve
plt.plot(fpr, tpr, marker='.', label='Logistic')
roc_auc = roc_auc_score(y_test, pos_probs_roc)
print('ROC AUC: %.3f' % roc_auc)
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()

In [None]:
# Precision-Recall Curve

# predict probabilities
yhat = model_lr.predict_proba(X_test)
# retrieve just the probabilities for the positive class
pos_probs = yhat[:, 1]
sale = len(y_train[y_train==1]) / len(y_train)
plt.plot([0, 1], [sale, sale], linestyle='--', label='Sale')
# calculate model precision-recall curve
precision, recall, _ = precision_recall_curve(y_test, pos_probs)
auc_score = auc(recall, precision)
print('PR AUC: %.3f' % auc_score)
# plot the model precision-recall curve
plt.plot(recall, precision, marker='.', label='LR')
# axis labels
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the legend
plt.legend()
# show the plot
plt.show()

In [None]:
# feature importance code
    
importance = model_lr.named_steps['m'].coef_[0]
cols = list(X_train.columns)

# function to color the plot
def bar_color(df,color1,color2):
    return np.where(importance>0,color1,color2).T

# summarize feature importance
for col,score in zip(X_train.columns,importance):
    print('Feature: %0s, Score: %.5f' % (col,score))

# plot feature importance
plt.bar([x for x in range(len(importance))], importance, color=bar_color(importance,'g','r'))
plt.xticks([x for x in range(len(importance))], cols, rotation=90)
plt.show()

We can see that for the given data set Logistic regression performed well when we changed the class weight. We will use this model for final prediction of sale of consumer loan.

In [None]:
# Saving model to disk for sale of consumer loan prediction
pickle.dump(model_lr, open('model_lr_sale_cl.pkl', 'wb'))

##### Applying Machine Learning Model

Applying Machine Learning model for regression problem and to find the revenue from the sale of consumer loan

In [None]:
#Splitting data set into train and test
# deviding the data set into target and predictors for regression problem
X_reg = df_train.copy()
X_reg.drop(['Client','Revenue_CL'],inplace = True, axis = 1)
# y_sale_cl = df_train.iloc[:,26].values
y_revenue_cl = df_train.iloc[:,29].values
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_revenue_cl, test_size=0.2, random_state=1)


In [None]:
model_evaluation(RandomForestRegressor(),scale = False,classification = False)

In [None]:
model_evaluation(AdaBoostRegressor(),scale = False,classification = False)

In [None]:
# import sklearn as sklearn
# sklearn.metrics.SCORERS.keys()

From the cross validation results we can see that Linear regression and random forest is performing well so we will do hyper-parameter tunning for this two algorithm and try to improve the model performance.

In [None]:
model_evaluation(Ridge(),scale = True,classification = False)

We can see that out of all models Random forest and Ridge performed well so we will do hyper-parameter tunning for improving the model performance.


##### Random Forest Regression

In [None]:
# creating a parameter grid for cross validation and hyper parameter tuning
param_grid_rf = {
    'm__max_depth': [80, 90],
    'm__min_samples_leaf': [ 4, 5],
    'm__min_samples_split': [8, 10],
    'm__n_estimators': [50,100,150]
}
model = RandomForestRegressor()
Steps_rf = [ ('m', model)]
pipeline_rf = Pipeline(steps = Steps_rf)
grid_search_rf = GridSearchCV(estimator = pipeline_rf, param_grid = param_grid_rf, cv = 10, n_jobs = -1, verbose = 2)  
# grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_rf.fit(X_train_reg, y_train_reg)
grid_search_rf.best_params_

In [None]:
model_rf = grid_search_rf.best_estimator_

In [None]:
# predicting the values using X_test data set
y_pred_rf = model_rf.predict(X_test_reg)

In [None]:
# Calculating the metrics for our model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_reg, y_pred_rf))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_reg, y_pred_rf))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_reg, y_pred_rf)))

In [None]:
# Feature importance
cols_reg = list(X_train_reg.columns)
eli5.explain_weights(model_rf.named_steps['m'], top=15, feature_names=cols_reg)

##### Ridge Regression Algorithm

In [None]:
# creating a parameter grid for cross validation and hyper parameter tuning

param_grid_rd = {
    'm__alpha': arange(0, 1, 0.01)
    
}

model = Ridge()
Steps_rd = [ ('m', model)]
pipeline_rd = Pipeline(steps = Steps_rd)
grid_search_rd = GridSearchCV(estimator = pipeline_rd, param_grid = param_grid_rd, cv = 10, n_jobs = -1, verbose = 2)  
# grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_rd.fit(X_train_reg, y_train_reg)
grid_search_rd.best_params_

In [None]:
model_rd = grid_search_rd.best_estimator_

In [None]:
# predicting the values using X_test data set
y_pred_rd = model_rd.predict(X_test_reg)

In [None]:
# Calculating the metrics for our model performance
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_reg, y_pred_rd))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_reg, y_pred_rd))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_reg, y_pred_rd)))

In [None]:
# Feature Importance

eli5.explain_weights(model_rd.named_steps['m'], top=15, feature_names=cols_reg)

In [None]:
# Saving model to disk for revenue prediction
pickle.dump(model_rf, open('model_rf_revenue_cl.pkl', 'wb'))