***
***
***

# A2 Data Analysis and Code | Classification Modeling
**Machine Learning**<br />
Shresth Sethi - Student MSBA<br />
FMSBA5 - Valencia<br />
Hult International Business School<br><br>

***
***
***

***
***

**PROJECT APPRENTICE CHEF - Halfway There Cross-Sell service**<br /><br />
**AIM**: Of this project is to build a machine learning model to predict weather a customer will subscribe to Halfway There service.<br />

**BACKGROUND**: Halfway There is a unique subscription in which subscribers will get half bottle of wine from local California vineyard every Wednesday.<br />

**WHY**: Apprentice Chef want to diversify there revenue stream by adding a cross-sell service. Plus this will give Apprentice Chef competitive advantage based on its unique product offering of hard to find local wines.<br />

**ASSUMPTION**: The dataset provided by the engineering team has used dataset engineering techniques and are statistically sound and represent the true picture of Apprentice Chef’s customers.<br />

***
***

**Importing Libraries**

In [32]:
#Import all the required libraries
import pandas as pd                                      # data science essentials
import matplotlib.pyplot as plt                          # data visualization
import seaborn as sns                                    # enhanced data visualization
import statsmodels.formula.api as smf                    # regression modeling
from sklearn.model_selection import train_test_split     # train/test split
import sklearn.linear_model                              # linear models
from sklearn.neighbors import KNeighborsRegressor        # KNN for Regression
from sklearn.preprocessing import StandardScaler         # standard scaler
from sklearn.metrics import confusion_matrix             # confusion matrix
from sklearn.metrics import roc_auc_score                # auc score
from sklearn.neighbors import KNeighborsClassifier       # KNN for classification
from sklearn.tree import DecisionTreeClassifier          # classification trees
from sklearn.tree import export_graphviz                 # exports graphics
from sklearn.externals.six import StringIO               # saves objects in memory
from IPython.display import Image                        # displays on frontend
import pydotplus                                         # interprets dot objects
from sklearn.model_selection import GridSearchCV         # hyperparameter tuning
from sklearn.metrics import make_scorer                  # customizable scorer
from sklearn.ensemble import RandomForestClassifier      # random forest
from sklearn.ensemble import GradientBoostingClassifier  # gbm


**Loading dataset to the environment**

Doing this step to load the dataset and further perform analysis on it. I have also set the display option to see appropriate results from data loaded.

In [33]:
#Loading data to the python file
original_df = pd.read_excel("Apprentice_Chef_Dataset.xlsx")
original_description = pd.read_excel('Apprentice_Chef_Data_Dictionary.xlsx')

#Setting print operations for df
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

**CHECKING MISSING VALUES**<br /> 
If any in the dataset<br /><br />

***
Found that the dataset is almost clean only the **FAMILY_NAME** column has some missing values.
***

Next I have dropped the FAMILY_NAME, NAME, and FIRST_NAME from the analysis as they do not add value to the steps performer further.

In [34]:
#Checking missing values and then summing it up to know the total column wise
d  = original_df.isnull().sum()

#as only FAMILY_NAME column has missing values
#dropping FAMILY_NAME, NAME, FIRST_NAME

original_df = original_df.drop(labels = ['FAMILY_NAME', 'NAME', 'FIRST_NAME'],
                               axis   = 1)

***
***

## **Exploratory Data Analysis**

***
***

### Handling Categorical Data <br />
Separating emails into different domains professional/ personal, and junk <br />
This is done because the case provide different categories assigned to domain id. Further will be one hot encoding the group formed.

STEP 1 : Separating the ID and Domain in different columns and storing it in email_df

In [35]:
#Creating an empty list
empty_lst = []



# looping over each email address
for index, col in original_df.iterrows():
    
    # splitting email domain at '@'
    split_email = original_df.loc[index, 'EMAIL'].split(sep = '@')
    
    # appending the list
    empty_lst.append(split_email)
    


# converting empty_lst into a DataFrame 
email_df = pd.DataFrame(empty_lst)



# Creating the id and domain column in email_df
email_df.columns = ['ID', 'DOMAIN']

STEP 2 : Joining the data of email_df with original_df<br />
    
*NOTE : I will only be using domain to categorize so the join will be with only the DOMAIN column of email_df*

In [36]:
original_df = pd.concat([  original_df, 
                           email_df.loc[:,'DOMAIN']], 
                           axis = 1)

STEP 3 : Creating customer email domain groups

*Professional/Personal/Junk* these are the categories that case defined, now assigning these with the values of domain id.

In [37]:
#Creating a list with professional and assigning domain id related to it
professional = ['@mmm.com', '@amex.com', '@apple.com','@boeing.com',
                '@caterpillar.com', '@chevron.com', '@cisco.com',
                '@cocacola.com','@disney.com', '@dupont.com',
                '@exxon.com', '@ge.org','@goldmansacs.com', 
                '@homedepot.com', '@ibm.com', '@intel.com',
                '@jnj.com', '@jpmorgan.com', '@mcdonalds.com',
                '@merck.com', '@microsoft.com', '@nike.com', 
                '@pfizer.com', '@pg.com', '@travelers.com',
                '@unitedtech.com', '@unitedhealth.com', 
                '@verizon.com', '@visa.com', '@walmart.com']


#Creating a list with personal and assigning domain id related to it
personal     = ['@gmail.com', '@yahoo.com', '@protonmail.com']


#Creating a list with junk and assigning domain id related to it
junk         = ['@me.com', '@aol.com', '@hotmail.com', '@live.com',
                '@msn.com', '@passport.com']

STEP 4 : Assigning the groups created for email domains

Storing the values by creating a new column 'DOMAIN_GRP' in the df for further analysis

In [38]:
# Creating an empty list
empty_lst = []



# looping to assign domains
for i in original_df['DOMAIN']:
    
        if   '@'+ i in professional:
             empty_lst.append('professional') #professional emails
            
        elif '@' + i in personal:
             empty_lst.append('personal') # personal emails

        elif '@' + i in junk:
             empty_lst.append('junk') # junk emails
            
        else:
             print('Unknown')

                

# Creating a new column in the original dataset to store the assigned values
original_df['DOMAIN_GRP'] = pd.DataFrame(empty_lst)

STEP 5 : Creating one hot encoding variables

Dropping the "DOMAIN_GRP","DOMAIN", and "EMAIL" column as it has already been encoded and then "domain one hot" data with original data frame

In [39]:
#using get_dummies to encode the DOMAIN_GRP column
domain_one_hot = pd.get_dummies(original_df['DOMAIN_GRP'])



#Dropping the columns
original_df          = original_df.drop(['DOMAIN_GRP', 'DOMAIN', 'EMAIL'], axis = 1)



#Joining the one hot encoding with the df
original_df          = original_df.join([domain_one_hot])

**Checkpoint 1**

Creating a check point to save the data, doing this to not change the original EXCEL data file

In [40]:
# saving results
original_df.to_excel('chef_feature_rich.xlsx',
                 index = False)

In [41]:
# loading saved file
original_df = pd.read_excel('chef_feature_rich.xlsx')

### Checking Outliers<br />
**Visualizing numerical data**

Doing this by using boxplots and distplot, this is done to identify the outliers so that it can be treated separately for the analysis. Checking outliers and creating different columns for them and seeing how it effects the results. Visual are required to base the cutoff for different variables or set values to certain variables.


*I have commented out this code for faster processing though have based my decisions referring to these graph.*

In [42]:
#creating subplots for better visuals

#fig, ax = plt.subplots(figsize = (10, 8)) 
#plt.subplot(2, 2, 1)
#sns.boxplot(original_df['TOTAL_MEALS_ORDERED'])
#plt.subplot(2, 2, 2)
#sns.distplot(original_df['UNIQUE_MEALS_PURCH'])
#plt.subplot(2, 2, 3)
#sns.boxplot(original_df['CONTACTS_W_CUSTOMER_SERVICE'])
#plt.subplot(2, 2, 4)
#sns.boxplot(original_df['PRODUCT_CATEGORIES_VIEWED'])

#fig, ax = plt.subplots(figsize = (10, 8))
#plt.subplot(2, 2, 1)
#sns.boxplot(original_df['AVG_TIME_PER_SITE_VISIT'])
#plt.subplot(2, 2, 2)
#sns.distplot(original_df['MOBILE_NUMBER'])
#plt.subplot(2, 2, 3)
#sns.boxplot(original_df['CANCELLATIONS_BEFORE_NOON'])
#plt.subplot(2, 2, 4)
#sns.boxplot(original_df['CANCELLATIONS_AFTER_NOON'])

#fig, ax = plt.subplots(figsize = (10, 8))
#plt.subplot(2, 2, 1)
#sns.distplot(original_df['TASTES_AND_PREFERENCES'])
#plt.subplot(2,2,2)
#sns.distplot(original_df['PC_LOGINS'])
#plt.subplot(2,2,3)
#sns.distplot(original_df['MOBILE_LOGINS'])
#plt.subplot(2,2,4)
#sns.boxplot(original_df['WEEKLY_PLAN'])

#fig, ax = plt.subplots(figsize = (10, 8))
#plt.subplot(2, 2, 1)
#sns.distplot(original_df['EARLY_DELIVERIES'])
#plt.subplot(2,2,2)
#sns.boxplot(original_df['LATE_DELIVERIES'])
#plt.subplot(2,2,3)
#sns.distplot(original_df['PACKAGE_LOCKER'])
#plt.subplot(2,2,4)
#sns.distplot(original_df['REFRIGERATED_LOCKER'])

#fig, ax = plt.subplots(figsize = (10, 8))
#plt.subplot(2, 2, 1)
#sns.boxplot(original_df['FOLLOWED_RECOMMENDATIONS_PCT'])
#plt.subplot(2,2,2)
#sns.boxplot(original_df['AVG_PREP_VID_TIME'])
#plt.subplot(2,2,3)
#sns.boxplot(original_df['LARGEST_ORDER_SIZE'])
#plt.subplot(2,2,4)
#sns.distplot(original_df['MASTER_CLASSES_ATTENDED'])

#fig, ax = plt.subplots(figsize = (10, 8))
#plt.subplot(2, 2, 1)
#sns.distplot(original_df['MEDIAN_MEAL_RATING'])
#plt.subplot(2,2,2)
#sns.boxplot(original_df['AVG_CLICKS_PER_VISIT'])
#plt.subplot(2,2,3)
#sns.distplot(original_df['TOTAL_PHOTOS_VIEWED'])
#plt.subplot(2,2,4)
#sns.distplot(original_df['junk'])

#fig, ax = plt.subplots(figsize = (10, 8))
#plt.subplot(2, 2, 1)
#sns.distplot(original_df['personal'])
#plt.subplot(2,2,2)
#sns.distplot(original_df['professional'])
#plt.subplot(2,2,3)
#sns.distplot(original_df['CROSS_SELL_SUCCESS'])
#plt.subplot(2,2,4)
#sns.boxplot(original_df['REVENUE'])


#plt.show()

***
***

## **Feature Engineering**

***
***

**Defining Thresholds**

Have used the above graphs to define thresholds for different variables.

In [43]:
#Setting Thresholds for outliers

REVENUE_out = 4200 
TOTAL_MEALS_ORDERED_out = 180
UNIQUE_MEALS_PURCH_out = 8
CONTACTS_W_CUSTOMER_SERVICE_hi = 8
CONTACTS_W_CUSTOMER_SERVICE_lo = 5
PRODUCT_CATEGORIES_VIEWED_at = 5
AVG_TIME_PER_SITE_VISIT_out = 200
MOBILE_NUMBER_at = 1
CANCELLATIONS_BEFORE_NOON_at = 1
CANCELLATIONS_AFTER_NOON_at = 0
TASTES_AND_PREFERENCES_at = 1
PC_LOGINS_at = 5
PC_LOGINS_at1 = 6
MOBILE_LOGINS_at = 1
MOBILE_LOGINS_at1 = 2
WEEKLY_PLAN_out = 30
EARLY_DELIVERIES_at = 0
LATE_DELIVERIES_at = 3
PACKAGE_LOCKER_at = 0
PACKAGE_LOCKER_at1 = 1
REFRIGERATED_LOCKER_at = 0
FOLLOWED_RECOMMENDATIONS_PCT_at = 30
AVG_PREP_VID_TIME_out = 280
LARGEST_ORDER_SIZE_at = 4
MASTER_CLASSES_ATTENDED_at = 0
MASTER_CLASSES_ATTENDED_at1 = 1
MEDIAN_MEAL_RATING_at = 3
AVG_CLICKS_PER_VISIT_at = 13
TOTAL_PHOTOS_VIEWED_at = 0

### Creating new outliers variable

Based on the threshold I have created following outliers variables for final model

In [44]:
# Selecting REVENUE outliers
original_df['REVENUE_out'] = 0
condition_hi = original_df.loc[0:,'REVENUE_out'][original_df['REVENUE'] > REVENUE_out]
original_df['REVENUE_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

# Selecting TOTAL_MEALS_ORDERED_out outliers
original_df['TOTAL_MEALS_ORDERED_out'] = 0
condition_hi = original_df.loc[0:,'TOTAL_MEALS_ORDERED_out'][original_df['TOTAL_MEALS_ORDERED'] > TOTAL_MEALS_ORDERED_out]
original_df['TOTAL_MEALS_ORDERED_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)
#Selecting UNIQUE_MEALS_PURCH_out outliers
original_df['UNIQUE_MEALS_PURCH_out'] = 0
condition_hi = original_df.loc[0:,'UNIQUE_MEALS_PURCH_out'][original_df['UNIQUE_MEALS_PURCH'] > UNIQUE_MEALS_PURCH_out]
original_df['UNIQUE_MEALS_PURCH_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting CONTACTS_W_CUSTOMER_SERVICE_out defining boundries for outliers
original_df['CONTACTS_W_CUSTOMER_SERVICE_out'] = 0
condition_hi = original_df.loc[0:,'CONTACTS_W_CUSTOMER_SERVICE_out'][original_df['CONTACTS_W_CUSTOMER_SERVICE'] > CONTACTS_W_CUSTOMER_SERVICE_hi]
condition_lo = original_df.loc[0:,'CONTACTS_W_CUSTOMER_SERVICE_out'][original_df['CONTACTS_W_CUSTOMER_SERVICE'] < CONTACTS_W_CUSTOMER_SERVICE_lo]
original_df['CONTACTS_W_CUSTOMER_SERVICE_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)
original_df['CONTACTS_W_CUSTOMER_SERVICE_out'].replace(to_replace = condition_lo,
                                              value      = 1,
                                              inplace    = True)

#Selecting PRODUCT_CATEGORIES_VIEWED_at
original_df['PRODUCT_CATEGORIES_VIEWED_at'] = 0
condition_hi = original_df.loc[0:,'PRODUCT_CATEGORIES_VIEWED_at'][original_df['PRODUCT_CATEGORIES_VIEWED'] == PRODUCT_CATEGORIES_VIEWED_at]
original_df['PRODUCT_CATEGORIES_VIEWED_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting AVG_TIME_PER_SITE_VISIT_out
original_df['AVG_TIME_PER_SITE_VISIT_out'] = 0
condition_hi = original_df.loc[0:,'AVG_TIME_PER_SITE_VISIT_out'][original_df['AVG_TIME_PER_SITE_VISIT'] > AVG_TIME_PER_SITE_VISIT_out]
original_df['AVG_TIME_PER_SITE_VISIT_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting MOBILE_NUMBER_at
original_df['MOBILE_NUMBER_at'] = 0
condition_hi = original_df.loc[0:,'MOBILE_NUMBER_at'][original_df['MOBILE_NUMBER'] == MOBILE_NUMBER_at]
original_df['MOBILE_NUMBER_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting CANCELLATIONS_BEFORE_NOON_at
original_df['CANCELLATIONS_BEFORE_NOON_at'] = 0
condition_hi = original_df.loc[0:,'CANCELLATIONS_BEFORE_NOON_at'][original_df['CANCELLATIONS_BEFORE_NOON'] == CANCELLATIONS_BEFORE_NOON_at]
original_df['CANCELLATIONS_BEFORE_NOON_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting CANCELLATIONS_AFTER_NOON_at
original_df['CANCELLATIONS_AFTER_NOON_at'] = 0
condition_hi = original_df.loc[0:,'CANCELLATIONS_AFTER_NOON_at'][original_df['CANCELLATIONS_AFTER_NOON'] == CANCELLATIONS_AFTER_NOON_at]
original_df['CANCELLATIONS_AFTER_NOON_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting TASTES_AND_PREFERENCES_at
original_df['TASTES_AND_PREFERENCES_at'] = 0
condition_hi = original_df.loc[0:,'TASTES_AND_PREFERENCES_at'][original_df['TASTES_AND_PREFERENCES'] == TASTES_AND_PREFERENCES_at]
original_df['TASTES_AND_PREFERENCES_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting PC_LOGINS_at
original_df['PC_LOGINS_at'] = 0
condition_hi = original_df.loc[0:,'PC_LOGINS_at'][original_df['PC_LOGINS'] == PC_LOGINS_at]
original_df['PC_LOGINS_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting PC_LOGINS_at1
original_df['PC_LOGINS_at1'] = 0
condition_hi = original_df.loc[0:,'PC_LOGINS_at1'][original_df['PC_LOGINS'] == PC_LOGINS_at1]
original_df['PC_LOGINS_at1'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting MOBILE_LOGINS_at
original_df['MOBILE_LOGINS_at'] = 0
condition_hi = original_df.loc[0:,'MOBILE_LOGINS_at'][original_df['MOBILE_LOGINS'] == MOBILE_LOGINS_at]
original_df['MOBILE_LOGINS_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting MOBILE_LOGINS_at1
original_df['MOBILE_LOGINS_at1'] = 0
condition_hi = original_df.loc[0:,'MOBILE_LOGINS_at1'][original_df['MOBILE_LOGINS'] == MOBILE_LOGINS_at1]
original_df['MOBILE_LOGINS_at1'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting WEEKLY_PLAN_out
original_df['WEEKLY_PLAN_out'] = 0
condition_hi = original_df.loc[0:,'WEEKLY_PLAN_out'][original_df['WEEKLY_PLAN'] > WEEKLY_PLAN_out]
original_df['WEEKLY_PLAN_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting EARLY_DELIVERIES_at
original_df['EARLY_DELIVERIES_at'] = 0
condition_hi = original_df.loc[0:,'EARLY_DELIVERIES_at'][original_df['EARLY_DELIVERIES'] == EARLY_DELIVERIES_at]
original_df['EARLY_DELIVERIES_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting LATE_DELIVERIES_at
original_df['LATE_DELIVERIES_at'] = 0
condition_hi = original_df.loc[0:,'LATE_DELIVERIES_at'][original_df['LATE_DELIVERIES'] == LATE_DELIVERIES_at]
original_df['LATE_DELIVERIES_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting PACKAGE_LOCKER_at
original_df['PACKAGE_LOCKER_at'] = 0
condition_hi = original_df.loc[0:,'PACKAGE_LOCKER_at'][original_df['PACKAGE_LOCKER'] == PACKAGE_LOCKER_at]
original_df['PACKAGE_LOCKER_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting PACKAGE_LOCKER_at1
original_df['PACKAGE_LOCKER_at1'] = 0
condition_hi = original_df.loc[0:,'PACKAGE_LOCKER_at1'][original_df['PACKAGE_LOCKER'] == PACKAGE_LOCKER_at1]
original_df['PACKAGE_LOCKER_at1'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting REFRIGERATED_LOCKER_at
original_df['REFRIGERATED_LOCKER_at'] = 0
condition_hi = original_df.loc[0:,'REFRIGERATED_LOCKER_at'][original_df['REFRIGERATED_LOCKER'] == REFRIGERATED_LOCKER_at]
original_df['REFRIGERATED_LOCKER_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting FOLLOWED_RECOMMENDATIONS_PCT_at
original_df['FOLLOWED_RECOMMENDATIONS_PCT_at'] = 0
condition_hi = original_df.loc[0:,'FOLLOWED_RECOMMENDATIONS_PCT_at'][original_df['FOLLOWED_RECOMMENDATIONS_PCT'] == FOLLOWED_RECOMMENDATIONS_PCT_at]
original_df['FOLLOWED_RECOMMENDATIONS_PCT_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)


#Selecting AVG_PREP_VID_TIME_out
original_df['AVG_PREP_VID_TIME_out'] = 0
condition_hi = original_df.loc[0:,'AVG_PREP_VID_TIME_out'][original_df['AVG_PREP_VID_TIME'] > AVG_PREP_VID_TIME_out]
original_df['AVG_PREP_VID_TIME_out'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting LARGEST_ORDER_SIZE_at
original_df['LARGEST_ORDER_SIZE_at'] = 0
condition_hi = original_df.loc[0:,'LARGEST_ORDER_SIZE_at'][original_df['LARGEST_ORDER_SIZE'] == LARGEST_ORDER_SIZE_at]
original_df['LARGEST_ORDER_SIZE_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting MASTER_CLASSES_ATTENDED_at
original_df['MASTER_CLASSES_ATTENDED_at'] = 0
condition_hi = original_df.loc[0:,'MASTER_CLASSES_ATTENDED_at'][original_df['MASTER_CLASSES_ATTENDED'] == MASTER_CLASSES_ATTENDED_at]
original_df['MASTER_CLASSES_ATTENDED_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting MASTER_CLASSES_ATTENDED_at1
original_df['MASTER_CLASSES_ATTENDED_at1'] = 0
condition_hi = original_df.loc[0:,'MASTER_CLASSES_ATTENDED_at1'][original_df['MASTER_CLASSES_ATTENDED'] == MASTER_CLASSES_ATTENDED_at1]
original_df['MASTER_CLASSES_ATTENDED_at1'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting MEDIAN_MEAL_RATING_at
original_df['MEDIAN_MEAL_RATING_at'] = 0
condition_hi = original_df.loc[0:,'MEDIAN_MEAL_RATING_at'][original_df['MEDIAN_MEAL_RATING'] == MEDIAN_MEAL_RATING_at]
original_df['MEDIAN_MEAL_RATING_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

#Selecting AVG_CLICKS_PER_VISIT_at
original_df['AVG_CLICKS_PER_VISIT_at'] = 0
condition_hi = original_df.loc[0:,'AVG_CLICKS_PER_VISIT_at'][original_df['AVG_CLICKS_PER_VISIT'] == AVG_CLICKS_PER_VISIT_at]
original_df['AVG_CLICKS_PER_VISIT_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)
#Selecting TOTAL_PHOTOS_VIEWED_at
original_df['TOTAL_PHOTOS_VIEWED_at'] = 0
condition_hi = original_df.loc[0:,'TOTAL_PHOTOS_VIEWED_at'][original_df['TOTAL_PHOTOS_VIEWED'] == TOTAL_PHOTOS_VIEWED_at]
original_df['TOTAL_PHOTOS_VIEWED_at'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)



**Checkpoint 2**

Creating a check point to save the data

In [45]:
# saving results
original_df.to_excel('chef_feature_rich.xlsx',
                 index = False)

In [46]:
# loading saved file
original_df = pd.read_excel('chef_feature_rich.xlsx')

***
***

## **Modeling Techniques**

***
***


### CORRELATION MATRIX<br /> 


The correlations show by how much a value is related with revenue. These displayed below are all highly correlated variables. This mean any change in them will have positive/negative result on the model depending on the variable sign.

**High Correlations**

~~~
CROSS_SELL_SUCCESS                 1.00
FOLLOWED_RECOMMENDATIONS_PCT       0.46
junk                              -0.28
FOLLOWED_RECOMMENDATIONS_PCT_at   -0.24
professional                       0.19
CANCELLATIONS_BEFORE_NOON          0.16
MOBILE_NUMBER                      0.10
MOBILE_NUMBER_at                   0.10
TASTES_AND_PREFERENCES_at          0.08
TASTES_AND_PREFERENCES             0.08
REFRIGERATED_LOCKER                0.07
REFRIGERATED_LOCKER_at            -0.07
~~~

*The code below is commented out as the relevant results are displayed above.*

In [47]:
#creating correlation for the dataset
#df_corr = original_df.corr().round(2)


#displaying result of correlation with respect to revenue
#df_corr.loc[ : ,'CROSS_SELL_SUCCESS'].sort_values(ascending = False)


This code was used to print out the column names with comma "," to use it in the next code.

*Code has been commented to not display results as it is not relevant and faster processing.*

In [48]:
#Printing column names to use
#for name in original_df.columns:
#    print(f"'{name}' ,")


**Creating X variable with statistically significant columns**

Printing the result with + sign to be used further in Logistic regression.

- The print statement is commented as the statistically significant values are already pasted in the regression model<br />
- Other commented values in the x_val are because they were not statistically significant and were removed by one after another depending on the p values

In [49]:
#Inputing X variable and commenting out the statistically insignificant variables
x_val = [   #'REVENUE' ,
            #'TOTAL_MEALS_ORDERED' ,
            #'UNIQUE_MEALS_PURCH' ,
            #'CONTACTS_W_CUSTOMER_SERVICE' ,
            #'PRODUCT_CATEGORIES_VIEWED' ,
            #'AVG_TIME_PER_SITE_VISIT' ,
            #'MOBILE_NUMBER' ,
            'CANCELLATIONS_BEFORE_NOON' ,
            #'CANCELLATIONS_AFTER_NOON' ,
            #'TASTES_AND_PREFERENCES' ,
            #'PC_LOGINS' ,
            #'MOBILE_LOGINS' ,
            #'WEEKLY_PLAN' ,
            #'EARLY_DELIVERIES' ,
            #'LATE_DELIVERIES' ,
            #'PACKAGE_LOCKER' ,
            #'REFRIGERATED_LOCKER' ,
            'FOLLOWED_RECOMMENDATIONS_PCT' ,
            #'AVG_PREP_VID_TIME' ,
            #'LARGEST_ORDER_SIZE' ,
            #'MASTER_CLASSES_ATTENDED' ,
            #'MEDIAN_MEAL_RATING' ,
            #'AVG_CLICKS_PER_VISIT' ,
            #'TOTAL_PHOTOS_VIEWED' ,
            #'junk' ,
            'personal' ,
            'professional' ,
            #'REVENUE_out' ,
            #'TOTAL_MEALS_ORDERED_out' ,
            #'UNIQUE_MEALS_PURCH_out' ,
            #'CONTACTS_W_CUSTOMER_SERVICE_out' ,
            #'PRODUCT_CATEGORIES_VIEWED_at' ,
            #'AVG_TIME_PER_SITE_VISIT_out' ,
            'MOBILE_NUMBER_at' ,
            #'CANCELLATIONS_BEFORE_NOON_at' ,
            #'CANCELLATIONS_AFTER_NOON_at' ,
            'TASTES_AND_PREFERENCES_at' ,
            #'PC_LOGINS_at' ,
            'PC_LOGINS_at1' ,
            'MOBILE_LOGINS_at' ,
            #'MOBILE_LOGINS_at1' ,
            #'WEEKLY_PLAN_out' ,
            #'EARLY_DELIVERIES_at' ,
            #'LATE_DELIVERIES_at' ,
            #'PACKAGE_LOCKER_at' ,
            #'PACKAGE_LOCKER_at1' ,
            #'REFRIGERATED_LOCKER_at' ,
            'FOLLOWED_RECOMMENDATIONS_PCT_at' ,
            #'AVG_PREP_VID_TIME_out' ,
            #'LARGEST_ORDER_SIZE_at' ,
            #'MASTER_CLASSES_ATTENDED_at' ,
            #'MASTER_CLASSES_ATTENDED_at1' ,
            #'MEDIAN_MEAL_RATING_at' ,
            #'AVG_CLICKS_PER_VISIT_at' ,
            #'TOTAL_PHOTOS_VIEWED_at'
        ]

#Running a loop tp print the statisticaly significant variables to be used in logit regression
#for name in x_val:
   # print(f"original_df['{name}'] +")

### Logistic Regression

- The Model below is with statistically significant variables. First I created a model with all the variables and then removed variables that had high p value or p values greater than 0.05. Removal of p values was done in a step by step approach. Removing highest p value variable first and then the next one by one.


In [50]:
logit_full = smf.logit(formula =  """   CROSS_SELL_SUCCESS ~
                                        original_df['CANCELLATIONS_BEFORE_NOON'] +
                                        original_df['FOLLOWED_RECOMMENDATIONS_PCT'] +
                                        original_df['personal']+
                                        original_df['professional'] +
                                        original_df['MOBILE_NUMBER_at'] +
                                        original_df['TASTES_AND_PREFERENCES_at'] +
                                        original_df['PC_LOGINS_at1'] +
                                        original_df['MOBILE_LOGINS_at'] +
                                        original_df['FOLLOWED_RECOMMENDATIONS_PCT_at']""",
                          data = original_df)
#Fit the the data
logit_full = logit_full.fit()

#Display the results
print(logit_full.summary())

Optimization terminated successfully.
         Current function value: 0.409431
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:     CROSS_SELL_SUCCESS   No. Observations:                 1946
Model:                          Logit   Df Residuals:                     1936
Method:                           MLE   Df Model:                            9
Date:                Sun, 15 Mar 2020   Pseudo R-squ.:                  0.3478
Time:                        23:50:23   Log-Likelihood:                -796.75
converged:                       True   LL-Null:                       -1221.6
Covariance Type:            nonrobust   LLR p-value:                4.242e-177
                                                     coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------
Intercept                           

**CREATING DICTIONARY**


- The dictionary is created for one will all variables called as "logit_full" and the another one with only significant variables called as "logit_sig". <br />
- This dictionary is created to use these variables and test result on different models and compare model performance.


In [51]:
#Creates the dictionary
candidate_dict = {

                 #With all the variables significant and non significant
                 'logit_full'   : [ 'REVENUE' ,
                                    'TOTAL_MEALS_ORDERED' ,
                                    'UNIQUE_MEALS_PURCH' ,
                                    'CONTACTS_W_CUSTOMER_SERVICE' ,
                                    'PRODUCT_CATEGORIES_VIEWED' ,
                                    'AVG_TIME_PER_SITE_VISIT' ,
                                    'MOBILE_NUMBER' ,
                                    'CANCELLATIONS_BEFORE_NOON' ,
                                    'CANCELLATIONS_AFTER_NOON' ,
                                    'TASTES_AND_PREFERENCES' ,
                                    'PC_LOGINS' ,
                                    'MOBILE_LOGINS' ,
                                    'WEEKLY_PLAN' ,
                                    'EARLY_DELIVERIES' ,
                                    'LATE_DELIVERIES' ,
                                    'PACKAGE_LOCKER' ,
                                    'REFRIGERATED_LOCKER' ,
                                    'FOLLOWED_RECOMMENDATIONS_PCT' ,
                                    'AVG_PREP_VID_TIME' ,
                                    'LARGEST_ORDER_SIZE' ,
                                    'MASTER_CLASSES_ATTENDED' ,
                                    'MEDIAN_MEAL_RATING' ,
                                    'AVG_CLICKS_PER_VISIT' ,
                                    'TOTAL_PHOTOS_VIEWED' ,
                                    'junk' ,
                                    'personal' ,
                                    'professional' ,
                                    'REVENUE_out' ,
                                    'TOTAL_MEALS_ORDERED_out' ,
                                    'UNIQUE_MEALS_PURCH_out' ,
                                    'CONTACTS_W_CUSTOMER_SERVICE_out' ,
                                    'PRODUCT_CATEGORIES_VIEWED_at' ,
                                    'AVG_TIME_PER_SITE_VISIT_out' ,
                                    'MOBILE_NUMBER_at' ,
                                    'CANCELLATIONS_BEFORE_NOON_at' ,
                                    'CANCELLATIONS_AFTER_NOON_at' ,
                                    'TASTES_AND_PREFERENCES_at' ,
                                    'PC_LOGINS_at' ,
                                    'PC_LOGINS_at1' ,
                                    'MOBILE_LOGINS_at' ,
                                    'MOBILE_LOGINS_at1' ,
                                    'WEEKLY_PLAN_out' ,
                                    'EARLY_DELIVERIES_at' ,
                                    'LATE_DELIVERIES_at' ,
                                    'PACKAGE_LOCKER_at' ,
                                    'PACKAGE_LOCKER_at1' ,
                                    'REFRIGERATED_LOCKER_at' ,
                                    'FOLLOWED_RECOMMENDATIONS_PCT_at' ,
                                    'AVG_PREP_VID_TIME_out' ,
                                    'LARGEST_ORDER_SIZE_at' ,
                                    'MASTER_CLASSES_ATTENDED_at' ,
                                    'MASTER_CLASSES_ATTENDED_at1' ,
                                    'MEDIAN_MEAL_RATING_at' ,
                                    'AVG_CLICKS_PER_VISIT_at' ,
                                    'TOTAL_PHOTOS_VIEWED_at'],
    
              #With statistically significant variables

              'logit_sig'     : [   'CANCELLATIONS_BEFORE_NOON',
                                    'FOLLOWED_RECOMMENDATIONS_PCT',
                                    'personal',
                                    'professional',
                                    'MOBILE_NUMBER_at',
                                    'TASTES_AND_PREFERENCES_at',
                                    'PC_LOGINS_at1',
                                    'MOBILE_LOGINS_at',
                                    'FOLLOWED_RECOMMENDATIONS_PCT_at']
}

### Train Test Split

- Creating train test split with significant variables to see performance on the model. Target variable CROSS_SELL_SUCCESS. Test size as 0.25 that creates a 75% and 25% split between train and test at a given random state of 222.<br />
- The train test split with full data set is also created where all the variables are entered, this is done to separate between the significant and full data values. There parameters are same for full and significant variables.<br />
- Stratification parameter is also added as to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.


In [52]:
# train_test_split
# training data on the x variable that has statisticaly significant values except the tareget variable
chef_data   = original_df.loc[ : , candidate_dict['logit_sig']]

# target data on the y variable that has the target that we need to achieve
# -preparing response variable data - cross sell success
chef_target = original_df.loc[:, 'CROSS_SELL_SUCCESS']


# test_size = 0.25 and randome_state = 222
X_train, X_test, y_train, y_test = train_test_split(
                                                    chef_data,
                                                    chef_target,
                                                    test_size    = 0.25,
                                                    random_state = 222,
                                                    stratify     = chef_target
                                                    )

# train/test split with the logit_sig variables
chef_data_full   =  original_df.loc[ : , candidate_dict['logit_full']]
chef_target_full =  original_df.loc[ : , 'CROSS_SELL_SUCCESS']


# train/test split
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
                                                                        chef_data_full,
                                                                        chef_target_full,
                                                                        random_state = 222,
                                                                        test_size    = 0.25,
                                                                        stratify     = chef_target
                                                                        )

### Logistic Regression

- Using GridSearchCV for hyperparameter tunning and finding the tuned parameters to for further using in the logistic regression.

*Displayed the results of this tune here and commented out the code to increase processing time, the same tuned parameters are used in the regression*<br />

**OUTPUT**
~~~
Tuned Parameters  : {'C': 1.9000000000000001, 'warm_start': True}
Tuned CV AUC      : 0.614
~~~

In [53]:

# declaring a hyperparameter space
#C_space          = pd.np.arange(0.1, 3.0, 0.1)
#warm_start_space = [True, False]


# creating a hyperparameter grid
#param_grid = {'C'          : C_space,
#              'warm_start' : warm_start_space}


# INSTANTIATING the model object without hyperparameters
#lr_tuned = sklearn.linear_model.LogisticRegression(solver = 'lbfgs',
#                              max_iter= 1000,
#                              random_state = 222)


# GridSearchCV object
#lr_tuned_cv = GridSearchCV(estimator  = lr_tuned,
#                           param_grid = param_grid,
#                          cv         = 3,
#                           scoring    = make_scorer(roc_auc_score,
#                                                    needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)
#lr_tuned_cv.fit(chef_data, chef_target)


# printing the optimal parameters and best score
#print("Tuned Parameters  :", lr_tuned_cv.best_params_)
#print("Tuned CV AUC      :", lr_tuned_cv.best_score_.round(4))

**Logistic Regression Model** with tuned results

 - Added the results from above code to the model for tuning.

In [54]:
# INSTANTIATING a logistic regression model with tuned values
lr_tuned = sklearn.linear_model.LogisticRegression(solver = 'lbfgs',
                                                   C = 1.9,
                                                   warm_start = True,
                                                   max_iter= 1000,
                                                   random_state = 222)


lr_tuned = lr_tuned.fit(X_train, y_train)

# PREDICTING based on the testing set
lr_tuned_pred = lr_tuned.predict(X_test)

**Creating Model Performance** dataframe

- For better comparison storing results of each model in a data frame.

In [55]:
# creating an empty list
model_performance = [['Model', 'Training Accuracy',
                      'Testing Accuracy', 'AUC Value']]


# train accuracy
logreg_train_acc  = lr_tuned.score(X_train, y_train).round(3)


# test accuracy
logreg_test_acc   = lr_tuned.score(X_test, y_test).round(3)


# auc value
logreg_auc = roc_auc_score(y_true  = y_test,
                           y_score = lr_tuned_pred).round(3)


# saving the results
model_performance.append(['Tuned Logestic Regression',
                          logreg_train_acc,
                          logreg_test_acc,
                          logreg_auc])

#declaring a DataFrame object
model_performance = pd.DataFrame(model_performance[1:], columns = model_performance[0])

### Decision Tree Classifier

- Using GridSearchCV for hyperparameter tunning and finding the tuned parameters to for further using in the decision tree classifier.

*Displayed the results of this tune here and commented out the code to increase processing time, the same tuned parameters are used in the classifier*

**OUTPUT**
~~~
Tuned Parameters  : {'criterion': 'gini', 'max_depth': 8, 'min_samples_leaf': 6, 'splitter': 'best'}
Tuned Training AUC: 0.638
~~~

In [56]:
# declaring a hyperparameter space
#criterion_space = ['gini', 'entropy']
#splitter_space = ['best', 'random']
#depth_space = pd.np.arange(1, 25)
#leaf_space  = pd.np.arange(1, 100)


# creating a hyperparameter grid
#param_grid = {'criterion'        : criterion_space,
#              'splitter'         : splitter_space,
#              'max_depth'        : depth_space,
#              'min_samples_leaf' : leaf_space}


# INSTANTIATING the model object without hyperparameters
#tuned_tree = DecisionTreeClassifier(random_state = 222)


# GridSearchCV object
#tuned_tree_cv = GridSearchCV(estimator  = tuned_tree,
#                             param_grid = param_grid,
 #                            cv         = 3,
#                             scoring    = make_scorer(roc_auc_score,
#                                                      needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)
#tuned_tree_cv.fit(chef_data, chef_target)


# printing the optimal parameters and best score
#print("Tuned Parameters  :", tuned_tree_cv.best_params_)
#print("Tuned Training AUC:", tuned_tree_cv.best_score_.round(4))


**Decision Tree Classifier** with tuned results

 - Added the results from above code to the model for tuning.

In [57]:
#Initiating the decision tree classifier
tree_tuned =   DecisionTreeClassifier (
                                       criterion        = 'gini',
                                       max_depth        = 8,
                                       min_samples_leaf = 6,
                                       splitter         = 'best',
                                       random_state     = 222)

#fiting the tree on train data
tree_tuned = tree_tuned.fit(X_train, y_train)

#predicting the tree on the test data
tree_tuned_pred = tree_tuned.predict(X_test)

Adding the results of decision tree classifier in the model performance data frame

In [58]:
# declaring model performance objects
tree_train_acc = tree_tuned.score(X_train, y_train).round(3)
tree_test_acc  = tree_tuned.score(X_test, y_test).round(3)
tree_auc       = roc_auc_score(y_true  = y_test,
                              y_score = tree_tuned_pred).round(3)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'Tuned Tree',
                          'Training Accuracy'  : tree_train_acc,
                          'Testing Accuracy'   : tree_test_acc,
                          'AUC Value'          : tree_auc},
                          ignore_index = True)

### Random Forest Classifier

- Using GridSearchCV for hyperparameter tunning and finding the tuned parameters to for further using in the random forest classifier.

*Displayed the results of this tune here and commented out the code to increase processing time, the same tuned parameters are used in the classifier*

**OUTPUT**
~~~
Tuned Parameters  : {'bootstrap': False, 'criterion': 'entropy', 'min_samples_leaf': 1, 'n_estimators': 100, 'warm_start': True}
Tuned Training AUC: 0.592
~~~

In [59]:
# declaring a hyperparameter space
#estimator_space  = pd.np.arange(100, 1100, 250)
#leaf_space       = pd.np.arange(1, 31, 10)
#criterion_space  = ['gini', 'entropy']
#bootstrap_space  = [True, False]
#warm_start_space = [True, False]


# creating a hyperparameter grid
#param_grid = {'n_estimators'     : estimator_space,
#              'min_samples_leaf' : leaf_space,
#              'criterion'        : criterion_space,
#              'bootstrap'        : bootstrap_space,
#              'warm_start'       : warm_start_space}


# INSTANTIATING the model object without hyperparameters
#full_forest_grid = RandomForestClassifier(random_state = 222)


# GridSearchCV object
#full_forest_cv = GridSearchCV(estimator  = full_forest_grid,
#                              param_grid = param_grid,
#                              cv         = 3,
#                              scoring    = make_scorer(roc_auc_score,
#                                           needs_threshold = False))


# using full data for cross validation
#full_forest_cv.fit(chef_data_full, chef_target_full)


# printing the optimal parameters and best score
#print("Tuned Parameters  :", full_forest_cv.best_params_)
#print("Tuned Training AUC:", full_forest_cv.best_score_.round(4))

**RANDOM FOREST CLASSIFIER**

- Model performance with hyperparameter results

- Manually entering tuned parameter to random forest classifier

In [60]:
#Developing model with tuned results
full_rf_tuned = RandomForestClassifier(bootstrap        = False,
                                       criterion        = 'entropy',
                                       min_samples_leaf = 1,
                                       n_estimators     = 100,
                                       warm_start       = True,
                                       random_state     = 222)


#Fitting the model on train
full_rf_tuned_fit = full_rf_tuned.fit(X_train, y_train)


#Predicting the result on test data
full_rf_tuned_pred = full_rf_tuned_fit.predict(X_test)


Adding the results of decision tree classifier in the model performance data frame

In [61]:
# declaring model performance objects
rf_train_acc = full_rf_tuned_fit.score(X_train, y_train).round(3)
rf_test_acc  = full_rf_tuned_fit.score(X_test, y_test).round(3)
rf_auc       = roc_auc_score(y_true  = y_test,
                             y_score = full_rf_tuned_pred).round(3)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'Tuned Random Forest',
                          'Training Accuracy'  : rf_train_acc,
                          'Testing Accuracy'   : rf_test_acc,
                          'AUC Value'          : rf_auc},
                          ignore_index = True)

### Gradient Boosting Classifier

This is without using hyperparameter tuning

In [62]:
# INSTANTIATING the model object without hyperparameters
full_gbm_default = GradientBoostingClassifier(loss          = 'deviance',
                                              learning_rate = 0.1,
                                              n_estimators  = 100,
                                              criterion     = 'friedman_mse',
                                              max_depth     = 3,
                                              warm_start    = False,
                                              random_state  = 222)


# FIT step
full_gbm_default_fit = full_gbm_default.fit(X_train, y_train)


# PREDICTING based on the testing set
full_gbm_default_pred = full_gbm_default_fit.predict(X_test)


Adding the results of decision tree classifier in the model performance data frame

In [63]:
# declaring model performance objects
gbm_train_acc = full_gbm_default_fit.score(X_train, y_train).round(3)
gbm_test_acc  = full_gbm_default_fit.score(X_test, y_test).round(3)
gbm_auc       = roc_auc_score(y_true  = y_test,
                              y_score = full_gbm_default_pred).round(3)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'GBM_without_tune',
                          'Training Accuracy'  : gbm_train_acc,
                          'Testing Accuracy'   : gbm_test_acc,
                          'AUC Value'          : gbm_auc},
                          ignore_index = True)


### Gradient Boosting Classifier

- Using GridSearchCV for hyperparameter tunning and finding the tuned parameters to for further using in the gradient boosting classifier.

*Displayed the results of this tune here and commented out the code to increase processing time, the same tuned parameters are used in the classifier*

**OUTPUT**
~~~
Tuned Parameters  : {'learning_rate': 1.3000000000000003, 'max_depth': 1, 'n_estimators': 100}
Tuned Training AUC: 0.631
~~~

In [64]:
# declaring a hyperparameter space
#learn_space     = pd.np.arange(0.1, 1.6, 0.3)
#estimator_space = pd.np.arange(50, 250, 50)
#depth_space     = pd.np.arange(1, 10)


# creating a hyperparameter grid
#param_grid = {'learning_rate' : learn_space,
#              'max_depth'     : depth_space,
#              'n_estimators'  : estimator_space}


# INSTANTIATING the model object without hyperparameters
#full_gbm_grid = GradientBoostingClassifier(random_state = 222)


# GridSearchCV object
#full_gbm_cv = GridSearchCV(estimator  = full_gbm_grid,
#                           param_grid = param_grid,
#                           cv         = 3,
#                           scoring    = make_scorer(roc_auc_score,
#                                        needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)
#full_gbm_cv.fit(chef_data_full, chef_target_full)


# printing the optimal parameters and best score
#print("Tuned Parameters  :", full_gbm_cv.best_params_)
#print("Tuned Training AUC:", full_gbm_cv.best_score_.round(4))

**GRADIENT BOOSTING CLASSIFIER**

Model performance with hyperparameter results

Manually entering tuned parameter to gradient boosting classifier

In [65]:
# INSTANTIATING the model object without hyperparameters
gbm_tuned = GradientBoostingClassifier(learning_rate = 1.3,
                                       max_depth     = 1,
                                       n_estimators  = 100,
                                       random_state  = 222)


# FIT step is needed as we are not using .best_estimator
gbm_tuned_fit = gbm_tuned.fit(X_train, y_train)


# PREDICTING based on the testing set
gbm_tuned_pred = gbm_tuned_fit.predict(X_test)


Adding the results of decision tree classifier in the model performance data frame

In [66]:
# declaring model performance objects
gbm_train_acc = gbm_tuned_fit.score(X_train, y_train).round(3)
gbm_test_acc  = gbm_tuned_fit.score(X_test, y_test).round(3)
gbm_auc       = roc_auc_score(y_true  = y_test,
                              y_score = gbm_tuned_pred).round(3)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'Tuned GBM',
                          'Training Accuracy'  : gbm_train_acc,
                          'Testing Accuracy'   : gbm_test_acc,
                          'AUC Value'          : gbm_auc},
                          ignore_index = True)

***
***

## **Displaying Model Comparisons**

***
***

In [67]:
sort_perform = model_performance.sort_values(by = 'AUC Value',
                                                  ascending = False)
sort_perform

Unnamed: 0,Model,Training Accuracy,Testing Accuracy,AUC Value
3,GBM_without_tune,0.824,0.797,0.779
4,Tuned GBM,0.812,0.793,0.766
1,Tuned Tree,0.835,0.774,0.758
2,Tuned Random Forest,0.897,0.737,0.713
0,Tuned Logestic Regression,0.791,0.737,0.703


**The best model her i got is GBM without tuning will use it as my final model**