# IMPROVING LEAD GENERATION AT EUREKA FORBES 

Eureka Forbes, part of the conglomerate Shapoorji Pallonji Group, is currently one of the world's largest direct sales company known for its water purifier brand Aquaguard with a turnover of more than INR 30 billion. The company is estimated to have a customer base of 20 million across 53 countries. The company's distribution channel includes a direct sales force of dealers, institutional channels, business partner network and a rural channel across 1500 cities and towns in India. The company's previous customer acquisition model ensured that interested customers were individually visited for demonstration of the product and for completion of purchase. While this made the company a household name, it kept the acquisition costs on the higher side. With the imminence of online retailing, the brand had been taking steps to establish their digital presence and build a stable online sales channel. The company website (www.eurekaforbes.com) attracts online traffic from various sources such as organic searches, google ads, email campaigns, etc. The company has started to use this click stream data to build a rich database of visitor acquisition factors and behavioral variables such as session duration, device category, pages visited, lead forms filled, etc. using the Google Analytics Reporting API. The company identifies these visitors as potential customers and is actively deploying remarketing campaigns with optimism to convert them. 

**Source**: https://store.hbr.org/product/improving-lead-generation-at-eureka-forbes-using-machine-learning-algorithms/IMB779

The business goal is clearly defined for the company – they want to target potential customers while keeping
the cost per lead (CPL) as low as possible. For Kashif Kudalkar, the Deputy General Manager for Digital
Marketing and Analytics, the task is to achieve better conversion at lower costs. This is achievable when
the target audience is narrowed down to a sizeable number for remarketing campaigns. Kashif wants to use
the collected behavioral and visitor data to achieve the following objectives:

1. Find the target audience with a high probability of submitting a lead and eventually converting.
2. Segment the visitor audience into buckets based on their activity for designing better advertising and
remarketing campaigns.
3. Finally, have a probability score that can be used to run a personalized campaign for users/segments

## Loading Dataset

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd

In [None]:
eureka_df_v3 = pd.read_csv('https://raw.githubusercontent.com/manaranjanp/IIMBClasses/main/classification/eureka_encoded_csv.zip')

In [None]:
eureka_df_v3.head(5)

In [None]:
eureka_df_v3.info()

In [None]:
eureka_df_v3.converted.unique()

In [None]:
eureka_df_v3.converted.value_counts()

## All Features

In [None]:
X_features = list(eureka_df_v3.columns)
X_features.remove('converted')

## Splitting Dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_set_0, test_set_0 = train_test_split(eureka_df_v3[eureka_df_v3.converted == 0],
                                       train_size = 0.99)

train_set_1, test_set_1 = train_test_split(eureka_df_v3[eureka_df_v3.converted == 1],
                                       train_size = 0.8)

In [None]:
train_set = pd.concat([train_set_0, train_set_1])
test_set = pd.concat([test_set_0, test_set_1])

In [None]:
train_set.converted.value_counts()

In [None]:
test_set.converted.value_counts()

## Resamping to create balanced dataset

In [None]:
from sklearn.utils import resample, shuffle

In [None]:
train_label_1 = resample(train_set[train_set.converted == 1],
                             replace = True,
                             n_samples=50000)

train_label_0 = resample(train_set[train_set.converted == 0],
                             replace = False,
                             n_samples=50000)                              

In [None]:
# Combine majority class with upsampled minority class
train_set_resampled = pd.concat([train_label_1, train_label_0])

In [None]:
train_set_resampled = shuffle(train_set_resampled)

## Decision Tree Model

### Buidling the model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_v1 = DecisionTreeClassifier(max_depth=8, 
                                 criterion = 'gini')

In [None]:
tree_v1.fit(train_set_resampled[X_features], 
            train_set_resampled['converted'])

### Predicting on test set

In [None]:
y_tree_pred = tree_v1.predict(test_set[X_features])

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm_rf = confusion_matrix(test_set['converted'], y_tree_pred, [1,0])

In [None]:
sn.heatmap(cm_rf,
           fmt='.0f',
           annot = True,
           xticklabels = ['Converted', 'Not Converted'],
           yticklabels = ['Converted', 'Not Converted'])
plt.xlabel('Predicted')
plt.ylabel('Actual');

In [None]:
from sklearn.metrics import roc_auc_score, plot_roc_curve

In [None]:
y_rf_pred_prob = tree_v1.predict_proba(test_set[X_features])

In [None]:
auc_score = roc_auc_score(test_set['converted'], y_rf_pred_prob[:,1])

In [None]:
auc_score

In [None]:
plot_roc_curve(tree_v1, test_set[X_features], 
               test_set['converted']);

### Finding important features

In [None]:
import numpy as np

In [None]:
features_df = pd.DataFrame({'feature': X_features,
                            'importance': np.round(tree_v1.feature_importances_, 3) })

In [None]:
features_df = features_df.sort_values('importance', 
                                      ascending = False)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [None]:
plt.figure(figsize=(6,12))
sn.barplot(y = 'feature', 
           x = 'importance', 
           data = features_df[0:50]);

In [None]:
features_df['cumsum'] = features_df.importance.cumsum()

In [None]:
imp_cumsum_df = features_df.sort_values('cumsum', ascending=True)

In [None]:
imp_cumsum_df = imp_cumsum_df.reset_index()

In [None]:
imp_cumsum_df

## Random Forest model

### Building the Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf_clf = RandomForestClassifier(n_estimators = 100,
                                max_depth=8,
                                max_features=0.3,
                                max_samples=0.4)

In [None]:
rf_clf.fit(train_set_resampled[X_features], 
           train_set_resampled['converted'])

### Predicting on test set

In [None]:
y_rf_pred = rf_clf.predict(test_set[X_features])

In [None]:
cm_rf = confusion_matrix(test_set['converted'], y_rf_pred, [1,0])

In [None]:
sn.heatmap(cm_rf,
           fmt='.0f',
           annot = True,
           xticklabels = ['Converted', 'Not Converted'],
           yticklabels = ['Converted', 'Not Converted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')

In [None]:
y_rf_pred_prob = rf_clf.predict_proba(test_set[X_features])

In [None]:
auc_score = roc_auc_score(test_set['converted'], y_rf_pred_prob[:,1])

In [None]:
auc_score

In [None]:
from sklearn.metrics import precision_score

In [None]:
precision_score(test_set['converted'], y_rf_pred)

### Finding important features

In [None]:
features_df = pd.DataFrame({'feature': X_features,
                            'importance': np.round( rf_clf.feature_importances_, 3) })

In [None]:
features_df = features_df.sort_values('importance', 
                                      ascending = False)

In [None]:
plt.figure(figsize=(6,12))
sn.barplot(y = 'feature', 
           x = 'importance', 
           data = features_df[0:50]);

In [None]:
features_df['cumsum'] = features_df.importance.cumsum()

In [None]:
imp_cumsum_df = features_df.sort_values('cumsum', ascending=True)

In [None]:
imp_cumsum_df = imp_cumsum_df.reset_index()

In [None]:
imp_cumsum_df

## Lift Chart and Gain Chart 

### By Decile

In [None]:
sorted_predict_prob_df = pd.DataFrame( { 'actual': test_set['converted'], 
                                         'prob' : y_rf_pred_prob[:,1] })

In [None]:
sorted_predict_prob_df = sorted_predict_prob_df.sort_values('prob', 
                                                            ascending = False)

In [None]:
sorted_predict_prob_df[0:10]

In [None]:
num_per_decile = int( len( sorted_predict_prob_df ) / 10 )
print( "Number of observations per decile: ", num_per_decile)

In [None]:
len(sorted_predict_prob_df)

### Creating Deciles

In [None]:
def get_deciles( df ):
    df['decile'] = 1

    idx = 0

    for each_d in range( 0, 10 ):
        df.iloc[idx:idx+num_per_decile, df.columns.get_loc('decile')] = each_d 
        idx += num_per_decile

    df['decile'] = df['decile'] + 1    
    
    return df

In [None]:
deciles_predict_df = get_deciles( sorted_predict_prob_df )

In [None]:
deciles_predict_df[0:10]

In [None]:
gain_lift_df = pd.DataFrame( 
    deciles_predict_df.groupby( 
            'decile')['actual'].sum() ).reset_index()
gain_lift_df.columns = ['decile', 'gain']

In [None]:
gain_lift_df['gain_percentage'] = (100 * 
            gain_lift_df.gain.cumsum()/gain_lift_df.gain.sum())

### Gain Chart

In [None]:
gain_lift_df

In [None]:
plt.figure( figsize = (8,4))
plt.plot( gain_lift_df['decile'], 
         gain_lift_df['gain_percentage'], '-' )

plt.title("Gain Chart")
plt.show()

In [None]:
gain_lift_df['lift'] = ( gain_lift_df.gain_percentage 
                        / ( gain_lift_df.decile * 10) )

In [None]:
gain_lift_df

In [None]:
gain_lift_df['lift'] = ( gain_lift_df.gain_percentage 
                        / ( gain_lift_df.decile ) )

gain_lift_df

plt.figure( figsize = (8,4))
plt.plot( gain_lift_df['decile'], gain_lift_df['lift'], '-' )
plt.title("Lift Chart")
plt.show()

### By Percentile

In [None]:
sorted_predict_prob_df = pd.DataFrame( { 'actual': test_set['converted'], 
                                         'prob' : y_rf_pred_prob[:,1] })

In [None]:
sorted_predict_prob_df = sorted_predict_prob_df.sort_values('prob', 
                                                            ascending = False)

In [None]:
num_per_percentile = int( len( sorted_predict_prob_df ) / 100 )
print( "Number of observations per percentile: ", num_per_percentile)

In [None]:
def get_percentiles( df ):
    df['percentile'] = 1

    idx = 0

    for each_d in range( 0, 100 ):
        df.iloc[idx:idx+num_per_decile, df.columns.get_loc('percentile')] = each_d 
        idx += num_per_percentile

    df['percentile'] = df['percentile'] + 1    
    
    return df

In [None]:
percentile_predict_df = get_percentiles( sorted_predict_prob_df )

In [None]:
percentile_predict_df[0:10]

In [None]:
gain_lift_df = pd.DataFrame( 
    percentile_predict_df.groupby( 
            'percentile')['actual'].sum() ).reset_index()
gain_lift_df.columns = ['percentile', 'gain']

In [None]:
gain_lift_df['gain_percentage'] = (100 * 
            gain_lift_df.gain.cumsum()/gain_lift_df.gain.sum())

In [None]:
gain_lift_df

In [None]:
plt.figure( figsize = (8,4))
plt.plot( gain_lift_df['percentile'], 
         gain_lift_df['gain_percentage'], '-' )

plt.title("Gain Chart")
plt.show()

In [None]:
gain_lift_df['lift'] = ( gain_lift_df.gain_percentage 
                        / ( gain_lift_df.percentile ) )

In [None]:
gain_lift_df

In [None]:
plt.figure( figsize = (8,4))
plt.plot( gain_lift_df['percentile'], gain_lift_df['lift'], '-' )
plt.title("Lift Chart")
plt.show()