# Gain and Lift

To illustrate the gain and lift charts, we will be using the bank marketing dataset. The data describes a problem in which a bank is interested in predicting which customers may respond to their direct marketing campaign to open a term deposit with the bank. The response variable Y = 1 implies that the customer opens a term deposit after the campaign and Y = 0 otherwise. The marketing campaign is based on the phone calls.

Gain = (Cumulative number of positive observations upto decile i) / (Total number of positive observations in the data)

Lift = (Cumulative Number of Positive Observations upto decile i in LR model) / 
        (Cumulative Number of Positive Observations upto decile i in random model)

### Data Description for Bank Marketing Dataset

#### Age
Dtype : Numeric

Desc : Age of the client who is the target of this marketing exercise


#### Job
Dtype : Categorical

Desc : Type of job (admin, blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown)


#### Marital
Dtype : Categorical

Desc : Marital Status (married, divorced, single)


#### Education
Dtype : Categorical

Desc : Education qualification (unknown, secondary, primary, tertiary)


#### Default
Dtype : Categorical

Desc : Customer has credit in default? (yes, no)


#### Balance
Dtype : Numerical

Desc : Average yearly balance, in euros


#### Housing Loan
Dtype : Categorical

Desc : has housing loan (no, yes)


#### Personal Loan
Dtype : Categorical

Desc : has personal loan (no, yes)


#### Previous Campaign
Dtype : Numerical

Desc : Number of contacts performed before this campaign and for this client


#### Subscribed
Dtype : Categorical

Desc : Has the client subscribed a term deposit? (yes, no)

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [16]:
bank_df = pd.read_csv('Datasets/bank.csv', delimiter=';')

In [17]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [18]:
new_bank_df = bank_df[['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'campaign', 'previous', 'y']]

In [19]:
new_bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,campaign,previous,y
0,30,unemployed,married,primary,no,1787,no,no,1,0,no
1,33,services,married,secondary,no,4789,yes,yes,1,4,no
2,35,management,single,tertiary,no,1350,yes,no,1,1,no
3,30,management,married,tertiary,no,1476,yes,yes,4,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,1,0,no


In [20]:
new_bank_df['Subscribed'] = new_bank_df['y']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_bank_df['Subscribed'] = new_bank_df['y']


In [22]:
new_bank_df.drop('y', axis=1, inplace=True)
new_bank_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_bank_df.drop('y', axis=1, inplace=True)


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,campaign,previous,Subscribed
0,30,unemployed,married,primary,no,1787,no,no,1,0,no
1,33,services,married,secondary,no,4789,yes,yes,1,4,no
2,35,management,single,tertiary,no,1350,yes,no,1,1,no
3,30,management,married,tertiary,no,1476,yes,yes,4,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,1,0,no


In [23]:
new_bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   age         4521 non-null   int64 
 1   job         4521 non-null   object
 2   marital     4521 non-null   object
 3   education   4521 non-null   object
 4   default     4521 non-null   object
 5   balance     4521 non-null   int64 
 6   housing     4521 non-null   object
 7   loan        4521 non-null   object
 8   campaign    4521 non-null   int64 
 9   previous    4521 non-null   int64 
 10  Subscribed  4521 non-null   object
dtypes: int64(4), object(7)
memory usage: 388.7+ KB


In [24]:
new_bank_df.Subscribed.value_counts()

Subscribed
no     4000
yes     521
Name: count, dtype: int64

The dataset has a total of 4521 observations, out of which 521 customers subscribed the term deposit
(approximately 11.5%) and the remaining 4000 did not subscribe the term deposit.

Let us capture the independent variables into the list X_features

In [25]:
X_features = list(new_bank_df.columns)
X_features.remove('Subscribed')
X_features

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'campaign',
 'previous']

Encode the categorical features into dummy variables using the following code:

In [26]:
encoded_bank_df = pd.get_dummies(new_bank_df[X_features], drop_first=True)

The outcome variable subscribed is set to yes or no. This needs to be encoded as 1 (yes) and 0 (no).

In [27]:
Y = new_bank_df['Subscribed'].map(lambda x: int(x == 'yes'))
X = encoded_bank_df

Here the dataset is not split into training and test datasets for simplicity as our objective is primarily to
understand gain and lift chart.