# Gain and Lift

To illustrate the gain and lift charts, we will be using the bank marketing dataset. The data describes a problem in which a bank is interested in predicting which customers may respond to their direct marketing campaign to open a term deposit with the bank. The response variable Y = 1 implies that the customer opens a term deposit after the campaign and Y = 0 otherwise. The marketing campaign is based on the phone calls.

Gain = (Cumulative number of positive observations upto decile i) / (Total number of positive observations in the data)

Lift = (Cumulative Number of Positive Observations upto decile i in LR model) / 
        (Cumulative Number of Positive Observations upto decile i in random model)

### Data Description for Bank Marketing Dataset

#### Age
Dtype : Numeric

Desc : Age of the client who is the target of this marketing exercise


#### Job
Dtype : Categorical

Desc : Type of job (admin, blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown)


#### Marital
Dtype : Categorical

Desc : Marital Status (married, divorced, single)


#### Education
Dtype : Categorical

Desc : Education qualification (unknown, secondary, primary, tertiary)


#### Default
Dtype : Categorical

Desc : Customer has credit in default? (yes, no)


#### Balance
Dtype : Numerical

Desc : Average yearly balance, in euros


#### Housing Loan
Dtype : Categorical

Desc : has housing loan (no, yes)


#### Personal Loan
Dtype : Categorical

Desc : has personal loan (no, yes)


#### Previous Campaign
Dtype : Numerical

Desc : Number of contacts performed before this campaign and for this client


#### Subscribed
Dtype : Categorical

Desc : Has the client subscribed a term deposit? (yes, no)

In [58]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [59]:
bank_df = pd.read_csv('Datasets/bank.csv', delimiter=';')

In [60]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [61]:
new_bank_df = bank_df[['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'campaign', 'previous', 'y']]

In [62]:
new_bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,campaign,previous,y
0,30,unemployed,married,primary,no,1787,no,no,1,0,no
1,33,services,married,secondary,no,4789,yes,yes,1,4,no
2,35,management,single,tertiary,no,1350,yes,no,1,1,no
3,30,management,married,tertiary,no,1476,yes,yes,4,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,1,0,no


In [63]:
new_bank_df['Subscribed'] = new_bank_df['y']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_bank_df['Subscribed'] = new_bank_df['y']


In [64]:
new_bank_df.drop('y', axis=1, inplace=True)
new_bank_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_bank_df.drop('y', axis=1, inplace=True)


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,campaign,previous,Subscribed
0,30,unemployed,married,primary,no,1787,no,no,1,0,no
1,33,services,married,secondary,no,4789,yes,yes,1,4,no
2,35,management,single,tertiary,no,1350,yes,no,1,1,no
3,30,management,married,tertiary,no,1476,yes,yes,4,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,1,0,no


In [65]:
new_bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   age         4521 non-null   int64 
 1   job         4521 non-null   object
 2   marital     4521 non-null   object
 3   education   4521 non-null   object
 4   default     4521 non-null   object
 5   balance     4521 non-null   int64 
 6   housing     4521 non-null   object
 7   loan        4521 non-null   object
 8   campaign    4521 non-null   int64 
 9   previous    4521 non-null   int64 
 10  Subscribed  4521 non-null   object
dtypes: int64(4), object(7)
memory usage: 388.7+ KB


In [66]:
new_bank_df.Subscribed.value_counts()

Subscribed
no     4000
yes     521
Name: count, dtype: int64

The dataset has a total of 4521 observations, out of which 521 customers subscribed the term deposit
(approximately 11.5%) and the remaining 4000 did not subscribe the term deposit.

Let us capture the independent variables into the list X_features

In [67]:
X_features = list(new_bank_df.columns)
X_features.remove('Subscribed')
X_features

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'campaign',
 'previous']

Encode the categorical features into dummy variables using the following code:

In [68]:
encoded_bank_df = pd.get_dummies(new_bank_df[X_features], drop_first=True, dtype=int)

The outcome variable subscribed is set to yes or no. This needs to be encoded as 1 (yes) and 0 (no).

In [69]:
Y = new_bank_df['Subscribed'].map({'yes': 1, 'no': 0})
X = encoded_bank_df

Here the dataset is not split into training and test datasets for simplicity as our objective is primarily to
understand gain and lift chart.

### Building Logistic Regression Model

In [70]:
import statsmodels.api as sm
logit_model = sm.Logit(Y, sm.add_constant(X)).fit()

Optimization terminated successfully.
         Current function value: 0.335572
         Iterations 7


In [71]:
logit_model.summary2()

0,1,2,3
Model:,Logit,Method:,MLE
Dependent Variable:,Subscribed,Pseudo R-squared:,0.061
Date:,2025-12-09 14:07,AIC:,3082.2384
No. Observations:,4521,BIC:,3236.2341
Df Model:,23,Log-Likelihood:,-1517.1
Df Residuals:,4497,LL-Null:,-1615.5
Converged:,1.0000,LLR p-value:,1.4866e-29
No. Iterations:,7.0000,Scale:,1.0000

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
const,-1.7573,0.3799,-4.6251,0.0000,-2.5019,-1.0126
age,0.0078,0.0058,1.3395,0.1804,-0.0036,0.0191
balance,-0.0000,0.0000,-0.2236,0.8231,-0.0000,0.0000
campaign,-0.0905,0.0238,-3.8042,0.0001,-0.1371,-0.0439
previous,0.1414,0.0212,6.6569,0.0000,0.0998,0.1830
job_blue-collar,-0.3412,0.2000,-1.7060,0.0880,-0.7331,0.0508
job_entrepreneur,-0.2900,0.3161,-0.9175,0.3589,-0.9096,0.3295
job_housemaid,-0.0166,0.3339,-0.0497,0.9603,-0.6711,0.6379
job_management,-0.0487,0.1984,-0.2455,0.8061,-0.4375,0.3401


In [72]:
def get_significant_vars(lm):
	var_p_val_df = pd.DataFrame(lm.pvalues)
	var_p_val_df['vars'] = var_p_val_df.index
	var_p_val_df.columns = ['pvals', 'vars']
	return list(var_p_val_df[var_p_val_df['pvals'] <= 0.05]['vars'])

In [73]:
significant_vars = get_significant_vars(logit_model)
significant_vars

['const',
 'campaign',
 'previous',
 'job_retired',
 'marital_married',
 'education_tertiary',
 'housing_yes',
 'loan_yes']

Setting X_features to only significant variables and building a logistic regression model with the significant features.

In [75]:
X_features = ['campaign', 'previous', 'job_retired', 'marital_married', 'education_tertiary', 'housing_yes', 'loan_yes']
logit_model2 = sm.Logit(Y, sm.add_constant(X[X_features])).fit()
logit_model2.summary2()

Optimization terminated successfully.
         Current function value: 0.337228
         Iterations 7


0,1,2,3
Model:,Logit,Method:,MLE
Dependent Variable:,Subscribed,Pseudo R-squared:,0.056
Date:,2025-12-09 14:12,AIC:,3065.2182
No. Observations:,4521,BIC:,3116.5501
Df Model:,7,Log-Likelihood:,-1524.6
Df Residuals:,4513,LL-Null:,-1615.5
Converged:,1.0000,LLR p-value:,8.1892e-36
No. Iterations:,7.0000,Scale:,1.0000

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
const,-1.4754,0.1133,-13.0260,0.0000,-1.6974,-1.2534
campaign,-0.0893,0.0236,-3.7925,0.0001,-0.1355,-0.0432
previous,0.1419,0.0211,6.7097,0.0000,0.1004,0.1833
job_retired,0.8246,0.1731,4.7628,0.0000,0.4853,1.1639
marital_married,-0.3767,0.0969,-3.8878,0.0001,-0.5667,-0.1868
education_tertiary,0.2991,0.1014,2.9500,0.0032,0.1004,0.4978
housing_yes,-0.5834,0.0986,-5.9179,0.0000,-0.7767,-0.3902
loan_yes,-0.7025,0.1672,-4.2012,0.0000,-1.0302,-0.3748


P-value for LLR (Likelihood Ratio test) shows (less than 0.05) that the overall model is significant. We will predict the probabilities of the same observations as we have not split the dataset.

In [79]:
y_pred_df = pd.DataFrame({
    'actual' : Y,
    'predicted_prob': logit_model2.predict(sm.add_constant(X[X_features]))
})

Now sort the observations by their predicted probabilities in the descending order.

In [80]:
sorted_pred_df = y_pred_df[['predicted_prob', 'actual']].sort_values(by='predicted_prob', ascending=False).reset_index(drop=True)

After sorting, we will segment all the observations into deciles. First we will find the number of observations in each decile by dividing the total number of observations by 10.

In [83]:
num_per_decile = int(len(sorted_pred_df) / 10)
print("Number of observations per decile:", num_per_decile)

Number of observations per decile: 452


The function get_deciles() takes a DataFrame and segments the observations into deciles and marks each observation with the decile number it belongs to. The DataFrame with sorted probabilities should be passed to this function.

In [84]:
def get_deciles(df):
    # set first decile value
    df['decile'] = 1

    idx = 0
    # Iterate over all 10 deciles
    for each_d in range(0, 10):
        # Setting each 452 observations to one decile in sequence
        df.iloc[idx: idx + num_per_decile, df.columns.get_loc('decile')] = each_d
        idx += num_per_decile

    df['decile'] = df['decile'] + 1
    return df

In [85]:
deciles_predict_df = get_deciles(sorted_pred_df)

In [86]:
deciles_predict_df[0:10]

Unnamed: 0,predicted_prob,actual,decile
0,0.864769,0,1
1,0.828031,0,1
2,0.706809,0,1
3,0.642337,1,1
4,0.631032,1,1
5,0.619146,0,1
6,0.609129,0,1
7,0.573199,0,1
8,0.572364,0,1
9,0.55935,0,1
