
# Problem Statement
* Our client, Happy Customer Bank is a mid-sized private bank that deals in all kinds of banking products.
* They also cross-sell products to its existing customers and to do so they use different kinds of communication like tele-calling, e-mails, recommendations on net banking, mobile banking, etc.
* In this case, they want to cross sell its credit cards to its existing customers.
* The bank has identified a set of customers that are eligible for taking these credit cards.

**Given historic data and other data of the customers we have to identify which customers are most likely to accept our cross-sell offer.**

## Data Description:-
We have the following information regarding the customer:
* Customer details (gender, age, region etc.)
* Details of his/her relationship with the bank (Channel_Code,Vintage, 'Avg_Asset_Value etc.)

## Expected Outcome:-
* Build a model to predict whether the person will be interested in buying the Credit card offered by our client.
* Grading Metric: **ROC_AUC_SCORE**

## Problem Category:-
For the data and objective, it’s evident that this is a **Binary Classification Problem** in the **Tabular Data** format.

So without further ado, let's now start with some basic imports to take us through this journey of Lead prediction:-

In [None]:
# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
from tqdm import tqdm
import pandas as pd
import numpy as np
import os
import random
pd.set_option('display.max_columns', None)

# Visialisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid")

# Machine Learning
# Utils
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, StratifiedKFold
from sklearn import preprocessing
#Feature Selection
from sklearn.feature_selection import chi2, f_classif, f_regression, mutual_info_classif
from sklearn.feature_selection import mutual_info_regression, SelectKBest, SelectPercentile
from sklearn.feature_selection import VarianceThreshold
# Models
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from imblearn.ensemble import BalancedBaggingClassifier, BalancedRandomForestClassifier
# Unsupervised Models
from sklearn.cluster import KMeans
#Metrics
from sklearn.metrics import roc_auc_score

# Fixing Seed
RANDOM_SEED = 42

def seed_everything(seed=RANDOM_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    
seed_everything()

# EDA
Let's have a basic look around the data we have at hand first

In [None]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
sub_df = pd.read_csv("sample.csv")

Let's see what columns we have in the training data.

In [None]:
train_df.sample(10)

In [None]:
train_df.columns

From the column keys in problem statement we know the following information about each of the features:-

 Variable      | Definition         
 :-----------  |:------------
 ID | Unique Identifier for a row
 Gender | Gender of the Customer
 Age | Age of the Customer (in Years)
 Region_Code | Code of the Region for the customers
 Occupation | Occupation Type for the customer
 Channel_Code | Acquisition Channel Code for the Customer  (Encoded)
 Vintage | Vintage for the Customer (In Months)
 Credit_Product | If the Customer has any active credit product (Home loan, Personal loan, Credit Card etc.)
 Avg_Account_Balance | Average Account Balance for the Customer in last 12 Months
 Is_Active | If the Customer is Active in last 3 Months
 Is_Lead(Target) | 0 : Customer is not interested
 | 1 : Customer is interested

We can get a naive idea about the type of variables form the definition itself and looking at the data makes it clearer.

In [None]:
train_df.describe().T

In [None]:
np.sum(train_df.isnull())

We can see that there are some Null values in some columns. Which is not good for data ingestion into any model, so let's see the Null situation upfront:-  
## 1. Null Values

In [None]:
nulls_train = np.sum(train_df.isnull())
nullcols_train = nulls_train.loc[(nulls_train != 0)].sort_values(ascending=False)
nullcols_train = nullcols_train.apply(lambda x: 100*x/train_df.shape[0])

barplot_dim = (15, 8)
ax = plt.subplots(figsize=barplot_dim)
sns.barplot(x=nullcols_train.index, y=nullcols_train)
plt.ylabel("Null %", size=20);
plt.xlabel("Feature Name", size=20);
plt.show()

In [None]:
nulls_train = np.sum(test_df.isnull())
nullcols_train = nulls_train.loc[(nulls_train != 0)].sort_values(ascending=False)
nullcols_train = nullcols_train.apply(lambda x: 100*x/test_df.shape[0])

barplot_dim = (15, 8)
ax = plt.subplots(figsize=barplot_dim)
sns.barplot(x=nullcols_train.index, y=nullcols_train)
plt.ylabel("Null %", size=20);
plt.xlabel("Feature Name", size=20);
plt.show()

The situation is almost similar in both train set and test set; Around 12% of the values from 'Credit_Product' column is missing. And looking at the feature definition of the missing column we can assume that the NaN values in credit product means we do not know is the user has/does not have any active credit product. So, we can replace Nan with 'Unknown' and create another binary column to track the same:-

In [None]:
train_df['Credit_Product'].fillna('Unk', inplace=True)
test_df['Credit_Product'].fillna('Unk', inplace=True)

In [None]:
train_df['Credit_Product_Known'] = train_df['Credit_Product'].apply(lambda x: 0 if x == 'Unk' else 1)
test_df['Credit_Product_Known'] = test_df['Credit_Product'].apply(lambda x: 0 if x == 'Unk' else 1)

In [None]:
np.sum(train_df.isnull())

In [None]:
np.sum(test_df.isnull())

Now that we have resolved all the null values, let's move on to EDA starting with class imabalance of the Dataset.
## 2. Class Imbalance
As this is a classification problem, let's start from the population of each class in out training set.

In [None]:
ax = plt.subplots(figsize=(18, 6))
sns.set_style("whitegrid")
sns.countplot(x='Is_Lead', data=train_df);
plt.ylabel("No. of Observations", size=20);
plt.xlabel("Is Lead?", size=20);

In [None]:
imbalance_ratio = train_df[train_df['Is_Lead'] == 0].shape[0]/train_df[train_df['Is_Lead'] == 1].shape[0]
print(f'Imbalance ratio: {imbalance_ratio}')

Okay, so it is an imbalanced set. We have to keep that in mind while modelling and choosing hyper-parameters later.  
## 3. Feature Value Counts
Let's see how manu unique values are there in each feature.

In [None]:
train_df.nunique()

In [None]:
train_df.shape

* There are a mix of certain continuous, high cardinality categorical and low cardinality categorical features.

Now let's look at each individual feature separately and undersatnd the data...
## 4. Gender
This feature contains the Gender data of the customer

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.countplot(x='Gender', hue='Is_Lead', data=train_df);

In [None]:
# Response Rate from Gender
v = train_df.groupby('Gender').Is_Lead.value_counts().unstack()
v['Ratio'] = v[1]/(v[0] + v[1])
v.reset_index(inplace=True)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.barplot(x='Gender', y='Ratio', data=v.sort_values(by=['Ratio'], ascending=False));

* There are more number of observations from Male customers as compared to Female customers in the training data.
* According to the dataset, Male gender has better conversion ratio than Female gender.

## 5. Age
Age of the Customer (in Years)

In [None]:
g = sns.catplot(x='Is_Lead', y='Age', kind='boxen', data=train_df);
g.fig.set_size_inches(15,8)

In [None]:
g = sns.catplot(x='Gender', y='Age', kind='boxen', data=train_df);
g.fig.set_size_inches(15,8)

In [None]:
g = sns.catplot(x='Gender', y='Age', hue='Is_Lead', kind='box', data=train_df);
g.fig.set_size_inches(15,8)

* On an average, more aged people are likely to respond positively to our offer.
* The average age of male in dataset is > average age of female.
* If we split by gender, the average age of positively responding male is > average of positively responding Female.

## 6. Region Code
Code of the Region for the customers

In [None]:
ax = plt.subplots(figsize=(30, 8))
sns.set_style("whitegrid")
sns.countplot(x='Region_Code', hue='Is_Lead', data=train_df);

In [None]:
# Response Rate from Cities
v = train_df.groupby('Region_Code').Is_Lead.value_counts().unstack()
v['Ratio'] = v[1]/(v[0] + v[1])
v.reset_index(inplace=True)

In [None]:
v['Ratio'].mean()

In [None]:
ax = plt.subplots(figsize=(30, 8))
sns.set_style("whitegrid")
sns.barplot(x='Region_Code', y='Ratio', data=v.sort_values(by=['Ratio'], ascending=False));

## 7. Occupation
Occupation Type for the customer

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.countplot(x='Occupation', hue='Is_Lead', data=train_df);

In [None]:
# Response Rate from Occupation
v = train_df.groupby('Occupation').Is_Lead.value_counts().unstack()
v['Ratio'] = v[1]/(v[0] + v[1])
v.reset_index(inplace=True)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.barplot(x='Occupation', y='Ratio', data=v.sort_values(by=['Ratio'], ascending=False));

* The customer base in training data are majorly Self-Employed or Salaried.
* Entrepreneural customers are way more likely to respond positively to our Credit Card offer.
* Salaried people are least likely to respond positively to our offer.

## 8. Channel Code
Acquisition Channel Code for the Customer  (Encoded)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.countplot(x='Channel_Code', hue='Is_Lead', data=train_df);

In [None]:
# Response Rate from Channel_Code
v = train_df.groupby('Channel_Code').Is_Lead.value_counts().unstack()
v['Ratio'] = v[1]/(v[0] + v[1])
v.reset_index(inplace=True)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.barplot(x='Channel_Code', y='Ratio', data=v.sort_values(by=['Ratio'], ascending=False));

* Customer acquired through X3 channel are most likely to respond positively and X1 are least likely.

## 9. Vintage
Vintage for the Customer (In Months)

In [None]:
g = sns.catplot(x='Is_Lead', y='Vintage', kind='boxen', data=train_df);
g.fig.set_size_inches(15,8)

* Customers having a higher Vintage on an average are more likely to respond positively.

## 10. Credit Product
If the Customer has any active credit product (Home loan, Personal loan, Credit Card etc.)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.countplot(x='Credit_Product', hue='Is_Lead', data=train_df);

In [None]:
# Response Rate from Credit_Product
v = train_df.groupby('Credit_Product').Is_Lead.value_counts().unstack()
v['Ratio'] = v[1]/(v[0] + v[1])
v.reset_index(inplace=True)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.barplot(x='Credit_Product', y='Ratio', data=v.sort_values(by=['Ratio'], ascending=False));

* People already having some kind of credit product are way more likely to respond positively to offer as compared to non-credit users.

## 11. Avg Account Balance
Average Account Balance for the Customer in last 12 Months

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.histplot(data=train_df, x='Avg_Account_Balance');

In [None]:
g = sns.catplot(x='Is_Lead', y='Avg_Account_Balance', kind='boxen', data=train_df);
g.fig.set_size_inches(15,8)

* Positively responding customers have a slightly higher average account balance on an average as compared to negatively responding customers.

## 12. Is Active
If the Customer is Active in last 3 Months

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.countplot(x='Is_Active', hue='Is_Lead', data=train_df);

In [None]:
# Response Rate from Is_Active
v = train_df.groupby('Is_Active').Is_Lead.value_counts().unstack()
v['Ratio'] = v[1]/(v[0] + v[1])
v.reset_index(inplace=True)

In [None]:
ax = plt.subplots(figsize=(8, 5))
sns.set_style("whitegrid")
sns.barplot(x='Is_Active', y='Ratio', data=v.sort_values(by=['Ratio'], ascending=False));

* Active Customers(in last 3 months) are slightly more likely to respond positively as compared to inactive customers.

In [None]:
target = ['Is_Lead']
not_features = ['ID', 'Is_Lead']
cols = list(train_df.columns)
features = [feat for feat in cols if feat not in not_features]
print(features)

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_features = []
numerical_features = []

for i in features:
    if train_df[i].dtype in numerics:
        numerical_features.append(i)
    else:
        categorical_features.append(i)
        
print(f'Numeric features: {numerical_features}')
print(f'Categorical features: {categorical_features}')

## 13. Numerical Feature Interactions

In [None]:
g = sns.pairplot(train_df[numerical_features + ['Is_Lead']], hue='Is_Lead')
g.fig.set_size_inches(10,8)

In [None]:
train_df_cor_spear = train_df[numerical_features].corr(method='spearman')
plt.figure(figsize=(10,8))
sns.heatmap(train_df_cor_spear, square=True, cmap='coolwarm', annot=True);

# KFolds Split
Before we move on to feature engineering, it is always a good idea to perform cross validation splits. In that way, we will not rix any data leakage and would be more certain of the validation set being aptly represenative of the real world unknown data.

In [None]:
NUM_SPLITS = 5

train_df["kfold"] = -1
train_df = train_df.sample(frac=1).reset_index(drop=True)
y = train_df.Is_Lead.values
kf = StratifiedKFold(n_splits=NUM_SPLITS)
for f, (t_, v_) in enumerate(kf.split(X=train_df, y=y)):
    train_df.loc[v_, 'kfold'] = f
    
train_df.head()

# Feature Encoding

First let's convert all the categorical features to numbers. I have decided to use Label encoder after lot of experiments with other types of categorical encoders and decision is made based on CV ROC score.

In [None]:
def label_enc(train_df, test_df, features):
    lbl_enc = preprocessing.LabelEncoder()
    full_data = pd.concat(
        [train_df[features], test_df[features]],
        axis=0
    )
    
    for col in (features):
        print(col)
        if train_df[col].dtype == 'object':
            lbl_enc.fit(full_data[col].values)
            train_df[col] = lbl_enc.transform(train_df[col])
            test_df[col] = lbl_enc.transform(test_df[col])
            
    return train_df, test_df

In [None]:
mapping_dict = {'Yes': 1,
                'No': 0,
                'Unk': 0.5}

In [None]:
train_df['Credit_Product'] = train_df['Credit_Product'].map(mapping_dict)
test_df['Credit_Product'] = test_df['Credit_Product'].map(mapping_dict)

train_df['Is_Active'] = train_df['Is_Active'].map(mapping_dict)
test_df['Is_Active'] = test_df['Is_Active'].map(mapping_dict)

In [None]:
train_df.head()

In [None]:
target = ['Is_Lead']
not_features = ['ID', 'Is_Lead', 'kfold']
cols = list(train_df.columns)
features = [feat for feat in cols if feat not in not_features]
print(features)

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_features = []
numerical_features = []

for i in features:
    if train_df[i].dtype in numerics:
        numerical_features.append(i)
    else:
        categorical_features.append(i)
        
print(f'Numeric features: {numerical_features}')
print(f'Categorical features: {categorical_features}')

In [None]:
train_df[numerical_features] = train_df[numerical_features].astype('float64')
test_df[numerical_features] = test_df[numerical_features].astype('float64')

In [None]:
train_df[target] = train_df[target].astype('float64')

In [None]:
cutoff = 10
low_cardinal_columns = []
high_cardinal_columns = []

for i in categorical_features:
    if train_df[i].nunique() > cutoff:
        high_cardinal_columns.append(i)
    else:
        low_cardinal_columns.append(i)
        
print(f'High Cardinality columns: {high_cardinal_columns}')
print(f'Low Cardinality columns: {low_cardinal_columns}')

In [None]:
if len(low_cardinal_columns) > 0:
    train_df, test_df = label_enc(train_df, test_df, low_cardinal_columns)

In [None]:
if len(high_cardinal_columns) > 0:
    train_df, test_df = label_enc(train_df, test_df, high_cardinal_columns)

# Clustering
We can cluster the customers to some groups using the given features and use them as one of the features to identify what cluster the customer belongs to.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42, n_jobs=-1)
kmeans.fit(train_df[features])

train_df['cluster'] = kmeans.predict(train_df[features])
test_df['cluster'] = kmeans.predict(test_df[features])

# Feature Selection

We need to select only the important features for better performance of the model. As unnecessary in best case scenario will not add to any productive calculation of the algorithm or in worst case scenario 'confuse' the model.

To do the same let's create a wrapper class that has all the built in statistical tests required to perform feature selection and takes some basic inputs from user and spits out the required features.

In [None]:
cols = list(train_df.columns)
features = [feat for feat in cols if feat not in not_features+target]

In [None]:
class UnivariateFeatureSelction:
    def __init__(self, n_features, problem_type, scoring, return_cols=True):
        """
        Custom univariate feature selection wrapper on
        different univariate feature selection models from
        scikit-learn.
        :param n_features: SelectPercentile if float else SelectKBest
        :param problem_type: classification or regression
        :param scoring: scoring function, string
        """
        self.n_features = n_features
        
        if problem_type == "classification":
            valid_scoring = {
                "f_classif": f_classif,
                "chi2": chi2,
                "mutual_info_classif": mutual_info_classif
            }
        else:
            valid_scoring = {
                "f_regression": f_regression,
                "mutual_info_regression": mutual_info_regression
            }
        if scoring not in valid_scoring:
            raise Exception("Invalid scoring function")
            
        if isinstance(n_features, int):
            self.selection = SelectKBest(
                valid_scoring[scoring],
                k=n_features
            )
        elif isinstance(n_features, float):
            self.selection = SelectPercentile(
                valid_scoring[scoring],
                percentile=int(n_features * 100)
            )
        else:
            raise Exception("Invalid type of feature")
    
    def fit(self, X, y):
        return self.selection.fit(X, y)
    
    def transform(self, X):
        return self.selection.transform(X)
    
    def fit_transform(self, X, y):
        return self.selection.fit_transform(X, y)
    
    def return_cols(self, X):
        if isinstance(self.n_features, int):
            mask = SelectKBest.get_support(self.selection)
            selected_features = []
            features = list(X.columns)
            for bool, feature in zip(mask, features):
                if bool:
                    selected_features.append(feature)
                    
        elif isinstance(self.n_features, float):
            mask = SelectPercentile.get_support(self.selection)
            selected_features = []
            features = list(X.columns)
            for bool, feature in zip(mask, features):
                if bool:
                    selected_features.append(feature)
        else:
            raise Exception("Invalid type of feature")
        
        return selected_features

In [None]:
ufs = UnivariateFeatureSelction(
    n_features=1.0,
    problem_type="classification",
    scoring="f_classif"
)

ufs.fit(train_df[features], train_df[target].values.ravel())
selected_features = ufs.return_cols(train_df[features])

Through iterations it has been found that all features are important. Hence we are not doing any feature selection and thus the n_features parameter has value 1.0

In [None]:
print(f'{len(selected_features)} Features Selected')
print(selected_features)

# Models Benchmarking
First let's create a benchmarking function which finds the best single model for this dataset.

In [None]:
def get_models():
    models = dict()
    models['gauss'] = GaussianNB()
    models['QDA'] = QuadraticDiscriminantAnalysis()
    models['lr'] = LogisticRegression(solver='liblinear')
    models['rf'] = RandomForestClassifier(class_weight='balanced_subsample',
                                          random_state=42)
    models['lgbm'] = LGBMClassifier(metric='binary_logloss',
                                    objective='binary',
                                    reg_alpha=2.945525898790487,
                                    max_depth=13,
                                    num_leaves=34,
                                    seed=42,
                                    learning_rate=0.0037601596530868493,
                                    n_estimators=1913)
    models['BalBag'] = BalancedBaggingClassifier()
    models['BalRF'] = BalancedRandomForestClassifier()
    
    return models

def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
    scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv,
                             n_jobs=-1, error_score='raise')
    return scores

In [None]:
%%time

X = train_df[selected_features].values
y = train_df[target].values

models = get_models()
results = []
names = []

for name, model in models.items():
    scores = evaluate_model(model, X, y)
    results.append(scores)
    names.append(name)
    print(f'{name} : {round(np.mean(scores),6)} ({round(np.std(scores),3)})')

In [None]:
ax = plt.subplots(figsize=(12, 6))
plt.boxplot(results, labels=names, showmeans=True)
plt.show()

# Solution-1
Prediction using the best single model.

In [None]:
mean_scores = []
for score in results:
    mean_scores.append(round(np.mean(score),3))
min_index = mean_scores.index(max(mean_scores))
model_name = names[min_index]

In [None]:
print(f'Best Score: {mean_scores[min_index]}')
print(f'Best Model: {model_name}')

In [None]:
%%time

models = get_models()
clf = models[model_name]
X = train_df[selected_features]
y = train_df[target]
clf.fit(X, y)

preds = clf.predict_proba(test_df[selected_features])
sub = pd.DataFrame()
sub['ID'] = test_df['ID']
sub['Is_Lead'] = preds[:, 1]

In [None]:
sub.head()

In [None]:
sub.to_csv('Best_single_Model.csv', index=False)

# Final Solution
Creating a custom ensemble using the best single model from above and training on folds while tracking the OOF scores.

In [None]:
test_pred_all = None

for i in tqdm(range(NUM_SPLITS)):
    print('#'*50)
    print(f'{"*"*21} FOLD {i+1} {"*"*21}')
    
    train = train_df[train_df['kfold'] != i]
    valid = train_df[train_df['kfold'] == i]
    test = test_df
    
    clf = LGBMClassifier(metric='binary_logloss',
                         objective='binary',
                         reg_alpha=2.945525898790487,
                         max_depth=13,
                         num_leaves=34,
                         seed=42,
                         learning_rate=0.0037601596530868493,
                         n_estimators=20000)
    clf.fit(train[selected_features].values,train[target].values,
           eval_set=(valid[selected_features].values,valid[target].values),
           eval_metric='binary_logloss',
           early_stopping_rounds=500,
           verbose=1000)
    
    pred = clf.predict_proba(valid[selected_features])[:, 1]
    roc = roc_auc_score(valid[target], pred)

    test_pred = clf.predict_proba(test[selected_features])[:, 1]
    if test_pred_all is None:
        test_pred_all = test_pred
    else:
        test_pred_all += test_pred
    
    print(f'ROC: {roc}')
    print('#'*50)
    
test_pred_all /= NUM_SPLITS

In [None]:
sub_2 = pd.DataFrame()
sub_2['ID'] = test_df['ID']
sub_2['Is_Lead'] = test_pred_all

In [None]:
sub_2.head()

In [None]:
sub_2.to_csv('LGBM_5fold_Ensemble.csv', index=False)