# Problem Statement: 

Happy Customer Bank is trying to cross sell credit cards to its customers and would like to identify whether customer will be intrested in it or not based on some parameters.

## Given Parameters:

1. ID                  - nique Identifier for a row
2. Gender              - Gender of the Customer
3. Age                 - Age of the Customer (in Years)
4. Region_Code         - Code of the Region for the customers
5. Occupation          - Occupation Type for the customer
6. Channel_Code        - Acquisition Channel Code for the Customer  (Encoded)
7. Vintage             - Vintage for the Customer (In Months)
8. Credit_Product      - If the Customer has any active credit product (Home loan,Personal loan, Credit Card etc.)
9. Avg_Account_Balance - Average Account Balance for the Customer in last 12 Months
10. Is_Active          - If the Customer is Active in last 3 Months
11. Is_Lead(Target)    - If the Customer is interested for the Credit Card
                            0 : Customer is not interested
                            1 : Customer is interested

### Models Used:

CatBoost, XGBoost, RandomForest




## Let's begin...

In [None]:
#Importing all the required libraries 

!pip install catboost

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier, Pool
from sklearn.linear_model import LogisticRegression
from sklearn.utils import resample
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score,mean_absolute_error,confusion_matrix,classification_report

import xgboost
import warnings 
warnings.filterwarnings('ignore')

## Data Loading

In [None]:
# Loading train and test data sets

train = pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/train.csv")
test = pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/test.csv")

In [None]:
#Creating a function with name 'analysis' for extracting data type, unique and null count

def analysis(data):
    return pd.DataFrame({"Data Type":data.dtypes, "Unique Count":data.apply(lambda x: x.nunique(),axis=0), 
                         "Null Count": data.isnull().sum() })

In [None]:
# Getting train data analysis

analysis(train)

Observations:

- Null values present on ly in Credit_Product, so need to identify a way to replace them
- Data types looks fine 

In [None]:
# Getting test data analysis

analysis(test)

Observations:

- Null values present in Credit_Product, so need to identify a way to replace them
- Data types looks fine

In [None]:
#Making a copy of training data

train_copy = train.copy()
test_copy = test.copy()
print(train.shape)

## Visualizations

In [None]:
# Check count of target variable 

sns.countplot(train['Is_Lead'])

Notes: 

- Looks like data is imbalanced, need to be careful while splitting 

In [None]:
# Heatmap for numeric variables

plt.figure(figsize = (7,5))
sns.heatmap(train.corr(), annot = True, cmap = 'YlGnBu')

Notes:

- As thought, Age and Vintage are correlated. We need to achieve proper scaling between them

In [None]:
#Scatterlpot to observe Avg_Account_Balance data

plt.figure(figsize=(12,10))
sns.scatterplot('Age','Avg_Account_Balance',hue='Is_Lead', data=train)

Notes:

- Surprisingly, account balance is independent of Age. Infact account balance is bit high for age groupof 25-35

In [None]:
# Pair plot among the variables

sns.pairplot(train[['Age', 'Vintage', 'Avg_Account_Balance','Is_Lead']], hue= 'Is_Lead')

Notes:

- Age: Most of the customers are under 40 are equally spread in Lead
- While we compare Is_Lead with other attribtues, not able to get clear picture on distribution

In [None]:
# Analyse the distrubtion of various attributes w.r.t target variable

plt.figure(figsize = (15,10))

plt.subplot(2,2,1)
sns.countplot('Gender', hue = 'Is_Lead', data = train).set_title('Age')

plt.subplot(2,2,2)
sns.countplot('Occupation', hue = 'Is_Lead', data = train, palette = 'Set2').set_title('Occupation')

plt.subplot(2,2,3)
sns.countplot('Channel_Code', hue = 'Is_Lead', data = train, ).set_title('Channel_Code')

Notes:

- Both male and female are likely have same amount of interest 
- Self employed people are hingly interested when comapared to others
- Through x3 channel more chances for credit card

In [None]:
# Check the target variable portion in missing data

sns.countplot('Is_Lead', data = train[train['Credit_Product'].isnull()]).set_title('Age')

Notes:

- Majority for missing data belongs to interested customers so we need to fill those values. 
- Lets keep the value as 'Not Sure' and check the performance 

## Feature Engineering

In [None]:
# Convert Age in years to months as Vintage in months

train['Age'] = train['Age']*12
test['Age'] = test['Age']*12

In [None]:
# Replacing null values with 'Not Sure' for both train and test sets. Its al together creating new class

train['Credit_Product'] = train['Credit_Product'].fillna("Not Sure")
test['Credit_Product'] = test['Credit_Product'].fillna("Not Sure")
train[train['Credit_Product'] == 'Not Sure'].head()

In [None]:
# Storing target value in 'Target' attribute for further usage

Target = pd.DataFrame(train['Is_Lead'])

In [None]:
# Dropping unwanted columns 

train = train.drop(['Is_Lead', 'ID'], axis = 1)
test = test.drop(['ID'], axis = 1)

print("Shape of train data:", train.shape)
print("Shape of test data:", test.shape)

In [None]:
# Concat both sets to data file

data = pd.concat([train, test])
data.shape

In [None]:
# Trying to reduce skewnees by applying some operators 

#data['Vintage'] = round(np.log(round(np.log(data['Vintage']),2)),2)
#data['Age'] = round(np.log(round(np.log(data['Age']),2)),2)
data['Avg_Account_Balance'] = np.log(data['Avg_Account_Balance'])

data.head()

In [None]:
# Getting numeric and categorical columns

data_num_cols = data._get_numeric_data().columns 
data_cat_cols = data.columns.difference(data_num_cols)
print("Numeric columns: ", data_num_cols)
print()
print("Categorical columns: ", data_cat_cols)

In [None]:
#Separating both numeric and categorical data from set

data_num_data = data.loc[:, data_num_cols]
data_cat_data = data.loc[:, data_cat_cols]

print("Shape of num data:", data_num_data.shape)
print("Shape of cat data:", data_cat_data.shape)

In [None]:
# Using StandardScaler to scale the data

s_scaler = preprocessing.StandardScaler()
data_num_data_s = s_scaler.fit_transform(data_num_data)

data_num_data_s = pd.DataFrame(data_num_data_s, columns = data_num_cols)

fig, (ax1) = plt.subplots(ncols=1, figsize=(8, 5))
ax1.set_title('After StandardScaler')

sns.kdeplot(data_num_data_s['Age'], ax=ax1)
sns.kdeplot(data_num_data_s['Vintage'], ax=ax1)
sns.kdeplot(data_num_data_s['Avg_Account_Balance'], ax=ax1);

Notes:

- Avg_Account_Balance scaled well and others are in bad shape

In [None]:
# Dealing with categorical variables using Lable encoding

label = LabelEncoder()
data_cat_data = data_cat_data.apply(LabelEncoder().fit_transform)

In [None]:
# Strorig cleaned data into 'data_new'

data_num_data_s.reset_index(drop=True, inplace=True)
data_cat_data.reset_index(drop=True, inplace=True)
#df = pd.concat([df1, df2], axis=1)
data_new = pd.concat([data_num_data_s, data_cat_data], axis = 1)

In [None]:
# Splitting back the data into train and test

train_new = data_new.loc[:245724,]
test_new = data_new.loc[245725:,]

print("Shape of train data:", train_new.shape)
print("Shape of test data:", test_new.shape)

In [None]:
# Splitting train data into train and validation for model building

trainx,valx,trainy,valy = train_test_split(train_new,Target,test_size=0.3,random_state=1234)
#print(cust_data.shape)
print(trainx.shape)
print(valx.shape)

In [None]:
# As the data in imbalanced need to sample it. Undersample fits the best in here

training_set = pd.concat([trainx, trainy], axis = 1)
lead = training_set[training_set.Is_Lead == 1]
not_lead = training_set[training_set.Is_Lead == 0]

In [None]:
undersample = resample(not_lead, replace = True, n_samples = len(lead), random_state = 4)

In [None]:
# Storing sampled trainx and trainy data for model training

us_training_set = pd.concat([lead, undersample])
us_trainy = us_training_set['Is_Lead']
us_trainx = us_training_set.drop('Is_Lead', axis = 1)

## Models

### 1. CatBoost Classifier

In [None]:
# Storing categorical attributes in a variable and definig SEED value

cat_var = np.where(us_trainx.dtypes != np.float)[0]
SEED = 1993

In [None]:
params = {
    'cat_features':cat_var,
    'eval_metric': 'AUC',
    'random_seed': SEED}

cat = CatBoostClassifier(**params)

cross_val_score(cat,us_trainx, us_trainy, cv=5, n_jobs=-1, verbose=1, scoring='roc_auc').mean()

In [None]:
# Defining CatBoost classifier

#cat = CatBoostClassifier.fit(X = us_trainx, y = us_trainy, cat_features=categorical_var)
mod = cat.fit(us_trainx, us_trainy,plot=True, verbose=False)

In [None]:
# Predicting values on train and validation sets

pred_train_cat = mod.predict(trainx)
pred_val_cat = mod.predict(valx)
pred_val_cat

In [None]:
# Checking roc_auc_score for both train and validation sets

cat_auc_train = roc_auc_score(trainy, pred_train_cat)
cat_auc_val = roc_auc_score(valy, pred_val_cat)
print("ROC_AUC_Score Train: ",cat_auc_train)
print("ROC_AUC_Score Val: ", cat_auc_val)

### 2. XGBoost Classifier

In [None]:
# Defining XGBoost classifier

xgb = xgboost.XGBClassifier()

xgb = xgb.fit(us_trainx, us_trainy)

In [None]:
cross_val_score(xgb,us_trainx, us_trainy, cv=5, n_jobs=-1, verbose=1, scoring='roc_auc').mean()

In [None]:
'''# Tried applying cv with less number of paramets but didnt work due to technical constraints
# Time taken to run: 100min


xgb_params = {'learning_rate': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5],
                'max_depth': [3, 5, 7, 10, 15, 20],
                'min_child_weight': [1, 3, 5]}

gridcv = GridSearchCV(estimator = classifier,
                      param_grid = xgb_params)

gridcv.fit(us_trainx, us_trainy)'''

In [None]:
# Predicting on train and validation data 

pred_train_xgb = xgb.predict(trainx)
pred_val_xgb = xgb.predict(valx)
pred_val_xgb

In [None]:
# Measuring the roc_auc_score values 

xgb_auc_train = roc_auc_score(trainy, pred_train_xgb)
xgb_auc_val = roc_auc_score(valy, pred_val_xgb)
print("ROC_AUC_Score Train: ",xgb_auc_train)
print("ROC_AUC_Score Val: ", xgb_auc_val)

In [None]:
# Getting Confusion Matrix and Classification Report

results = confusion_matrix(valy, pred_val_xgb) 
print('Confusion Matrix :')
print(results) 
print ('Accuracy Score :',accuracy_score(valy, pred_val_xgb))
print ('Report : ')
print (classification_report(valy, pred_val_xgb))

### 3. RandomForest Classifier

In [None]:
# Model fit

rfc = RandomForestClassifier(max_depth = 15, criterion= 'entropy', n_estimators=200)
rfc.fit(X = us_trainx,y = us_trainy)

In [None]:
cross_val_score(rfc,us_trainx, us_trainy, cv=5, n_jobs=-1, verbose=1, scoring='roc_auc').mean()

In [None]:
# Predicting on train and validation data 

pred_train_rfc = rfc.predict(trainx)
pred_val_rfc = rfc.predict(valx)
pred_val_rfc

In [None]:
# Measuring the roc_auc_score values 

rfc_auc_train = roc_auc_score(trainy, pred_train_rfc)
rfc_auc_val = roc_auc_score(valy, pred_val_rfc)
print("ROC_AUC_Score Train: ",rfc_auc_train)
print("ROC_AUC_Score Val: ", rfc_auc_val)

In [None]:
# Getting Confusion Matrix and Classification Report

results = confusion_matrix(valy, pred_val_rfc) 
print('Confusion Matrix :')
print(results) 
print ('Accuracy Score :',accuracy_score(valy, pred_val_rfc))
print ('Report : ')
print (classification_report(valy, pred_val_rfc))

## Submission File

In [None]:
# Copying ID column from tess_copy file and creating 'submission' file

#submission = pd.DataFrame(test_copy['ID'])

# Storing best model output to submisison file

#submission['Is_Lead'] = pred_test_rfc
#submission.head()

# Downloading the file to local drive

#submission1_XGb = submission.to_csv (r'C:\Users\91879\Desktop\AV\Submissions\Submission_Prudhvi.csv',index = None, header=True)

### Challenges:
1. Missing data in Credit_Product attribute - I have careated separate model by keeping Credit_Product as target variable adn rest as input attributes. But it didnt give promising values so finally repalced them with "Not Sure". I think the main reason for this could be imbalance nature in the data

2. Target vale imbalance - With the size of the data we have this is a huge imbalance case. I tried various smapling techniques but for binar classifications undersmapling works very well.

Initially, I have created XGBoost, RF, LR and KNN models, and later I have seen some better submission and from there I have tried CatBoost and it worked very well. Need to study much more on CatBoost and its parameters.


### Please feel free to add your comments and suggestions!!!