# Banking Dataset - Marketing Targets:
**(Banking Dataset of different customers to predict if they will subscribe the term deposite or not.)**

## About the dataset : 

Term deposits are a major source of income for a bank.
A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term.
The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.
Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

**Attributes of the dataset:**

|column name|description|
|-----|-----|
|age | Age of the client|
|job | type of job|
|marital | Marital status of the client|
|education | Education level of the client|
|default | Credit in default|
|balance | Average yearly balance .in Euros|
|housing | If the person has taken a Housing Loan|
|loan | If the person has taken a Personal Loan|
|contact | contact communication type|
|day | Day of Week of last Contact|
|month | Month of last contact|
|duration | last contact Duration (seconds)|
|campaign | number of contacts performed during this campaign to the client|
|pdays | number of days that passed by after the client was last contacted|
|.previous | number of contacts performed in previous campaign|
|poutcome | outcome of the previous marketing campaign|
|y | has the client subscribed a term deposit? (target variable) |

# Objective

**The classification goal is to predict if the client will subscribe to a term deposit (variable y)**

### Import Libraries

In [8]:
#basic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib

In [9]:
#for preprocessing data
from sklearn.preprocessing import OneHotEncoder ,StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from imblearn.over_sampling import SMOTE

In [10]:
#statisticak models
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier ,AdaBoostClassifier
from xgboost import XGBClassifier

In [11]:
#Evaluations
from sklearn.metrics import classification_report , confusion_matrix , accuracy_score ,recall_score

In [12]:
#Set some default parameters
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
%matplotlib inline
sns.set_style('darkgrid')
plt.rcParams["figure.figsize"] = (8, 5)
set_config(display='diagram')

### Load the data

In [6]:
data = pd.read_csv('data/train.csv',header = 0,sep =';')
data.head()  

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [7]:
#check the shape
print('shape of the data:',data.shape)
print('rows:',data.shape[0])
print('rows:',data.shape[0])

shape of the data: (45211, 17)
rows: 45211
rows: 45211


### Make a copy of a Data

In [8]:
df = data.copy(deep = True)
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [9]:
#check the sahpe 
print('shape of the copy:',df.shape)

shape of the copy: (45211, 17)


### Data Preprocessing (Based on EDA)

#### 1.Deal with Outliers

1. only logical outliers in the column 'previous' . we will drop those outliers .
2. the number of outliers are high therefore we avoid the capping method or removing them instead we will build machine learning model which are robus to outliers

In [10]:
#drop the row where privious >100

df = df[df['previous'] <= 100]
df.shape

(45210, 17)

In [11]:
# change 'admin.' to 'admin' in 'job' column
df['job'].replace({'admin.':'admin'},inplace =True)
df['job'].unique()

array(['management', 'technician', 'entrepreneur', 'blue-collar',
       'unknown', 'retired', 'admin', 'services', 'self-employed',
       'unemployed', 'housemaid', 'student'], dtype=object)

#### 2.seperate Dependent(y) variable

In [12]:
#split into X and y variable
X= df.drop('y',axis=1)
y= df['y']

print(X.shape)
print(y.shape)

(45210, 16)
(45210,)


#### 3.Feature Selection (Pre model Building)

1. drop 'default' varible : 98% data corresponding to 1 category (biased)
2. drop 'day' variable , there was not much of infernece from the variable.

In [13]:
X.drop({'default','day'},axis= 1 ,inplace= True)
X.columns

Index(['age', 'job', 'marital', 'education', 'balance', 'housing', 'loan',
       'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome'],
      dtype='object')

#### 4.Feature Engineering(Training data)

**1.seperate categorical and numerical features** 

In [14]:
numerical_cols = X.select_dtypes(include='number').columns
categorical_cols = X.select_dtypes(include='object').columns
print(f'nmerical columns :\n{numerical_cols}')

print(f'\ncateggorial columns :\n{categorical_cols}')

print('\ncount of numerical columns: ',len(numerical_cols))
print('count of categorical columns: ',len(categorical_cols))

nmerical columns :
Index(['age', 'balance', 'duration', 'campaign', 'pdays', 'previous'], dtype='object')

categgorial columns :
Index(['job', 'marital', 'education', 'housing', 'loan', 'contact', 'month',
       'poutcome'],
      dtype='object')

count of numerical columns:  6
count of categorical columns:  8


In [15]:
final_coulumns = X.columns
final_coulumns

Index(['age', 'job', 'marital', 'education', 'balance', 'housing', 'loan',
       'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome'],
      dtype='object')

**2.Transform the data into array**

In [16]:
# Transformer objects

#Imputer 
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

#Encoding
ohe = OneHotEncoder(drop='first',sparse_output=False,handle_unknown='ignore')

#Scale the data
scaler = StandardScaler()

In [17]:
cat_pipe = Pipeline([
            ('cat_imputer', cat_imputer),
            ('onehot', ohe)
        ])

In [18]:
cat_pipe

In [19]:
# Create a ColumnTransformer with a Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num_imputer',num_imputer,numerical_cols),
        ('cat_imputer', cat_pipe, categorical_cols)
    ],
    remainder='passthrough'
)

In [20]:
#scaling
transform_toarray = Pipeline(
    [
        ('preprocess_train', preprocessor),
        ('scaling_train' , scaler)
        
    ])

In [21]:
transform_toarray

In [22]:
#Transform the data into an array
transform_toarray.fit(X)
Transformed_train = transform_toarray.transform(X)
Transformed_train

array([[ 1.60694537,  0.25641067,  0.01102379, ..., -0.2059165 ,
        -0.18595013,  0.47249062],
       [ 0.28852415, -0.43789615, -0.41611513, ..., -0.2059165 ,
        -0.18595013,  0.47249062],
       [-0.74737823, -0.44676384, -0.70734621, ..., -0.2059165 ,
        -0.18595013,  0.47249062],
       ...,
       [ 2.92536659,  1.42957244,  3.37377201, ..., -0.2059165 ,
         5.37778601, -2.11644415],
       [ 1.51277243, -0.22802763,  0.97014482, ..., -0.2059165 ,
        -0.18595013,  0.47249062],
       [-0.37068646,  0.52835298,  0.3993319 , ...,  4.85633741,
        -0.18595013, -2.11644415]])

In [23]:
# Svae the objects after fitting
transformation_objects = {
   'num_imputer' : num_imputer,
    'cat_imputer' : cat_imputer,
    'ohe' : ohe,
    'scaler' : scaler,
    'transform_toarray' : transform_toarray
}

for name , obj in transformation_objects.items():
    path = 'Transformation_objects/'+name+'.joblib'
    joblib.dump(obj , path)


In [24]:
#transformed_train shape
Transformed_train.shape

(45210, 40)

In [25]:
#transforme y_train
y.replace({'yes' :1 , 'no' : 0},inplace =True)
y.value_counts()

y
0    39921
1     5289
Name: count, dtype: int64

**3.Deal with data imbalnced**

In [26]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(Transformed_train, y)
unique, counts = np.unique(y_train_resampled, return_counts=True)
print(dict(zip(unique, counts)))

{0: 39921, 1: 39921}


#### 5.Prepare the test data

##### 1.import test data

In [27]:
#import test data
test = pd.read_csv('data/test.csv',header = 0,sep =';')
test.head()  

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [28]:
#check shape
test.shape

(4521, 17)

##### 2.seperate dependent(y) variable

In [29]:
#split into X and y variable
X_test= test.drop('y',axis=1)
y_test= test['y']

print(X_test.shape)
print(y_test.shape)

(4521, 16)
(4521,)


##### 3.Feature selection

In [30]:
#feature selection
X_test.drop({'default','day'},axis= 1 ,inplace= True)
X_test.columns

Index(['age', 'job', 'marital', 'education', 'balance', 'housing', 'loan',
       'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome'],
      dtype='object')

In [31]:
#Transform test data to array
Transformed_test = transform_toarray.transform(X_test)
Transformed_test

array([[-1.02989707e+00,  1.39488613e-01, -6.95696964e-01, ...,
        -2.05916499e-01, -1.85950128e-01,  4.72490616e-01],
       [-7.47378232e-01,  1.12544371e+00, -1.48182531e-01, ...,
        -2.05916499e-01, -1.85950128e-01, -2.11644415e+00],
       [-5.59032344e-01, -4.03649729e-03, -2.84090369e-01, ...,
        -2.05916499e-01, -1.85950128e-01, -2.11644415e+00],
       ...,
       [ 1.51277243e+00, -3.50533044e-01, -4.16115126e-01, ...,
        -2.05916499e-01, -1.85950128e-01,  4.72490616e-01],
       [-1.21824295e+00, -7.39926721e-02, -5.01542910e-01, ...,
         4.85633741e+00, -1.85950128e-01, -2.11644415e+00],
       [ 2.88524155e-01, -7.43211048e-02,  3.37202604e-01, ...,
         4.85633741e+00, -1.85950128e-01, -2.11644415e+00]])

In [32]:
Transformed_test.shape

(4521, 40)

In [33]:
#transforme y_train
y_test.replace({'yes' :1 , 'no' : 0},inplace =True)
y_test.value_counts()

y
0    4000
1     521
Name: count, dtype: int64

#### 6.Model testing

**1.Base models**

In [34]:
models = [
    ('Naïve Bayes', GaussianNB()),
    ('KNN', KNeighborsClassifier(n_neighbors=50)),
    ('Tree' ,DecisionTreeClassifier(random_state=42 , min_samples_split=10)),
    ('SVM', SVC(gamma=0.01 , C = 1 , kernel='rbf'))
]

for name, model in models:
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(Transformed_test)
    
    print(name)
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print()


Naïve Bayes
Confusion Matrix:
[[3551  449]
 [ 260  261]]
Accuracy Score: 0.8431762884317628
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.89      0.91      4000
           1       0.37      0.50      0.42       521

    accuracy                           0.84      4521
   macro avg       0.65      0.69      0.67      4521
weighted avg       0.87      0.84      0.85      4521


KNN
Confusion Matrix:
[[3247  753]
 [  69  452]]
Accuracy Score: 0.8181818181818182
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.81      0.89      4000
           1       0.38      0.87      0.52       521

    accuracy                           0.82      4521
   macro avg       0.68      0.84      0.71      4521
weighted avg       0.91      0.82      0.85      4521


Tree
Confusion Matrix:
[[3893  107]
 [  78  443]]
Accuracy Score: 0.9590798495907985
Classification Report:
              

**2.Ensemble models**

In [35]:
models = [
    ('rf',RandomForestClassifier(n_estimators=100,random_state=42)),
    ('gradient',GradientBoostingClassifier(n_estimators=100,random_state=42)),
    ('adaboost',AdaBoostClassifier(n_estimators=100, random_state=42)),
    ('XGB',XGBClassifier(n_estimators=100, random_state=42))
]

for name, model in models:
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(Transformed_test)
    
    print(name)
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("Accuracy Score:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print()


rf
Confusion Matrix:
[[4000    0]
 [   0  521]]
Accuracy Score: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4000
           1       1.00      1.00      1.00       521

    accuracy                           1.00      4521
   macro avg       1.00      1.00      1.00      4521
weighted avg       1.00      1.00      1.00      4521


gradient
Confusion Matrix:
[[3521  479]
 [ 138  383]]
Accuracy Score: 0.8635257686352577
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.88      0.92      4000
           1       0.44      0.74      0.55       521

    accuracy                           0.86      4521
   macro avg       0.70      0.81      0.74      4521
weighted avg       0.90      0.86      0.88      4521


adaboost
Confusion Matrix:
[[3627  373]
 [ 187  334]]
Accuracy Score: 0.876133598761336
Classification Report:
              precision    rec

**Conclusion:**
Models with balanced Recall as wel as highest accuracy for both the classes :
1. Random Forest

Radnom Forest with 100% accuracy .wo we will choose the model for future prediction . no need to tune as the result was 100% accurate.

In [36]:
# fit the final model on training data
model = RandomForestClassifier(n_estimators=100,random_state=42)
model.fit(X_train_resampled, y_train_resampled)

y_pred = model.predict(Transformed_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
print()

Confusion Matrix:
[[4000    0]
 [   0  521]]
Accuracy Score: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4000
           1       1.00      1.00      1.00       521

    accuracy                           1.00      4521
   macro avg       1.00      1.00      1.00      4521
weighted avg       1.00      1.00      1.00      4521




#### 8.Dump the model into file

In [37]:
joblib.dump(model , 'Model/RandomForest.joblib')

['Model/RandomForest.joblib']

#### 9. Make inference Pipeline for test data and dump into a file

In [40]:
#load the model
final_model = joblib.load('Model/RandomForest.joblib')

In [41]:
#pipline to predict the output
test_pipline = Pipeline(
    [
        ('array_transformation',transform_toarray),
        ('model_fitting' , final_model)
    ]
)

In [42]:
test_pipline

In [43]:
joblib.dump(test_pipline , 'Transformation_objects/test_pipline.joblib')

['Transformation_objects/test_pipline.joblib']

# End