## name:Deepak Rajbhar 
## roll no:12
## M.Sc (Applied Mathematics) 

## Project Name:Predict Modeling for Customer Churn

### Objective :

The objective of this project is to build a predictive model that can predict customer churn for a given company. We will use machine learning techniques to build the model and document the process, including feature selection, model evaluation, and performance metrics.

## Instructions:

1.Obtain a dataset of customer information, including demographic information, customer behavior, and whether or not the customer has churned.  
2.Perform data cleaning and preprocessing on the dataset, including handling missing data and converting categorical variables to numerical variables.  
3.Explore the data and perform feature selection to select the relevant features that will be used in the model.  
4.Build a predictive model using machine learning algorithms such as Logistic Regression, Random Forest, or Gradient Boosting.  
5.Train the model using a portion of the data and use the remaining data to evaluate the performance of the model.  
6.Evaluate the model performance using metrics such as accuracy, precision, recall, F1-score and AUC-ROC.  
7.Fine-tune the model by trying different parameters or techniques to improve performance.  
8.Create a report detailing the process and results, including the feature selection process, the model used, the evaluation metrics, and the performance of the final model.  
9.Provide a brief on the limitation of the model and the possible future work

## Data Information 

The data is realted with direct marketing campaigns of Portuguese banking institutions  
The marketing campaigns were based on phone calls, Often, more than one contact to the same client was required, in order to access if  the product (bank term deposit) would be (or not) subscribed  
The classification goal is to predict if the client will subscribe aterm deposit. 

Attribute Information

1.age(numeric)  
2.job:type of job (categorical:admin, unkown, unemployed, management, housemaid, entrepreneur, student, blue-collar, self-employed, retired, technical, services)   
   
3.marital:marital status (categorical: married, divorced, single, note:divorced means divorced or widowed)   
4.education(categorical:unknown, secondary, tertiary)   
5.defualt:has credit in default? (binary:yes or no)   
6.balance: average yearly balance, in euros(numeric)   
    
7.housing: has  housing loan? (binary :yes or no)  
    
8.loan: has personal loan? (binary:yes or no)  
    

        
        

## Other attributes:

1.campaign : number of contacts performed during this campaign and for this client (numeric, includes last contact) 
    
2.pdays:numbers of the days that passed by after the client was last contacted from a previous campaign (numeric, - 1 means client was not previously contacted)   
    3.previous:number of contacts performed before this campaign and for this client(numeric)    
        4.poutcome: outcome of the previous marketing campaign (categorical:unkown, other, failure, success) 

## Problem Statement 

The classification goal is to predict if the client will subscribe a term deposit or not 

### Import libraries 

In [None]:
# import libereries
import os 
import json,pickle
import numpy as np 
from sklearn.pipeline import Pipeline

# import  libereries for braries for data structuring and analysis(data visualization)
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import chi2_contingency,chi2
#import warnings to avoid warnings
import warnings
warnings.filterwarnings('ignore')

# importing encoders
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder,LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
# importing randomised cross validation and train test split
from sklearn.model_selection import train_test_split,KFold,cross_val_score,RandomizedSearchCV
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PowerTransformer, KBinsDiscretizer
# importing Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#from xgboost import XGBClassifier as xgb
from sklearn.feature_selection import SelectKBest, chi2
# importing performance matrixes
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,roc_curve
from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

# importing SMOTENN to resample data(balancing the data)
#from imblearn.combine import SMOTEENN

# setting to display max columns 
pd.set_option("display.max_columns",None)


### Getting the Data 

In [None]:
data=pd.read_csv('bank.csv',delimiter =';')
data.head()

### Dropping unnecessary columns 

In [3]:
data=data.drop(['contact', 'day', 'poutcome','month'], axis=1)
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,duration,campaign,pdays,previous,y
0,30,unemployed,married,primary,no,1787,no,no,79,1,-1,0,no
1,33,services,married,secondary,no,4789,yes,yes,220,1,339,4,no
2,35,management,single,tertiary,no,1350,yes,no,185,1,330,1,no
3,30,management,married,tertiary,no,1476,yes,yes,199,4,-1,0,no
4,59,blue-collar,married,secondary,no,0,yes,no,226,1,-1,0,no


### Replacing "unknown" value to NAN value 

In [4]:
for i in data.columns:
    data[i]=np.where(data[i]=="unknown",np.nan,data[i])
# Counting numbers of Nan value in data     
data.isnull().sum()

age            0
job           38
marital        0
education    187
default        0
balance        0
housing        0
loan           0
duration       0
campaign       0
pdays          0
previous       0
y              0
dtype: int64

In [5]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,duration,campaign,pdays,previous,y
0,30.0,unemployed,married,primary,no,1787.0,no,no,79.0,1.0,-1.0,0.0,no
1,33.0,services,married,secondary,no,4789.0,yes,yes,220.0,1.0,339.0,4.0,no
2,35.0,management,single,tertiary,no,1350.0,yes,no,185.0,1.0,330.0,1.0,no
3,30.0,management,married,tertiary,no,1476.0,yes,yes,199.0,4.0,-1.0,0.0,no
4,59.0,blue-collar,married,secondary,no,0.0,yes,no,226.0,1.0,-1.0,0.0,no


### Splitting the data into Training data set and Testing data set

In [6]:
X_train, X_test, y_train, y_test =train_test_split (data.drop('y',axis=1),data[['y']], test_size=0.2,random_state=42)
X_train.shape, X_test.shape 

((3616, 12), (905, 12))

### labeling the target traing variable and Testing variable using label encoder

In [7]:
y_train_labeled=LabelEncoder().fit_tragnsform(y_train)
y_train_labeled

array([0, 0, 0, ..., 0, 0, 0])

In [8]:
y_test_labeled=LabelEncoder().fit_transform(y_test) 
y_test_labeled

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

#### Using ColumnTransformer Imputing the data using simple imputer for most frequent. 

In [9]:
#Imputation 
trf1=ColumnTransformer([('impute_job',SimpleImputer(strategy='most_frequent' ), [1]) ,('impute_edu',SimpleImputer(strategy='most_frequent'), [3])],remainder='passthrough')


In [10]:
X_train_impute=pd.DataFrame(trf1.fit_transform(X_train)) 
X_train_impute

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,entrepreneur,tertiary,34.0,divorced,no,262.0,no,no,371.0,1.0,-1.0,0.0
1,management,tertiary,32.0,married,no,2349.0,no,no,134.0,5.0,-1.0,0.0
2,technician,secondary,34.0,single,no,1076.0,no,no,70.0,2.0,-1.0,0.0
3,management,tertiary,31.0,married,no,156.0,no,no,657.0,7.0,-1.0,0.0
4,blue-collar,primary,46.0,married,no,258.0,yes,no,217.0,1.0,-1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
3611,admin.,tertiary,41.0,married,no,1536.0,no,no,54.0,2.0,-1.0,0.0
3612,self-employed,secondary,34.0,married,no,-370.0,yes,no,748.0,1.0,-1.0,0.0
3613,management,tertiary,46.0,married,no,523.0,yes,no,105.0,4.0,366.0,2.0
3614,management,tertiary,47.0,single,no,440.0,no,no,71.0,4.0,-1.0,0.0


### Making 2nd ColumnTransformer for Ordinal Encoding for order data values

In [11]:
# Oridinal Encoding 
trf2=ColumnTransformer([('ord_martial_education_default_housing_loan',OrdinalEncoder(categories=[['single','married', 'divorced'], ['primary','secondary', 'tertiary'], ['no', 'yes'], ['no', 'yes'], ['no', 'yes']],dtype=int),[3,1,4,6,7])],remainder ='passthrough') 

In [12]:
X_train_ord=pd.DataFrame(trf2.fit_transform(X_train_impute)) 
X_train_ord

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,2,2,0,0,0,entrepreneur,34.0,262.0,371.0,1.0,-1.0,0.0
1,1,2,0,0,0,management,32.0,2349.0,134.0,5.0,-1.0,0.0
2,0,1,0,0,0,technician,34.0,1076.0,70.0,2.0,-1.0,0.0
3,1,2,0,0,0,management,31.0,156.0,657.0,7.0,-1.0,0.0
4,1,0,0,1,0,blue-collar,46.0,258.0,217.0,1.0,-1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
3611,1,2,0,0,0,admin.,41.0,1536.0,54.0,2.0,-1.0,0.0
3612,1,1,0,1,0,self-employed,34.0,-370.0,748.0,1.0,-1.0,0.0
3613,1,2,0,1,0,management,46.0,523.0,105.0,4.0,366.0,2.0
3614,0,2,0,0,0,management,47.0,440.0,71.0,4.0,-1.0,0.0


### 3rd ColumnTransformer for One Hot encoding for Independent data values

In [13]:
#OneHotEncoding
trf3=ColumnTransformer([('ohe_job',OneHotEncoder(sparse=False,handle_unknown='ignore', dtype=int,drop='first'), [5])],remainder='passthrough')

In [14]:
X_train_ohe=pd.DataFrame(trf3.fit_transform(X_train_ord)) 
X_train_ohe

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,0,1,0,0,0,0,0,0,0,0,2,2,0,0,0,34.0,262.0,371.0,1.0,-1.0,0.0
1,0,0,0,1,0,0,0,0,0,0,1,2,0,0,0,32.0,2349.0,134.0,5.0,-1.0,0.0
2,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,34.0,1076.0,70.0,2.0,-1.0,0.0
3,0,0,0,1,0,0,0,0,0,0,1,2,0,0,0,31.0,156.0,657.0,7.0,-1.0,0.0
4,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,46.0,258.0,217.0,1.0,-1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3611,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,41.0,1536.0,54.0,2.0,-1.0,0.0
3612,0,0,0,0,0,1,0,0,0,0,1,1,0,1,0,34.0,-370.0,748.0,1.0,-1.0,0.0
3613,0,0,0,1,0,0,0,0,0,0,1,2,0,1,0,46.0,523.0,105.0,4.0,366.0,2.0
3614,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,47.0,440.0,71.0,4.0,-1.0,0.0


### 4th ColumnTransformer for Power transformer for scaling the data 

In [15]:
# Power Transformation 
trf4=ColumnTransformer([('Power_transform', PowerTransformer(),[15,16,17,18,19,20])], remainder='passthrough' )

In [16]:
X_train_power=pd.DataFrame(trf4.fit_transform(X_train_ohe)) 
X_train_power

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,-0.621854,-0.309759,0.767816,-1.1138,-0.468806,-0.468846,0,1,0,0,0,0,0,0,0,0,2,2,0,0,0
1,-0.870418,0.496341,-0.371197,1.299566,-0.468806,-0.468846,0,0,0,1,0,0,0,0,0,0,1,2,0,0,0
2,-0.621854,0.038189,-1.011429,0.117351,-0.468806,-0.468846,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
3,-1.001665,-0.364477,1.486026,1.583323,-0.468806,-0.468846,0,0,0,1,0,0,0,0,0,0,1,2,0,0,0
4,0.577261,-0.311743,0.146531,-1.1138,-0.468806,-0.468846,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3611,0.128687,0.211024,-1.249767,0.117351,-0.468806,-0.468846,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0
3612,-0.621854,-1.609322,1.657575,-1.1138,-0.468806,-0.468846,0,0,0,0,0,1,0,0,0,0,1,1,0,1,0
3613,0.577261,-0.188705,-0.619153,1.068456,2.157275,2.16323,0,0,0,1,0,0,0,0,0,0,1,2,0,1,0
3614,0.660027,-0.225732,-0.998118,1.068456,-0.468806,-0.468846,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0


In [17]:
#Feature Selection
trf5=SelectKBest(score_func=chi2,k=10)

### 6th ColumnTransformer for Random forest classifier model

In [18]:
# train model
trf6=RandomForestClassifier() 

### 7 th ColumnTransformer for LogisticRegression model

In [19]:
trf7=LogisticRegression() 

### making pipe line for all Transformation run together and training Random Forests model

In [None]:
pipe=Pipeline([('trf1', trf1),('trf2', trf2), ('trf3', trf3), ('trf4', trf4),
               ('trf6', trf6) ])


### making pipe line 1 for All transformation run together and training LogisticRegression model

In [21]:
pipe1=Pipeline([('trf1', trf1),('trf2', trf2), ('trf3', trf3), ('trf4', trf4),
               ('trf7', trf7) ])


In [22]:
pipe.fit(X_train,y_train_labeled ) 

Pipeline(steps=[('trf1',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_job',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  [1]),
                                                 ('impute_edu',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  [3])])),
                ('trf2',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ord_martial_education_default_housing_loan',
                                                  OrdinalEncoder(categories=[['single',
                                                                              'm...
                                                                 dtype=<class 'int'>),
                                                  

In [23]:
pipe1.fit(X_train,y_train_labeled) 

Pipeline(steps=[('trf1',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_job',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  [1]),
                                                 ('impute_edu',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  [3])])),
                ('trf2',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ord_martial_education_default_housing_loan',
                                                  OrdinalEncoder(categories=[['single',
                                                                              'm...
                                                                 dtype=<class 'int'>),
                                                  

In [24]:
pred_rf=pipe.predict(X_test) 
pred_rf

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [25]:
print(classification_report(y_test_labeled,pred_rf,target_names=['no', 'yes'])) 

              precision    recall  f1-score   support

          no       0.91      0.97      0.94       807
         yes       0.45      0.17      0.25        98

    accuracy                           0.89       905
   macro avg       0.68      0.57      0.59       905
weighted avg       0.86      0.89      0.86       905



In [26]:
confusion_matrix(y_test_labeled, pred_rf, labels=[0, 1]) 

array([[786,  21],
       [ 81,  17]], dtype=int64)

In [27]:
pred_lr=pipe1.predict(X_test) 
pred_lr

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [28]:
print(classification_report(y_test_labeled,pred_lr,target_names=['no', 'yes'])) 

              precision    recall  f1-score   support

          no       0.91      0.98      0.94       807
         yes       0.57      0.20      0.30        98

    accuracy                           0.90       905
   macro avg       0.74      0.59      0.62       905
weighted avg       0.87      0.90      0.87       905



In [29]:
confusion_matrix(y_test_labeled, pred_lr, labels=[0, 1]) 

array([[792,  15],
       [ 78,  20]], dtype=int64)

In [30]:
X_test.head(2)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,duration,campaign,pdays,previous
2398,51.0,entrepreneur,married,secondary,yes,-2082.0,no,yes,123.0,6.0,-1.0,0.0
800,50.0,management,married,tertiary,no,2881.0,no,no,510.0,2.0,2.0,5.0


In [31]:
with open('churn_prediction_without_logistic_model.pkl','wb')as  file:
    pickle.dump(pipe,file) 

In [32]:
with open('churn_prediction_without_randomforest_model.pkl','wb')as  file:
    pickle.dump(pipe1,file) 

In [33]:
test_input=np.array([35.0,'management','single', 'tertiary', 'no', 2500.0,'yes','no',300.0,4.0,0.0,1.0], dtype=object).reshape(1,12)
test_input 

array([[35.0, 'management', 'single', 'tertiary', 'no', 2500.0, 'yes',
        'no', 300.0, 4.0, 0.0, 1.0]], dtype=object)

In [34]:
test_rf_output=pipe.predict(test_input)
test_rf_output

array([0])

In [35]:
test_lr_output=pipe1.predict(test_input)
test_lr_output

array([0])

In [36]:
print('mean score of logistics model  ={}'.format(np.mean(cross_val_score(pipe1,X_train,y_train_labeled, cv=10,scoring='accuracy')))) 

mean score of logistics model  =0.8868903138917371


In [37]:
print('mean score of random forest model  ={}'.format(np.mean(cross_val_score(pipe,X_train,y_train_labeled, cv=10,scoring='accuracy')))) 

mean score of random forest model  =0.8863301755406253


In [38]:
kbins=KBinsDiscretizer(n_bins=7, encode='ordinal',strategy='quantile')
trf_Kbins=ColumnTransformer([('binning', kbins,[0])],remainder ='passthrough')
X_train_kbins=pd.DataFrame(trf_Kbins.fit_transform(X_train))

X_train_kbins.rename(columns={0:'age_trf'} ,inplace=True)
X_train_kbins.head()

Unnamed: 0,age_trf,1,2,3,4,5,6,7,8,9,10,11
0,2.0,entrepreneur,divorced,tertiary,no,262.0,no,no,371.0,1.0,-1.0,0.0
1,1.0,management,married,tertiary,no,2349.0,no,no,134.0,5.0,-1.0,0.0
2,2.0,technician,single,secondary,no,1076.0,no,no,70.0,2.0,-1.0,0.0
3,1.0,management,married,tertiary,no,156.0,no,no,657.0,7.0,-1.0,0.0
4,4.0,blue-collar,married,primary,no,258.0,yes,no,217.0,1.0,-1.0,0.0


In [39]:
print('mean score of logistics  model  ={}'.format(np.mean(cross_val_score(pipe1,X_train_kbins,y_train_labeled, cv=10,scoring='accuracy')))) 

mean score of logistics  model  =0.8860608193936427


In [40]:
print('mean score of random forest model  ={}'.format(np.mean(cross_val_score(pipe,X_train_kbins,y_train_labeled, cv=10,scoring='accuracy')))) 

mean score of random forest model  =0.8857830458670666
