#### Problem Description:
* A relatively young bank is growing rapidly in terms of overall customer acquisition.Majority of these are Liability customers with varying sizes of relationship with the bank.The customer base of Asset customers is quite small, and the bank WANTS to grow this base rapidly to bring in more loan business. 

* Specifically, it wants to explore ways of converting its liability customers to Personal Loan customers.

* A campaign the bank ran for liability customers last year showed a healthy conversion rate of over 9% successes. This has encouraged the Retail Marketing department to devise smarter campaigns with better target marketing.

#### Anlaytics Objectives :
	
	
* 1	While designing a new campaign, can we model the previous campaign's customer behavior to 
	analyze what combination of parameters make a customer more likely to 
	accept a personal loan?
	
* 2	There are several special products / facilities the bank offers like CD and security accounts, 
	online services, credit cards, etc. Can we spot any association among these
	for finding cross-selling opportunities?

#### Data Set Description :
* ID:	Customer ID			
* Age:	Customer's age in completed years			
* Experience:	#years of professional experience			
* Income:	Annual income of the customer in thousands of Dollars			
* ZIPCode:	Home Address ZIP code.			Do not use ZIP code
* Family:	Family size of the customer			
* CCAvg:	Avg. spending on credit cards per month in thousands of Dollars		
* Education:	Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional			
* Mortgage:	Value of house mortgage if any. (thousands of Dollars)			
* **Personal Loan:	Did this customer accept the personal loan offered in the last campaign?**			
* Securities Account:	Does the customer have a securities account with the bank?			
* CD Account:	Does the customer have a certificate of deposit (CD) account with the bank?			
* Online:	Does the customer use internet banking facilities?			
* CreditCard:	Does the customer use a credit card issued by UniversalBank?			

#### Note:
* While reading the data set  replace the '?',',' as NAs

#### Experiment :
* Building a Random Forest to predict whether a person takes personal loan or not



In [60]:
import os
import pandas as pd
path = os.getcwd()
os.chdir(path)

In [61]:
import warnings
warnings.filterwarnings('ignore')

####  Check the dimensions and type

In [62]:
bank=pd.read_csv("UniversalBank.csv",na_values=["?",","])
print("The number of Rows in the bank data set  ="+str(bank.shape[0]))
print("The number of Columns in the bank data set =" +str(bank.shape[1]))

The number of Rows in the bank data set  =5000
The number of Columns in the bank data set =14


#### Print Columns names and check the datatypes of columns(dtypes)

In [63]:
print("The columns in the data set are : \n",list(bank.columns))

The columns in the data set are : 
 ['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal Loan', 'Securities Account', 'CD Account', 'Online', 'CreditCard']


In [64]:
print("The data types of the columns are :\n ",bank.dtypes)

The data types of the columns are :
  ID                      int64
Age                   float64
Experience            float64
Income                float64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object


#### Check the top 10 rows to glance the data set 

In [65]:
bank.head(10)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25.0,1.0,49.0,91107,4,1.6,1,0,0,1,0,0,0
1,2,45.0,19.0,34.0,90089,3,1.5,1,0,0,1,0,0,0
2,3,39.0,15.0,11.0,94720,1,1.0,1,0,0,0,0,0,0
3,4,35.0,9.0,100.0,94112,1,2.7,2,0,0,0,0,0,0
4,5,35.0,8.0,45.0,91330,4,1.0,2,0,0,0,0,0,1
5,6,37.0,13.0,29.0,92121,4,0.4,2,155,0,0,0,1,0
6,7,53.0,27.0,72.0,91711,2,1.5,2,0,0,0,0,1,0
7,8,50.0,24.0,22.0,93943,1,0.3,3,0,0,0,0,0,1
8,9,35.0,10.0,81.0,90089,3,0.6,2,104,0,0,0,1,0
9,10,,9.0,180.0,93023,1,8.9,3,0,1,0,0,0,0


In [66]:
bank.tail(10)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
4990,4991,55.0,25.0,58.0,95023,4,2.0,3,219,0,0,0,0,1
4991,4992,51.0,25.0,92.0,91330,1,1.9,2,100,0,0,0,0,1
4992,4993,30.0,5.0,13.0,90037,4,0.5,3,0,0,0,0,0,0
4993,4994,45.0,21.0,218.0,91801,2,6.67,1,0,0,0,0,1,0
4994,4995,64.0,40.0,75.0,94588,3,2.0,3,0,0,0,0,1,0
4995,4996,29.0,3.0,40.0,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30.0,4.0,15.0,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63.0,39.0,24.0,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65.0,40.0,49.0,90034,3,0.5,2,0,0,0,0,1,0
4999,5000,28.0,4.0,83.0,92612,3,0.8,1,0,0,0,0,1,1


#### Check the summary of dataframe(describe())

In [67]:
bank.describe(include='all')

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,4998.0,4998.0,4987.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.336335,20.108043,73.807098,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.460241,11.468603,46.037325,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


#### Check the unique levels in the target attribute Personal and also check for the percentage distribution

In [68]:
bank["Personal Loan"].value_counts()/bank.shape[0]

0    0.904
1    0.096
Name: Personal Loan, dtype: float64

#### Check the number of unique ZIP Codes present in the dataset 

In [69]:
print("The number of Unique ZIP Codes in the bank data set",bank['ZIP Code'].value_counts().count())


The number of Unique ZIP Codes in the bank data set 467


#### Check the Unique counts of  Family members  present in the dataset

In [70]:
print("The number of Unique Family members in the bank data set: \n",bank['Family'].value_counts())

The number of Unique Family members in the bank data set: 
 1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64


#### Check the Unique counts of Education levels present in the dataset

In [71]:
print("The number of Unique Education levels in the bank data set: \n",bank['Education'].value_counts())

The number of Unique Education levels in the bank data set: 
 1    2096
3    1501
2    1403
Name: Education, dtype: int64


#### Do Necessary changes for the data types from the previous observations

In [72]:
bank['Education']=bank['Education'].astype('category')
bank['CD Account']=bank['CD Account'].astype('category')
bank['Online']=bank['Online'].astype('category')
bank['CreditCard']=bank['CreditCard'].astype('category')
bank['Securities Account']=bank['Securities Account'].astype('category')
bank['Family']=bank['Family'].astype('category')
bank['ZIP Code']=bank['ZIP Code'].astype('category')
######################################################################################################################
#Use the following code when you have more aolumns
# for column in ['Education', 'CD Account', 'Online']:
#     bank[column]=bank[column].astype('category')
#bank[bank.select_dtypes(['object']).columns] = bank.select_dtypes(['object']).apply(lambda x: x.astype('category')) 
#bank[bank.select_dtypes(['object']).columns] = bank.select_dtypes(['object']).apply(lambda x: x.astype('category')) 
######################################################################################################################

In [73]:
bank.dtypes

ID                       int64
Age                    float64
Experience             float64
Income                 float64
ZIP Code              category
Family                category
CCAvg                  float64
Education             category
Mortgage                 int64
Personal Loan            int64
Securities Account    category
CD Account            category
Online                category
CreditCard            category
dtype: object

#### Remove the unncessary Columns

In [74]:
bank=bank.drop(["ID","ZIP Code"],axis=1)

In [75]:
bank.head(3)

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,25.0,1.0,49.0,4,1.6,1,0,0,1,0,0,0
1,45.0,19.0,34.0,3,1.5,1,0,0,1,0,0,0
2,39.0,15.0,11.0,1,1.0,1,0,0,0,0,0,0


####  Check the missing values 

In [76]:
bank.isnull().sum()

Age                    2
Experience             2
Income                13
Family                 0
CCAvg                  0
Education              0
Mortgage               0
Personal Loan          0
Securities Account     0
CD Account             0
Online                 0
CreditCard             0
dtype: int64

#### SPLIT THE data in to train and test(use sklearn package)

In [77]:
from sklearn.model_selection import train_test_split

In [78]:
y=bank["Personal Loan"]
X=bank.drop('Personal Loan', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In [79]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4000, 11)
(1000, 11)
(4000,)
(1000,)


#### Split the numerical and Categorical Attributes

In [80]:
num_attr = X.select_dtypes(include=['float64','int64']).columns
num_attr

Index(['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage'], dtype='object')

In [81]:
cat_attr = X.select_dtypes(exclude=['float64','int64']).columns
cat_attr

Index(['Family', 'Education', 'Securities Account', 'CD Account', 'Online',
       'CreditCard'],
      dtype='object')

#### Standardize the data (numerical attributes only)( import StandardScaler) 


In [82]:
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [83]:
numeric_transformer = Pipeline(memory ='./' ,steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

In [84]:
categorical_transformer = Pipeline(memory = './',steps=[
    ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [85]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_attr),
        ('cat', categorical_transformer, cat_attr)])

# Now It is time for Model BUilding 

#### Let us see the details of the Random Forest in Sklearn 
* class sklearn.ensemble.RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)



* A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True 

#### Parameters
* **n_estimators** : integer, optional (default=10)
    **The number of trees in the forest**.The default value of n_estimators will change from 10 in version 0.20 to 100 in version 0.22.

* **criterion** : string, optional (default=”gini”)
    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

* **max_depth** : integer or None, optional (default=None)
    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

* **min_samples_split** : int, float, optional (default=2)
    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number.
    If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.


* **min_samples_leaf** : int, float, optional (default=1)
    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number.
    If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.


* **min_weight_fraction_leaf** : float, optional (default=0.)
    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

* **max_features** : int, float, string or None, optional (default=”auto”)
    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split.
    If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
    If “auto”, then max_features=sqrt(n_features).
    If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
    If “log2”, then max_features=log2(n_features).
    If None, then max_features=n_features.
    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

* **max_leaf_nodes** : int or None, optional (default=None)
    Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

* **min_impurity_decrease** : float, optional (default=0.)
    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

  The weighted impurity decrease equation is the following:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)
    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

* **bootstrap** : boolean, optional (default=True)
    Whether bootstrap samples are used when building trees.

* **oob_score** : bool (default=False)
    Whether to use out-of-bag samples to estimate the generalization accuracy.

* **n_jobs** : int or None, optional (default=None)
    The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

* **random_state** : int, RandomState instance or None, optional (default=None)
    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

* **verbose** : int, optional (default=0)
    Controls the verbosity when fitting and predicting.

* **warm_start** : bool, optional (default=False)
    When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

* **class_weight** : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)
    Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

    Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

    The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

    For multi-output, the weights of each column of y will be multiplied.

    Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

####  Build Random Forest Classifier

In [87]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10,max_depth=8)

* Append classifier to preprocessing pipeline.Now we have a full prediction pipeline.

In [88]:
clf = Pipeline(memory = './',steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])

In [31]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score


y_pred = clf.predict(X_train)
print("Train Accuracy = ",accuracy_score(y_train,y_pred))
print("Recall in train = ",recall_score(y_train,y_pred,pos_label=1))

y_pred = clf.predict(X_test)
print("Test Accuracy = ",accuracy_score(y_test,y_pred))
print("Recall on Test = ",recall_score(y_test,y_pred,pos_label=1))

Train Accuracy =  0.99275
Recall in train =  0.9324675324675324
Test Accuracy =  0.987
Recall on Test =  0.8736842105263158


In [37]:
clf.named_steps['classifier'].feature_importances_

array([0.01969535, 0.02075261, 0.27850165, 0.21043057, 0.02749233,
       0.01065475, 0.00890064, 0.04080783, 0.03946239, 0.17150858,
       0.02684214, 0.05103575, 0.00614659, 0.00172055, 0.03243235,
       0.03831813, 0.00188224, 0.00142408, 0.00953436, 0.00245712])

In [58]:
!ls -la ./joblib/sklearn/pipeline/_fit_transform_one/

total 8
drwxrwxr-x 8 mahidharv mahidharv 4096 Feb  9 22:53 .
drwxrwxr-x 3 mahidharv mahidharv   39 Feb  9 22:53 ..
drwxrwxr-x 2 mahidharv mahidharv   31 Feb  9 22:53 432c7082c00dcbfedecabf1dab9526bc
drwxrwxr-x 2 mahidharv mahidharv   34 Feb  9 22:53 5504e8cd7225a04788ff0677a7cae659
drwxrwxr-x 2 mahidharv mahidharv   31 Feb  9 22:53 58319a044ddc3afe2e8bab54c0fb9a39
drwxrwxr-x 2 mahidharv mahidharv   34 Feb  9 22:53 679a677abca30f7d9894759dd0604cdd
drwxrwxr-x 2 mahidharv mahidharv   31 Feb  9 22:53 d216262512be451bdb0b07eaeaa7af36
drwxrwxr-x 2 mahidharv mahidharv   34 Feb  9 22:53 db636cf727786f3938442c3c9f744cd3
-rw-rw-r-- 1 mahidharv mahidharv  418 Feb  9 22:53 func_code.py


In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
features =clf.steps[1][1].columns
print(features)
importances = clf.named_steps['classifier'].feature_importances_
indices = np.argsort(importances)
print(indices)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
print([features[i] for i in indices])

#### GridSearch Cross validation

In [None]:
from sklearn.model_selection import GridSearchCV
rfc = RandomForestClassifier(n_jobs=-1, max_features='sqrt') 
 
# Use a grid over parameters of interest
param_grid = { 
           "n_estimators" : [9, 18, 27, 36, 45, 54, 63],
           "max_depth" : [2,3,5,7],
           "min_samples_leaf" : [2, 4]}
 
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 10)
CV_rfc.fit(X=X_train, y=y_train)
print (CV_rfc.best_score_, CV_rfc.best_params_) 

In [None]:
y_pred_test=CV_rfc.predict(X_test)
print(accuracy_score(y_test,y_pred_test))

In [None]:
from sklearn.metrics import recall_score
print(recall_score(y_test,y_pred_test, pos_label=1, average='binary'))

#### Bagging

In [None]:
clf = BaggingClassifier(n_estimators=10)
clf.fit(X=X_train, y=y_train)

#### Accuracy on testdata

In [None]:
y_pred = clf.predict(X_test)
print(accuracy_score(y_test,y_pred))

#### Accuracy on train data

In [None]:
y_pred = clf.predict(X_train)
print(accuracy_score(y_train,y_pred))

In [None]:
import sklearn
print(sklearn.__version__)

In [None]:
from sklearn.tree import DecisionTreeClassifier 
param_grid = {
    'base_estimator__max_depth' : [1, 2, 3, 4, 5],
    'max_samples' : [0.05, 0.1, 0.2, 0.5]
}

clf = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(),
                                     n_estimators = 100, max_features = 0.5),
                   param_grid,scoring='recall')
clf.fit(X_train, y_train)

In [None]:
print(clf.best_score_,clf.best_params_)

In [None]:

from sklearn.metrics import recall_score
trainpreds=clf.predict(X_train)
print(recall_score(y_train,trainpreds,pos_label=1))# recall
print(accuracy_score(y_train,trainpreds))# accuracy