# Lab 4: Due Sunday February 6th

In [1]:
# Import required packages
from pathlib import Path

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

!pip install dmba
from dmba import classificationSummary

%matplotlib inline

Looking in indexes: http://mirrors.aliyun.com/pypi/simple


# Personal Loan Acceptance

The file _UniversalBank.csv_ contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise, we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below).

Partition the data into training (60%) and validation (40%) sets.

## 1. Data Preparation

__1.1__ Load the data and remove all unnecessary columns from the dataset and convert _Online_ and _CreditCard_ to categories. Split the data into training (60%), and validation (40%) sets (use <code>random_state=1</code>). Remove any spaces from variable names.

In [2]:
#Insert code here
loan = pd.read_csv("UniversalBank1.csv")
loan = loan[['Personal Loan', 'Online','CreditCard']]
loan.rename(columns={'Personal Loan' : 'PersonalLoan'}, inplace=True)
loan.Online = loan.Online.astype('category')
loan.CreditCard = loan.CreditCard.astype('category')

Split dataset into training and validation sets.

In [3]:
#Insert code here
train_df,valid_df = train_test_split(loan, test_size=0.4, random_state=1)

## 2. Pivot Table

__2.1__ Create a pivot table for the training data with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the table should convey the count.

(Hint: Use pivot_table with an index, columns and aggfunc=len.  This step should only be one line of code.)

In [4]:
#Insert code here
count_table = train_df.pivot_table(index=['CreditCard','PersonalLoan'],columns = 'Online', aggfunc=len)
count_table

Unnamed: 0_level_0,Online,0,1
CreditCard,PersonalLoan,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,792,1117
0,1,73,126
1,0,327,477
1,1,39,49


__2.2__ Using the pivot table created, consider the task of classifying a customer who owns a bank credit card and is actively using online banking services. 

Looking at the pivot table, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1)).

In [5]:
count_table.values[3,1]/(count_table.values[3,1]+count_table.values[2,1])

0.09315589353612168

__Answer:__

49/(477+49) = 9.32%

__2.3__ Create two separate pivot tables for the training data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC.

Pivot table for Loan (rows) as a function of Online (columns). Here we can use the `pivot_table` method of the pandas data frame.

In [6]:
#Since we had some issues in class, I'm providing the following code:

predictors = ['CreditCard', 'Online']

print(train_df['PersonalLoan'].value_counts() / len(train_df))
print()

for predictor in predictors:
    # construct the frequency table
    df = train_df[['PersonalLoan', predictor]]
    freqTable = df.pivot_table(index='PersonalLoan', columns=predictor, aggfunc=len)

    # divide each row by the sum of the row to get conditional probabilities
    propTable = freqTable.apply(lambda x: x / sum(x), axis=1)
    print(propTable)
    print()

0    0.904333
1    0.095667
Name: PersonalLoan, dtype: float64

CreditCard           0         1
PersonalLoan                    
0             0.703649  0.296351
1             0.693380  0.306620

Online               0         1
PersonalLoan                    
0             0.412459  0.587541
1             0.390244  0.609756



<small><em>CreditCard</em> abbreviated as CC, <em>Personal Loan</em> abbreviated as Loan)</small>

__2.4__ Compute the following quantities, P(A | B) means “the probability of A given B”]:

<ul>
<li>i. P(CC = 1 | Loan = 1) (the proportion of credit card holders among the loan acceptors)</i>
<li>ii. P(Online = 1|Loan = 1)</li>
<li>iii. P(Loan = 1) = the proportion of loan acceptors</li>
<li>iv. P(CC = 1|Loan = 0)</li>
<li>v.  P(Online = 1|Loan = 0)</li>
<li>vi. P(Loan = 0)</li>
</ul>

Use the pivot tables created in 2.3.

__Answers:__

    i. P(CreditCard = 1|Loan = 1) = 0.306620
    ii. P(Online = 1|Loan = 1) = 0.609756
    iii. P(Loan = 1) = 0.095667
    iv. P(CC = 1|Loan = 0) = 0.296351
    v. P(Online = 1|Loan = 0) = 0.587541
    vi. P(Loan = 0) = 0.904333

__2.5__ Which of the entries in this table are needed for computing P(Loan = 1 | CC = 1, Online = 1)? In Python, run naive Bayes on the data. Examine the model output on training data, and find the entry that corresponds to P(Loan = 1 | CC = 1,
Online = 1). 

In Python, run naive Bayes on the training data. Use data points that match the condition <em>CreditCard=1,Online=1</em> to find the predicted probability for P(Loan=1|CC=1,Online=1).

(Hint: Your target is Loan and your predictors are Online and CC.)

P( Loan = 1| CC = 1; Online = 1)

= (0.306620 * 0.609756 * 0.095667) / [(0.306620 * 0.609756 * 0.095667) + (0.296351 * 0.587541 * 0.904333)] 

= 0.0179/(0.0179+0.1575)

= 0.0179/0.1754

= 0.1021

Change the types of variables to categories and create dummies

In [7]:
#Insert code here
train_df = pd.get_dummies(train_df,drop_first=True)
valid_df = pd.get_dummies(valid_df,drop_first=True)

Create your outcome and predictors lists and fit a MultinomialNB

In [8]:
X_train = train_df[['Online_1', 'CreditCard_1']]
y_train = train_df['PersonalLoan']
X_valid = valid_df[['Online_1', 'CreditCard_1']]
y_valid = valid_df['PersonalLoan']

In [9]:
#Insert code here
loan_nb = MultinomialNB(alpha=0.1)
loan_nb.fit(X_train, y_train)

y_train_pred = loan_nb.predict(X_train)
y_valid_pred = loan_nb.predict(X_valid)

a) Predict probabilities using predict_proba. 

b) Concatenate the training data frame and the predicted probability data frame in part a.

c) Check for the probability of "1" in the row where Online = 1 and CreditCard = 1

In [10]:
# a) Predict probabilities using predict_proba.
pred_prob_train = loan_nb.predict_proba(X_train)
pred_prob_valid = loan_nb.predict_proba(X_valid)

In [11]:
# b) Concatenate the training data frame and the predicted probability data frame in part a.
X_train_concatenate = X_train.copy().reset_index(drop=True)
X_train_concatenate['loan'] = y_train.values
X_train_concatenate[['no_pred_prob','pred_prob']] = pd.DataFrame(pred_prob_train)
X_train_concatenate['loan_pred'] = y_train_pred
X_train_concatenate

Unnamed: 0,Online_1,CreditCard_1,loan,no_pred_prob,pred_prob,loan_pred
0,0,0,0,0.904333,0.095667,0
1,1,0,0,0.904260,0.095740,0
2,1,1,0,0.904406,0.095594,0
3,0,1,0,0.904480,0.095520,0
4,1,0,1,0.904260,0.095740,0
...,...,...,...,...,...,...
2995,1,0,0,0.904260,0.095740,0
2996,1,0,0,0.904260,0.095740,0
2997,1,1,0,0.904406,0.095594,0
2998,1,0,0,0.904260,0.095740,0


In [12]:
# c) Check for the probability of "1" in the row where Online = 1 and CreditCard = 1
X_train_concatenate[(X_train_concatenate['Online_1']==1) & (X_train_concatenate['CreditCard_1']==1)]

Unnamed: 0,Online_1,CreditCard_1,loan,no_pred_prob,pred_prob,loan_pred
2,1,1,0,0.904406,0.095594,0
9,1,1,1,0.904406,0.095594,0
10,1,1,0,0.904406,0.095594,0
11,1,1,0,0.904406,0.095594,0
22,1,1,0,0.904406,0.095594,0
...,...,...,...,...,...,...
2972,1,1,0,0.904406,0.095594,0
2980,1,1,0,0.904406,0.095594,0
2983,1,1,0,0.904406,0.095594,0
2984,1,1,0,0.904406,0.095594,0


__2.6__ What is the probability of a Loan given Online and CC?

The probability of loan = 1 given Online = 1 and CC = 1 is **0.095594**