# Question #2:

## Question #2, Part A
Please see the Word document submitted with the assignment for the written response to Q2 part A

## Question #2, Part B (Python)

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

# Part 1 Data Acquisition. Read data from a CSV file into a data frame
df = pd.read_csv(r'C:\Users\Home\Documents\Data Mining\Lecture Slides and Python Examples\Lecture 7-Logistic Regression for model accuracy\PersonalLoan.csv')

#Part 2: Setting Up Categorical and Numerical Variables

#Dropping ZIP CODE as we are not yet ready (10/19/2020) to handle the tons of dummies it creates
rvar_list = ['ZIPCode']
df_sample1 = df.drop(columns=rvar_list)

#Specify Variable Names Based off their definitions
cvar_list = ['Education','SecuritiesAccount','CDAccount','Online','CreditCard','PersonalLoan']
nvar_list = ['Age','Experience','Income','Family','CCAvg','Mortgage']

#Standardizing Numerical variables

df_sample2 = df_sample1.copy()
    
original_column_values = df_sample1[nvar_list]
sample_mean = df_sample1[nvar_list].mean()
sample_stddev = df_sample1[nvar_list].std()

df_sample2[nvar_list] = ((original_column_values - sample_mean)/sample_stddev)

#Creating Dummies for Categorical Variables
    #Create the dummies
df_sample3 = df_sample2.copy()
df_sample3[cvar_list] = df_sample2[cvar_list].astype('category')
df_sample3[nvar_list] = df_sample2[nvar_list].astype('float64')

df_sample4 = df_sample3.copy()
df_sample4 = pd.get_dummies(df_sample3, prefix_sep = '_')


    #remove one "redundant dummy", per each set of dummies
rdummies = ['Education_1','SecuritiesAccount_Yes','CDAccount_Yes','Online_Yes','CreditCard_Yes','PersonalLoan_No']
df_sample5 = df_sample4.copy()
df_sample5 = df_sample4.drop(columns=rdummies)

#Part 3: Data Partition:
#Splitting the data into our partitions will return two dataframes, so we must prep like so:
testpart_size = .2
df4partition = df_sample5

df_nontestdata, df_testdata = train_test_split(df4partition, test_size = testpart_size, random_state = 1)

#Part 4: calculate profits under the naive strategy:
#Isolate the dependent variable in the PersonalLoan test partition data
index_reset = df_testdata['PersonalLoan_Yes'].reset_index(drop=True)
accept_or_decline = index_reset

#Since the Bank is pursuing a "naive strategy", every single customer will receive a loan offer.
#Thus, if the customer accepts, the Bank makes $5. If they do not accept, the Bank loses $2.
total_profit = 0
for i in range(len(accept_or_decline)):
    if accept_or_decline[i] == 1:
        profit = 5
    elif accept_or_decline[i] == 0:
        profit = -2
    total_profit += profit
    
#Calculate average profit
average_net_profit = total_profit / len(accept_or_decline)
print(average_net_profit)

-1.3


## Question #2, Part B (Explanation):
As can be seen from the result of the above code, when the Bank pursues the naive strategy of sending to everyone, they lose an average of $1.30 per customer.

## Question #2, Part C

In [2]:
#Logistic Regression Analysis over the test partition:
DV = 'PersonalLoan_Yes'
y = df_testdata[DV]
x = df_testdata.drop(columns = [DV])

def summary_coef(model_object):
    n_predictors = x.shape[1]
    model_coef = pd.DataFrame(model_object.coef_.reshape(1,n_predictors),columns = x.columns.values)
    model_coef['Intercept'] = model_object.intercept_
    return (model_coef.transpose())

#Setup Logistic Regression with k-folds = 5
kfolds = 5

#Specifying the alpha range for the logistic regression function
min_alpha = .01
max_alpha = 100

max_C = 1/min_alpha
min_C = 1/max_alpha

#Because there are infinite values between min_alpha and max_alpha, we must specify how many alphas Python should look for
#Python will then divide that interval into an even number of searches.
n_candidates = 10000
c_list= list(np.linspace(min_C, max_C, num = n_candidates))

#Incorporating expected profit calculations into the optimal model selection
#Note; relative to the function above, we made some tweaks to fit the code better with the logistic CV function.
def profit_calculation (model,x_value,y_value):
    #This is the cutoff value that results from revenue of $5 per customer loan acceptance and cost of $2 per sending
    #Please see Question 2, Part A for the calculation
    d_cutoff = 2/7
    decision = list(model.predict_proba(x_value)[:,1] > d_cutoff)
    y = list(y_value)
    #With this list of classified binaries, we can now calculate total profit
    n_obs = len(y)

    total_profit = 0
    for i in range(n_obs):
        if decision[i] == True and y[i] == 1:
            profit = 5
        elif decision[i] == True and y[i] == 0:
            profit = -2
        else:
            profit = 0
        total_profit += profit
    #Calculate average profit
    average_net_profit = total_profit / n_obs
    return average_net_profit

#Plug in clf_optimal to our previous Logistic model to find the optimal predictors
clf_optimal = LogisticRegressionCV(Cs = c_list,cv=kfolds, scoring = profit_calculation, penalty = 'l1',solver='saga',max_iter=200, random_state=1, n_jobs = -1).fit(x,y)

print(summary_coef(clf_optimal))

#Find the optimal selected alpha
print(1/clf_optimal.C_)

#Calculate profit
print("If the Bank accepts the final selected model and the decision rule from Part A, then the average profit over the test partition will be $",profit_calculation(clf_optimal,x,y))

                             0
Age                   0.000000
Experience            0.182282
Income                2.225531
Family                0.635112
CCAvg                 0.122225
Mortgage             -0.008906
Education_2           2.515176
Education_3           2.022576
SecuritiesAccount_No  0.000000
CDAccount_No         -2.824235
Online_No             0.444838
CreditCard_No         0.313069
Intercept            -3.121575
[2.04081633]
If the Bank accepts the final selected model and the decision rule from Part A, then the average profit over the test partition will be $ 0.328


## Question #2, Part D
The naive strategy carried an average net profit of -1.3 dollars. The optimal model carried an average net profit of .328 dollars. This is a swing of .328 -(-1.3) = 1.628 dollars.

## Question #2, Part E (Python)

In [3]:
#Revenue in the profit calculation has been updated to 20 and costs to 8
def profit_calculation_updated (model,x_value,y_value):
    #This is the cutoff value that results from revenue of $5 per customer loan acceptance and cost of $2 per sending
    #With an updated expected revenue of 20 and cost of 8, this is the new decision cut-off
    d_cutoff = 8/28
    decision = list(model.predict_proba(x_value)[:,1] > d_cutoff)
    y = list(y_value)
    #With this list of classified binaries, we can now calculate total profit
    n_obs = len(y)

    total_profit = 0
    for i in range(n_obs):
        if decision[i] == True and y[i] == 1:
            profit = 20
        elif decision[i] == True and y[i] == 0:
            profit = -8
        else:
            profit = 0
        total_profit += profit
    #Calculate average profit
    average_net_profit = total_profit / n_obs
    return average_net_profit

#Plug in clf_optimal to our previous Logistic model to find the optimal predictors
clf_optimal = LogisticRegressionCV(Cs = c_list,cv=kfolds, scoring = profit_calculation_updated, penalty = 'l1',solver='saga',max_iter=200, random_state=1, n_jobs = -1).fit(x,y)

print(summary_coef(clf_optimal))

#Find the optimal selected alpha
print(1/clf_optimal.C_)

#Calculate profit
print("If the Bank accepts the final selected model and the decision rule from Part A, then the average profit over the test partition will be $",profit_calculation_updated(clf_optimal,x,y))

                             0
Age                   0.000000
Experience            0.182282
Income                2.225531
Family                0.635112
CCAvg                 0.122225
Mortgage             -0.008906
Education_2           2.515176
Education_3           2.022576
SecuritiesAccount_No  0.000000
CDAccount_No         -2.824235
Online_No             0.444838
CreditCard_No         0.313069
Intercept            -3.121575
[2.04081633]
If the Bank accepts the final selected model and the decision rule from Part A, then the average profit over the test partition will be $ 1.312


## Question #2, Part E (Explanation):
Inspecting the two coefficient/alpha output tables, it can be observed that the final selected model, after updating revenue and costs, is no different than the prior final selected model. The altered revenue and costs only changed the resultant profit.

