# <div align="center"> SPECIAL TOPICS III </div>
## <div align="center"> Data Science for Social Scientists  </div>
### <div align="center"> ECO 4199 </div>
#### <div align="center">Class 7 - Classification</div>
<div align="center"> Jonathan Holmes, (he/him)</div>

# Mixing quantitative and qualitative

$$ Y = f(\mathbf{X}) + \varepsilon$$

- Last class we saw only instances of quantitative variables, both for our __predictors__ X and the __target__ Y
- In the multiple regression model, we assumed that TV and Radio ads have an __additive effect__
    - Remember from econometrics that this gives you the effect of TV on Sales, __holding Radio constant__
    - In other words, the effect of TV ads is assumed to be the same, regarless of how much you spend on Radio ads
- Today, we will explore new models

# New data set XXCHANGE MEXX 
- Remember that we are still using the book [Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/)
- I will sometimes use code is coming from this [set of scripts](https://github.com/JWarmenhoven/ISLR-python)
- We will also use the <span style="color:orange;">Credit dataset</span> from the R package ISLR or on the [book's webpage](https://www.statlearning.com/resources-first-edition)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib

import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn.linear_model as skl_lm

import math

folderPath="~/Dropbox/_teaching/ECO4199/2023/Data-Science-for-Social-Scientists/Class 07 - Classification/"

In [None]:
credit=pd.read_csv('/'.join([folderPath, 'Credit.csv']))
display()
display(credit.info())
display(credit.describe())
credit.head()

Let's inspect this dataset visually

In [None]:
sns.pairplot(credit[['Education','Limit','Rating']],diag_kind="kde")
plt.show()

In [None]:
sns.pairplot(credit[['Age', 'Cards','Education','Income','Limit','Rating','Balance']],diag_kind="kde")
txt="\nEach panel of the figure above is a scatterplot for a pair of variables whose identities are given by the corresponding row and column labels.\nFor example, the scatterplot directly to the right of the $\it{Balance}$ histogram depicts balance versus age, \nwhile the plot directly to the right of $\it{Age}$ corresponds to age versus cards."
plt.figtext(0.5, -0.03, txt, wrap=True, horizontalalignment='center', fontsize=12)
plt.show()

## The statistical task at hand
Our credit dataset records:
- _balance_ (average credit card debt for a number of individuals) 
- __quantitative predictors__: 
    - age, cards (number of credit cards), education(years of education), income (in thousands of dollars), limit(credit limit), and rating (credit rating). 
-  __qualitative variables__: 
    - gender, student (student status), status (marital status), and ethnicity (Caucasian, African American or Asian).

# Dummy variables

- Suppose we are interested in knowing how credit card balance differ between males and females, ignoring the other variables for the moment. 
- A qualitative predictor (__factor__) with only 2 levels can be easily integrated to a regression
    - Create an indicator or __dummy variable__ that takes on two possible numerical values. For example, based on the student variable, we can create variable
$$
x_i=
\begin{cases}
1,\ \  \text{if the ith person is a student}\\
0,\ \  \text{if the ith person is not a student}
\end{cases}
$$

Which yields the following population model:
$$
y_i= \beta_0 + \beta_1 x_i + \varepsilon_i =
\begin{cases}
\beta_0 + \beta_1 + \varepsilon_i ,\ \  \text{if the ith person is a student}\\
\beta_0 + \varepsilon_i ,\ \  \text{if the ith person is not a student}
\end{cases}
$$

In [None]:
# Replace Student by a dummy variable
print("Before")
display(credit.head())
credit['Student']=pd.to_numeric(credit.Student.replace({"No":0, "Yes":1})).astype(np.int8) # encode
print("After")
credit.head()

In [None]:
# Regress Balance on Student Status
results = smf.ols('Balance ~ Student ', data=credit).fit()
# Inspect the results
print(results.summary())

## Interpreting dummy variables

- $\hat{β}_0$ (\$480) is average credit card balance among males
- $\hat{β}_0 + \hat{β}_1$ (\$876) is the average credit card balance among students, and 
- $\hat{β}_1$ (\$396) is the difference, on average, in credit card balance between students and non-students.

# Qualitative Predictors with More than Two Levels

- If you have more than 2 categories
- For example, for the ethnicity variable we can create two dummy variables:

$$
x^S_{i}=
\begin{cases}
1,\ \  \text{if the ith person is from the South region}\\
0,\ \  \text{if the ith person is not from the South region}
\end{cases}
$$
and define $x^E_i$ (East), $x^W_i$ (West) similarly (note: There is no North)

$$y_i = β_0+β_1x^S_{i}+β_2x^W_{i}+\varepsilon_i =
\begin{cases}
β_0+β_1+\varepsilon_i\ \ \ , \text{if the ith person is from the South}\\
β_0+β_2+\varepsilon_i\ \ \ , \text{if the ith person is from the West} \\
β_0+\varepsilon_i\ \ \ \ \ \ \ \ \ \ \ \ , \text{if the ith person is from the East} 
\end{cases}
$$

- $β_0$ is the average credit card balance for East 
- $β_1$ is the difference in the average balance between South and East
- $β_2$ is the difference in the average balance between West and East


- To avoid __multicolinearity__ there should be an omitted category. 
- The level with no dummy variable in this example is also known as the __leave-out group__.

In [None]:
# Regress Sales on a constant term and Ethnicity -- THE EASY WAY
results = smf.ols('Balance ~ Region ', data=credit).fit() # smf understands to use Ethnicity as a categorical variable and drops the first category as baseline
# Inspect the results
print(results.summary())

In [None]:
# Regress Sales on a constant term and Region -- THE EXPLICIT WAY
dummies=pd.get_dummies(credit['Region'], drop_first=True) # create dummies and drop the omitted category
display(dummies.head())
df=credit.join(dummies)

results = smf.ols('Balance ~ South + West ', data=df).fit() # pass the dummies explicitly
# Inspect the results
print(results.summary())

In [None]:
# Regress Sales on a constant term and Ethnicity -- THE MULTICOLINEARITY AWARE WAY
dummies=pd.get_dummies(credit['Region'], drop_first=False) # create dummies with no omitted category
dummies.columns=dummies.columns.str.replace(" ","")
display(dummies.head())
df=credit.join(dummies)

results = smf.ols('Balance ~ South + West + East -1', data=df).fit() # -1 indicates to drop the intercept
# Inspect the results
print(results.summary())

# In-Class Exercise: 

Suppose I am a business owner, and I want to predict seasonal sales. Let: 
- $Y$ = total sales
- $X_1$ = A dummy variable for _fall_
- $X_2$ = A dummy variable for _winter_
- $X_3$ = A dummy variable for _spring_
- $X_4$ = A dummy variable for _summer_

Using data on my historical sales, I estimate the following model: 
\begin{equation}
    Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + u_i
\end{equation}

I estimate the following coefficients: $\hat{\beta}_0 = 100$, $\hat{\beta}_1 = 10$, $\hat{\beta}_2 = -10$, and $\hat{\beta}_3 = 50$. 

Question #1: What were my average sales in fall, winter, spring, and summer? 

Question #2: Suppose I estimated a different model: 
\begin{equation}
    Y = \alpha_0 + \alpha_2 X_2 + \alpha_3 X_3 + \alpha_4 X_4 + u_i
\end{equation}

Is it possible to predict what estimates I would get for $\hat{\alpha}_0$, $\hat{\alpha}_2$, $\hat{\alpha}_3$, and $\hat{\alpha}_4$? If yes, what are they?   
















# Removing the Additive Assumption

- Last week we used the <span style="color:orange;">Advertising</span> dataset
- In the multiple regression model we saw that both TV and radio were associated with sales. 
- We also assumed that TV and Radio effects were indepent from one another
    - Using the old terminology, we captured the effect of TV ads on Sales keeping radio constant


## Interaction effect

- Suppose that spending money on radio ads actually increases the effectiveness of TV ads
    - This means that slope for TV should increase as radio spending increases.
- If true, for a given budget, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio. 
- This is an interaction effect 

## Interaction term
- Instead of running:
$$Y_i = β_0 + β_1 \text{TV}_i + β_2 \text{Radio}_i+ \varepsilon_i$$
- We could interact TV and radio using an __interaction term__:
$$Y_i = β_0 + β_1 \text{TV}_i + β_2 \text{Radio}_i+ β_3 \text{Radio}_i\times\text{TV}_i+ \varepsilon_i$$


## Interaction term - continued
- Note that we can rewrite the last equation as:
$$Y_i = β_0 + \underbrace{(β_1 + β_3 \text{Radio}_i)}_{\tilde{β_1}} \text{TV}_i + β_2 \text{Radio}_i+ \varepsilon_i$$


$$Y_i = β_0 + \tilde{β_1} \text{TV}_i + β_2 \text{Radio}_i+ \varepsilon_i$$

- Since $\tilde{β_1}$ changes with Radio, the effect of TV on Y is
no longer constant 
    - adjusting Radio will change the impact of TV on Y
    - $β_3$ is the change in the effect of TV ads on Sales when Radio increases by one
    - $β_1$ is now the effect of TV ads on Sales when Radio ads equal zero

In [None]:
ads=pd.read_csv('/'.join([folderPath, 'Advertising.csv']), usecols=[1,2,3,4])

print("***Additive Model***")
results = smf.ols('Sales ~ TV + Radio ', data=ads).fit()
print(results.summary())

In [None]:
print("***Model with Interaction Term***")
results = smf.ols('Sales ~ TV + Radio + TV*Radio', data=ads).fit() 
# Inspect the results
print(results.summary())

## Interpretation of the results
-  $\hat{β_1} \text{ and } \hat{β_2}$ give what's known as the __main effect__
- $\hat{β_3}$ is significant and implies that:
    - an increase in TV advertising of 1,000 dollars is associated with increased sales of ($\hat{β_1}+ \hat{β_3}\times $ radio) ×1,000 =19+1.1×radio units. 
    - an increase in radio advertising of \\$1,000 will be associated with an increase in sales of ($\hat{β_2} + \hat{β_3} \times $ TV) × 1,000 = 29 + 1.1 × TV units.

## Models' performance
- We now have two models that include the same number of predictor variables (TV and Radio)
- But the second model allows for synergies between Radio and TV
- The $R^2$ for our model is 96.8%, compared to only 89.7% for the additive model
    -  This means that (96.8 − 89.7)/(100 −89.7) = 69% of the variability in sales that remains after fitting the additive model has been explained by the interaction term.
    - This interaction term has greatly improved our predictive power!

# Classification
- Qualitative variables need not be used only as predictors
- In many interesting predictive tasks, the response variable is qualitative
- Predicting a __categorical variable__ (two or more levels), is a process known as __classification__ 

- In this section, we will attempt to predict whether a person defaults on a loan

In [None]:
## Load New Dataset: Loan Defaults

df = pd.read_excel('/'.join([folderPath, 'Default.xlsx']),index_col=0)

#Data cleaning: The dataset has a column that is listed as "Yes" and "No." We want it to read 0/1.
display(df.head())

# Note: factorize() returns two objects: a label array and an array with the unique values.#
display(df.default.factorize())
# We are only interested in the first object. 
df.rename(columns={"default":"default_lab","student":"student_lab" },inplace=True)
df.loc[:,'default'] = df.default_lab.factorize()[0] 
df['student'] = df.student_lab.factorize()[0]
df.head()


In [None]:
sns.pairplot(df[['balance','income','default', 'student']],diag_kind="kde")

In [None]:
# Create Binned Scatterplot of Balance 
df['bin'] = pd.cut(df.balance, bins=np.linspace(0, 3000, 10), include_lowest=True)

balbins = df.groupby(['bin']).mean()

fig, ax = plt.subplots(figsize=(12,12))

ax.plot(balbins['balance'], balbins['default'], marker='o', linestyle="")

ax.set_xlabel("Bank Account Balance")
ax.set_ylabel("Average Probability of Default")

plt.title("Binned Scatterplot, Default vs. Balance", {'fontsize':30})

plt.show()



## Linear probability model
- The simplest way to perform classification with a categorical variable (2 levels) is to use the __linear probability model__ (LPM)
- In the LPM, you regress a dummy variable on your regressors:
\begin{equation*}
Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + ... + \beta_k X_{ki} + \varepsilon_i
\end{equation*}


- The population regression function is defined by the expected value of the dependent variable given each set of possible values of the regressors:
\begin{equation*}
E(Y | X_1, X_2, ..., X_k) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k
\end{equation*}

Question: why was $\varepsilon_i$ in the population regression but not in the expectation?

## From expectation to mean to probability

- The expectation in the population corresponds to the sample mean
- If a random variable can take only two values (0, 1) then its mean will also be between 0 and 1
    - Mean of [0,0,0] is $\frac{1}{3} \times (0 + 0 + 0) = \frac{0}{3} = 0$
    - Mean of [1,1,1] is $\frac{1}{3} \times (1 + 1 + 1) = \frac{3}{3} = 1$
- If $Y$ is a dummy variable, then its expected value is the same as the probability that Y is equal to 1
- So in this case, the population regression function is:
\begin{equation*}
P(Y = 1 | X_1, X_2, ..., X_k) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k
\end{equation*}



In [None]:
est_ols = smf.ols('default ~ balance', data=df).fit() 
# Inspect the results
print(est_ols.summary())
balance=1000
print("\n\n******")
print(f"An increase in the credit card by ${balance:,} is associated with, on average, \
an increase in the probability of default by {round(est_ols.params['balance']*balance*100,3)} percentage points.")


In [None]:
fig, ax= plt.subplots(1,1,figsize=(10,5))

ax = sns.regplot(x="balance", y="default", data=df, ci=None,scatter_kws={'alpha':.05, 'color':'darkorange','marker':'*'},line_kws={ 'color':'darkgreen'})
ax.get_xaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.axhline(0, color='k', ls=":",alpha=0.5)
ax.axhline(1, color='k', ls=":",alpha=0.5)

ax.set_xlabel("Balance in $", fontsize=16)
ax.set_ylabel("Default (Dummy Variable)", fontsize=16)
plt.show()

# What's wrong with the linear probability model?
- The relationship we estimated seems intuitive:
    - As balance increases the probability of defaulting increases (linearly).
- But remember that we now care more about our ability to predict the probability of defaulting
    - Obviously our $R^2$ is not fantastic but both our parameters are significant at the 1\% level 
    - More importantly, think about the values $\hat{Y}_i$ can take

In [None]:
balance=[bal for bal in np.arange(0,10000, 200)]
df_hat=pd.DataFrame(data={'intercept':est_ols.params['Intercept'], 'slope':est_ols.params['balance'], 'balance':balance})
df_hat['Predicted Probability of Default']=df_hat['intercept'] + df_hat['slope']*df_hat['balance']
fig, ax=plt.subplots(1,1,figsize=(10,10))

sns.lineplot(data=df_hat, x="balance", y="Predicted Probability of Default", ax=ax, color='darkgreen')

ax.set_xlim(left=0)
ax.set_ylim(bottom=df_hat['Predicted Probability of Default'].min() , top=df_hat['Predicted Probability of Default'].max())

ax.get_xaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

ax.set_xlabel("Balance in $", fontsize=16)
ax.set_ylabel("Probability of Default", fontsize=16)

ax.axhline(0, color='darkorange', ls=":")
ax.axhline(1, color='darkorange', ls=":")

ax.set_title("Predicted Probability of Default -- Linear Probability Model")

plt.axhspan(0, df_hat['Predicted Probability of Default'].min(), facecolor='0.5', alpha=0.3)
plt.axhspan(1, df_hat['Predicted Probability of Default'].max(), facecolor='0.5', alpha=0.3)

plt.text(0.2, 0.5, 'Probability',fontsize=16, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes)
plt.text(0.2, .9, 'Not a Probability',fontsize=16, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes)
plt.text(0.2, .015, 'Not a Probability',fontsize=16, horizontalalignment='center',verticalalignment='center', transform=ax.transAxes)

plt.show()

# Finding the right function
- Remember that our goal is to first find the correct function linking X to Y in the population 
$$Y = f(\mathbf{X}) + \varepsilon$$
- It seems that a linear function is bound to fail if the response variable is qualitative
- In our example, even reasonable support yields predicted probability below 0 and above 1
    - a support is the __range of values__ over which we use our function
- Ideally, we would want to find a function which would predict a probability between 0 and 1 even if we use extravagant values as input (e.g -50,000 or 2 million balances)
- It turns out, cumulative distribution functions fit this description

# Logistic Function
- The most used cummulative distribution function is the standard logistic function:
$$f(X) = \frac{e^x}{1+e^x}$$
- This function has a useful property for us:
    - As $x \to -\infty$, $\frac{e^x}{1+e^x} \to 0$
    - As $x \to +\infty$, $\frac{e^x}{1+e^x} \to 1$

In [None]:
fig, axes = plt.subplots(1,2,figsize=(14,8))

X=[x for x in np.arange(-20,20,1)]
Y1=[math.exp(x) for x in X ]
Y2=[math.exp(x)/(1+math.exp(x)) for x in X ]

sns.lineplot( x=X, y=Y1 ,ax=axes[0], color='darkgreen')
sns.lineplot( x=X, y=Y2 ,ax=axes[1], color='darkgreen')

axes[0].set_xlabel(r"$X$", fontsize=20)
axes[0].set_ylabel(r"$f(X) = e^x$", fontsize=16)
axes[0].set_title("Exponential function")
axes[0].axhline(0, color='darkorange', ls=":")


axes[1].set_xlabel(r"$X$", fontsize=20)
axes[1].set_ylabel(r"$f(X) = \frac{e^x}{1+e^x} $", fontsize=16)
axes[1].set_title("standard logistic function".capitalize())
axes[1].axhline(0, color='darkorange', ls=":")
axes[1].axhline(1, color='darkorange', ls=":")

plt.show()

## From Logistic Function to Logistic Regression

- We need to transform our logistic function to a __logistic regression__:
    - Note that probability of default given balance can be written as
        - Pr(default = Yes|balance)
        - You read it as the probability (Pr) of defaulting ("default=Yes") given that ("|") balance is a given value
    - Unlike the linear probability model, Pr(default = Yes|balance) will range between 0 and 1. 
    - Then for any given value of balance, a prediction can be made for default. 
$$Pr(Y=1 | X) = p(X) = \Large\frac{e^{\beta_0 + \beta_1x}}{1+e^{\beta_0 + \beta_1x}} $$


## Logit (tl;dr)
After a bit of algebra (that you don't need to know about) we can rewrite the logistic regression:
$$ p(X) = \Large\frac{e^{(\beta_0 + \beta_1X)}}{1+e^{(\beta_0 + \beta_1X)}} $$

as

$$ \underbrace{\frac{p(X)}{1-p(X) }}_{odds} = \Large e^{(\beta_0 + \beta_1X)} $$
- e.g. $p(X) = 0.2$ implies odds of $\frac{0.2}{1-0.2} = \frac{1}{4}$
    
taking log of both sides yields:
$$ \underbrace{\log (\frac{p(X)}{1-p(X) })}_{\text{log-odds}} = \beta_0 + \beta_1X $$


## Logistic Regression - Estimation

- As per the OLS regression, the goal is to find the $\hat{\beta}_0$ and $\hat{\beta}_1$ that provide the best fit of the data.
- This is done using __maximum likelihood estimation__ (MLE)
    - We won't explain MLE method now, the solution is more technical than OLS
    - But MLE follows the same logic:
        - Try to find $\hat{\beta}_0$ and $\hat{\beta}_1$ such that plugging these estimates into the model for p(X),  yields a number close to one for all individuals who defaulted, and a number close to zero for all individuals who did not.

Example using [Scikit Learn's Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
%%capture --no-stdout 
# hide warnings from sklearn

# Logistic regression using sklearn
print("Shape {} and type {} before.".format(df.balance.shape, type(df.balance)))
X_train = df.balance.values.reshape(-1,1) # store the balance variable (our predictor) to a numpy array
print("Shape {} and type {} after.".format(X_train.shape, type(X_train)))
y = df.default # Store the target variable in y

# Calculate the classification probability and predicted classification.
clf = skl_lm.LogisticRegression(solver='newton-cg') # instantiate a logistic regression class
clf.fit(X_train,y) # fit the data using this class

# Summary of the results
beta_0= clf.intercept_[0] ; beta_1=clf.coef_[0][0]
print('\n\nsklearn function used: ',clf)
print('classes (unique values for Default): ',clf.classes_)
print('\u03B2\u0302\u2080:',beta_0)
print('\u03B2\u0302\u2081:',beta_1)

## New prediction

- We now have parameter values that can be used to predict the probability of default given balance.
- Our predicted probability will thus follow:

$\hat{Pr}(Y=1 | X) = \hat{p}(X) = \Huge\frac{e^{\hat{\beta}_0 + \hat{\beta}_1x}}{1+e^{\hat{\beta}_0 + \hat{\beta}_1x}} = \Huge\frac{e^{-10.7 + 0.005\times x}}{1+e^{-10.7  + 0.005\times x}} $


In [None]:
# Plot the predicted default against the actual data

balance=[bal for bal in np.arange(df.balance.min(), df.balance.max())]
df_hat=pd.DataFrame(data={'intercept':beta_0, 'slope':beta_1, 'balance':balance})
# Predict using the logistic function formula
df_hat['Predicted Probability of Default']=np.exp(df_hat['intercept'] + df_hat['slope']*df_hat['balance']) /(1+  np.exp(df_hat['intercept'] + df_hat['slope']*df_hat['balance']))

df_hat.head()
# initiate plot
fig, ax=plt.subplots(1,1,figsize=(10,10))
# plot actual datapoints
sns.scatterplot(x="balance", y="default", data=df, ax=ax,alpha=.3, color='darkorange', marker='*')
# plot y hat, the predicted probability of default from our estimated model
sns.lineplot(data=df_hat, x="balance", y="Predicted Probability of Default", ax=ax, color='darkgreen')

ax.get_xaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.axhline(0, color='k',alpha=.3, linestyle=":")
ax.axhline(1, color='k',alpha=.3, linestyle=":")

ax.set_xlabel("Balance in $", fontsize=16)
ax.set_ylabel("Probability of Default", fontsize=16)
ax.set_title("Predicted Probability of Default -- Logistic Regression")
plt.show()

In [None]:
est_logit = smf.logit('default ~ balance' , data=df).fit()
print(est_logit.summary())

# Assessing the Accuracy of the Coefficient Estimates
- Many aspects of the logistic regression output shown above are similar to the linear regression output from last week. 
- Estimated parameters also have standard errors. 
- The z-statistic in table plays the same role as the t-statistic in the linear regression output, for example in 
    - z-statistic associated with β1 is equal to $\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}$
    -  a large (absolute) value of the z-statistic indicates evidence against the null hypothesis that probability of default does not depend on balance 
        - H0 : $β_1$ = 0. 
        - This null hypothesis implies that $p(X) = \frac{e^{\beta_0}}{1 + e^{\beta_0}}$
- The p-value associated with balance is small, we can reject H0. 
- The estimated intercept is typically not of interest; its main purpose is to adjust the average fitted probabilities to the proportion of ones in the data.

In [None]:
# Classification

balance=[bal for bal in np.arange(0,10000, 200)]









In [None]:
#Comparing the two models 
est_probit = smf.probit('default ~ balance' , data=df).fit()


df['logit_predict'] = est_logit.predict()
df['ols_predict'] = est_ols.predict()
df['probit_predict'] = est_probit.predict()
df['bin'] = pd.cut(df.balance, bins=np.linspace(0, 3000, 100), include_lowest=True)


balbins = df.groupby(['bin']).mean()

fig, ax = plt.subplots(figsize=(12,12))

ax.plot(balbins['balance'], balbins['default'], marker='o', linestyle="", label='Binned Scatterplot')
ax.plot(balbins['balance'], balbins['logit_predict'], marker='+', label="Logit")
ax.plot(balbins['balance'], balbins['probit_predict'], marker='+', label="Probit")
ax.plot(balbins['balance'], balbins['ols_predict'], marker='.', label="Linear Probability Model")


ax.set_xlabel("Bank Account Balance")
ax.set_ylabel("Average Probability of Default")

plt.title("Binned Scatterplot, Default vs. Balance", {'fontsize':30})
plt.legend(loc='upper left')

plt.show()


# Which of these models is a better classifier? 

__TRICK QUESTION__: All the models are going to set a single threshold above which all observations are classified as _default_ 

Last step for classifier: We need to set a threshold to determine assignment. What cutoff should we use? 

In [None]:
errors = pd.DataFrame(data={'cutoffs': [.2, .5, .75, .9]})
errors = pd.DataFrame(data={'cutoffs': np.linspace(0,1,10).tolist()})

errors['Defaults'] = df[df['default']== 1].shape[0]
errors['Total Observations'] = df.shape[0]
errors['Predicted Defaults'] = [df[df['logit_predict']>x].shape[0] for x in errors['cutoffs']]
errors['Correct Predictions'] = [df[(df['logit_predict']>x) & (df['default']==1)].shape[0] for x in errors['cutoffs']]
errors['Incorrect Predictions'] = [df[(df['logit_predict']>x) & (df['default']==0)].shape[0] for x in errors['cutoffs']]
errors['False Positive Rate'] = errors['Incorrect Predictions']/errors['Predicted Defaults']
errors['True Positive Rate'] = errors['Correct Predictions']/errors['Defaults']

errors.head()




In [None]:

fig, ax = plt.subplots(figsize=(12,12))

ax.plot(errors['False Positive Rate'], errors['True Positive Rate'], marker='o', linestyle="-")

ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")

plt.title("ROC Curve", {'fontsize':30})


# Multiple Logistic Regression

- By analogy with the extension from simple to multiple linear regression from last class 
- We can generalize our logit to p, predictors:
$$ \log (\frac{p(X)}{1-p(X) }) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p $$


In [None]:
df['income_thd']=df['income']/1000

In [None]:
est_mult_logistic = smf.logit('default ~ balance + income_thd + student' , data=df).fit()
print(est_mult_logistic.summary())


# Multiple Logistic Regression - Interpretation

- p-values associated with balance and student are small
- the coefficient for the student dummy variable is negative:
    - students are less likely to default than nonstudents
    - for a fixed value of balance and income, a student is less likely to default than a non-student
- The p-value associated with income suggests no effect of income
    - conditional on balance level and student status 

In [None]:
est_mult_linear = smf.ols('default ~ balance + income_thd + student' , data=df).fit()




In [None]:
errors_mult = pd.DataFrame(data={'cutoffs': np.linspace(0,1,10).tolist()})

df['logit_predict'] = est_mult_logistic.predict()
df['ols_predict'] = est_mult_linear.predict()

errors_mult['Defaults'] = df[df['default']== 1].shape[0]
errors_mult['Total Observations'] = df.shape[0]
errors_mult['Predicted Defaults'] = [df[df['logit_predict']>x].shape[0] for x in errors_mult['cutoffs']]
errors_mult['Correct Predictions'] = [df[(df['logit_predict']>x) & (df['default']==1)].shape[0] for x in errors_mult['cutoffs']]
errors_mult['Incorrect Predictions'] = [df[(df['logit_predict']>x) & (df['default']==0)].shape[0] for x in errors_mult['cutoffs']]
errors_mult['False Positive Rate'] = errors_mult['Incorrect Predictions']/errors_mult['Predicted Defaults']
errors_mult['True Positive Rate'] = errors_mult['Correct Predictions']/errors_mult['Defaults']

errors_mult

In [None]:
df['ols_predict'] = est_mult_linear.predict()

errors_lin = pd.DataFrame(data={'cutoffs': np.linspace(0,df['ols_predict'].max(),10).tolist()})


errors_lin['Defaults'] = df[df['default']== 1].shape[0]
errors_lin['Total Observations'] = df.shape[0]
errors_lin['Predicted Defaults'] = [df[df['ols_predict']>x].shape[0] for x in errors_lin['cutoffs']]
errors_lin['Correct Predictions'] = [df[(df['ols_predict']>x) & (df['default']==1)].shape[0] for x in errors_lin['cutoffs']]
errors_lin['Incorrect Predictions'] = [df[(df['ols_predict']>x) & (df['default']==0)].shape[0] for x in errors_lin['cutoffs']]
errors_lin['False Positive Rate'] = errors_lin['Incorrect Predictions']/errors_lin['Predicted Defaults']
errors_lin['True Positive Rate'] = errors_lin['Correct Predictions']/errors_lin['Defaults']

errors_lin

In [None]:
fig, ax = plt.subplots(figsize=(12,12))

ax.plot(errors_mult['False Positive Rate'], errors_mult['True Positive Rate'], marker='o', linestyle="-", label='Multiple Logistic')
ax.plot(errors_lin['False Positive Rate'], errors_lin['True Positive Rate'], marker='+', linestyle="--", label='Multiple Linear')
ax.plot(errors['False Positive Rate'], errors['True Positive Rate'], marker='+', linestyle="--", label='Single Logistic')

ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")

plt.title("ROC Curve", {'fontsize':30})

plt.legend()

plt.show()


# In-Class Exercise

3. Let's say I want to target a 0.4 false-positive rate. What model gives the best predictions in this sample given a 0.4 false-positive rate? 

4. Is one of these three models the "best?" Why or why not? 