In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import copy

import sklearn.linear_model as skl_lm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn import preprocessing
from sklearn import neighbors

import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
plt.style.use('seaborn-white')
import warnings
warnings.filterwarnings("ignore")

# Linear Discriminant Analysis

We are interested in classifying an observation into one of $K$ classes, with $K\geq 2$. In other words, the qualitative response variable $Y$ can take $K$ possible distinct and unordered values. The probability that a given observation is associated with the $k$th category of the response variable $Y$ and the density function of X for an observation that comes from the kth class are represented by $pi_k$ and $f_k (x) = Pr(X = x|Y = k)$, respectively.


As can be seen, $f_k (x)$ can be a large value if there is a high probability that an observation in the $k$th class has $X \approx x$, and $f_k (x)$ takes smaller values if it is very unlikely that an observation in the $k$th class has $X \approx x$. It follows from *Bayes’ theorem* that

$$Pr(Y = k|X = x) =\dfrac{\pi_k f_k (x)}{\sum_{j=1}^{K}\pi_j f_j (x)}$$

## Default Example

In [2]:
df = pd.read_csv('Data/Default.csv')
df['Income']=df['Income']/1000
df.head(5)

Unnamed: 0,Default,Student,Balance,Income
0,No,No,729.526495,44.361625
1,No,Yes,817.180407,12.106135
2,No,No,1073.549164,31.767139
3,No,No,529.250605,35.704494
4,No,No,785.655883,38.463496


Here, Student status is encoded as a dummy variable as

$$\text{Student}=\begin{cases}
  1, & \mbox{Student},\\
  0, & \mbox{Non-Student}. \\
\end{cases}$$
This means

In [3]:
df2=copy.deepcopy(df)
df2['Default'] = df.Default.factorize()[0]
df2['Student'] = df.Student.factorize()[0]
df2.head(5)

Unnamed: 0,Default,Student,Balance,Income
0,0,0,729.526495,44.361625
1,0,1,817.180407,12.106135
2,0,0,1073.549164,31.767139
3,0,0,529.250605,35.704494
4,0,0,785.655883,38.463496


Preparing Data:

In [4]:
# X
X = df2[['Balance', 'Income', 'Student']].to_numpy()
# True Default
True_Default = df2.Default.to_numpy()

LDA:

In [5]:
LDA = LinearDiscriminantAnalysis(solver='svd')
# Predicted Default
Predicted_Default = LDA.fit(X, True_Default).predict(X)

In [6]:
# Dataframe
df_LDA = pd.DataFrame({'True Default Status': True_Default, 'Predicted Default Status': Predicted_Default})
# replacing the dummy variables with 'Yes' and 'No'
df_LDA.replace(to_replace={0:'No', 1:'Yes'}, inplace=True)

# grouping by each category and creating a new Table
Table=df_LDA.groupby(['Predicted Default Status','True Default Status']).size().unstack('True Default Status')
Table

True Default Status,No,Yes
Predicted Default Status,Unnamed: 1_level_1,Unnamed: 2_level_1
No,9645,254
Yes,22,79


The above table is known as **confusion matrix**. This table here compares the LDA predictions to the true default statuses for the 10,000 training observations in the Default data set.

Elements on the diagonal of the matrix, 2645 and 79, represent individuals whose default statuses were correctly predicted, and off-diagonal elements represent individuals that were misclassified. LDA made incorrect predictions for 22 individuals who did not default and for 254 individuals who did default.

In [7]:
Table['Total']=[sum(Table.iloc[0]), sum(Table.iloc[1])]
temp=pd.DataFrame({'No': [sum(Table.iloc[:,0])], 'Yes': [sum(Table.iloc[:,1])], 'Total': [sum(Table.iloc[:,2])]})
Table = pd.concat([Table, temp])
del temp
Table.rename(index={0:'Total'}, inplace=True)
Table

True Default Status,No,Yes,Total
No,9645,254,9899
Yes,22,79,101
Total,9667,333,10000


To identify high-risk individuals, an error rate of

In [8]:
Table.iloc[0,1]/Table.iloc[-1,1]*100

76.27627627627628

percent of individuals who default may well be unacceptable.

In [9]:
print(classification_report(True_Default, Predicted_Default, target_names=['No', 'Yes']))

              precision    recall  f1-score   support

          No       0.97      1.00      0.99      9667
         Yes       0.78      0.24      0.36       333

    accuracy                           0.97     10000
   macro avg       0.88      0.62      0.67     10000
weighted avg       0.97      0.97      0.97     10000



The **sensitivity**: the percentage of true defaulters that are identified = $24\%$

The **specificity**: the percentage of non-defaulters that are identified = $97\%$

The Bayes classifier works by assigning an observation to the class for which the posterior probability $p_k (X)$ is greatest. In the two-class case, this amounts to assigning an observation to the default class if

$$Pr(default = Yes|X = x) > 0.5. $$

However, if we are concerned about incorrectly predicting the default status for individuals who default, then we can consider
lowering this threshold.

$$Pr(default = Yes|X = x) > \text{Decision probability} $$

For example, we can consider

$$Pr(default = Yes|X = x) > 0.2. $$

In [13]:
Decision_probability = 0.2
Predicted_Default = LDA.fit(X, True_Default).predict_proba(X)

In [14]:
df_LDA= pd.DataFrame({'True Default Status': True_Default,
                      'Predicted Default Status': Predicted_Default[:,1] > Decision_probability})
df_LDA.replace(to_replace={0:'No', 1:'Yes', 'True':'Yes', 'False':'No'}, inplace=True)
Table=df_LDA.groupby(['Predicted Default Status','True Default Status']).size().unstack('True Default Status')
Table

True Default Status,No,Yes
Predicted Default Status,Unnamed: 1_level_1,Unnamed: 2_level_1
No,9435,140
Yes,232,193


In [15]:
Table['Total']=[sum(Table.iloc[0]), sum(Table.iloc[1])]
temp=pd.DataFrame({'No': [sum(Table.iloc[:,0])], 'Yes': [sum(Table.iloc[:,1])], 'Total': [sum(Table.iloc[:,2])]})
Table = pd.concat([Table, temp])
del temp
Table.rename(index={0:'Total'}, inplace=True)
Table

True Default Status,No,Yes,Total
No,9435,140,9575
Yes,232,193,425
Total,9667,333,10000


In [16]:
print(classification_report(True_Default, Predicted_Default[:,1] > Decision_probability, target_names=['No', 'Yes']))

              precision    recall  f1-score   support

          No       0.99      0.98      0.98      9667
         Yes       0.45      0.58      0.51       333

    accuracy                           0.96     10000
   macro avg       0.72      0.78      0.74     10000
weighted avg       0.97      0.96      0.96     10000



Now LDA predicts that 425 individuals will default. Of the 333 individuals
who default, LDA correctly predicts all but 140, or

In [17]:
Table.iloc[0,1]/Table.iloc[-1,1]*100

42.04204204204204

This is a vast improvement over the error rate of $76\%$ that resulted from using the threshold of 50%. However, this improvement comes at a cost: now 232 individuals who do not default are incorrectly classified. 