# Logistic Regression

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. It is used when the dependent variable(target) is categorical. Just like Linear regression assumes that the data follows a linear function, Logistic regression models the data using the sigmoid function.

Logistic Regression is classified into 3 types based on number of categories:

- __Binary Logistic Regression__ : The categorical response has only two 2 possible outcomes.
- __Multinomial Logistic Regression__: Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan)
- __Ordinal Logistic Regression__: Three or more categories with ordering. Example: Movie rating from 1 to 5

Below logistic regression is performed on 'restatements' dataset.
Features - fyear, roa, size, mtb, sic2; 
Dependent variable - restatement

__Reference__

- https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
- https://www.geeksforgeeks.org/understanding-logistic-regression/
- https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# read dataset
dataset = pd.read_excel(r'..\datasets\restatements.xlsx')
dataset.head()

### Dependent variable

The dependent variable is an indicator variable that is 1 if the financial statement for that year is later on restated, 0 otherwise.
A restatement means fixing an error (can be an intentional error but doesn't have to be).

In [None]:
# restatements are somewhat scarce (4% of the bos)
dataset[['restatement']].value_counts()

In [None]:
Y = dataset['restatement'].values
Y

### Independent variables

Independent variables: roa, size and mtb

In [None]:
# independent variablesvalues gives numpy array (datatype that sklearn expects)
X = dataset[['roa', 'size', 'mtb']].values
X

### Sklearn: test and train sample and model training

In [None]:
from sklearn.model_selection import train_test_split
# 70% of the obs used for training, 30% for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)

In [None]:
print("#obs for training", len(X_train), "#obs for testing", len(X_test) )

## statsmodel OLS and scikit linear regression

Just like with OLS, there are two main packages for logistic regression: Statsmodel and Scikit-learn (sklearn). 

### Statsmodel

In [None]:
import statsmodels.api as sm

# set up, no constant (default)
# note that unlike with OLS there is no 'nice model' to edit (instead, we pass in data)
#logit_model=sm.Logit(Y_train,X_train)

# set up, with intercept (constant)
logit_model=sm.Logit(Y_train, sm.add_constant(X_train) )

# train the model
result=logit_model.fit()

# print summary
print(result.summary())

## sklearn

In [None]:
from sklearn.linear_model import LogisticRegression

# new logistic regression object
#clf = LogisticRegression()
# disable sklearn regularization
# see https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels
clf = LogisticRegression(penalty='none')

# run model
clf.fit(X_train, Y_train.ravel())

In [None]:
# coefficients
print('Intercept', clf.intercept_)
print('Coefficients', clf.coef_)

In [None]:
# fitted values on test sample
Y_pred = clf.predict(X_test)
Y_pred

In [None]:
# these are the probabilities for being '0' vs '1' (adds up to 1)
# the highest probability is chosen for predict (in this case all 0)
y_prob = clf.predict_proba(X_test)
y_prob

In [None]:
from sklearn.metrics import accuracy_score
# accuracy compares predicted values (0 or 1) vs actual values (0 or 1)
print('Test Accuracy', accuracy_score(Y_test, Y_pred))

## Confusion Matrix

To get more information on the accuracy of the model, a confusion matrix is used. In the case of binary classification, the confusion matrix shows the numbers of the following:

- True negatives in the upper-left position
- False negatives in the lower-left position
- False positives in the upper-right position
- True positives in the lower-right position

In [None]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(Y_test,Y_pred)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=matrix,display_labels=clf.classes_)
disp.plot()
plt.show()