# Binomial Logistic Regression
---

1.   **[Introduction to Logistic Regression](#1.-Introduction-to-Logistic-Regression)**
2.   **[Foundations of Logistic Regression](#2.-Foundations-of-Logistic-Regression)**
3.   **[Model Assumptions](#3.-Model-Assumptions)**
4.   **[Model Interpretation](#4.-Model-Interpretation)**
5.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
6.   **[Model Construction](#6.-Model-Construction)**
7.   **[Model Results](#7.-Model-Results)**
8.   **[Model Evaluation](#8.-Model-Evaluation)**
9.   **[Conclusion](#9-Conclusions)**

---
<a name="1.-Introduction-to-Multiple-Linear-Regression"></a>
### 1. Introduction to Logistic Regression

#### 1.1 Definitions

**Logistic Regression |** Technique that models a categorical dependent variable based on one or more independent variables 
- Used to estimate the probability of an outcome

**Dependant variable (Y) |** The variable a given model estimates, also referred to as a response or outcome variable

**Independent variable (X) |** A variable that explains trends in the dependent variable, also referred to as an explanatory or predictor variable.

**Logistic Regression Model |** 
- $\mu\{Y|X\} = Prob(Y = 1|X) = p$
- $g(p) = \beta_0 + \beta_1X$ - (Link Function)

**Link Function |** A nonlinear function that connects or links the dependent variable to the independent variable mathematically 

---
<a name="2.-Foundations-of-Logistic-Regression"></a>
### 2. Foundations of Logistic Regression

#### 2.1 Binomial Logistic Regression


A technique that models the probability of an observation falling into one of two categories, based on one or more independent variables.
- This is a classification model

**Binomial logistic regression linearity assumption |** There should be a linear relationship between each X variable and the logit of the probability that Y equals 1
- $Odds = \frac{p}{1-p}$

**Logit(log-odds) |** The logarithm of the odds of a given probability. The logit of probability p is equal to the logarithm of p divided by 1 minus p.
- $logit(p) = log(\frac{p}{1-p})$

**Logit in terms of X variables |** 
- $logit(p) = log(\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nXn$

**Maximum Likelihood Estimation (MLE) |** A technique for estimating the beta parameters that maximize the likelihood of the model producing the observed data
- The best logistic regression model estimates the set of beta coefficients that maximizes the likelihood of observing all of the sample data. 

---
<a name="3.-Model-Assumptions"></a>
### 3. Model Assumptions

Model assumptions are statements about the data that must be true in order to justify the use of a particular modeling technique



#### 3.1 Logistic Regression Assumptions
- **Linearity**
- **Independent Observations**
- **No Multicollinearity**
- **No Extreme Outliers**


##### 3.1.1 Linearity

Each predictor variable $(X_i)$ is linearly related to the outcome variable $(Y)$

##### 3.1.2 Independent Observations 

Each observation in the dataset is independent

##### 3.1.3 No Multicollinearity

No two independent variables ($X_i$ and $X_j$) can be highly correlated with each other.
- Variance Inflation Factors (VIF) quantifies how correlated each independent variable is with all of the other independent variables

##### 3.1.4 No Extreme Outliers

These can be identified once the model has been constructed


#### 3.2 Assumption Violations

*`Lacking information on dealing with extreme outliers`*



##### 3.2.1 Linearity
**Transform one or both of the variables**, such as taking the logarithm.
- For example, if measuring the relationship between years of education and income, take the logarithm of the income variable and check if that helps the linear relationship.

##### 3.2.2 Independent Observation 
**Take just a subset of the available data.**
- If, for example, data is a survey including responses from people in the same household, responses may be correlated. Correct for this by just keeping the data of one person in each household.
- Another example data on bike rental over a time period. If data collected every 15 minutes, the number of bikes rented out at 8:00 a.m. might correlate with the number of bikes rented out at 8:15 a.m. Perhaps the number of bikes rented out is independent if the data is taken once every 2 hours, instead of once every 15 minutes.

##### 3.2.3 No Multicollinearity
**Drop Variables**
- Drop one or more variables that have high multicollinearity
- Strategic variable selection:
    - Forward Selection
    - Backward elimination 
- Advanced variable selection:
    - Ridge regression
    - Lasso regression
    - Elastic-net regression
    - Principal component analysis(PSA)

**Create new Variables**
- Use existing data to create new variables

---
<a name="4.-Model-Interpretation"></a>
### 4. Model Interpretation

#### 4.1 Confusion Matrix

A graphical representation of how accurate a classifier is at predicting the labels for a categorical variable 

####  4.2 Evaluation Metrics 
- Precision 
- Recall
- Accuracy

##### 4.2.1 Precision
The proportion of positive predictions that were true positives

$Precision = \frac{True Positives}{True Positives + False Positives}$

##### 4.2.2 Recall
The proportion of positives the model was able to identify correctly 

$Recall = \frac{True Positives}{True Positives + False Negatives}$

##### 4.2.3 Accuracy
The proportion of data points that were correctly categorized

$Accuracy = \frac{True Positives + True Negatives}{Total Predictions}$

##### 4.2.4 ROC Curves (Receiver Operating Characteristic).
Used to visualize the performance of a classifier at different classification thresholds on a graph. A classification threshold in the context of a binary classification is the cutoff threshold for differentiating the positive class from the negative class.
- Plots two key concepts
    1. True Positive Rate
    2. False Positive Rate
**The more that the ROC curve hugs the top left corner of the plot, the better the model does at classifying the data.**

##### 4.2.4 AUC 
Stands for Area Under ROC Curve. Provides an aggregate measure of performance across all possible classification thresholds. AUC ranges in value from 0.0 t0 1.0. A model that is 100% wrong has an AUC of 0 and a model predicting 100% correct has AUC of 1.0
- An AUC of less than 0.5 indicates that the model performs worse than a random classifier 
- Python function: `metrics_roc_auc_score(y_test,y_pred)`

#### 4.3 Considerations when choosing metrics

##### 4.3.1 When to use Precision
**Using precision as an evaluation metric is especially helpful in contexts where the cost of a false positive is quite high and much higher than the cost of a false negative.** For example, in the context of email spam detection, a false positive (predicting a non-spam email as spam) would be more costly than a false negative (predicting a spam email as non-spam). A non-spam email that is misclassified could contain important information, such as project status updates from a vendor to a client or assignment deadline announcements from an instructor to a class of students. 

##### 4.3.1 When to use Recall
**Using recall as an evaluation metric is especially helpful in contexts where the cost of a false negative is quite high and much higher than the cost of a false positive.** For example, in the context of fraud detection among credit card transactions, a false negative (predicting a fraudulent credit card charge as non-fraudulent) would be more costly than a false positive (predicting a non-fraudulent credit card charge as fraudulent). A fraudulent credit card charge that is misclassified could lead to the customer losing money, undetected.

##### 4.3.1 When to use Accuracy
**It is helpful to use accuracy as an evaluation metric when you specifically want to know how much of the data at hand has been correctly categorized by the classifier.** Another scenario to consider: **accuracy is an appropriate metric to use when the data is balanced, in other words, when the data has a roughly equal number of positive examples and negative examples. Otherwise, accuracy can be biased.** For example, imagine that 95% of a dataset contains positive examples, and the remaining 5% contains negative examples. Then you train a logistic regression classifier on this data and use this classifier predict on this data. If you get an accuracy of 95%, that does not necessarily indicate that this classifier is effective. Since there is a much larger proportion of positive examples than negative examples, the classifier may be biased towards the majority class (positive) and thus the accuracy metric in this context may not be meaningful. When the data you are working with is imbalanced, consider either transforming it to be balanced or using a different evaluation metric other than accuracy. 

---
<a name="5.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
# Standard operational package imports.
import numpy as np
import pandas as pd

# Important imports for preprocessing, modeling, and evaluation.
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics

# Visualization package imports.
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset into a DataFrame and save in a variable
data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Display the data type for each column. NB logistic regression models expect numeric data
data.dtypes

#### 5.3 Missing Data

##### 5.3.1 Check for missing data

In [None]:
# Step 1. Start with .isna() to get booleans indicating whether each value in the data is missing
data.isna()

In [None]:
# Step 2. Use .any(axis=1) to get booleans indicating whether there are any missing values along the columns in each row
data.isna().any(axis=1)

In [None]:
# Step 3. Use .sum() to get the number of rows that contain missing values
data.isna().any(axis=1).sum()

##### 5.3.2 Drop missing data

In [None]:
# Step 1. Use .dropna(axis=0) to indicate that you want rows which contain missing values to be dropped
# Step 2. To update the DataFrame, reassign it to the result
data = data.dropna(axis=0)

# This is generally a good place to subset the data to avoid corrupting the original df so var should become data_subset but left as data for simplicity sake

In [None]:
# Check to make sure that the data does not contain any rows with missing values now

# Step 1. Start with .isna() to get booleans indicating whether each value in the data is missing
# Step 2. Use .any(axis=1) to get booleans indicating whether there are any missing values along the columns in each row
# Step 3. Use .sum() to get the number of rows that contain missing values
data.isna().any(axis=1).sum()

##### 5.3.3 Convert Categorical Data to Numerical

In [None]:
# create list of columns that need to be encoded
columns_to_encode = ['cat_column_1', 'cat_column_2']

# instantiate new df from the encoded df
df_new = pd.get_dummies(df, columns=columns_to_encode, drop_first=True)

df_new.head()

##### 5.3.3 Create training, validation and test data

In [None]:
# Separate the dataset into labels (y) and features (X).
y = df3['left']

X = df3.copy()
X = X.drop('left', axis = 1)

# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify= y, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.30, stratify= y_train, random_state = 0)

---
<a name="6.-Model-Construction"></a>
### 6. Model Construction

In [None]:
# Fit a LogisticRegression model to the training data
clf = LogisticRegression().fit(X_train,y_train)

In [None]:
# Obtain parameter estimates
clf.coef_

In [None]:
clf.intercept_

In [None]:
# Create a plot of your model to visualize results using the seaborn package (this is a vizze generally used for simple linear regression not multiple variables)
sns.regplot(x="independent variable", y="dependant variable", data= data, logistic= True, ci= None)

**Question to Answer:** What observations can be made from this initial visualization?

---
<a name="7.-Model-Results"></a>
### 7. Model Results

In [None]:
# Predict the outcome for the test dataset
# Save predictions and print predictions
y_pred = clf.predict(X_test)
print(y_pred)

In [None]:
# Use predict_proba to output a probability.
clf.predict_proba(X_test)

In [None]:
# Use predict to output 0's and 1's.
clf.predict(X_test)

In [None]:
# Analyze the results by printing accuracy, precision, recall and F1 score

print("Accuracy:", "%.6f" % metrics.accuracy_score(y_test, y_pred))
print("Precision:", "%.6f" % metrics.precision_score(y_test, y_pred))
print("Recall:", "%.6f" % metrics.recall_score(y_test, y_pred))
print("F1 Score:", "%.6f" % metrics.f1_score(y_test, y_pred))

In [None]:
# Produce a confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred, labels = clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels = clf.classes_)
disp.plot()

---
<a name="8.-Model-Evaluation"></a>
### 8. Model Evaluation

#### 7.1 Model Assumptions Check

##### 7.1.1 Linearity Check

In [None]:
# Create a scatterplot for each independent variable and the dependent variable.

# Create a 1x2 plot figure.
fig, axes = plt.subplots(1, 2, figsize = (8,4))

# Create a scatterplot between X_1 and Y.
sns.scatterplot(x = ols_data['X_1'], y = ols_data['Y'],ax=axes[0])

# Set the title of the first plot.
axes[0].set_title("X_1 and Y")

# Create a scatterplot between Social Media and Sales.
sns.scatterplot(x = ols_data['X_2'], y = ols_data['Y'],ax=axes[1])

# Set the title of the second plot.
axes[1].set_title("X_2 and Y")

# Set the xlabel of the second plot.
axes[1].set_xlabel("X_2")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance.
plt.tight_layout()

**Question to answer |** Is there a clear linear relationship in the scatterplot between Y and X variables? If yes then assumption is met.

##### 7.1.2 Independence Check



The independent observation assumption states that each observation in the dataset is independent. As each marketing promotion (i.e., row) is independent from one another, the independence assumption is not violated.
- Consider whether each row of data is independent from one another is so then the independence assumption is not violated.

##### 7.1.3 Multicollinearity Check

**Two common ways to check for multicollinearity are to:**

* Create scatterplots to show the relationship between pairs of independent variables
* Use the variance inflation factor to detect multicollinearity

In [None]:
# Create a pairplot of the data.
sns.pairplot(data)

**Question to answer:**

1. Are the independent variables visibly linearly correlated?
    - If no then assumption met

In [None]:
# Calculate the variance inflation factor

# Import variance_inflation_factor from statsmodels.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a subset of the data with the continuous independent variables. 
X = ols_data[['X_1','X_2']]

# Calculate the variance inflation factor for each variable.
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Create a DataFrame with the VIF results for the column names in X.
df_vif = pd.DataFrame(vif, index=X.columns, columns = ['VIF'])

# Display the VIF results.
df_vif

**Question to answer:**

A VIF value of 1 indicates that there is no correlation between the predictor variables, while a value of greater than 1 indicates that there is some correlation. Generally, a VIF value of 5 or above is considered to indicate a high degree of multicollinearity.



#### 7.1.4 Check for extreme outliers
Note: that the assumption of no extreme outliers is not a formal assumption of logistic regression. However, extreme outliers can have a disproportionate impact on the model's results and influence the estimates of the coefficients. Therefore, it is a good practice to check for outliers in any regression analysis.

In [None]:
# Obtain the standardized residuals for each observation
std_resid = clf.get_influence().resid_studentized_internal

# Create a plot of the standardized residuals against the predicted values
pred_vals = logit_model.predict(X)
plt.scatter(pred_vals, std_resid)
plt.axhline(y=2, color='r', linestyle='--')
plt.axhline(y=-2, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Standardized Residuals')
plt.title('Plot of Standardized Residuals vs. Predicted Values')
plt.show()

# Look for any points that fall outside the range of +/- 2 or +/- 3 for the standardized residuals. These points may be considered as potential outliers.
# If potential outliers found, investigate them further to determine whether they are genuine outliers or data errors. You may want to remove them from the analysis if they are confirmed as outliers.


---
<a name="9.-Conclusions"></a>
### 9. Conclusions


##### 9.1 Drawing conclusions

1. Interpret the confusion matrix
    - Is there a significant difference between the false positives or false negatives that the model produced? 

2. What could be done to improve the model?
