# Logistic Regression

## Why Not Linear Regression?

Linear regression is a fundamental statistical technique that is widely used for modeling the relationship between a dependent variable and one or more independent variables. It has its strengths and weaknesses, and there are situations where it may not be the best choice. Here are some reasons why linear regression might not be suitable [James et al., 2023]:

1. **Nonlinear Relationships:** Linear regression assumes a linear relationship between the independent and dependent variables. If the true relationship is nonlinear, using linear regression can lead to poor model fit and inaccurate predictions.

2. **Complex Interactions:** Linear regression is limited in handling complex interactions between variables. If the relationship involves intricate interactions, nonlinear effects, or higher-order terms, linear regression may not capture these nuances effectively.

3. **Assumption Violations:** Linear regression assumes that the residuals (the differences between observed and predicted values) are normally distributed and have constant variance. If these assumptions are violated, the results and inferences from linear regression may be unreliable.

4. **Outliers and Influential Points:** Linear regression is sensitive to outliers and influential data points. A single outlier can significantly affect the regression line and the estimated coefficients.

5. **Categorical and Binary Variables:** While linear regression can handle continuous and numeric independent variables, it may struggle with categorical or binary predictors without appropriate encoding or treatment.

6. **Multi-Collinearity:** When the independent variables in a linear regression model are highly correlated with each other, it can lead to multicollinearity issues. This makes it challenging to interpret the individual contributions of these variables.

7. **Limited in Handling Non-Parametric Data:** Linear regression is a parametric method, meaning it makes assumptions about the underlying data distribution. If the data is non-parametric or has heavy-tailed distributions, linear regression may not be appropriate.

In cases where these limitations are significant, other techniques such as nonlinear regression, generalized linear models, decision trees, support vector machines, neural networks, or other advanced statistical and machine learning methods may be more suitable. The choice of the appropriate modeling technique depends on the nature of the data, the research question, and the specific goals of the analysis.

## Logistic Regression: Predicting Binary Outcomes

"Logistic regression is a powerful statistical method tailored specifically for binary classification tasks. Its primary objective is to predict binary outcomes, such as yes/no or 0/1, by leveraging the impact of one or more predictor variables. This approach is effectively represented by the logistic regression model [James et al., 2023]."

### Logistic Model

At the core of logistic regression is its ingenious model, which seamlessly integrates predictor variables within a linear framework, further molded by a transformative logistic function. This fusion aims to predict the probability of the positive class (class 1) in a binary response scenario. The model is encapsulated by the following equation [James et al., 2023]:

\begin{align}
\log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p,
\end{align}

In this equation:
- $X = (X_1, ..., X_p)$ constitutes the vector of predictor variables (features).
- $\beta_0, \beta_1, ..., \beta_p$ take on the roles of coefficients, encompassing both the intercept and each predictor variable's contribution.
- $p(X)$ designates the probability of the positive class, informed by the predictor variables.

Notably, the ratio $p(X)/[1 - p(X)]$, dubbed **the odds**, encapsulates the relationship between the probability of positive and negative outcomes [James et al., 2023].

### Estimating Probabilities

The logistic model allows us to deduce the estimated probability of the positive class ($p(X)$) through the following equation:

\begin{align}
p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}},
\end{align}

In this equation, the base $e$ of the natural logarithm (approximately 2.71828) plays a pivotal role in the transformation. This transformation, facilitated by the logistic function, molds the linear combination of predictor variables into a probability value that spans the entire spectrum between 0 and 1.

### Sigmoid (Logistic) Function

The equation given:

\begin{equation}
\log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p
\end{equation}

is closely connected to the sigmoid (logistic) function within the context of logistic regression, a method used for binary classification.

In logistic regression, the objective is to estimate the likelihood of the positive outcome (class 1) based on the values of certain predictor variables, such as $X_1, X_2, \ldots, X_p$. The left side of the equation, $\log\left(\frac{p(X)}{1 - p(X)}\right)$, signifies the log-odds (logit) of the probability $p(X)$ for the positive outcome. This transformation linearizes the connection between the predictor variables and the probability.

Now, let's introduce the sigmoid function. The [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function), denoted as $f(x) = \frac{1}{1 + e^{-x}}$, converts a real-number $x$ into a probability ranging from 0 to 1. In the logistic regression equation, this can be related to $p(X$ as follows:

\begin{equation}p(X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p)}}\end{equation}

In this expression, $p(X)$ represents the probability of the positive outcome. The right-hand side is a linear combination of predictor variables, each multiplied by coefficients $\beta_0, \beta_1, \ldots, \beta_p$. The sigmoid function is applied to this combination, ensuring that the result, $p(X)$, remains within the range of 0 to 1.

In essence, the logistic regression equation incorporates the sigmoid function to transform the linear combination of predictor variables into a valid probability value, enabling binary classification decisions based on this probability threshold.

---

<font color='Red'><b>Note:</b></font>

`scipy.special.expit` is a function within the SciPy library, a widely recognized open-source resource for scientific and mathematical computations. This function, also known as the "exponential of the logistic function," serves the purpose of executing a logistic sigmoid transformation. It operates on either a single value or an array as input and employs the logistic sigmoid function, mathematically expressed as:

\begin{equation}f(x) = \frac{1}{1 + e^{-x}}\end{equation}

Here, $e$ signifies the natural logarithm base, while $x$ denotes the input variable. The fundamental role of the logistic sigmoid function is to transform input values into a bounded range of 0 to 1. This property finds extensive application in various domains, including machine learning, specifically in logistic regression, where it is instrumental in modeling the probability of binary outcomes.

---

<font color='Blue'><b>Example:</b></font>
Let's examine a straightforward scenario with a single variable, denoted as X, and a dependent variable, y. In this instance, we generate X values from a normal distribution with a mean ($\mu$) of 10 and a standard deviation ($\sigma$) of 2. The variable y is defined as follows:

\begin{equation}
y = \begin{cases}
1, & \text{if } x > \mu \\
0, & \text{otherwise}.
\end{cases}
\end{equation}

Next, we estimate the probability of the default outcome using linear regression, which is depicted in the top plot. It's worth noting that some of the estimated probabilities turn out to be negative, which is problematic because probabilities should always fall within the range of 0 to 1.

To address this issue, we employ logistic regression, as shown in the bottom plot. In logistic regression, all estimated probabilities strictly conform to the valid probability range, ensuring that they stay between 0 and 1.

In [None]:
import matplotlib.pyplot as plt
plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Files/mystyle.mplstyle')
import numpy as np
# from scipy.special import expit
from sklearn.linear_model import LinearRegression, LogisticRegression

mu, sigma = 10, 2  # mean and standard deviation
np.random.seed(0)
X = np.random.normal(mu, sigma, 100)
y = np.where(X > mu, 1, 0)
X = X.reshape(-1, 1)
X_gen = np.linspace(X.min(), X.max(), 300)

# Create the models
models = [LinearRegression(), LogisticRegression()]

# Titles for the plots
titles = [ "Linear Regression Model", "Logistic Regression Model"]

fig, axes = plt.subplots(2, 1, figsize=(9.5, 9), sharex=True, gridspec_kw={'hspace': 0})

for i, ax in enumerate(axes):
    model = models[i]
    model.fit(X, y)

    if i == 1:
        Px = (1/(1 + np.exp(- (X_gen * model.coef_ + model.intercept_)))).flatten()
        # Or we could utilize scipy.special.expit
        # Px = expit(X_gen * model.coef_ + model.intercept_).ravel()
    else:
        Px = model.coef_ * X_gen + model.intercept_

    ax.scatter(X, y, color='#b496ff', ec='#602ce5', alpha=0.3, label="Sample Data")
    ax.plot(X_gen, Px, color='#C60004', lw=2, label=titles[i])
    ax.set_xlim([4, 16])
    ax.hlines([0, 1], *ax.get_xlim(), linestyles='dashed', color = '#26912d', lw=1.5)
    ax.set(xlabel='' if i == 0 else 'X', ylabel='Probability(X)')
    ax.grid(False)
    ax.legend(loc = 'right')
    ax.set_ylim(-0.1, 1.1)

plt.tight_layout()

---

<font color='Red'><b>Remark:</b></font>

Logistic regression is primarily used for classification, not regression, despite its name. It's a statistical model used to predict the probability of a binary outcome, typically denoted as class 0 and class 1. The logistic regression model estimates the probability that a given input belongs to one of these two classes. This makes it a valuable tool for various classification tasks, such as spam detection, disease diagnosis, or sentiment analysis. In logistic regression, the output is a probability score, and a threshold (often 0.5) is applied to classify the input into one of the two classes. Therefore, it's fundamentally a classification technique, even though the term "regression" is in its name.

---

### Finding the Coefficients

In logistic regression, the goal is to find the optimal coefficients ($\beta_0$, $\beta_1$, ..., $\beta_p$) for the model that best fits the data. This is achieved through the [Maximum Likelihood Estimation (MLE)](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) method. MLE aims to determine the coefficient values that maximize the likelihood of observing the actual binary outcomes in the dataset, given the predictor variables.

Here's a step-by-step refined explanation of how to find these coefficients in logistic regression:

1. **Formulate the Likelihood Function**: Begin by defining the likelihood function, which represents the probability of observing the binary outcomes (0s and 1s) based on the logistic regression model. This function is a mathematical representation of the relationship between the coefficients and the observed data.

2. **Take the Natural Logarithm**: Simplify the likelihood calculations by working with the natural logarithm of the likelihood function, known as the log-likelihood. This transformation preserves the underlying structure while making computations more manageable.

3. **Maximize the Log-Likelihood**: Utilize optimization techniques, often numerical methods, to find the coefficient values that maximize the log-likelihood function. The goal is to identify the coefficient values that make the observed data most likely under the logistic regression model.

4. **Interpret the Coefficients**: Once the estimated coefficients are obtained, interpret their values. Each coefficient (excluding $\beta_0$, the intercept) represents the change in the log-odds (logit) of the positive outcome (class 1) associated with a one-unit change in the corresponding predictor variable while holding other variables constant.

Practical implementations of logistic regression, especially with multiple predictor variables ($X_1$, $X_2$, ..., $X_p$) and non-linear optimization, can be complex. The choice of optimization method may depend on the software libraries or statistical packages you are using.

Most modern statistical software packages for data analysis, such as Python's scikit-learn or R's glm function, provide built-in functions to perform logistic regression. These functions automatically handle the MLE process and provide the estimated coefficients. However, if manual implementation of logistic regression is necessary, you'll likely rely on optimization algorithms like gradient descent or more advanced techniques. Proficiency in numerical optimization and statistical concepts is crucial for successfully implementing this process from scratch.

### Translating Theory into Python

In the world of Python, practical implementation of logistic regression is greatly simplified by leveraging powerful libraries like scikit-learn. This Python library provides a comprehensive toolkit for machine learning, and the `LogisticRegression` class it offers is an invaluable tool for binary classification tasks. By utilizing this class, you can effortlessly apply logistic regression to your dataset, allowing you to fit the model and gain access to the estimated coefficients.

For those eager to explore the depths of logistic regression within the scikit-learn ecosystem, the official scikit-learn documentation serves as a valuable resource. It acts as a gateway to the `LogisticRegression` class, providing detailed insights and instructions on how to utilize it effectively: [scikit-learn Logistic Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). This documentation is an essential reference for both beginners and experienced practitioners, offering guidance on various parameters, techniques, and best practices for applying logistic regression in diverse real-world scenarios..

<font color='Blue'><b>Example:</b></font> In the next example, we turn our attention to the utilization of the Default dataset from the ISLR repository [James et al., 2023]. This dataset, available at (https://www.statlearning.com/resources-python). This dataset encapsulates a comprehensive record of instances where customers defaulted on their credit obligations.

In [None]:
# Download the zip file using wget
!wget -N "https://www.statlearning.com/s/ALL-CSV-FILES-2nd-Edition-corrected.zip"

# Unzip the downloaded zip file
!unzip -j -o ALL-CSV-FILES-2nd-Edition-corrected.zip "ALL CSV FILES - 2nd Edition/Default.csv"

# Remove the downloaded zip file after extraction
!rm -r ALL-CSV-FILES-2nd-Edition-corrected.zip

In [None]:
import pandas as pd
Default = pd.read_csv('Default.csv')
Default.columns = [x.title() for x in Default.columns]
display(Default.head(5))

---

<font color='Red'><b>Note:</b></font>


In the context of credit cards, "default" refers to the failure of a cardholder to meet the agreed-upon terms and obligations associated with the credit card account. This typically includes missing minimum monthly payments, exceeding the credit limit, or violating other terms specified in the credit card agreement. When a cardholder defaults, it indicates a breach of the contract between the cardholder and the credit card issuer.

---

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Files/mystyle.mplstyle')

fig, ax = plt.subplots(1, 3, figsize=(9.5, 4.5), gridspec_kw = {'width_ratios':[.6,.2,.2]})
CP = {'Yes': 'MediumSeaGreen', 'No': 'OrangeRed'}
# Left
_ = ax[0].scatter(Default.loc[Default.Default == 'Yes', 'Balance'],
              Default.loc[Default.Default == 'Yes', 'Income'],
              edgecolors=CP['Yes'], label='Yes')
_ = ax[0].scatter(Default.loc[Default.Default == 'No', 'Balance'],
              Default.loc[Default.Default == 'No', 'Income'],
              edgecolors=CP['No'], label='No', alpha=0.2)
_ = ax[0].set_ylim([0, 8e4])
_ = ax[0].set(xlabel='Balance', ylabel='Income')
_ = ax[0].legend(title='Default', loc='upper right')
# Center
_ = sns.boxplot(x='Default', y='Balance', data=Default, orient='v', ax=ax[1], palette=CP)
_ = ax[1].set_ylim([0, 3e3])
# Right
_ = sns.boxplot(x='Default', y='Income', data=Default, orient='v', ax=ax[2], palette=CP)
_ = ax[2].set_ylim([0, 8e4])

plt.tight_layout()

The Default dataset is portrayed through three enlightening panels. Let's explore each [James et al., 2023]:

**Left Panel:**
Within this panel, the yearly incomes and monthly credit card balances of diverse individuals are visually depicted. Individuals who encountered default in their credit card payments are distinctly portrayed in Orange, while those who successfully averted default are prominently showcased in Green. This visualization immediately illuminates the dynamic interplay between income, credit card balance, and credit card payment outcomes.

**Center Panel:**
At the heart of this panel, a series of boxplots takes prominence, providing insights into the distribution of credit card balances in relation to default status. These boxplots offer a comprehensive overview of medians, quartiles, and potential outliers within each default group. The contrast between individuals who experienced default and those who did not is vividly delineated, succinctly revealing disparities in balances.

**Right Panel:**
Complementing the center panel, the right panel introduces a pair of boxplots that cast a spotlight on income variation according to default status. These boxplots concisely capture nuanced shifts in income distribution for both defaulters and non-defaulters, offering a visual narrative that illuminates the role income plays in credit card payment outcomes.

The Student status can be effectively represented using a [dummy variable](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)), which is a common technique in data analysis and machine learning. This involves creating a binary variable that takes the value of 1 if the individual is a student, and 0 if the individual is not a student. In mathematical terms, this encoding can be expressed as:

\begin{equation}
\text{Student} =
\begin{cases}
1, & \text{if Student}, \\
0, & \text{if Non-Student}.
\end{cases}
\end{equation}

In this context:
- When an individual's Student status is true, the corresponding value of the dummy variable is set to 1, indicating that the individual is a student.
- Conversely, if an individual is not a student, the dummy variable takes the value of 0, signifying the non-student status.

This approach simplifies the representation of Student status in a format that can be effectively utilized in various analyses and predictive modeling scenarios. It provides a straightforward and standardized way to incorporate categorical information like Student status into computational workflows.

In [None]:
df = Default.copy()
df['Default'], _ = Default.Default.factorize()
df['Student'], _ = Default.Student.factorize()
print('From:')
display(Default.head(5))
print('To')
display(df.head(5))

## Simple Logistic Regression
Let's explore a logistic regression approach characterized by the following equation:

\begin{equation}
\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 \times \text{Balance}
\end{equation}

This formulation encapsulates the likelihood of default occurrence. It revolves around the effect of the 'Balance' predictor. While we initially mentioned 'Student' status and 'Income' as contributors, for clarity, we'll focus on 'Balance' in this representation.

The probability of default, denoted as $ p(X) $, is expressed as follows:

\begin{equation}
p(X) = \frac{e^{\beta_0 + \beta_1 \times \text{Balance}}}
{1 + e^{\beta_0 + \beta_1 \times \text{Balance}}}
\end{equation}

Breaking down the elements of the equation:
- $ \beta_0 $ and $ \beta_1 $ are coefficients associated with the intercept and 'Balance', respectively.
- 'Balance' serves as the predictor variable that shapes the logistic regression model's outcome.
- The natural logarithm ($ \log $) is applied to the odds ratio $ \frac{p(X)}{1 - p(X)} $, enabling the representation of the relationship between the 'Balance' predictor and the log-odds of default in a linear manner.
- The exponentiation ($ e^{\dots} $) incorporated within the formula translates the log-odds into a probability scale.

In essence, this logistic regression equation dissects the impact of 'Balance' on predicting the probability of default, providing a focused lens into this specific predictor's role in the model.

Within the context of the Default dataset, logistic regression serves as a potent tool for modeling the probability of encountering default. To illustrate, consider the probability of default given a specific balance, which can be expressed as:

\begin{equation}
\text{Pr (Default = Yes | balance)}.
\end{equation}

In this formulation, logistic regression enables us to quantitatively assess the likelihood of a default occurrence based on the observed balance value. This probability-based perspective provides valuable insights into the potential outcomes within the dataset, shedding light on the dynamics between credit card balances and the probability of encountering default.

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression

X = df['Balance'].values.reshape(-1, 1)
y = df['Default'].values

# Generating test data
X_gen = np.arange(X.min(), X.max()).reshape(-1, 1)

log_reg = LogisticRegression()
_ = log_reg.fit(X, y)
Pred_proba = log_reg.predict_proba(X_gen)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(9.5, 4.5))
_ = ax.scatter(X, y, color='#7aea49', ec = '#38761d', alpha = 0.3)
_ = ax.plot(X_gen, Pred_proba[:, 1], color='#C60004', lw=2)
_ = ax.hlines([0, 1], *ax.get_xlim(), linestyles='dashed', lw=1)
_ = ax.set(xlabel = 'Balance', ylabel='Probability of Default',
              title = 'Estimated probability of default using logistic regression')
_ = ax.grid(False)
plt.tight_layout()

The predicted probabilities of default are showcased, derived from the logistic regression method. Unlike the left panel, these probabilities align with the fundamental concept of probability, always lying within the range of 0 to 1. This adherence to the proper probability range is a distinctive trait of logistic regression's predictions.

An alternative approach is to utilize the [**Statsmodels Generalized Linear Models**](https://www.statsmodels.org/devel/examples/notebooks/generated/glm_formula.html) library.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Also recall our Reg_Result for OLS Reression results
def Reg_Result(Inp):
    Temp = pd.read_html(Inp.summary().tables[1].as_html(), header=0, index_col=0)[0]
    display(Temp.style\
    .format({'coef': '{:.4e}', 'P>|t|': '{:.4e}', 'std err': '{:.4e}'})\
    .bar(subset=['coef'], align='mid', color='Lime')\
    .set_properties(subset=['std err'], **{'background-color': 'DimGray', 'color': 'White'}))

model = smf.glm(formula = 'Default ~ Balance', data = df, family=sm.families.Binomial())
Results = model.fit()
print(Results.summary())
Reg_Result(Results)

In particular, for
```python
model = smf.glm(formula='Default ~ Balance', data=df, family=sm.families.Binomial())
```
we have,

1. `formula='Default ~ Balance'`: This is the formula notation used to specify the relationship between the response variable ('Default') and the predictor variable ('Balance'). The tilde symbol (~) separates the response variable from the predictor variable. In this case, it signifies that we are modeling the influence of 'Balance' on the likelihood of 'Default'.

1. `family=sm.families.Binomial()`: This parameter defines the error distribution or family of the model. In this case, we're using the Binomial distribution, which is appropriate for binary response variables. The `sm.families.Binomial()` indicates that we're using the Binomial distribution from the Statsmodels library.

## Multiple Logistic Regression

Let's delve into an alternative logistic regression approach characterized by the following equation:

\begin{equation}
\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 \times \text{Student} + \beta_2 \times \text{Balance} + \beta_3 \times \text{Income}
\end{equation}

This formulation encapsulates the likelihood of encountering default. It hinges on the contribution of multiple predictors, including 'Student' status, 'Balance', and 'Income'. The probability of default, $ p(X) $, is expressed as follows:

\begin{equation}
p(X) = \dfrac{e^{\beta_0 + \beta_1 \times \text{Student} + \beta_2 \times \text{Balance} + \beta_3 \times \text{Income}}}
{1 + e^{\beta_0 + \beta_1 \times \text{Student} + \beta_2 \times \text{Balance} + \beta_3 \times \text{Income}}}
\end{equation}

In this equation:
- $ \beta_0 $, $ \beta_1 $, $ \beta_2 $, and $ \beta_3 $ are the coefficients corresponding to the intercept, 'Student' status, 'Balance', and 'Income', respectively.
- 'Student', 'Balance', and 'Income' are the predictor variables that contribute to the logistic regression model.
- The natural logarithm ($ \log $) is applied to the odds ratio $ \frac{p(X)}{1 - p(X)} $ to linearly model the relationship between the predictors and the log-odds of default.
- The exponentiation ($ e^{\dots} $) within the formula facilitates the transformation of the log-odds back to the probability scale.

In essence, this logistic regression equation accounts for the combined influence of 'Student' status, 'Balance', and 'Income' in predicting the probability of default.

An alternative approach is to utilize the [**Statsmodels Generalized Linear Models**](https://www.statsmodels.org/devel/examples/notebooks/generated/glm_formula.html) library.

In [None]:
formula = 'Default ~ Student + Income + Balance'
model = smf.glm(formula = formula, data=df, family=sm.families.Binomial())
Results = model.fit()
print(Results.summary())
Reg_Result(Results)

## Example: Synthetic Dataset

<font color='Blue'><b>Example</b></font>: In this code example, a Decision Tree Classifier is utilized to illustrate decision boundaries on synthetic data. The synthetic dataset is generated using the `make_blobs` function from scikit-learn, designed for creating artificial datasets for various machine learning experiments. This particular dataset consists of the following characteristics:

- **Number of Samples:** 1000
- **Number of Features:** 2
- **Number of Classes:** 2
- **Random Seed (random_state):** 0
- **Cluster Standard Deviation (cluster_std):** 1.0

**Features:**
- The dataset contains 1000 data points, each described by a pair of feature values. These features are represented as 'Feature 1' and 'Feature 2'.

**Outcome (Target Variable):**
- The dataset also includes a target variable called 'Outcome.' This variable assigns each data point to one of two distinct classes, identified as 'Class 0' and 'Class 1'.

The dataset has been designed to simulate a scenario with two well-separated clusters, making it suitable for binary classification tasks. Each data point in this dataset is associated with one of the two classes, and it can be used for practicing and evaluating machine learning algorithms that deal with binary classification problems.

In [None]:
from sklearn.datasets import make_blobs
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate synthetic data
X, y = make_blobs(n_samples=1000, centers=2, random_state=0, cluster_std=1.0)

# Create a scatter plot using Seaborn
fig, ax = plt.subplots(1, 1, figsize=(9.5, 6))

colors = ["#f5645a", "#b781ea"]
markers = ['o', 's']

# Scatter plot of data points
for num in np.unique(y):
    ax.scatter(X[:, 0][y == num], X[:, 1][y == num], c=colors[num],
                s=40, edgecolors="k", marker=markers[num], label=str(num))

ax.set(xlim=[-2, 6], ylim=[-2, 8])
ax.legend(title = 'Outcome')
ax.set_title('Synthetic Dataset', weight = 'bold', fontsize = 16)
plt.tight_layout()

The synthetic dataset depicted above exhibits a balanced distribution between two classes.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(5, 2))
_ = sns.countplot(y = y, palette= colors)
_ = ax.set(xlabel = 'Count', ylabel = 'Outcome' )
plt.tight_layout()

Next, Logistic Regression from the scikit-learn library [scikit-learn Developers, 2023] is employed for the classification of the aforementioned dataset.

In [None]:
from sklearn.linear_model import LogisticRegression
from matplotlib.colors import ListedColormap
from sklearn.inspection import DecisionBoundaryDisplay

colors = ["#f5645a", "#b781ea"]
edge_colors = ['#8A0002', '#3C1F8B']
cmap_light = ListedColormap(['#f7dfdf', '#e3d3f2'])
markers = ['o', 's']

# Plot decision boundaries
fig, ax = plt.subplots(1, 1, figsize=(9.5, 6))
log_reg = LogisticRegression(max_iter = 200, solver = 'lbfgs')
log_reg.fit(X, y)
DecisionBoundaryDisplay.from_estimator(log_reg, X, cmap=cmap_light, ax=ax,
                                   response_method="predict",
                                   plot_method="pcolormesh",
                                   xlabel= 'Feature 1', ylabel='Feature 2',
                                   shading="auto")
# Scatter plot of data points
for num in np.unique(y):
    ax.scatter(X[:, 0][y == num], X[:, 1][y == num], c=colors[num],
                s=40, edgecolors="k", marker=markers[num], label=str(num))

ax.set(xlim=[-2, 6], ylim=[-2, 8])
ax.legend(title = 'Outcome')
ax.set(xlim=[-2, 6], ylim=[-2, 8])
ax.set_title(f'Logistic Regression', fontweight='bold', fontsize = 16)
ax.grid(False)
plt.tight_layout()

In practical application, it is imperative to take into account the division of the dataset into two distinct subsets: one designated for model training and the other for model testing. This separation can be achieved using the `train_test_split` function, as exemplified below:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state=0)

print(f'Shape of X_train = {X_train.shape}')
print(f'Shape of y_train = {y_train.shape}')
print(f'Shape of X_test = {X_test.shape}')
print(f'Shape of y_test = {y_test.shape}')

log_reg = LogisticRegression()
_ = log_reg.fit(X_train, y_train)

def _gen_cr(model, X, y):
    y_pred = model.predict(X)
    Results = pd.DataFrame(classification_report(y, y_pred,
                                             output_dict=True)).T
    display(Results.style.format(precision = 3))

print('\nTrain Data:')
_gen_cr(log_reg, X_train, y_train)

print('\nTest Data:')
_gen_cr(log_reg, X_test, y_test)

The performance of the trained model on both the training and testing datasets is as follows:

In [None]:
from matplotlib.colors import ListedColormap
from sklearn.inspection import DecisionBoundaryDisplay

fig, axes = plt.subplots(1, 2, figsize=(9.5, 4.5))

# Create a loop for train and test sets
for i, (X_set, y_set, title) in enumerate([(X_train, y_train, 'Train Set'), (X_test, y_test, 'Test Set')]):
    # Plot decision boundaries
    DecisionBoundaryDisplay.from_estimator(log_reg, X_set, cmap=cmap_light, ax=axes[i],
                                           response_method="predict",
                                           plot_method="pcolormesh",
                                           xlabel='Feature 1', ylabel='Feature 2',
                                           shading="auto")
    for num in np.unique(y):
        axes[i].scatter(X_set[:, 0][y_set == num],
                     X_set[:, 1][y_set == num], c=colors[num],
                    s=40, edgecolors="k", marker=markers[num], label=str(num))

    axes[i].legend(title = 'Outcome:')
    axes[i].set_title(f'{title} - Logistic Regression', fontweight='bold', fontsize=16)
    axes[i].grid(False)

plt.tight_layout()

Upon closer examination, it becomes evident that certain data points from the set X have been misclassified. These instances are visually highlighted with yellow circles in the subsequent figure.

In [None]:
from matplotlib.colors import ListedColormap
from sklearn.inspection import DecisionBoundaryDisplay

fig, axes = plt.subplots(1, 2, figsize=(9.5, 4.5))

# Create a loop for train and test sets
for i, (X_set, y_set, title) in enumerate([(X_train, y_train, 'Train Set'), (X_test, y_test, 'Test Set')]):
    # Plot decision boundaries
    DecisionBoundaryDisplay.from_estimator(log_reg, X_set, cmap=cmap_light, ax=axes[i],
                                           response_method="predict",
                                           plot_method="pcolormesh",
                                           xlabel='Feature 1', ylabel='Feature 2',
                                           shading="auto")
    for num in np.unique(y):
        axes[i].scatter(X_set[:, 0][y_set == num],
                     X_set[:, 1][y_set == num], c=colors[num],
                    s=40, edgecolors="k", marker=markers[num], label= f'Outcome {num}')

    # Plot data points where y_set and log_reg(X_set) differ in color
    axes[i].scatter(X_set[:, 0][y_set != log_reg.predict(X_set)],
                    X_set[:, 1][y_set != log_reg.predict(X_set)],
                    fc='Yellow', ec='black', s=40, marker= 'h', label= 'Inaccurate Predictions')

    axes[i].set_title(f'{title} - Logistic Regression', fontweight='bold', fontsize=16)
    axes[i].grid(False)
    # Remove the legend for each panel
    axes[i].legend()
    axes[i].get_legend().remove()

# Create a single legend for both subplots at the top
handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', ncol=3, borderaxespad= -0.1)
plt.tight_layout()

Lastly, we present the confusion matrix below:

In [None]:
import matplotlib.pyplot as plt  # Import matplotlib for plotting
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix  # Import necessary functions/classes

def format_confusion_matrix(cm, title):
    true_pos, true_neg, false_pos, false_neg = cm.ravel()
    result = f"\033[1m{title} Set Confusion Matrix\033[0m:\n"
    result += f"- {true_pos} instances were correctly predicted as class 1.\n"
    result += f"- {true_neg} instances were correctly predicted as class 0.\n"
    result += f"- {false_pos} instance was incorrectly predicted as class 1 when it was actually class 0.\n"
    result += f"- {false_neg} instances were incorrectly predicted as class 0 when they were actually class 1.\n"

    return result


def plot_cm(model, X_train, X_test, y_train, y_test, class_names, figsize=(7, 4)):
    # Create a figure and axes for displaying confusion matrices side by side
    fig, ax = plt.subplots(1, 2, figsize=figsize)

    datasets = [(X_train, y_train, 'Train'), (X_test, y_test, 'Test')]

    for i in range(2):
        X, y, dataset_name = datasets[i]

        # Compute confusion matrix for the dataset predictions
        cm = confusion_matrix(y, model.predict(X))

        result = format_confusion_matrix(cm, dataset_name)
        print(result)

        # Create a ConfusionMatrixDisplay and plot it on the respective axis
        cm_display = ConfusionMatrixDisplay(cm, display_labels=class_names)\
                        .plot(ax=ax[i],
                              im_kw=dict(cmap='Greens' if dataset_name == 'Train' else 'Blues'),
                              text_kw={"size": 16}, colorbar=False)
        ax[i].set_title(f'{dataset_name} Data')
        ax[i].grid(False)

    # Add a super title for the entire figure
    fig.suptitle('Confusion Matrices', fontsize=16, weight = 'bold')

    # Adjust the layout for better spacing
    plt.tight_layout()

In [None]:
plot_cm(log_reg, X_train, X_test, y_train, y_test, ['0', '1'], figsize=(6, 3))