# Logistic Regression Classification

Logistic regression is a statistical method used in machine learning for binary classification tasks. It models the probability that a given input belongs to a particular class by applying the logistic function to a linear combination of the input features. Unlike linear regression, which predicts continuous values, logistic regression outputs values between 0 and 1, representing the probability of the input belonging to the positive class. The model is trained using a dataset with known class labels, and the coefficients are optimized to minimize the error between the predicted probabilities and the actual labels. Logistic regression is widely used due to its simplicity, interpretability, and effectiveness in many practical applications.

1. The linear combination of input features:
$$
z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n = \mathbf{\beta}^T \mathbf{x}
$$

2. The logistic function (also known as the sigmoid function) applied to the linear combination:
$$
p = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}
$$

3. The decision rule for classification (for binary classification with a threshold of 0.5):
$$
\hat{y} =
\begin{cases}
1 & \text{if } p \geq 0.5 \\
0 & \text{otherwise}
\end{cases}
$$


In [15]:
# Importing the needed libraries

import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

In [16]:
# Creating the Data
# The data is about results from a 50 Mile Ultra Marathon

d = {
  'miles_per_week': [37, 39] + [np.random.randint(0, 200) for _ in range(97)],
  'completed_50m_ultra': ['no', 'yes'] + ['no' or 'yes' for _ in range(97)]
}

In [17]:
# Creating the DataFrame

df = pd.DataFrame(data=d)
df.head()

Unnamed: 0,miles_per_week,completed_50m_ultra
0,37,no
1,39,yes
2,98,no
3,140,no
4,117,no


In [38]:
# Feature/ Label Selection

X = df['miles_per_week'].to_numpy().reshape(-1,1) # Reshapes Data to 2D Array
y = df['completed_50m_ultra'].to_numpy()

#'completed_50m_ultra' need to use ordinal encoding to convert to integers. 0 = no, 1 = yes

encoder = LabelEncoder()
df['completed_50m_ultra'] = encoder.fit_transform(df['completed_50m_ultra'])
df.head()

Unnamed: 0,miles_per_week,completed_50m_ultra
0,37,0
1,39,1
2,98,0
3,140,0
4,117,0


In [39]:
# Splitting the data into training/ testing and making the model

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

logit = LogisticRegression()
logit.fit(X_train, y_train)

In [41]:
# Making Predictions

y_pred = logit.predict(X_test)
print(y_pred)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [42]:
# Interpreting results

cm = confusion_matrix(y_test, y_pred)
print(cm)

cr = classification_report(y_test, y_pred)
print(cr)

[[20]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



