# Logistic Regression

Logistic regression is a linear model used for **binary classification**. It estimates the probability that an input belongs to the positive class by applying a sigmoid (logistic) function to a linear combination of features.

Key ideas:

- **Sigmoid function**: maps real-valued scores to probabilities in (0, 1).
- **Decision boundary**: the line/plane where the predicted probability is 0.5.
- **Training objective**: minimize log loss (cross-entropy) to fit the model.

This notebook demonstrates how logistic regression works and how to train it on data.


In [1]:
import numpy as np
import pandas as pd

np.random.seed(42)

In [2]:

df = pd.read_csv('IRIS.csv')

top_2_species = df['species'].value_counts().index[:2]
print(f"2 loại hoa được chọn để train: {list(top_2_species)}")


df_binary = df[df['species'].isin(top_2_species)].copy()

mapping = {top_2_species[0]: 0, top_2_species[1]: 1}
df_binary['species'] = df_binary['species'].map(mapping)

print("Dữ liệu sau khi lọc và đổi nhãn:")
print(df_binary.head())
print("Số lượng mẫu:", len(df_binary))
print("Phân bố nhãn:")
print(df_binary['species'].value_counts())

2 loại hoa được chọn để train: ['Iris-setosa', 'Iris-versicolor']
Dữ liệu sau khi lọc và đổi nhãn:
   sepal_length  sepal_width  petal_length  petal_width  species
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4          0.2        0
Số lượng mẫu: 100
Phân bố nhãn:
species
0    50
1    50
Name: count, dtype: int64


In [4]:
df = df_binary.reset_index(drop=True)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [5]:
from sklearn.model_selection import train_test_split

y = df['species'].values
x = df.drop(columns=['species']).values

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    test_size=0.2,     # 20% test
    random_state=42,   # để lần nào chạy cũng ra y như nhau
    shuffle=True,      # trộn dữ liệu trước khi chia
    stratify=y         # (khuyên) giữ tỉ lệ lớp 0/1 giống nhau ở train & test
)

print(x_train.shape, y_train.shape)

(80, 4) (80,)


In [8]:
class MyLogisticRegression:
    def __init__(self, lr=0.1, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0.0

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-z))

    def fit(self, x, y):
        y = y.reshape(-1, 1)
        m, n = x.shape
        self.w = np.zeros((n,1))
        self.b = 0.0

        for _ in range(self.epochs):
            z = x @ self.w + self.b
            y_hat = self.sigmoid(z)

            dz = y_hat - y
            dw = (x.T @ dz)/m
            db = float(dz.mean())

            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, x):
        proba = self.sigmoid(x @ self.w + self.b)
        return (proba >= 0.5).astype(int).reshape(-1)


In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = MyLogisticRegression(lr=0.1, epochs=1000)
models.fit(x_train, y_train)

y_pred = models.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")

Accuracy: 1.0000
Precision: 1.0000


## How Logistic Regression Works

Logistic regression models the probability of the positive class using:

`p(y=1|x) = sigmoid(z)`, where `z = w^T x + b`.

Training finds `w` and `b` that make the predicted probabilities match the true labels by minimizing **log loss** (cross-entropy):

- If the true label is 1, the loss penalizes low predicted probability.
- If the true label is 0, the loss penalizes high predicted probability.

Typical training steps:

1. Compute scores `z = Xw + b` and probabilities `p = sigmoid(z)`.
2. Compute log loss over all samples.
3. Compute gradients of the loss with respect to `w` and `b`.
4. Update parameters with gradient descent (repeat for many epochs).

After training, predictions are made by thresholding the probability (commonly 0.5).
