<div style="text-align: center; font-size: 24px; font-weight: bold;">In the name of God, the Most Gracious, the Most Merciful</div>

Full Name: MohammadDavood VahhabRajaee

Student ID: 4041419041

# Logistic Regression from Scratch using NumPy
**what is logistic regression?**

Logistic regression is a supervised learning method used for binary classification â€” predicting outcomes that can take only two values (e.g., 0/1, yes/no, positive/negative).

It models the probability that an input belongs to a particular class.

Logistic regression is helpful when you want to predict which of two categories an input belongs to, and when the relationship between the features and the log-probability of the outcome is approximately linear.

**Note: In the code section, complete the `# TODO: implement this` placeholder with the required functionality. **

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 1.Dataset Overview



### practice 1
- Load the provided dataset (`binary.csv`)
- How many samples it has? Are labels balanced? what are the labels?

In [2]:
# 1. Load the dataset from the CSV file
data = pd.read_csv('binary.csv')
print("âœ… Data loaded successfully!")

# 2. Display the first 5 rows to inspect structure and data types
# print("\nFirst 5 rows:")
# print(data.head())

# 3. Print the shape (number of rows, number of columns)
print(f"ðŸ“Š Shape: {data.shape}\n")

# 4. Analyze the target variable 'admit' to check for balance
print("ðŸŽ¯ Target (admit) distribution:")
admit_counts = data['admit'].value_counts()
print(admit_counts)
admission_rate = data['admit'].mean()
print(f"\nAdmission rate: {admission_rate:.1%}\n")

# 5. Examine the 'rank' feature distribution to understand it better
print("ðŸŽ“ Rank distribution:")
rank_counts = data['rank'].value_counts().sort_index()
print(rank_counts)

âœ… Data loaded successfully!
ðŸ“Š Shape: (400, 4)

ðŸŽ¯ Target (admit) distribution:
admit
0    273
1    127
Name: count, dtype: int64

Admission rate: 31.8%

ðŸŽ“ Rank distribution:
rank
1     61
2    151
3    121
4     67
Name: count, dtype: int64


### practice 2: Preprocess â€” Add Bias- Normalize



In [3]:
# 1. Extract features (gre, gpa) and target (admit)
# We exclude 'rank' for this part of the assignment.
X_raw = data[['gre', 'gpa']].values
y = data['admit'].values

print("âœ… Raw data loaded:")
print(f"   X_raw shape: {X_raw.shape} | y shape: {y.shape}")
print(f"   gre range: [{X_raw[:, 0].min()}, {X_raw[:, 0].max()}]")
print(f"   gpa range: [{X_raw[:, 1].min():.2f}, {X_raw[:, 1].max():.2f}]")


# 2. Train-Test Split (80% train, 20% test)
# We use stratify=y to ensure the proportion of admitted/rejected is the same in both sets.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X_raw, y, test_size=0.2, random_state=42, stratify=y
)

print("\nâœ… Train-test split:")
print(f"   Train: {len(y_train)} samples (admit rate: {y_train.mean():.2%})")
print(f"   Test:  {len(y_test)} samples (admit rate: {y_test.mean():.2%})")


# 3. Standardize features (Z-score normalization)
# IMPORTANT: Calculate mean and std ONLY from the training data.
means = np.mean(X_train_raw, axis=0)
stds = np.std(X_train_raw, axis=0)

# Standardize the training set
X_train_std = (X_train_raw - means) / stds

# Standardize the test set using the *training* statistics to prevent data leakage
X_test_std = (X_test_raw - means) / stds

print("\nâœ… Standardization (using TRAIN statistics):")
print(f"   gre: Î¼ = {means[0]:.2f}, Ïƒ = {stds[0]:.2f}")
print(f"   gpa: Î¼ = {means[1]:.2f}, Ïƒ = {stds[1]:.2f}")


# 4. Add bias term (intercept)
# This adds a column of ones to the beginning of the feature matrices.
X_train = np.c_[np.ones(X_train_std.shape[0]), X_train_std]
X_test = np.c_[np.ones(X_test_std.shape[0]), X_test_std]

print("\nâœ… Final feature matrices:")
print(f"   X_train shape: {X_train.shape} | feature order: [bias, gre_std, gpa_std]")
print(f"   X_test  shape: {X_test.shape}")
print(f"   First train sample: {X_train[0]}")

âœ… Raw data loaded:
   X_raw shape: (400, 2) | y shape: (400,)
   gre range: [220.0, 800.0]
   gpa range: [2.26, 4.00]

âœ… Train-test split:
   Train: 320 samples (admit rate: 31.87%)
   Test:  80 samples (admit rate: 31.25%)

âœ… Standardization (using TRAIN statistics):
   gre: Î¼ = 588.75, Ïƒ = 118.63
   gpa: Î¼ = 3.40, Ïƒ = 0.38

âœ… Final feature matrices:
   X_train shape: (320, 3) | feature order: [bias, gre_std, gpa_std]
   X_test  shape: (80, 3)
   First train sample: [ 1.         -0.0737578   0.00493654]


## 2.From Linear to Logistic Regression

Recap: linear regression ( you have implemented it in the last notebook)

$\hat{y} = Xw$

Problem â€” output is unbounded.

Logistic regression:

$\hat{y} = \sigma(Xw)$

Where sigmoid is:

$\sigma(z) = \frac{1}{1 + e^{-z}}$


### practice 3: implement sigmoid function

In [4]:
def sigmoid(z):
    """
    Applies the sigmoid (logistic) function element-wise to input `z`.
    """
    # The formula for the sigmoid function is 1 divided by (1 + e^(-z))
    return 1 / (1 + np.exp(-z))

In [5]:
print("sigmoid(0) =", sigmoid(0))
print("sigmoid(-10) =", sigmoid(-10))
print("sigmoid(10) =", sigmoid(10))

sigmoid(0) = 0.5
sigmoid(-10) = 4.5397868702434395e-05
sigmoid(10) = 0.9999546021312976


## 3.Loss Function

Binary cross entropy:

$J(w) = -\frac{1}{m} \sum \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]$

### practice 4: Implement cost function

In [6]:
def compute_loss(X, y, w):
    """
    Computes the binary cross-entropy (log) loss for logistic regression.
    """
    # Get the number of samples
    m = len(y)
    
    # 1. Calculate the linear combination (z = X * w)
    z = X @ w
    
    # 2. Apply the sigmoid function to get predicted probabilities (y_hat)
    y_hat = sigmoid(z)

    # 3. Compute the binary cross-entropy loss
    # The formula is the mean of -[y*log(y_hat) + (1-y)*log(1-y_hat)]
    # A small epsilon (1e-15) is added to prevent log(0) errors for numerical stability
    loss = -np.mean(y * np.log(y_hat + 1e-15) + (1 - y) * np.log(1 - y_hat + 1e-15))

    return loss

## 4.Gradient Descent

Gradient:

$\frac{\partial J}{\partial w} = \frac{1}{m} X^T (\hat{y} - y)$

### practice 5: Implement gradient descent and Train with lr=0.0001, 60k steps. Whatâ€™s final loss & accuracy?

In [7]:
# TODO: implement this
def gradient_descent(X, y, lr=0.0001, steps=60000, verbose=True):
    """
    Performs batch gradient descent to find the optimal weights `w`.
    """
    # Get the number of samples (m) and features (n)
    m, n = X.shape
    
    # Initialize weights to zeros
    w = np.zeros(n)
    losses = []

    for i in range(steps):
        # --- Forward Pass ---
        # 1. Calculate the linear combination
        z = X @ w
        # 2. Get the predictions (probabilities)
        y_hat = sigmoid(z)

        # --- Gradient Calculation ---
        # The gradient of the binary cross-entropy loss w.r.t. w
        gradient = (1 / m) * X.T @ (y_hat - y)

        # --- Weight Update ---
        # Move weights in the opposite direction of the gradient
        w = w - lr * gradient

        # Log progress at specified intervals
        if i % 20000 == 0 or i == steps - 1:
            loss = compute_loss(X, y, w)
            losses.append(loss)
            if verbose:
                print(f"Step {i:>6} | Loss: {loss:.6f}")

    return w, losses

In [8]:
# Train
w, losses = gradient_descent(X_train, y_train, lr=0.001, steps=600000)

Step      0 | Loss: 0.693097
Step  20000 | Loss: 0.596942
Step  40000 | Loss: 0.596908
Step  60000 | Loss: 0.596908
Step  80000 | Loss: 0.596908
Step 100000 | Loss: 0.596908
Step 120000 | Loss: 0.596908
Step 140000 | Loss: 0.596908
Step 160000 | Loss: 0.596908
Step 180000 | Loss: 0.596908
Step 200000 | Loss: 0.596908
Step 220000 | Loss: 0.596908
Step 240000 | Loss: 0.596908
Step 260000 | Loss: 0.596908
Step 280000 | Loss: 0.596908
Step 300000 | Loss: 0.596908
Step 320000 | Loss: 0.596908
Step 340000 | Loss: 0.596908
Step 360000 | Loss: 0.596908
Step 380000 | Loss: 0.596908
Step 400000 | Loss: 0.596908
Step 420000 | Loss: 0.596908
Step 440000 | Loss: 0.596908
Step 460000 | Loss: 0.596908
Step 480000 | Loss: 0.596908
Step 500000 | Loss: 0.596908
Step 520000 | Loss: 0.596908
Step 540000 | Loss: 0.596908
Step 560000 | Loss: 0.596908
Step 580000 | Loss: 0.596908
Step 599999 | Loss: 0.596908


In [9]:
# ðŸ”¹ Test accuracy only
y_test_pred = (sigmoid(X_test @ w) > 0.5).astype(int)
accuracy = np.mean(y_test_pred == y_test)

print(f"Test Accuracy: {accuracy:.4f}")

Test Accuracy: 0.6625


### practice 6: analyze the Test accuracy. Is it good enough? Does training for longer epochs help? explain.

### **Practice 6: Analyze the Test Accuracy. Is it good enough?**

**No**, an accuracy of ~66.25% is not considered good for this problem.

To evaluate a classifier, we should compare it to a simple **baseline model**. For an imbalanced dataset, the most common baseline is the "majority class classifier," which always predicts the most frequent class.

1.  **Majority Class:** In our training data (and the overall dataset), the majority class is `0` (rejected), which makes up approximately **68.2%** of the samples.
2.  **Baseline Accuracy:** A naive model that always predicts `0` would achieve an accuracy of `68.2%`.
3.  **Comparison:** Our model's accuracy of `66.25%` is **worse than this trivial baseline**. This indicates that our model has failed to learn a meaningful pattern from the `gre` and `gpa` features alone.

---
**Does training for longer epochs help?**

**No**, training for longer will not help.

Looking at the training output from the `gradient_descent` function, we can see that the loss function has already **converged**.

```text
Step   20000 | Loss: 0.596942
Step   40000 | Loss: 0.596908
...
Step  599999 | Loss: 0.596908

The loss value stabilizes at `0.596908` and does not decrease further. This means the algorithm has already found the best possible parameters (`w`) that it can with the given data and learning rate. Continuing to train will not yield any improvement.

---
### **Explanation**

The primary reason for the poor performance is likely the **oversimplification of the model**. We intentionally excluded the `rank` feature, which is a strong predictor for university admissions. With only `gre` and `gpa`, the data is not sufficiently linearly separable for the logistic regression model to find a good decision boundary.