# Machine Learning from Scratch: Linear & Logistic Regression
---
This assignment notebook is designed to help you understand the mathematics behind two fundamental ML algorithms: Linear Regression and Logistic Regression.<br>
You will:
- Derive cost functions (MSE & Log Loss)
- Implement Gradient Descent from scratch
- Apply the models to simple datasets
- Understand how theory translates into practice

*(Let's begin 🚀)*

---

## Linear Regression (predicting numbers)

### The Idea  
Now We use linear regression when we want to **fit a straight line** to data. It is a simple model used for regression problems in ML.  
Remember from your MTH courses the Equation of a straight line is: $\hat y = w \cdot x + b$ <br> where:
- \(w\) = slope or gradient (how steep the line is)  
- \(b\) = intercept (where it cuts the y-axis)  
- \(x\) = features or independent variable
- \(y\) = labels or dependent variable

A linear regression model basically learns what rules map inputs (x) to outputs (y). In this case it learns the appropriate w and b that explains the relationship between x and y.

### Training Data for Question

| x | y |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |
| 4 | 8 |

This follows the rule \(y = 2x\).  

---

# **1. Question**
We start with the formula:
$\hat y = w \cdot x + b$ <br>
your task is to write the formula in code and test that it produces correct labels <br>
Stater code:


In [1]:
def predict(x, w, b):
    pass

# Try different values
print(predict(2, 1, 0))   # expect 2
print(predict(2, 2, 0))   # expect 4

---

### Understanding the Loss (or Cost) Function: Mean Squared Error (MSE)

Now that we have drawn a line, how do we know if it is a **good line** or a **bad line**?  
We need a way to **measure how wrong** our line is compared to the actual data.

Think of it this way:
- For each point, we can look at the **real value** (`y`) and the **predicted value** (`ŷ`).
- If our line is perfect, the real value and the predicted value will be the same.
- If our line is not perfect, there will be some **difference (error)**.

We square this difference (so negatives don’t cancel out) and then take the **average of all the squared errors**.  
This gives us a single number that tells us how "bad" or "good" our line is.

**Formula (don’t worry, it’s just saying what we explained):**

$$
MSE = \frac{1}{m} \sum (y - \hat y)^2
$$

Where:  
- \( y \) = the real value from the data  
- \( ŷ \) = the predicted value from our line  
- \( m \) = the number of data points  

---

# **2. Question**

Calculating the Mean Squared Error (MSE)

Now that you wrote the prediction formula in Question 1, let’s see **how wrong your line is** compared to the real data.  

Remember, the MSE tells us the **average squared difference** between the predicted values (`ŷ`) and the actual values (`y`):

- Use the same `predict()` function you wrote in Question 1.
- Use different **weights (`w`) and intercept (`b`) you have tried** to get appropriate outputs for the Training Data, also use other (`w`) and (`b`) and observe how the MSE changes.
- Then calculate the MSE.

**Starter code**:


In [2]:
# Training data
x_data = [1, 2, 3, 4]
y_data = [2, 4, 6, 8]

# Use your calculated w and b for the data given
w = None  # replace with your chosen value
b = None   # replace with your chosen value

# Step 1: Make predictions using your line (no need to edit this)
y_pred = [predict(xi, w, b) for xi in x_data]

# Step 2: Calculate MSE (no need to edit this)
def mse(x_data, y_data, w, b):
    errors = [(yi - predict(xi, w, b))**2 for xi, yi in zip(x_data, y_data)]
    return sum(errors)/len(x_data)

print("Mean Squared Error:", mse(x_data, y_data, w, b))

### Gradient Descent (Learning the Best Line)

So far we can:
- Make predictions using `w` and `b`
- Measure how wrong the line is using MSE

But how do we **improve `w` and `b`** so the line fits the data better?  
We use a process called **Gradient Descent**, which is just a fancy way of saying:

> “Try changing `w` and `b` little by little to make the error smaller.”

---

#### How it works (in simple words):

1. **Look at the error**: How far is our line from the real points?  
2. **Calculate direction**: Should we increase or decrease `w` and `b`?  
3. **Take a small step**: Move `w` and `b` a little in the right direction.  
4. **Repeat**: Keep adjusting until the line fits well.

The size of each step we move is called the **learning rate** (we call it `lr` in code).  
- Too small → takes a long time to find the optimal w and b
- Too big → might overshoot and never fit well  

---

#### Formulas (don’t worry, just for reference):

We adjust like this:

$$
w = w - \text{learning rate} \times dw
$$
$$
b = b - \text{learning rate} \times db
$$

`dw` is basically how much the error would go up or down if we nudge w a tiny bit. `db` is the same for b.

---

# **3. Question**

Train the Model

We are going to improve `w` and `b` using Gradient Descent, we would set both at 0,.  

- The code below already **uses your `predict()` function** inside the `compute_gradients()` and `mse()` functions.  
- You **don’t need to call `predict()` directly**, just fill in:
  - `lr` (learning rate)
  - `epochs` (number of steps to repeat)  


In [3]:
# Gradient Descent Starter Code

# Compute gradients for Linear Regression (basically calculates dw and db) (no need to edit this!)
def compute_gradients(x_data, y_data, w, b):
    """
    Calculates how much w and b should change to reduce error.
    Uses the formula:
        dw = average of (-2 * x * (y - y_pred))
        db = average of (-2 * (y - y_pred))
    """
    m = len(x_data)
    dw, db = 0, 0
    for i in range(m):
        y_pred = predict(x_data[i], w, b)  # <-- uses the student's predict() function
        dw += -2 * x_data[i] * (y_data[i] - y_pred)
        db += -2 * (y_data[i] - y_pred)
    return dw / m, db / m


# Initial guesses for w and b
w = 0
b = 0

# Learning rate
lr = None    # replace with a small number like 0.01

# Number of steps
epochs = None   # you can try 50 or 100

# Training loop (no need to edit this!)
for i in range(epochs):
    dw, db = compute_gradients(x_data, y_data, w, b)  # calculates how to change w and b
    w -= lr * dw
    b -= lr * db
    if i % 10 == 0:
        print(f"Step {i}: w={w:.2f}, b={b:.2f}, loss={mse(x_data,y_data,w,b):.2f}")

print(f"\nFinal line: y = {w:.2f}x + {b:.2f}")


In [4]:
## Lets look at how our learned model performs
## Watch how close the red line is to the points — the closer, the better the model learned.”

import matplotlib.pyplot as plt

# Predictions using trained w and b
y_pred = [predict(xi, w, b) for xi in x_data]

plt.scatter(x_data, y_data, color="blue", label="Data points")
plt.plot(x_data, y_pred, color="red", label=f"Learned line: y={w:.2f}x + {b:.2f}")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Linear Regression Fit")
plt.legend()
plt.show()


---
### Logistic Regression (Predicting Classes 0 or 1)

### The Idea  
Sometimes we don’t want to predict numbers — we want to **classify things**.  
For example:

| x | y |
|---|---|
| 1 | 0 |
| 2 | 0 |
| 3 | 1 |
| 4 | 1 |

- If `x` is small → class 0  
- If `x` is big → class 1  

We use **Logistic Regression** to find a “line” that separates the two classes.  
Instead of predicting a number, it predicts a **probability** between 0 and 1.  

---

###  Sigmoid Function  

We use a special function called **sigmoid** to turn any number into a probability:  

$$
sigmoid(z) = \frac{1}{1 + e^{-z}}
$$

- Input `z` can be any number  
- Output is always between 0 and 1  

# **4. Question**
Write the sigmoid function in code

**Stater code:**

In [5]:
import math

def sigmoid(z):
    pass

# Test it
print(sigmoid(0))   # 0.5
print(sigmoid(2))   # ~0.88
print(sigmoid(-2))  # ~0.12


---
### Prediction Function

Now to create the equation for logistic regression we combine the line formula you wrote in Question 1 (`wx + b`) with the **sigmoid** function you created:

$$
\hat y = sigmoid(w \cdot x + b)
$$

- `ŷ` is the **predicted probability** that `y = 1`.  
- If `ŷ > 0.5` → predict class **1**  
- If `ŷ <= 0.5` → predict class **0**



In [6]:
def predict_prob(x, w, b):
    z = predict(x, w, b)
    return sigmoid(z)

def predict_class(prob):
    """
    Returns the predicted class (0 or 1) for input prob
    """
    return 1 if prob > 0.5 else 0

# Example usage
print(predict_prob(2, 1, 0))  # probability for x=2
print(predict_class(predict_prob(2, 1, 0)))


# **5. Question**

Now that you have your `predict_prob()` function, let's use it to **calculate probabilities** and **class** for some new inputs.

- Calculate probability for `x = 4, w = 2, b = 1`  
- Calculate probability for `x = 1, w = 0.5, b = 1`  
- Calculate probability for `x = 3, w = -2, b = 3`  

**Hint:** Call your `predict_prob(x, w, b)` and `predict_class(prob)` functions for each case.


---

### Loss Function: Log Loss

Remember in **Linear Regression**, we used **Mean Squared Error (MSE)** to see how far off our predictions were:

$$
MSE = \frac{1}{m} \sum (y_i - \hat y_i)^2
$$

- Small MSE → our line fits the data well  
- Large MSE → our line is far from the data  

In **Logistic Regression**, we are predicting **probabilities** instead of exact numbers, so we need a different way to measure error: **Log Loss**.  

### What is Log Loss?

Log Loss tells us:  

> “How confident was the model when it made a wrong prediction?”  

Formula:

$$
LogLoss = -\frac{1}{m} \sum \Big[y \cdot \log(\hat y) + (1-y) \cdot \log(1-\hat y)\Big]
$$

Where:  
- `y` = actual class (0 or 1)  
- `ŷ` = predicted probability that `y = 1`  

Interpretation:  
- **Small Log Loss → good predictions** (predicted probabilities match actual classes)  
- **Large Log Loss → bad predictions** (predictions are far from the true class)  


---

# **6. Question**

Implement Log Loss

Now we will **compute the Log Loss** for some example predictions.  

You already have the `predict_prob(x, w, b)` function.  

Since we haven’t trained the model yet, you can **try different values of `w` and `b`** to see how the Log Loss changes.  

Example:  
- Start with `w = 1, b = 0`  
- Then try `w = 2, b = 1`  
- Try other values and observe how the Log Loss changes  

Use the dataset in the starter code

**Starter Code:**


In [8]:
# Dataset
x_data = [1, 2, 3, 4]
y_data = [0, 0, 1, 1]


def log_loss(x_data, y_data, w, b):
    m = len(x_data)
    total = 0
    for i in range(m):
        y_pred = predict_prob(x_data[i], w, b)
        # avoid log(0)
        y_pred = min(max(y_pred, 1e-10), 1-1e-10)
        total += y_data[i]*math.log(y_pred) + (1-y_data[i])*math.log(1-y_pred)
    return -total/m

w = None
b = None
print("Log Loss:", log_loss(x_data, y_data, w, b))

---
### Gradients for Logistic Regression

Just like in **Linear Regression**, we need a way to **update our parameters** `w` and `b` so that our predictions get better.  

- `dw` tells us **how much to change `w`** to reduce Log Loss  
- `db` tells us **how much to change `b`**  

We calculate the **average change needed** across all data points:

$$
dw = \frac{1}{m} \sum (\hat y - y) \cdot x
$$

$$
db = \frac{1}{m} \sum (\hat y - y)
$$

Where:  
- `y` = actual class (0 or 1)  
- `ŷ` = predicted probability from `sigmoid(wx+b)`  
- `m` = number of data points  

**Lets relate this to Linear Regression:**  
- In Linear Regression, we computed `dw = d(MSE)/dw`  
- Here, we compute `dw = d(LogLoss)/dw`  
- Same idea: **we look at how much the error changes with respect to each parameter**  

### Train Logistic Regression

Now that we can:

- Predict probabilities using `predict_prob()`  
- Measure error using `log_loss()`  
- Compute gradients using `compute_gradients_logistic()`

…it’s time to **train the model** using **Gradient Descent**.  

### Steps:

1. Start with initial guesses for `w` and `b`  
2. Choose a small learning rate `lr` (how big each step should be)  
3. Repeat the update for several steps (`epochs`)  

Each step:

$$
w = w - lr \cdot dw
$$

$$
b = b - lr \cdot db
$$

---

# **7. Question**
Fill in `lr` and `epochs` and watch your model learn




In [9]:
# Dataset
x_data = [1, 2, 3, 4]
y_data = [0, 0, 1, 1]

# Initial guesses
w = 0
b = 0

# Parameters to fill in
lr = None      # learning rate, e.g., 0.1
epochs = None   # number of steps, e.g., 100


def compute_gradients_logistic(x_data, y_data, w, b):
    m = len(x_data)
    dw, db = 0, 0
    for i in range(m):
        y_pred = predict_prob(x_data[i], w, b)  # uses your predict_prob() function
        dw += (y_pred - y_data[i]) * x_data[i]
        db += (y_pred - y_data[i])
    return dw/m, db/m


# Training loop (no need to edit)
for i in range(epochs):
    dw, db = compute_gradients_logistic(x_data, y_data, w, b)
    w -= lr * dw
    b -= lr * db
    if i % 10 == 0:
        print(f"Step {i}: w={w:.2f}, b={b:.2f}, loss={log_loss(x_data,y_data,w,b):.2f}")

print(f"\nFinal model: probability = sigmoid({w:.2f}*x + {b:.2f})")

In [10]:
# Visualize Logistic Regression
# In same manner we visualized linear regression model lets visualize the logistic regression model

import matplotlib.pyplot as plt

x_range = [i*0.1 for i in range(10, 41)]  # 1.0 to 4.0
y_prob = [predict_prob(xi, w, b) for xi in x_range]

plt.scatter(x_data, y_data, color="blue", label="Data points")
plt.plot(x_range, y_prob, color="red", label="Sigmoid curve")
plt.xlabel("x")
plt.ylabel("Probability of class 1")
plt.title("Logistic Regression Fit")
plt.legend(loc='upper left')
plt.show()


---

## 8. Questions

1. For both Linear and logistic regression Try different learning rates (`lr`). How does it affect training?  
2. What happens if you start with `w` and `b` far from the solution?  
3. For `x = 2.5` in logistic regression, what is the predicted probability and class?  
4. If you set `w = 0` in your Linear Regression model, what does the line look like? What does this tell you about the importance of `w`?  
5. Compare Log Loss to MSE from Linear Regression: which one is easier to interpret? Why?
