# 🧠 Machine Learning Fundamentals: Linear Regression & Gradient Descent

Welcome to your first step into the world of machine learning! In this notebook, we'll explore how computers learn from data — starting with one of the most powerful and intuitive ideas in AI: **learning by improving**.

We’ll break down two essential building blocks of modern machine learning:

**🎯 What You'll Learn:**
1. **Linear Regression** – How machines find patterns and make predictions using lines  
2. **Gradient Descent** – How computers improve their guesses through trial and error  
3. **Real-World Example** – Applying these tools to a relatable scenario

Whether you’re a curious beginner or brushing up your foundations, this notebook is designed to be:  
- ✅ **Visual** – with clear plots to show what’s happening  
- ✅ **Interactive** – so you can tweak the data and see the results  
- ✅ **Accessible** – no advanced math required, just an open mind

**📦 What’s Inside:**
- A gentle introduction to core ideas  
- A bottom-up learning path with minimal prerequisites  
- A real-world mini project to tie it all together  
- Code you can copy, extend, and reuse

**💡 Why It Matters:**  
These simple tools — linear regression and gradient descent — are the backbone of many AI systems. Understanding them gives you a clear window into how models learn from data, how optimization works, and how predictions are made.

> *From predicting house prices to powering deep learning — this is where it all begins.*

---

Ready? Let’s teach machines to learn! 🚀

In [None]:
# Quick Setup - Import Our Tools. Run this cell first (takes ~10 seconds).

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Import our plotting utilities
from plotting_utils.ml_fundamentals import (
    plot_house_data_scatter,
    create_manual_line_interactive,
    plot_computer_best_line,
    plot_learning_process,
    create_learning_rate_interactive,
    plot_coffee_productivity,
    normalize_data,
    denormalize_slope,
    denormalize_intercept
)

In [None]:
# Set random seed for reproducible results
np.random.seed(42)

## Part 1: Linear Regression - Finding the Pattern

### 🏠 The House Price Challenge

Imagine you're a real estate agent. A client asks: *"How much should I price my 1,800 sq ft house?"* You have data from recent sales. How do you find the pattern?

**Linear regression** finds the best straight line through data points - like drawing the "line of best fit" you might remember from school, but done automatically by a computer.


Let's start by creating some realistic house price data. This will help us visualize how linear regression works in practice. The house sizes are in square feet, and the prices are in thousands of dollars.

In [None]:
house_sizes = np.array([800, 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600])
house_prices = np.array([150, 180, 220, 250, 280, 320, 350, 380, 420, 450])

🧠 What's in this data?

- Each house is described by **one feature**: its size (in square feet).
- The **target** we're trying to predict is the **price** (in $1000s).
- This is a typical supervised learning setup: we want to learn a rule that maps inputs to outputs.

Let's print out the house sizes and prices in data pairs to see what we're working with. The first element is the input (house size), and the second is the output (price).

In [None]:
# Let's print out the house sizes and prices.
print("🏠 Recent House Sales Data:")
for size, price in zip(house_sizes, house_prices):
    print(f"   {size:,} sq ft → ${price}k")

Visualizations help us understand data better. We'll plot the house sizes against their prices. Maybe that will help us see a pattern.

In [None]:
plot_house_data_scatter(house_sizes, house_prices)

🤔 Question: If you had to draw a straight line through these points, where would you draw it?

### 🎮 Interactive: Try to Find the Best Line Yourself!

Now it's your turn! Let's see if you can find the best line that fits the data. The following code allows you to manually adjust the slope and intercept of a line to see how well it fits the data. 

- 🎯 Try different values to minimize the error!
- 💡 The 'best' line minimizes the average squared error.

**Squared error** = the difference between the actual prices and the predicted prices from your line, squared to avoid negative values. The goal is to make this error as small as possible.

In [None]:
create_manual_line_interactive(house_sizes, house_prices)

### 🤖 Now Let's See How the Computer Finds the Best Line

Now let's see how the computer finds the best line automatically using linear regression. We'll use a simple linear regression model to fit the data and visualize the results. This uses the normal equation method to find the optimal slope and intercept (an analytic solution to the linear regression problem).

Does the line look like the one you drew? If not, don't worry! The computer uses a systematic approach to find the best fit.

In [None]:
# Train linear regression model
model = LinearRegression()
X = house_sizes.reshape(-1, 1)  # Reshape for sklearn
y = house_prices

model.fit(X, y)

# Get the best line parameters
best_slope = model.coef_[0]
best_intercept = model.intercept_
best_predictions = model.predict(X)
best_error = mean_squared_error(y, best_predictions)

# Visualize the result
plot_computer_best_line(house_sizes, house_prices, best_slope, best_intercept, best_predictions, best_error)

This looks like a good fit! Let's see how the computer interprets this line:

In [None]:
print(f"   • Each additional sq ft adds ${best_slope*1000:.0f} to the price")
print(f"   • A 0 sq ft house would cost ${best_intercept*1000:.0f} (base value)")
print(f"   • 🎯 Estimated price an 1,800 sq ft house: ${model.predict(np.array([[1800]]))[0]:.1f}k")

Great! Now we have a model that predicts house prices based on their sizes. The computer's best line gives us a systematic way to estimate prices, and we can see how it fits the data. This is how machine learning helps us find patterns in data and make predictions based on those patterns. If you have any questions or want to explore more, feel free to ask! 😊

There are some important aspects to consider when letting a computer find the model:
- **We** still need to pick the model type (linear regression in this case). It is up to us to decide if this is the right model for our data.
- **We** need to define exactly what the model is trying to accomplish. In this case, we want to minimize the squared error between the predicted prices and the actual prices. This is called the training objective.
- **We** need to define exactly how the model accomplishes this objective. In this case, we find the solution by minimizing the squared error between the predicted prices and the actual prices using the normal equation method.
- **We** need to curate the data that we use to train the model. In this case, we have a small dataset of house sizes and prices, but in practice, we would want to use a larger and more diverse dataset to ensure the model generalizes well.

All in all, this is a simple example of how machine learning works. We define the model, the objective, and the data, and then let the computer find the best solution. Computers are only as smart as we make them, and they need our guidance to learn effectively.

## Part 2: Gradient Descent - How Computers Learn

### 🏔️ The Mountain Climbing Analogy

The computer found the best line for our housing price problem, but **how** did it do that? Imagine you're hiking in thick fog and want to reach the bottom of a valley (the lowest error). You can't see far, but you can feel which direction slopes downward. So you:

1. **Feel the ground** around your feet (measure the slope)
2. **Take a step** downhill (adjust your position)  
3. **Repeat** until you reach the bottom (find the minimum error)

This is **gradient descent** - the fundamental algorithm that powers most machine learning!

**🎯 Why This Matters:** neural networks use gradient descent to learn. They adjust their parameters (weights) iteratively to minimize the error between predicted and actual outputs. This is how deep learning models learn complex patterns in data. Learning this principle on simple examples helps us understand how more complex models work.

### 🎬 Gradient Descent in Action - Step by Step

The following code simulates the gradient descent process for our house price model. It starts with a random line and iteratively adjusts it to minimize the error.

We need to do some data acrobatics to make the gradient descent work: the algorithm only works efficiently if the data is normalized (mean = 0, standard deviation = 1). This helps the algorithm converge faster and more reliably. So we need to normalize and 'denormalize' the data before and after the training process.

Let's see it in action! We run the gradient descent algorithm with two parameters:
- **Learning rate**: How big of a step we take downhill each time. A small value means we take small steps, while a larger value means we take bigger steps.
- **Number of iterations**: How many times we repeat the process of feeling the ground and taking a step.

In [None]:
# Normalize data for stable learning
house_sizes_norm, mean_size, std_size = normalize_data(house_sizes)
mean_price = np.mean(house_prices)

def gradient_descent_demo(learning_rate, steps):
    """Demonstrate gradient descent step by step"""

    # Start with random guess
    slope = 0
    intercept = 0
    slope_orig = 0
    intercept_orig = 0

    # Track progress
    history = {'step': [], 'slope': [], 'intercept': [], 'error': []}

    print(f"🚀 Starting gradient descent (LR={learning_rate}, {steps} steps)")
    print("Step | Error   | Slope  | Intercept")
    print("-" * 35)

    for step in range(steps):
        # Calculate predictions and error
        predictions = slope * house_sizes_norm + intercept
        error = np.mean((house_prices - predictions) ** 2)

        # Convert to original scale for display
        slope_orig = denormalize_slope(slope, std_size)
        intercept_orig = denormalize_intercept(slope, mean_size, std_size, mean_price)

        # Record progress
        history['step'].append(step)
        history['slope'].append(slope_orig)
        history['intercept'].append(intercept_orig)
        history['error'].append(error)

        # Print progress every 10 steps
        if step % 10 == 0:
            print(f"{step:4d} | {error:7.2f} | {slope_orig:6.4f} | {intercept_orig:9.2f}")

        # Calculate gradients (which direction to move)
        n = len(house_sizes_norm)
        errors = house_prices - predictions
        slope_gradient = -2 * np.sum(errors * house_sizes_norm) / n
        intercept_gradient = -2 * np.sum(errors) / n

        # Take a step downhill
        slope = slope - learning_rate * slope_gradient
        intercept = intercept - learning_rate * intercept_gradient

    print(f"\n🏁 Final result: Slope={slope_orig:.4f}, Intercept={intercept_orig:.2f}")
    print(f"🎯 Compare to computer's solution: Slope={best_slope:.4f}, Intercept={best_intercept:.2f}")

    return history

# Run the demonstration
history = gradient_descent_demo(learning_rate=0.02, steps=50)

You'll see how the slope and intercept change over time, and how the error decreases as the model learns. We can also see that we aren't quite at the bottom of the valley yet, but we are getting closer with each step. How many more steps do you think it will take to reach the bottom?

It's probably better to get a visualization of the gradient descent process.

### 📊 Visualize the Learning Process

You'll 'see' the learning process in action! The following code visualizes how the model learns over time. It shows how the slope and intercept change with each step, and how the error decreases as the model learns.

In [None]:
# Add target lines for reference
plot_learning_process(history)

**🔍 What you're seeing:**
- Left: Error goes down as the algorithm learns ("going downhill")
- Right: Parameters gradually approach the optimal values (dashed lines). The slope already looks pretty good from the start, but the intercept is still far from the optimal value. This shows how gradient descent iteratively improves the model parameters.

This is exactly how neural networks train, too!

### 🎛️ Interactive: Effect of Learning Rate

Let's explore how the learning rate affects the gradient descent process. The learning rate determines how big of a step we take downhill each time. A small learning rate means we take small steps, while a larger learning rate means we take bigger steps.

The following code allows you to adjust the learning rate and see how it affects the gradient descent process. You can try different values and see how quickly the model converges to the optimal solution. By default, the code runs 50 iterations, but you can change this to see how the model learns over time.

**🎯 Can you find a learning rate that works well?**

- If the learning rate is too small, the model takes a long time to converge.
- If the learning rate is too large, the model may overshoot the optimal solution and oscillate around it, or even diverge.

In [None]:
create_learning_rate_interactive(house_sizes_norm, house_prices, std_size, mean_size, mean_price, denormalize_slope, denormalize_intercept)

### 🌀 Gradient Descent, but Stochastic!

The gradient descent we just saw used all data points at once to compute the slope and intercept update. This is called batch gradient descent.

But what if we had millions of data points? It would take a long time to compute the gradients at every step!

Stochastic Gradient Descent (SGD) speeds this up by using only one data point at a time to update the parameters. It’s faster and more memory efficient — and often generalizes better, too.

**🧠 Two Ways to Learn: Batch vs Stochastic Gradient Descent**

| Method                  | Uses               | Pros                            | Cons                          |
|-------------------------|--------------------|----------------------------------|-------------------------------|
| **Batch Gradient Descent**     | All data at each step | Stable steps, clear convergence | Can be slow for large datasets |
| **Stochastic Gradient Descent** | One point at a time    | Fast updates, good generalization | Noisy path, less stable       |


The following code repeats the gradient descent process, but this time using stochastic gradient descent. It updates the slope and intercept using only one data point at a time, which speeds up the learning process. How do you think it compares to the batch gradient descent we saw earlier?

In [None]:
def sgd_demo(learning_rate, steps):
    """Stochastic Gradient Descent using one data point at a time"""

    slope = 0
    intercept = 0
    slope_orig = 0
    intercept_orig = 0

    history = {'step': [], 'slope': [], 'intercept': [], 'error': []}

    print(f"🎯 Starting SGD (LR={learning_rate}, {steps} steps)")

    for step in range(steps):
        # Randomly pick one data point
        idx = np.random.randint(0, len(house_sizes_norm))
        x_i = house_sizes_norm[idx]
        y_i = house_prices[idx]

        # Prediction for just this point
        prediction = slope * x_i + intercept
        error = (y_i - prediction) ** 2

        # Gradients for this one point
        slope_grad = -2 * x_i * (y_i - prediction)
        intercept_grad = -2 * (y_i - prediction)

        # Update parameters
        slope -= learning_rate * slope_grad
        intercept -= learning_rate * intercept_grad

        # Track in original scale
        slope_orig = denormalize_slope(slope, std_size)
        intercept_orig = denormalize_intercept(slope, mean_size, std_size, mean_price)

        history['step'].append(step)
        history['slope'].append(slope_orig)
        history['intercept'].append(intercept_orig)
        history['error'].append(error)

        if step % 10 == 0:
            print(f"{step:4d} | Error: {error:.2f} | Slope: {slope_orig:.4f} | Intercept: {intercept_orig:.2f}")

    print(f"\n🏁 Final SGD result: Slope={slope_orig:.4f}, Intercept={intercept_orig:.2f}")
    return history

sgd_history = sgd_demo(learning_rate=0.03, steps=50)

The new result does not look quite as good as the one we got with batch gradient descent, but keep in mind that each step is based on only one data point. This means training with 50 steps lets the model only see 50 data points in total, while the batch gradient descent used all data points already 50 times!

Let's visualize the learning process again to see how the model learns with stochastic gradient descent. The following code shows how the slope and intercept change over time, and how the error decreases as the model learns. You will notice that everything looks a bit more noisy than with batch gradient descent, but the model still learns and improves over time.

In [None]:
plot_learning_process(sgd_history)


## Part 3: Quick Real-World Application (5 minutes)

### 📚 Predicting Data Scientist Productivity

Let’s bring regression into the daily life of a data scientist — and yes, that includes coffee.

In this real-world-style example, we’ll try to predict how many tasks a data scientist gets done based on how many cups of coffee they drink per day. It's a relatable scenario: some caffeine, some inspiration, maybe a bit of chaos — but is there a measurable pattern?

Our fictional dataset tracks daily coffee intake and task completion. With it, you’ll:
- Explore whether productivity increases linearly with coffee consumption
- Train a simple linear model on one feature: cups of coffee
- Use the model to predict performance for new caffeine levels (including dangerously high ones!)
- Visualize the trend and ask yourself: does more always mean better?

This is a playful example, but it reflects the kind of quick exploratory modeling that kicks off many real-world data projects. It’s a chance to practice everything you’ve learned in a familiar but fun setting — and who knows, maybe you’ll discover your optimal coffee zone along the way. ☕

In [None]:
# Create fictional data: Coffee intake (cups/day) vs tasks completed
coffee_cups = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
tasks_done = np.array([2, 4, 6, 8, 10, 11, 11, 10, 9])  # Diminishing returns!

print("☕ Coffee and Productivity Data:")
for cups, tasks in zip(coffee_cups, tasks_done):
    print(f"   {cups} cup(s)/day → {tasks} tasks completed")

# Train linear regression model
coffee_model = LinearRegression()
coffee_model.fit(coffee_cups.reshape(-1, 1), tasks_done)

# Predict productivity for a given input
cups_input = 6
predicted_tasks = coffee_model.predict(np.array([[cups_input]]))[0]

print(f"\n🎯 Prediction: With {cups_input} cups of coffee/day")
print(f"   → predicted productivity: {predicted_tasks:.1f} tasks/day")

# Visualization
plot_coffee_productivity(coffee_cups, tasks_done, coffee_model, cups_input, predicted_tasks)

💡 Insight:
- Each extra cup of coffee adds 2 completed tasks — up to a point!
- The data shows a non-linear story: productivity plateaus (and may even drop!) after about 6 cups.
- A linear model is easy to fit, but it doesn't capture diminishing returns — or the initial boost accurately.

👉 This is a great reminder: always check your model assumptions before trusting the predictions!


## 🎉 What You've Accomplished Today!

In just 45 minutes, you've explored the foundations of machine learning — the same building blocks behind modern AI systems used in everything from recommendation engines to autonomous vehicles.

✅ **Core Concepts Mastered:**
1. **Linear Regression** - Using simple models to uncover patterns and make predictions
2. **Gradient Descent** - Understanding how machines "learn" by iteratively minimizing errors
3. **Real-World Framing** - Applying models to relatable problems, and learning when they break

🚀 **Key Insights:**
- **Machine learning starts simple** — with lines and gradients — but these tools scale to powerful models.
- **Model assumptions matter** – A linear model is fast and interpretable, but it can miss the big picture.
- **Learning is iterative** – Algorithms improve step by step, just like we do.

🎯 **Next Steps:**
1. **Try your own data** - Apply these concepts to problems you care about
2. **Learn more algorithms** - Decision trees, neural networks, etc.


🌟 **The Big Picture:**

What you’ve seen today — especially gradient descent — isn’t just a classroom exercise. It’s the beating heart of deep learning, which uses the same principles to train complex neural networks on massive datasets.

Understanding these fundamentals gives you:
- The confidence to explore more complex models
- The ability to debug and demystify what’s happening under the hood
- A clear lens on where machine learning excels — and where caution is needed

👏 **Congratulations!** You now understand the core principles that power the AI revolution. The tools may grow in complexity, but the core ideas — patterns, learning, and optimization — start right here.
