<img src="../assets/Header_NovaAi_Camp2.0.png" alt="NovaAi Camp 2.0 Header" width="100%">

# Machine Learning From Scratch üîß

## Understanding HOW Machine Learning Works

Welcome! Before we use fancy libraries, let's build machine learning models **from scratch** using only NumPy and basic Python.

### Why Learn From Scratch?

**Think of it like learning to drive:**
- First, you need to understand what's under the hood (the engine)
- Then you can drive confidently knowing how it works
- Same with ML - understand the math, then use the tools!

### What You'll Build Today:

1. **Linear Regression** - Find the best line through data
2. **Prediction Function** - Make predictions manually
3. **Error Metrics** - Calculate how good your model is
4. **Train/Test Split** - Test on unseen data
5. **Multiple Features** - Handle more complex problems

### What You'll Learn:

- üìê The math behind machine learning (simplified!)
- üî¢ How models find patterns in data
- üìä How to measure success
- üß™ Why we split data into train and test sets

By the end of this notebook, you'll understand EXACTLY how machine learning works under the hood! üöÄ

# Part 1: Setup & Introduction üì¶

In [None]:
# Import basic libraries (no machine learning libraries yet!)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print("Libraries imported successfully! ‚úì")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Part 2: Understanding Linear Regression üìà

## What is Linear Regression?

**Simple Answer:** Finding the **best straight line** that fits through your data points.

### Real-World Example:

Imagine you're trying to predict **house prices** based on **square footage**.

- You know that bigger houses cost more
- But HOW MUCH more per square foot?
- Linear regression finds the answer!

### The Equation:

$$\text{Price} = (\text{Slope} \times \text{Square Feet}) + \text{Intercept}$$

Or in math notation:

$$y = mx + b$$

**Where:**
- **y** = Price (what we want to predict)
- **x** = Square Feet (what we know)
- **m** = Slope (how much price changes per sq ft)
- **b** = Intercept (base price when sq ft = 0)

### What We Need to Find:

Our job is to find the **best values** for:
1. **Slope (m)** - The rate of change
2. **Intercept (b)** - The starting point

## Visual Understanding

Let's see what different lines look like:

In [None]:
# Create sample data
square_feet = np.array([1000, 1500, 2000, 2500, 3000])
prices = np.array([200000, 300000, 400000, 500000, 600000])

# Plot the data points
plt.figure(figsize=(10, 6))
plt.scatter(square_feet, prices, color='blue', s=100, label='Actual Houses', zorder=3)

# Try different lines
x_line = np.array([1000, 3000])

# Bad line 1 (slope too low)
y_bad1 = 100 * x_line + 100000
plt.plot(x_line, y_bad1, 'r--', alpha=0.5, label='Bad Line (slope too low)')

# Bad line 2 (slope too high)
y_bad2 = 300 * x_line - 50000
plt.plot(x_line, y_bad2, 'orange', linestyle='--', alpha=0.5, label='Bad Line (slope too high)')

# Good line (we'll calculate this soon!)
y_good = 200 * x_line
plt.plot(x_line, y_good, 'g-', linewidth=3, label='Good Line (best fit)', zorder=2)

plt.xlabel('Square Feet', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Different Lines Through the Same Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The green line fits best because it's closest to all points!")

## How Do We Find the Best Line?

### The Goal: Minimize Errors

**Error** = How far each point is from the line

We want to find the line that makes the **total error as small as possible**.

### The Method: Least Squares

For each data point:
1. Calculate the error: `Error = Actual Price - Predicted Price`
2. Square it (to make negatives positive): `Squared Error = Error¬≤`
3. Add up all squared errors: `Total Error = Sum of all Squared Errors`

The **best line** is the one that minimizes this total error!

### The Formulas:

**Slope (m):**

$$m = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}$$

**Intercept (b):**

$$b = \bar{y} - m\bar{x}$$

Where:
- $x_i$, $y_i$ = Individual data points
- $\bar{x}$, $\bar{y}$ = Mean (average) of x and y

**Don't worry!** We'll code this step by step below! üëá

# Part 3: Building Linear Regression From Scratch üõ†Ô∏è

## Step 1: Create Simple Dataset

In [None]:
# Simple dataset: Square Feet vs Price
data = {
    'Square_Feet': [1000, 1500, 2000, 2500, 3000],
    'Price': [200000, 300000, 400000, 500000, 600000]
}

df = pd.DataFrame(data)

print("Our Training Data:")
print(df)
print("\nNotice the pattern: Each 500 sq ft adds about $100,000 to the price")

## Step 2: Calculate Slope and Intercept Manually

Let's calculate step by step using the formulas!

In [None]:
# Extract our x and y values
x = df['Square_Feet'].values  # Features
y = df['Price'].values         # Target

print("Step 1: Calculate the means (averages)")
mean_x = np.mean(x)
mean_y = np.mean(y)

print(f"Mean of Square Feet: {mean_x:,.0f}")
print(f"Mean of Price: ${mean_y:,.0f}")
print("\n" + "="*50 + "\n")

print("Step 2: Calculate the slope (m)")
print("Formula: m = Œ£[(x - mean_x)(y - mean_y)] / Œ£[(x - mean_x)¬≤]")

# Numerator: sum of products of deviations
numerator = np.sum((x - mean_x) * (y - mean_y))
print(f"  Numerator: {numerator:,.0f}")

# Denominator: sum of squared deviations of x
denominator = np.sum((x - mean_x) ** 2)
print(f"  Denominator: {denominator:,.0f}")

# Calculate slope
slope = numerator / denominator
print(f"  Slope (m): {slope:.2f}")
print(f"  Interpretation: Each square foot adds ${slope:.2f} to the price!")
print("\n" + "="*50 + "\n")

print("Step 3: Calculate the intercept (b)")
print("Formula: b = mean_y - (m √ó mean_x)")

intercept = mean_y - (slope * mean_x)
print(f"  Intercept (b): ${intercept:,.2f}")
print(f"  Interpretation: Base price when sq ft = 0")
print("\n" + "="*50 + "\n")

print("‚úì WE DID IT! We found our model:")
print(f"\nPrice = {slope:.2f} √ó Square_Feet + {intercept:,.2f}")
print(f"\nOr: Price = {slope:.2f} √ó Square_Feet + {intercept:.2f}")

## Step 3: Create a Prediction Function

Now that we have our slope and intercept, let's create a function to make predictions!

In [None]:
def predict_price(square_feet, slope, intercept):
    """
    Our handmade prediction function!
    
    Formula: y = mx + b
    
    Parameters:
    - square_feet: Size of the house
    - slope: How much each sq ft adds to price
    - intercept: Base price
    
    Returns:
    - predicted_price: Our prediction!
    """
    predicted_price = (slope * square_feet) + intercept
    return predicted_price

# Test our function!
test_sqft = 2200
predicted = predict_price(test_sqft, slope, intercept)

print(f"üè† Prediction for a {test_sqft:,} sq ft house:")
print(f"   Predicted Price: ${predicted:,.2f}")
print("\n" + "="*50 + "\n")

# Try multiple predictions
test_houses = [1200, 1800, 2500, 3200]

print("Predictions for different house sizes:")
print("-" * 40)
for sqft in test_houses:
    pred = predict_price(sqft, slope, intercept)
    print(f"{sqft:,} sq ft  ‚Üí  ${pred:,.2f}")

## Step 4: Visualize Our Model

In [None]:
# Plot actual data points
plt.figure(figsize=(10, 6))
plt.scatter(df['Square_Feet'], df['Price'], color='blue', s=150, 
            label='Actual Prices', zorder=3, edgecolors='black', linewidth=2)

# Create our prediction line
x_line = np.array([1000, 3000])
y_line = predict_price(x_line, slope, intercept)

plt.plot(x_line, y_line, color='red', linewidth=3, 
         label=f'Our Model: y = {slope:.2f}x + {intercept:.2f}', zorder=2)

# Add prediction for 2200 sq ft
pred_2200 = predict_price(2200, slope, intercept)
plt.scatter([2200], [pred_2200], color='green', s=200, marker='*', 
            label='New Prediction (2200 sq ft)', zorder=4, edgecolors='black', linewidth=2)

plt.xlabel('Square Feet', fontsize=12, fontweight='bold')
plt.ylabel('Price ($)', fontsize=12, fontweight='bold')
plt.title('Our Handmade Linear Regression Model', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úì The red line is our model!")
print("‚úì The green star is a new prediction we made!")

# Part 4: Model Evaluation - How Good Is Our Model? üìä

## Why Evaluate?

Making predictions is great, but **how do we know if they're good?**

We need **metrics** (measurements) to answer:
- How accurate are our predictions?
- Can we trust this model?
- Is it better than just guessing?

## Understanding Errors

**Error** = How far off our prediction is from the actual value

```
Error = Actual Price - Predicted Price
```

**Example:**
- Actual price: $400,000
- Predicted price: $380,000
- Error: $400,000 - $380,000 = $20,000 (we were $20k off)

Let's calculate errors for all our data points!

In [None]:
# Make predictions for all our training data
df['Predicted_Price'] = predict_price(df['Square_Feet'], slope, intercept)

# Calculate errors
df['Error'] = df['Price'] - df['Predicted_Price']
df['Absolute_Error'] = np.abs(df['Error'])
df['Squared_Error'] = df['Error'] ** 2

print("Predictions vs Actual:")
print("="*60)
print(df)
print("\n" + "="*60)

print("\nError Analysis:")
print("-" * 40)
for idx, row in df.iterrows():
    print(f"House {idx+1}: Predicted ${row['Predicted_Price']:,.0f}, "
          f"Actual ${row['Price']:,.0f}, Error: ${row['Error']:,.0f}")

## Metric 1: Mean Absolute Error (MAE)

**What it is:** The average of all absolute errors.

**Formula:**

$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

**In simple terms:** "On average, how far off are we?"

**Why it's useful:**
- Easy to understand (same units as target)
- If predicting house prices, MAE is in dollars
- "On average, we're off by $X"

In [None]:
# Calculate MAE manually
mae = np.mean(df['Absolute_Error'])

print("Mean Absolute Error (MAE):")
print("="*60)
print(f"MAE: ${mae:,.2f}")
print("\nInterpretation:")
print(f"On average, our predictions are off by ${mae:,.2f}")
print("\nThis is our 'average error' - the smaller, the better!")

## Metric 2: Mean Squared Error (MSE)

**What it is:** The average of all squared errors.

**Formula:**

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Why square the errors?**
- Makes all errors positive
- Punishes big errors more heavily
- A $10,000 error is treated worse than two $5,000 errors

**Downside:** Units are squared (hard to interpret)

In [None]:
# Calculate MSE manually
mse = np.mean(df['Squared_Error'])

print("Mean Squared Error (MSE):")
print("="*60)
print(f"MSE: {mse:,.2f}")
print("\nNote: MSE is in squared units (harder to interpret)")
print("That's why we often use RMSE instead...")

## Metric 3: Root Mean Squared Error (RMSE)

**What it is:** Square root of MSE.

**Formula:**

$$RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

**Why it's useful:**
- Same units as target (like MAE)
- Still penalizes large errors
- Most commonly used metric for regression

In [None]:
# Calculate RMSE manually
rmse = np.sqrt(mse)

print("Root Mean Squared Error (RMSE):")
print("="*60)
print(f"RMSE: ${rmse:,.2f}")
print("\nInterpretation:")
print(f"Typical prediction error is around ${rmse:,.2f}")
print("\nRMSE is like MAE but punishes big mistakes more!")

## Metric 4: R¬≤ Score (R-squared / Coefficient of Determination)

**What it is:** How much of the variance in prices does our model explain?

**Formula:**

$$R^2 = 1 - \frac{SS_{residual}}{SS_{total}}$$

Where:
- $SS_{residual}$ = Sum of squared errors (what we DON'T explain)
- $SS_{total}$ = Total variance in the data

**Scale:** 0 to 1 (can be negative if model is terrible)
- **R¬≤ = 1.0** ‚Üí Perfect predictions! üéØ
- **R¬≤ = 0.9** ‚Üí We explain 90% of variance (excellent!)
- **R¬≤ = 0.5** ‚Üí We explain 50% of variance (okay)
- **R¬≤ = 0.0** ‚Üí Our model is useless
- **R¬≤ < 0** ‚Üí Our model is worse than just predicting the mean!

In [None]:
# Calculate R¬≤ manually
# Step 1: Total Sum of Squares (total variance)
ss_total = np.sum((df['Price'].astype(float) - df['Price'].mean()) ** 2)
print(f"Total variance in prices: {ss_total:,.0f}")

# Step 2: Residual Sum of Squares (unexplained variance)
ss_residual = np.sum(df['Squared_Error'])
print(f"Unexplained variance: {ss_residual:,.0f}")

# Step 3: Calculate R¬≤
r2_score = 1 - (ss_residual / ss_total)

print("\n" + "="*60)
print(f"R¬≤ Score: {r2_score:.4f}")
print("="*60)

print(f"\nInterpretation:")
print(f"Our model explains {r2_score*100:.2f}% of the variance in house prices!")

if r2_score > 0.9:
    print("‚úì Excellent model!")
elif r2_score > 0.7:
    print("‚úì Good model!")
elif r2_score > 0.5:
    print("‚óã Okay model, could be better")
else:
    print("‚úó Weak model, needs improvement")

## Summary of Metrics

Let's put all our metrics together:

In [None]:
print("üìä MODEL EVALUATION SUMMARY")
print("="*60)
print(f"Mean Absolute Error (MAE):    ${mae:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print(f"R¬≤ Score:                      {r2_score:.4f} ({r2_score*100:.2f}%)")
print("="*60)

print("\nüí° What These Numbers Mean:")
print("-" * 60)
print("‚Ä¢ MAE: On average, we're off by this amount")
print("‚Ä¢ RMSE: Typical error (punishes big mistakes)")
print("‚Ä¢ R¬≤: Percentage of variance we explain (higher = better)")
print("\nFor this simple problem, our model works perfectly because")
print("the data has a perfect linear relationship!")

# Part 5: Train-Test Split - The Right Way to Evaluate ‚úÇÔ∏è

## The Problem with Current Evaluation

**Question:** Are we cheating?

**Answer:** YES! üö®

We're testing our model on the SAME data we used to train it!

**Think of it like:**
- üìñ Studying WITH the exam questions in front of you
- üéÆ Playing a video game with the walkthrough open
- üèÉ Running a race where you already know the route

You'll get great results, but it doesn't prove you actually learned anything useful!

## The Solution: Train-Test Split

**The Idea:**
1. **Split data** into two parts BEFORE training
2. **Training set (80%)** - Model learns from this
3. **Test set (20%)** - Model proves it learned (never seen during training)

**Visual:**
```
Original Data (100%)
    ‚Üì
    ‚îú‚îÄ Training Set (80%) ‚Üí Train model ‚Üí Model learns patterns
    ‚îî‚îÄ Test Set (20%) ‚Üí Test model ‚Üí See if it really learned!
```

This simulates how the model will perform on NEW data in the real world!

## Create a Larger Dataset

Our current dataset is too small (only 5 points). Let's create a bigger one!

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Create larger dataset (100 houses)
n_samples = 100

# Generate random square footage
square_feet_large = np.random.randint(1000, 4000, n_samples)

# True relationship: $200 per sq ft + base of $50,000
true_slope = 200
true_intercept = 50000

# Add some random noise (real world isn't perfect!)
noise = np.random.normal(0, 30000, n_samples)  # Random variation ¬±$30k

# Generate prices
prices_large = (true_slope * square_feet_large) + true_intercept + noise

# Create DataFrame
df_large = pd.DataFrame({
    'Square_Feet': square_feet_large,
    'Price': prices_large
})

print(f"Created dataset with {len(df_large)} houses!")
print("\nFirst 10 houses:")
print(df_large.head(10))

print(f"\nData statistics:")
print(df_large.describe())

## Visualize the Larger Dataset

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df_large['Square_Feet'], df_large['Price'], alpha=0.5, s=50)
plt.xlabel('Square Feet', fontsize=12, fontweight='bold')
plt.ylabel('Price ($)', fontsize=12, fontweight='bold')
plt.title('House Prices vs Square Footage (100 Houses)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice: Points are scattered (not perfect), just like real data!")

## Perform Train-Test Split Manually

In [None]:
# Step 1: Shuffle the data (randomize order)
df_shuffled = df_large.sample(frac=1, random_state=42).reset_index(drop=True)

# Step 2: Calculate split index (80% for training)
split_ratio = 0.8
split_index = int(split_ratio * len(df_shuffled))

# Step 3: Split into train and test
train_data = df_shuffled[:split_index].copy()
test_data = df_shuffled[split_index:].copy()

print("‚úì Data Split Complete!")
print("="*60)
print(f"Original dataset: {len(df_large)} houses")
print(f"Training set: {len(train_data)} houses ({len(train_data)/len(df_large)*100:.0f}%)")
print(f"Test set: {len(test_data)} houses ({len(test_data)/len(df_large)*100:.0f}%)")
print("="*60)

print("\nüí° Important:")
print("‚Ä¢ Model will ONLY see training data during training")
print("‚Ä¢ Test data is kept hidden until evaluation")
print("‚Ä¢ This simulates predicting for new, unseen houses!")

## Train Model on Training Data ONLY

In [None]:
# Extract training features and target
x_train = train_data['Square_Feet'].values
y_train = train_data['Price'].values

# Calculate slope and intercept using ONLY training data
mean_x_train = np.mean(x_train)
mean_y_train = np.mean(y_train)

numerator_train = np.sum((x_train - mean_x_train) * (y_train - mean_y_train))
denominator_train = np.sum((x_train - mean_x_train) ** 2)

slope_train = numerator_train / denominator_train
intercept_train = mean_y_train - (slope_train * mean_x_train)

print("‚úì Model Trained on Training Data!")
print("="*60)
print(f"Trained Slope: ${slope_train:.2f} per sq ft")
print(f"Trained Intercept: ${intercept_train:,.2f}")
print("="*60)

print(f"\nCompare to true values we used to generate data:")
print(f"True Slope: ${true_slope:.2f}")
print(f"True Intercept: ${true_intercept:,.2f}")
print(f"\nPretty close! Our model learned the pattern!")

## Evaluate on Training Data

In [None]:
# Predict on training data
train_predictions = predict_price(x_train, slope_train, intercept_train)

# Calculate training metrics
train_errors = y_train - train_predictions
train_mae = np.mean(np.abs(train_errors))
train_rmse = np.sqrt(np.mean(train_errors ** 2))

ss_total_train = np.sum((y_train - mean_y_train) ** 2)
ss_residual_train = np.sum(train_errors ** 2)
train_r2 = 1 - (ss_residual_train / ss_total_train)

print("üìä TRAINING SET PERFORMANCE:")
print("="*60)
print(f"MAE:  ${train_mae:,.2f}")
print(f"RMSE: ${train_rmse:,.2f}")
print(f"R¬≤:   {train_r2:.4f} ({train_r2*100:.2f}%)")
print("="*60)

## The Moment of Truth: Evaluate on Test Data!

This is the REAL test - how well does our model work on data it has NEVER seen?

In [None]:
# Extract test features and target
x_test = test_data['Square_Feet'].values
y_test = test_data['Price'].values

# Predict on test data (using model trained on training data)
test_predictions = predict_price(x_test, slope_train, intercept_train)

# Calculate test metrics
test_errors = y_test - test_predictions
test_mae = np.mean(np.abs(test_errors))
test_rmse = np.sqrt(np.mean(test_errors ** 2))

mean_y_test = np.mean(y_test)
ss_total_test = np.sum((y_test - mean_y_test) ** 2)
ss_residual_test = np.sum(test_errors ** 2)
test_r2 = 1 - (ss_residual_test / ss_total_test)

print("üéØ TEST SET PERFORMANCE (Unseen Data):")
print("="*60)
print(f"MAE:  ${test_mae:,.2f}")
print(f"RMSE: ${test_rmse:,.2f}")
print(f"R¬≤:   {test_r2:.4f} ({test_r2*100:.2f}%)")
print("="*60)

print("\nüìä Comparison:")
print("-"*60)
print(f"{'Metric':<10} {'Training':<20} {'Test':<20}")
print("-"*60)
print(f"{'MAE':<10} ${train_mae:<19,.2f} ${test_mae:<19,.2f}")
print(f"{'RMSE':<10} ${train_rmse:<19,.2f} ${test_rmse:<19,.2f}")
print(f"{'R¬≤':<10} {train_r2:<19.4f} {test_r2:<19.4f}")
print("-"*60)

if abs(train_r2 - test_r2) < 0.05:
    print("\n‚úì Great! Training and test scores are similar")
    print("  This means our model generalizes well to new data!")
else:
    print("\n‚ö† Warning: Big difference between train and test")
    print("  This might indicate overfitting or a small test set")

## Visualize Predictions vs Actual

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Left plot: Training data
ax1.scatter(x_train, y_train, alpha=0.5, s=30, label='Actual Prices')
x_line_train = np.array([x_train.min(), x_train.max()])
y_line_train = predict_price(x_line_train, slope_train, intercept_train)
ax1.plot(x_line_train, y_line_train, 'r-', linewidth=3, label='Our Model')
ax1.set_xlabel('Square Feet', fontsize=12, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12, fontweight='bold')
ax1.set_title('Training Data', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right plot: Test data
ax2.scatter(x_test, y_test, alpha=0.5, s=30, color='orange', label='Actual Prices')
x_line_test = np.array([x_test.min(), x_test.max()])
y_line_test = predict_price(x_line_test, slope_train, intercept_train)
ax2.plot(x_line_test, y_line_test, 'r-', linewidth=3, label='Our Model')
ax2.set_xlabel('Square Feet', fontsize=12, fontweight='bold')
ax2.set_ylabel('Price ($)', fontsize=12, fontweight='bold')
ax2.set_title('Test Data (Unseen!)', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úì Our model (red line) works well on both training and test data!")

# Part 6: Multiple Features - Beyond One Variable üéØ

## The Real World Problem

So far we predicted price using **only** square footage.

**But in reality, price depends on many factors:**
- Square footage
- Number of bedrooms
- Number of bathrooms
- Age of house
- Location
- School district
- And much more!

## Multiple Linear Regression

With multiple features, our formula extends:

**One feature:**
$$Price = (Slope \times Square\text{ }Feet) + Intercept$$

**Multiple features:**
$$Price = (w_1 \times Feature_1) + (w_2 \times Feature_2) + ... + (w_n \times Feature_n) + Intercept$$

Where $w_1, w_2, ..., w_n$ are **weights** (slopes for each feature)

**Example:**
$$Price = (200 \times Sq.Ft) + (10000 \times Bedrooms) + (15000 \times Bathrooms) + 50000$$

## Create Dataset with Multiple Features

In [None]:
# Create dataset with multiple features
np.random.seed(42)
n = 100

# Features
square_feet_multi = np.random.randint(1000, 4000, n)
bedrooms = np.random.randint(1, 6, n)
bathrooms = np.random.randint(1, 4, n)

# True weights
w_sqft = 150      # $150 per square foot
w_bed = 20000     # $20k per bedroom
w_bath = 15000    # $15k per bathroom
true_intercept_multi = 50000

# Generate prices with noise
noise_multi = np.random.normal(0, 30000, n)
prices_multi = (w_sqft * square_feet_multi + 
                w_bed * bedrooms + 
                w_bath * bathrooms + 
                true_intercept_multi + 
                noise_multi)

# Create DataFrame
df_multi = pd.DataFrame({
    'Square_Feet': square_feet_multi,
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'Price': prices_multi
})

print("Dataset with Multiple Features:")
print("="*60)
print(df_multi.head(10))
print("\nShape:", df_multi.shape)
print("\nFeatures: Square_Feet, Bedrooms, Bathrooms")
print("Target: Price")

## The Math Behind Multiple Features (Advanced)

For multiple features, we use **matrix algebra** to solve everything at once:

$$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

**Where:**
- $\mathbf{X}$ = Feature matrix (all features)
- $\mathbf{y}$ = Target vector (prices)
- $\mathbf{w}$ = Weights (slopes + intercept)

**Don't worry if this looks complex!** 
- This is why we use libraries like scikit-learn
- They handle the math for us automatically
- But it's good to know what's happening under the hood!

Let's implement this:

In [None]:
# Prepare feature matrix (add column of 1s for intercept)
X = df_multi[['Square_Feet', 'Bedrooms', 'Bathrooms']].values
X_with_intercept = np.column_stack([np.ones(len(X)), X])

# Target
y = df_multi['Price'].values

# Calculate weights using matrix formula
# w = (X^T X)^(-1) X^T y
XTX = X_with_intercept.T @ X_with_intercept  # X transpose times X
XTX_inv = np.linalg.inv(XTX)                  # Inverse
XTy = X_with_intercept.T @ y                  # X transpose times y
weights = XTX_inv @ XTy                        # Final weights

intercept_multi = weights[0]
w_sqft_learned = weights[1]
w_bed_learned = weights[2]
w_bath_learned = weights[3]

print("‚úì Multiple Linear Regression Complete!")
print("="*60)
print("Learned Weights:")
print(f"  Intercept:    ${intercept_multi:,.2f}")
print(f"  Square Feet:  ${w_sqft_learned:.2f} per sq ft")
print(f"  Bedrooms:     ${w_bed_learned:,.2f} per bedroom")
print(f"  Bathrooms:    ${w_bath_learned:,.2f} per bathroom")

print("\nTrue Values (what we used to generate data):")
print(f"  Intercept:    ${true_intercept_multi:,.2f}")
print(f"  Square Feet:  ${w_sqft:.2f}")
print(f"  Bedrooms:     ${w_bed:,.2f}")
print(f"  Bathrooms:    ${w_bath:,.2f}")

print("\n‚úì Our model learned the pattern pretty well!")

## Make Predictions with Multiple Features

In [None]:
# Make predictions
predictions_multi = X_with_intercept @ weights

# Calculate metrics
errors_multi = y - predictions_multi
mae_multi = np.mean(np.abs(errors_multi))
rmse_multi = np.sqrt(np.mean(errors_multi ** 2))

ss_total_multi = np.sum((y - y.mean()) ** 2)
ss_residual_multi = np.sum(errors_multi ** 2)
r2_multi = 1 - (ss_residual_multi / ss_total_multi)

print("üìä Multiple Regression Performance:")
print("="*60)
print(f"MAE:  ${mae_multi:,.2f}")
print(f"RMSE: ${rmse_multi:,.2f}")
print(f"R¬≤:   {r2_multi:.4f} ({r2_multi*100:.2f}%)")
print("="*60)

print("\nüí° Using multiple features improves predictions!")
print("   More information = Better predictions")

## Example: Predict Price for a New House

In [None]:
# New house features
new_house_sqft = 2500
new_house_bed = 4
new_house_bath = 2

# Predict using our learned model
predicted_price_new = (intercept_multi + 
                       w_sqft_learned * new_house_sqft + 
                       w_bed_learned * new_house_bed + 
                       w_bath_learned * new_house_bath)

print("üè† New House to Predict:")
print("="*60)
print(f"Square Feet: {new_house_sqft:,}")
print(f"Bedrooms:    {new_house_bed}")
print(f"Bathrooms:   {new_house_bath}")
print("="*60)
print(f"\nüí∞ Predicted Price: ${predicted_price_new:,.2f}")
print("\nBreakdown:")
print(f"  Base price:       ${intercept_multi:,.2f}")
print(f"  + Square footage: ${w_sqft_learned * new_house_sqft:,.2f}")
print(f"  + Bedrooms:       ${w_bed_learned * new_house_bed:,.2f}")
print(f"  + Bathrooms:      ${w_bath_learned * new_house_bath:,.2f}")
print(f"  {'='*30}")
print(f"  Total:            ${predicted_price_new:,.2f}")

# Final Summary üéì

## Congratulations! üéâ

You just built **machine learning models from scratch** without any ML libraries!

## What We Accomplished Today

### 1. Linear Regression Fundamentals ‚úì
- **What it is:** Finding the best line through data
- **Formula:** $y = mx + b$ (Slope-intercept form)
- **Manual calculation:** Used math formulas to find slope and intercept
- **Understanding:** You know EXACTLY how predictions are made

### 2. Model Evaluation Metrics ‚úì

Understanding how good our predictions are:

| Metric | What It Measures | Units | Interpretation |
|--------|-----------------|-------|----------------|
| **MAE** | Average absolute error | Same as target | "On average, we're off by X" |
| **MSE** | Average squared error | Squared units | Penalizes big errors more |
| **RMSE** | Square root of MSE | Same as target | Typical error size |
| **R¬≤** | Variance explained | 0 to 1 | % of pattern captured |

### 3. Train-Test Split ‚úì
- **Why:** Prevent "studying with the exam questions"
- **How:** Split data (80% train, 20% test)
- **Goal:** Test model on unseen data
- **Result:** Know if model truly learned patterns

### 4. Multiple Features ‚úì
- **Reality:** Predictions depend on many factors
- **Extension:** Multiple weights for each feature
- **Math:** Matrix algebra (complex, but powerful!)
- **Benefit:** More accurate, realistic predictions

---

## What This Means for You

üéØ **You now understand what's happening under the hood!**

When you use professional tools like **scikit-learn**, you'll know:
- What the library is doing internally
- Why we split data into train and test sets
- What evaluation metrics actually mean
- How to interpret and trust your results

---

## The Real World Approach

**In practice, we DON'T write these formulas manually!**

Instead, we use **scikit-learn** which:
- ‚úì Handles all the math automatically
- ‚úì Optimizes calculations for speed
- ‚úì Provides easy-to-use functions
- ‚úì Works with any number of features
- ‚úì Includes 100+ advanced models

**But now you understand HOW it works!** This makes you a much better practitioner.

---

## Next Steps üöÄ

**You're ready for:**
1. **Day 1: SK-Learn Basics** - Same concepts, professional tools (ONE line of code!)
2. More complex models (polynomial regression, decision trees, neural networks)
3. Real-world datasets from Kaggle
4. Advanced evaluation techniques (cross-validation, hyperparameter tuning)

---

## Final Thoughts üí°

> **"Understanding the basics from scratch makes you a better ML practitioner!"**

You didn't just learn to use tools - you learned the **foundations**.

This knowledge will help you:
- Debug problems when things go wrong
- Understand documentation and research papers
- Make better modeling decisions
- Explain your work to others confidently

**Keep learning! Keep building! Keep growing!** üåü

Happy Learning! üìö

By Abdulellah Mojalled : [linkedin](https://www.linkedin.com/in/abdulellah-mojalled/)