![Header](../assets/Header_NovaAi_Camp2.0.png)

# Day 1: Scikit-Learn Basics üöÄ

## Welcome to Professional Machine Learning

In the previous notebook, we built everything from scratch to understand the fundamentals. Now it's time to use the tools that data scientists use in the real world.

**You learned:**
- How to calculate slope and intercept manually
- How to compute MAE, MSE, RMSE, R¬≤ from formulas
- How to split data into training and test sets
- Why evaluation metrics matter

**Now we'll do the exact same things**, but using professional tools that make everything faster and easier.

## What is Scikit-Learn (sklearn)?

Scikit-learn is the most popular machine learning library in Python. It's used by companies worldwide for everything from fraud detection to recommendation systems.

**Think of it like this:**
- **From Scratch** = Learning how a car engine works (educational, teaches fundamentals)
- **Scikit-Learn** = Driving the car (practical, gets you where you need to go)

You did the first part - now you understand HOW it works. Now let's learn to USE it effectively.

## Why Use Scikit-Learn?

| Feature | Benefit |
|---------|---------|
| **Automatic Math** | Complex formulas handled internally |
| **Speed** | Optimized C/C++ code under the hood |
| **Consistent API** | Same pattern for all models |
| **Scalability** | Works with 1 or 1,000 features |
| **Production Ready** | Battle-tested by millions of users |
| **Rich Library** | 100+ algorithms included |

## What You'll Learn

1. **Linear Regression** - The sklearn way
2. **Train-Test Split** - One line of code
3. **Model Evaluation** - Built-in metrics
4. **Multiple Features** - No matrix math needed
5. **Making Predictions** - Clean and simple

Let's dive in! üí°

## Import Libraries üì¶

Before we start, let's import our tools:

- **Pandas** - For working with data tables
- **NumPy** - For numerical operations
- **Matplotlib** - For visualizations
- **Sklearn** - For machine learning

**Key Sklearn Components:**
- `LinearRegression` - The model we'll use
- `train_test_split` - Splits data automatically
- `mean_absolute_error`, `mean_squared_error`, `r2_score` - Evaluation metrics

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt

# Scikit-Learn imports
from sklearn.linear_model import LinearRegression      # The model
from sklearn.model_selection import train_test_split  # Train/test split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  # Evaluation metrics

print("‚úì All libraries imported successfully!")
print("Ready to use scikit-learn! üöÄ")

# Part 1: Linear Regression with Scikit-Learn üéØ

## The Sklearn Workflow - Universal Recipe

Scikit-learn follows a **consistent pattern** for ALL models. This is one of its best features!

### The 3-Step Recipe

```
Step 1: CREATE        ‚Üí  model = LinearRegression()
                         (Get a blank model ready to learn)

Step 2: FIT           ‚Üí  model.fit(X, y)
                         (Train the model on your data)

Step 3: PREDICT       ‚Üí  predictions = model.predict(X_new)
                         (Use the trained model on new data)
```

**This SAME pattern works for:**
- Linear Regression (what we're learning)
- Decision Trees
- Random Forests
- Neural Networks
- And 100+ other algorithms!

**Learn it once, use it everywhere.** That's the power of good design! ‚ú®

---

## Create Sample Dataset

Let's start with a simple example: predicting house prices from square footage.

**Remember this from the previous notebook?** Same data, different tools!

In [None]:
# Create simple dataset
data = {
    'Square_Feet': [1000, 1500, 2000, 2500, 3000],
    'Price': [250000, 350000, 450000, 550000, 650000]
}

df = pd.DataFrame(data)

print("Our Dataset:")
print("="*50)
print(df)
print("\nüìä Goal: Predict house price from square footage")

## Prepare Data for Sklearn

**IMPORTANT:** Sklearn expects data in a specific format. This trips up many beginners!

### The Rules:

**X (Features) must be 2D:**
```python
X = df[['Square_Feet']]  # Double brackets [[ ]] ‚Üí Creates a 2D DataFrame
# Shape: (5, 1) - 5 samples, 1 feature
```

**y (Target) can be 1D:**
```python
y = df['Price']  # Single brackets [ ] ‚Üí Creates a 1D Series
# Shape: (5,) - 5 samples
```

### Why 2D for X?

Sklearn needs to handle multiple features, so it always expects a table structure:

**One feature (our current example):**
```python
X = [[1000],   # House 1: 1000 sq ft
     [1500],   # House 2: 1500 sq ft
     [2000]]   # House 3: 2000 sq ft
# Shape: (3, 1) - 3 houses, 1 feature
```

**Multiple features (we'll see this later):**
```python
X = [[1000, 3, 2],  # House 1: [sqft, bedrooms, bathrooms]
     [1500, 4, 2],  # House 2
     [2000, 5, 3]]  # House 3
# Shape: (3, 3) - 3 houses, 3 features
```

**Key Point:** Always use double brackets `[[]]` for features, even with just one feature.

**Think of it as:** Sklearn wants a table (DataFrame), not a list (Series).

In [None]:
# Prepare features (X) - needs to be 2D
X = df[['Square_Feet']]  # Double brackets [[]] make it 2D

# Prepare target (y) - can be 1D
y = df['Price']  # Single brackets [] make it 1D

print("Feature Matrix (X):")
print(X)
print(f"Shape: {X.shape} - (5 samples, 1 feature)\n")

print("Target Vector (y):")
print(y.values)
print(f"Shape: {y.shape} - (5 samples,)")

print("\n‚úì Data ready for sklearn!")

## Step 1: Create the Model

Creating a model is like getting a blank notebook - ready to learn but currently empty.

```python
model = LinearRegression()
```

**What happens here:**
- A blank LinearRegression model is created
- It has NO knowledge yet
- Parameters (slope, intercept) are uninitialized
- It's ready to learn but hasn't seen any data

Think of it as hiring a student on their first day - they have potential but haven't studied yet!

In [None]:
# Create a Linear Regression model
model = LinearRegression()

print("‚úì Model created!")
print(f"Model type: {type(model)}")
print("\nThe model is ready to learn, but hasn't seen any data yet.")

## Step 2: Train (Fit) the Model üèãÔ∏è

Training teaches the model to recognize patterns in your data.

```python
model.fit(X, y)
```

**All the math we did manually before?** Sklearn does it in milliseconds! ‚ö°

**What happens during `.fit()`:**

**BEFORE:**
```python
model.coef_      # Doesn't exist yet
model.intercept_ # Doesn't exist yet
```

**DURING (sklearn automatically):**
1. Analyzes the relationship between X and y
2. Calculates optimal slope using least squares
3. Calculates optimal intercept
4. Minimizes prediction errors
5. Stores learned parameters in the model

**AFTER:**
```python
model.coef_ = [200.0]      # Learned slope: $200 per sq ft
model.intercept_ = 50000   # Learned intercept: $50,000 base price
```

The model is now "trained" and ready to make predictions!

In [None]:
# Train the model
model.fit(X, y)

print("‚úì Model trained successfully!")
print("\nBehind the scenes, sklearn:")
print("  1. Calculated the optimal slope")
print("  2. Calculated the optimal intercept")
print("  3. Minimized prediction errors")
print("\nAll in one line of code! üéâ")

## Inspect What the Model Learned üîç

After training, we can see what the model learned from the data.

Every sklearn model stores learned values in **attributes** (variables ending with `_`):

```python
model.intercept_  # The y-intercept (base value)
model.coef_       # Coefficients (slopes) - one per feature
```

### Understanding the Values:

**Intercept:** The Y value when X = 0 (starting point of the line)
- Example: `intercept_ = 50000` means base price is $50,000
- In context: "A house with 0 sq ft would theoretically cost $50,000"

**Coefficient:** How much Y changes when X increases by 1 (slope of the line)
- Example: `coef_ = 200` means each square foot adds $200
- In context: "Each additional square foot increases price by $200"

### Why the underscore `_`?

In sklearn:
- **No underscore** = Settings you provide (before training)
- **With underscore `_`** = Values learned from data (after training)

This convention helps you distinguish between what you set and what the model learned!

In [None]:
# Get learned parameters
slope = model.coef_[0]  # Coefficient (slope)
intercept = model.intercept_  # Intercept

print("Model Parameters:")
print("="*50)
print(f"Slope (coefficient):  ${slope:.2f} per sq ft")
print(f"Intercept (base):     ${intercept:,.2f}")
print("="*50)

print("\nüìê Learned Formula:")
print(f"Price = {slope:.2f} √ó Square_Feet + {intercept:,.2f}")

print("\nüí° Interpretation:")
print(f"‚Ä¢ Each additional square foot adds ${slope:.2f} to the price")
print(f"‚Ä¢ Base price (0 sq ft) would be ${intercept:,.2f}")

## Step 3: Make Predictions üîÆ

Now our trained model can predict prices for any square footage!

### How `.predict()` Works:

```
Input: Square footage (X)
  ‚Üì
Model applies learned formula:
  Price = (slope √ó Square_Feet) + intercept
  ‚Üì
Output: Predicted price
```

**Example:**
```python
new_house = [[2200]]           # 2200 square feet (must be 2D!)
prediction = model.predict(new_house)

# Behind the scenes:
# Price = 200 √ó 2200 + 50000
# Price = 440000 + 50000
# Price = 490000  ‚Üí $490,000
```

**Key Point:** The input must be 2D, just like when we trained! Use `[[value]]` not `[value]`.

In [None]:
# Make predictions for our original data
predictions = model.predict(X)

# Create comparison DataFrame
comparison = pd.DataFrame({
    'Square_Feet': X['Square_Feet'],
    'Actual_Price': y,
    'Predicted_Price': predictions,
    'Error': y - predictions
})

print("Predictions vs Actual:")
print("="*70)
print(comparison)
print("="*70)

print("\n‚úì Perfect predictions! (Because data is perfectly linear)")

## Predict for a New House

Let's predict the price for a house we haven't seen before.

In [None]:
# New house: 2200 square feet
new_house_sqft = 2200

# Need to format as 2D array for sklearn
new_house_X = [[new_house_sqft]]  # Double brackets for 2D

# Predict
predicted_price = model.predict(new_house_X)[0]  # [0] gets the single value

print(f"üè† New House: {new_house_sqft:,} square feet")
print("="*50)
print(f"üí∞ Predicted Price: ${predicted_price:,.2f}")
print("="*50)

# Show the calculation
manual_calc = slope * new_house_sqft + intercept
print(f"\nüìê Calculation:")
print(f"{slope:.2f} √ó {new_house_sqft:,} + {intercept:,.2f} = ${manual_calc:,.2f}")

## Visualize the Model

In [None]:
# Create visualization
plt.figure(figsize=(10, 6))

# Plot actual data points
plt.scatter(X, y, color='blue', s=100, alpha=0.6, label='Actual Prices', zorder=3)

# Plot regression line
x_line = np.array([[X['Square_Feet'].min()], [X['Square_Feet'].max()]])
y_line = model.predict(x_line)
plt.plot(x_line, y_line, 'r-', linewidth=3, label='Sklearn Model', zorder=2)

# Plot prediction for new house
plt.scatter(new_house_sqft, predicted_price, color='green', s=200, 
            marker='*', label='New House Prediction', zorder=4, edgecolors='black')

# Labels and styling
plt.xlabel('Square Feet', fontsize=12, fontweight='bold')
plt.ylabel('Price ($)', fontsize=12, fontweight='bold')
plt.title('Linear Regression with Scikit-Learn', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úì Model fits the data perfectly!")
print("‚úì Green star shows our new prediction")

# Part 2: Train-Test Split with Sklearn ‚úÇÔ∏è

## Why Split Data?

**Remember the golden rule:** Never test on your training data!

Testing on training data is like:
- üìñ Studying WITH the exam questions in front of you
- üéÆ Playing a video game with a walkthrough
- üèÉ Running a race where you already know the route

You'll get great results, but it doesn't prove you actually learned anything useful!

### The Problem:

**Bad Approach (no split):**
```
All Data ‚Üí Train model ‚Üí Test on same data ‚Üí "Perfect" scores ‚úì
                                              But will it work on new data? ‚ùå
```

**Good Approach (train-test split):**
```
All Data ‚Üí Split into Train (80%) & Test (20%)
        ‚Üí Train on training set only
        ‚Üí Test on unseen test data ‚Üí Realistic performance ‚úì
```

---

## The Sklearn Way

**From scratch:** We wrote ~10 lines to shuffle and split data.

**With sklearn:** ONE line does it all! üéâ

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

**This ONE function:**
- Shuffles the data randomly
- Splits into train and test sets
- Returns 4 arrays (X_train, X_test, y_train, y_test)
- Maintains proper correspondence between features and targets

---

## Create a Larger Dataset

Let's create a more realistic dataset with some noise (variation):

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Create 100 houses
n_samples = 100

# Generate data
square_feet_large = np.random.randint(1000, 4000, n_samples)
noise = np.random.normal(0, 30000, n_samples)  # Add realistic variation
prices_large = (200 * square_feet_large) + 50000 + noise

# Create DataFrame
df_large = pd.DataFrame({
    'Square_Feet': square_feet_large,
    'Price': prices_large
})

print(f"Created dataset with {len(df_large)} houses!")
print("\nFirst 10 houses:")
print(df_large.head(10))
print(f"\nData shape: {df_large.shape}")

## Visualize the Dataset

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df_large['Square_Feet'], df_large['Price'], alpha=0.5, s=50)
plt.xlabel('Square Feet', fontsize=12, fontweight='bold')
plt.ylabel('Price ($)', fontsize=12, fontweight='bold')
plt.title('House Prices vs Square Footage (100 Houses)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice: Data has scatter (noise), like real-world data!")

## Prepare Features and Target

In [None]:
# Prepare X and y
X_large = df_large[['Square_Feet']]
y_large = df_large['Price']

print(f"Features (X) shape: {X_large.shape}")
print(f"Target (y) shape: {y_large.shape}")
print("\n‚úì Data ready for train-test split!")

## Perform Train-Test Split üé≤

This one line does everything we need:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

### Breaking It Down:

**Inputs:**
- `X` - All features (100%)
- `y` - All targets (100%)
- `test_size=0.2` - Keep 20% for testing (0.2 = 20%)
- `random_state=42` - Random seed for reproducibility

**Outputs (4 arrays):**
- `X_train` - Features for training (80%)
- `X_test` - Features for testing (20%)
- `y_train` - Targets for training (80%)
- `y_test` - Targets for testing (20%)

### Understanding `test_size`:

Common practice is to use 20-30% for testing:
```python
test_size=0.2   # 20% test, 80% train (most common)
test_size=0.25  # 25% test, 75% train
test_size=0.3   # 30% test, 70% train
```

**Rule of thumb:** More training data is usually better, but you need enough test data to evaluate reliably.

### Understanding `random_state`:

**Without random_state:**
```python
# Run 1: Random houses [5, 12, 23, 45, ...] go to test set
# Run 2: Different houses [3, 8, 19, 41, ...] go to test set ‚ùå Not reproducible!
```

**With random_state:**
```python
random_state=42  # Or any number
# Run 1: Houses [5, 12, 23, 45, ...] go to test set
# Run 2: Same houses [5, 12, 23, 45, ...] go to test set ‚úì Reproducible!
```

**Why 42?** It's a convention from "Hitchhiker's Guide to the Galaxy" - any number works, but 42 is popular!

In [None]:
# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X_large, 
    y_large, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print("‚úì Train-Test Split Complete!")
print("="*60)
print(f"Original dataset:  {len(X_large)} houses")
print(f"Training set:      {len(X_train)} houses ({len(X_train)/len(X_large)*100:.0f}%)")
print(f"Test set:          {len(X_test)} houses ({len(X_test)/len(X_large)*100:.0f}%)")
print("="*60)

print("\nüí° What just happened:")
print("‚Ä¢ X_train, y_train ‚Üí Model will learn from these")
print("‚Ä¢ X_test, y_test ‚Üí Model will be tested on these (unseen!)")
print("‚Ä¢ random_state=42 ‚Üí Makes results reproducible")

## Train Model on Training Data Only

In [None]:
# Create and train model
model_split = LinearRegression()
model_split.fit(X_train, y_train)

print("‚úì Model trained on training data!")
print("="*60)
print(f"Learned Slope:     ${model_split.coef_[0]:.2f} per sq ft")
print(f"Learned Intercept: ${model_split.intercept_:,.2f}")
print("="*60)

print("\nModel has NOT seen the test data yet!")

# Part 3: Model Evaluation with Sklearn Metrics üìä

### The Easy Way
Remember calculating metrics from scratch? Let's compare:


### From Scratch (Manual):

```python
# MAE - Mean Absolute Error
errors = y_true - y_pred
mae = np.mean(np.abs(errors))

# MSE - Mean Squared Error
mse = np.mean(errors ** 2)

# RMSE - Root Mean Squared Error
rmse = np.sqrt(mse)

# R¬≤ - Coefficient of Determination
ss_total = np.sum((y - y.mean()) ** 2)
ss_residual = np.sum(errors ** 2)
r2 = 1 - (ss_residual / ss_total)
```

### Sklearn Approach (Simple):

```python
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)  # Or mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)
```

**Much easier!** One function call per metric. ‚ú®

---

## Understanding the Metrics

### 1. MAE (Mean Absolute Error)

**What it measures:** Average size of errors (in original units)

```
MAE = Average of |actual - predicted|
```

**Example:** If MAE = $25,000
- "On average, predictions are off by $25,000"
- Could be $25k too high or $25k too low

**Lower is better!** MAE = 0 means perfect predictions.

---

### 2. MSE (Mean Squared Error)

**What it measures:** Average of squared errors

```
MSE = Average of (actual - predicted)¬≤
```

**Why square the errors?**
- Makes all errors positive (no cancellation)
- Penalizes large errors MORE than small errors
- Example: One $20k error hurts more than two $10k errors

**Downside:** Units are squared ($¬≤), harder to interpret

**Lower is better!**

---

### 3. RMSE (Root Mean Squared Error)

**What it measures:** Square root of MSE

```
RMSE = ‚àöMSE
```

**Why it's popular:**
- Same units as original target (dollars, not dollars¬≤)
- Still penalizes large errors
- Most commonly used metric in regression

**Example:** If RMSE = $28,000
- "Typical prediction error is around $28,000"

**Lower is better!**

---

### 4. R¬≤ (R-Squared / Coefficient of Determination)

**What it measures:** Percentage of variance in Y explained by the model

```
R¬≤ = 1 - (unexplained variance / total variance)
```

**Scale:** 0 to 1 (can be negative if model is terrible)

**Interpretation:**
- **R¬≤ = 1.0** ‚Üí Perfect predictions! Every point on the line! üéØ
- **R¬≤ = 0.9** ‚Üí Excellent! Model explains 90% of variance
- **R¬≤ = 0.7** ‚Üí Good! Model explains 70% of variance
- **R¬≤ = 0.5** ‚Üí Okay. Model explains 50% of variance
- **R¬≤ = 0.0** ‚Üí Useless. No better than predicting the mean
- **R¬≤ < 0** ‚Üí Worse than useless! Model is actually harmful

**Higher is better!** Maximum is 1.0

**Example:** If R¬≤ = 0.85
- "Our model explains 85% of the variance in house prices"
- "15% is due to factors we haven't captured"

---

## Evaluate on Training Data

In [None]:
# Make predictions on training data
y_train_pred = model_split.predict(X_train)

# Calculate metrics using sklearn
train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)  # Or use mean_squared_error with squared=False
train_r2 = r2_score(y_train, y_train_pred)

print("üìä TRAINING SET PERFORMANCE:")
print("="*60)
print(f"MAE (Mean Absolute Error):       ${train_mae:,.2f}")
print(f"RMSE (Root Mean Squared Error):  ${train_rmse:,.2f}")
print(f"R¬≤ Score:                        {train_r2:.4f} ({train_r2*100:.2f}%)")
print("="*60)

print("\nüí° Interpretation:")
print(f"‚Ä¢ On average, predictions are off by ${train_mae:,.0f}")
print(f"‚Ä¢ Model explains {train_r2*100:.1f}% of price variance")

## Evaluate on Test Data (The Real Test!) üéØ

This tells us how well the model works on **new, unseen data**.

This is what actually matters in the real world!

In [None]:
# Make predictions on test data
y_test_pred = model_split.predict(X_test)

# Calculate metrics
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)

print("üéØ TEST SET PERFORMANCE (Unseen Data):")
print("="*60)
print(f"MAE (Mean Absolute Error):       ${test_mae:,.2f}")
print(f"RMSE (Root Mean Squared Error):  ${test_rmse:,.2f}")
print(f"R¬≤ Score:                        {test_r2:.4f} ({test_r2*100:.2f}%)")
print("="*60)

print("\nüí° This is what really matters!")
print("   Test performance shows how the model will work in the real world.")

## Compare Training vs Test Performance

In [None]:
# Create comparison table
comparison_df = pd.DataFrame({
    'Metric': ['MAE', 'RMSE', 'R¬≤'],
    'Training': [f'${train_mae:,.2f}', f'${train_rmse:,.2f}', f'{train_r2:.4f}'],
    'Test': [f'${test_mae:,.2f}', f'${test_rmse:,.2f}', f'{test_r2:.4f}']
})

print("üìä Training vs Test Comparison:")
print("="*60)
print(comparison_df.to_string(index=False))
print("="*60)

# Analyze the difference
r2_diff = abs(train_r2 - test_r2)

if r2_diff < 0.05:
    print("\n‚úì Excellent! Training and test scores are similar")
    print("  ‚Üí Model generalizes well to new data")
elif r2_diff < 0.1:
    print("\n‚óã Good! Scores are reasonably close")
    print("  ‚Üí Model is performing acceptably")
else:
    print("\n‚ö† Warning: Large gap between training and test")
    print("  ‚Üí Possible overfitting or small test set")

## Visualize Predictions on Both Sets

In [None]:
# Create side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Training data plot
ax1.scatter(X_train, y_train, alpha=0.5, s=30, label='Actual Prices')
x_line_train = np.array([[X_train.values.min()], [X_train.values.max()]])
y_line_train = model_split.predict(x_line_train)
ax1.plot(x_line_train, y_line_train, 'r-', linewidth=3, label='Model')
ax1.set_xlabel('Square Feet', fontsize=12, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12, fontweight='bold')
ax1.set_title(f'Training Data (R¬≤ = {train_r2:.4f})', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Test data plot
ax2.scatter(X_test, y_test, alpha=0.5, s=30, color='orange', label='Actual Prices')
x_line_test = np.array([[X_test.values.min()], [X_test.values.max()]])
y_line_test = model_split.predict(x_line_test)
ax2.plot(x_line_test, y_line_test, 'r-', linewidth=3, label='Model')
ax2.set_xlabel('Square Feet', fontsize=12, fontweight='bold')
ax2.set_ylabel('Price ($)', fontsize=12, fontweight='bold')
ax2.set_title(f'Test Data (R¬≤ = {test_r2:.4f})', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úì Model performs well on both training and test data!")

# Part 4: Multiple Features with Sklearn üí™

This is where sklearn REALLY shines!


### From Scratch vs Sklearn:

**From Scratch (Multiple Features):**
```python
# Add column of 1s for intercept
X_with_ones = np.column_stack([np.ones(len(X)), X])

# Matrix multiplication
XTX = X_with_ones.T @ X_with_ones

# Matrix inversion (complex!)
XTX_inv = np.linalg.inv(XTX)

# More matrix operations
XTy = X_with_ones.T @ y

# Final calculation
weights = XTX_inv @ XTy

# üò´ Complex! Error-prone! Hard to understand!
```

**With Sklearn:**
```python
model.fit(X, y)  # That's it! üéâ

# üòÑ Same line! Works with 1, 10, or 1000 features!
```

### The Beautiful Part:

The API is **identical** regardless of feature count:

```python
# 1 Feature:
model.fit(X, y)  # Works! ‚úì

# 10 Features:
model.fit(X, y)  # Works! ‚úì

# 100 Features:
model.fit(X, y)  # Works! ‚úì

# 1000 Features:
model.fit(X, y)  # Still works! ‚úì
```

**This is brilliant design!** Learn it once, use it everywhere.

---

## Real-World House Prices üè°

In reality, house prices depend on MANY factors:

```
House Price = f(
    Square Footage,      ‚Üê Size of the house
    Number of Bedrooms,  ‚Üê Sleeping capacity
    Number of Bathrooms, ‚Üê Convenience
    Age of House,        ‚Üê Condition
    Location,            ‚Üê Neighborhood quality
    School District,     ‚Üê Education quality
    Lot Size,            ‚Üê Land area
    Amenities,           ‚Üê Pool, garage, etc.
    ... and more!
)
```

**One feature (square feet) is useful, but limited.**

**Multiple features give a more complete picture!**

### The Formula Extends:

**One feature:**
```
Price = (w‚ÇÅ √ó Square_Feet) + intercept
```

**Multiple features:**
```
Price = (w‚ÇÅ √ó Square_Feet) + 
        (w‚ÇÇ √ó Bedrooms) + 
        (w‚ÇÉ √ó Bathrooms) + 
        intercept
```

Where: w‚ÇÅ, w‚ÇÇ, w‚ÇÉ = Weights (importance) of each feature

**Example calculation:**
```
Price = (150 √ó 2500) +     # $150 per square foot
        (20000 √ó 4) +      # $20,000 per bedroom
        (15000 √ó 2) +      # $15,000 per bathroom
        50000              # Base price
      = $375,000 + $80,000 + $30,000 + $50,000
      = $535,000
```

Let's create a multi-feature dataset:

In [None]:
# Set random seed
np.random.seed(42)
n = 100

# Generate features
square_feet_multi = np.random.randint(1000, 4000, n)
bedrooms = np.random.randint(1, 6, n)
bathrooms = np.random.randint(1, 4, n)

# Generate prices (true relationship + noise)
noise_multi = np.random.normal(0, 30000, n)
prices_multi = (150 * square_feet_multi + 
                20000 * bedrooms + 
                15000 * bathrooms + 
                50000 + 
                noise_multi)

# Create DataFrame
df_multi = pd.DataFrame({
    'Square_Feet': square_feet_multi,
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'Price': prices_multi
})

print("Dataset with Multiple Features:")
print("="*70)
print(df_multi.head(10))
print("\nShape:", df_multi.shape)
print("\nFeatures: Square_Feet, Bedrooms, Bathrooms")
print("Target: Price")

## Prepare Multi-Feature Data

**Now X has 3 columns instead of 1!**

### Visual Comparison:

**Before (One Feature):**
```python
X = df[['Square_Feet']]
# Shape: (100, 1) - 100 houses, 1 feature
```

**Now (Three Features):**
```python
X = df[['Square_Feet', 'Bedrooms', 'Bathrooms']]
# Shape: (100, 3) - 100 houses, 3 features
```

### The Beautiful Part:

**Same sklearn code works for both!**

```python
# One feature:
X = df[['Square_Feet']]
model.fit(X, y)  # Works! ‚úì

# Multiple features:
X = df[['Square_Feet', 'Bedrooms', 'Bathrooms']]
model.fit(X, y)  # Also works! ‚úì  Same code!
```

No code changes needed - sklearn handles it automatically!

In [None]:
# Prepare features (all 3 columns)
X_multi = df_multi[['Square_Feet', 'Bedrooms', 'Bathrooms']]

# Prepare target
y_multi = df_multi['Price']

print("Feature Matrix (X):")
print(X_multi.head())
print(f"\nShape: {X_multi.shape} - (100 samples, 3 features)")

print("\n‚úì Ready for multi-feature regression!")

## Train-Test Split for Multi-Feature Data

Same function works with any number of features!

In [None]:
# Split the data
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, 
    y_multi, 
    test_size=0.2, 
    random_state=42
)

print("‚úì Multi-feature data split complete!")
print("="*60)
print(f"Training set: {len(X_train_multi)} houses")
print(f"Test set:     {len(X_test_multi)} houses")
print("="*60)
print(f"\nEach sample has {X_train_multi.shape[1]} features")

## Train Multi-Feature Model

Exact same code as before!

In [None]:
# Create and train model
model_multi = LinearRegression()
model_multi.fit(X_train_multi, y_train_multi)

print("‚úì Multi-feature model trained!")
print("\nNo matrix math required - sklearn handled it all! üéâ")

## Inspect Multi-Feature Model Parameters üîç

**Now we have 3 coefficients - one for each feature!**

### Understanding Coefficients:

**With one feature:**
```python
model.coef_ = [200]  # Just one number
```

**With multiple features:**
```python
model.coef_ = [150, 20000, 15000]  # Array of numbers
                ‚Üë      ‚Üë      ‚Üë
           sqft  bedrooms bathrooms
```

### The Formula:

```
Price = intercept + (coef[0] √ó sqft) + (coef[1] √ó beds) + (coef[2] √ó baths)
```

**Example calculation:**
```
Price = $50,000 + (150 √ó 2500) + (20000 √ó 4) + (15000 √ó 2)
      = $50,000 + $375,000 + $80,000 + $30,000
      = $535,000
```

### What Each Coefficient Means:

**Coefficient 1 (Square Feet): $150**
- "Each additional square foot adds $150 to the price"
- *Holding bedrooms and bathrooms constant*
- Going from 2000 to 2001 sqft ‚Üí +$150

**Coefficient 2 (Bedrooms): $20,000**
- "Each additional bedroom adds $20,000 to the price"
- *Holding square feet and bathrooms constant*
- Going from 3 to 4 bedrooms ‚Üí +$20,000

**Coefficient 3 (Bathrooms): $15,000**
- "Each additional bathroom adds $15,000 to the price"
- *Holding other features constant*
- Going from 2 to 3 bathrooms ‚Üí +$15,000

**Key insight:** Each coefficient tells you the impact of ONE feature while keeping others constant. This is called "all else being equal" or *ceteris paribus*.

In [None]:
# Get coefficients (slopes)
coefficients = model_multi.coef_
intercept_multi = model_multi.intercept_

# Get feature names
feature_names = X_multi.columns.tolist()

print("Model Parameters:")
print("="*60)
print(f"Intercept (base price): ${intercept_multi:,.2f}")
print("\nCoefficients (feature weights):")
for name, coef in zip(feature_names, coefficients):
    print(f"  {name:<15}: ${coef:,.2f}")
print("="*60)

print("\nüìê Learned Formula:")
print(f"Price = ${intercept_multi:,.0f} + ")
print(f"        ({coefficients[0]:.2f} √ó Square_Feet) + ")
print(f"        ({coefficients[1]:,.2f} √ó Bedrooms) + ")
print(f"        ({coefficients[2]:,.2f} √ó Bathrooms)")

print("\nüí° Interpretation:")
print(f"‚Ä¢ Each square foot adds: ${coefficients[0]:.2f}")
print(f"‚Ä¢ Each bedroom adds:     ${coefficients[1]:,.2f}")
print(f"‚Ä¢ Each bathroom adds:    ${coefficients[2]:,.2f}")

## Evaluate Multi-Feature Model

In [None]:
# Predictions
y_train_pred_multi = model_multi.predict(X_train_multi)
y_test_pred_multi = model_multi.predict(X_test_multi)

# Training metrics
train_mae_multi = mean_absolute_error(y_train_multi, y_train_pred_multi)
train_rmse_multi = np.sqrt(mean_squared_error(y_train_multi, y_train_pred_multi))
train_r2_multi = r2_score(y_train_multi, y_train_pred_multi)

# Test metrics
test_mae_multi = mean_absolute_error(y_test_multi, y_test_pred_multi)
test_rmse_multi = np.sqrt(mean_squared_error(y_test_multi, y_test_pred_multi))
test_r2_multi = r2_score(y_test_multi, y_test_pred_multi)

print("üìä Multi-Feature Model Performance:")
print("="*70)
print(f"{'Metric':<20} {'Training':<25} {'Test':<25}")
print("-"*70)
print(f"{'MAE':<20} ${train_mae_multi:<24,.2f} ${test_mae_multi:<24,.2f}")
print(f"{'RMSE':<20} ${train_rmse_multi:<24,.2f} ${test_rmse_multi:<24,.2f}")
print(f"{'R¬≤':<20} {train_r2_multi:<24.4f} {test_r2_multi:<24.4f}")
print("="*70)

print("\nüí° Observations:")
print(f"‚Ä¢ Using 3 features gives R¬≤ = {test_r2_multi:.4f}")
print("‚Ä¢ More features often = better predictions!")
print("‚Ä¢ But be careful not to add too many (overfitting risk)")

## Predict for a New House (Multiple Features)

In [None]:
# New house specifications
new_house = {
    'Square_Feet': 2500,
    'Bedrooms': 4,
    'Bathrooms': 2
}

# Convert to DataFrame (sklearn prefers this format)
new_house_df = pd.DataFrame([new_house])

# Predict
predicted_price_multi = model_multi.predict(new_house_df)[0]

print("üè† New House Specifications:")
print("="*60)
for feature, value in new_house.items():
    print(f"  {feature:<15}: {value:,}")
print("="*60)

print(f"\nüí∞ Predicted Price: ${predicted_price_multi:,.2f}")
print("="*60)

# Show breakdown
print("\nüìê Price Breakdown:")
print(f"  Base price:       ${intercept_multi:,.2f}")
print(f"  + Square footage: ${coefficients[0] * new_house['Square_Feet']:,.2f}")
print(f"  + Bedrooms:       ${coefficients[1] * new_house['Bedrooms']:,.2f}")
print(f"  + Bathrooms:      ${coefficients[2] * new_house['Bathrooms']:,.2f}")
print(f"  {'-'*40}")
print(f"  Total:            ${predicted_price_multi:,.2f}")

## Batch Predictions üè≠

**One of sklearn's superpowers: Predict for MANY samples at once!**

### Real-World Application:

**Scenario:** Real estate website
- User searches for houses
- 100 matching houses found
- Need to predict prices for all of them

**Inefficient approach:**
```python
# Predict one at a time (slow!)
pred1 = model.predict([[1500, 2, 1]])
pred2 = model.predict([[2000, 3, 2]])
pred3 = model.predict([[2800, 4, 2]])
# ... repeat 97 more times üò´
```

**Efficient approach (batch prediction):**
```python
# Predict all at once (fast!)
houses = [[1500, 2, 1],
          [2000, 3, 2],
          [2800, 4, 2],
          ...]  # All 100 houses
predictions = model.predict(houses)  # One call! ‚ö°
```

**Benefits:**
- Faster (optimized internally)
- Cleaner code
- Less error-prone
- Professional approach

In [None]:
# Multiple new houses
new_houses = pd.DataFrame({
    'Square_Feet': [1500, 2000, 2800, 3200],
    'Bedrooms': [2, 3, 4, 5],
    'Bathrooms': [1, 2, 2, 3]
})

# Predict for all at once
predictions_batch = model_multi.predict(new_houses)

# Add predictions to DataFrame
new_houses['Predicted_Price'] = predictions_batch

print("üèòÔ∏è Batch Predictions for Multiple Houses:")
print("="*70)
print(new_houses.to_string(index=False))
print("="*70)

print("\n‚úì Predicted prices for all 4 houses in one call!")

# Part 5: The Sklearn Workflow - Complete Picture üé®

## The Universal Pattern

This is the **most important concept** in sklearn:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         SKLEARN WORKFLOW                ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  1. IMPORT                              ‚îÇ
‚îÇ     from sklearn.X import ModelName     ‚îÇ
‚îÇ                                         ‚îÇ
‚îÇ  2. CREATE                              ‚îÇ
‚îÇ     model = ModelName()                 ‚îÇ
‚îÇ                                         ‚îÇ
‚îÇ  3. FIT (Train)                         ‚îÇ
‚îÇ     model.fit(X_train, y_train)         ‚îÇ
‚îÇ                                         ‚îÇ
‚îÇ  4. PREDICT                             ‚îÇ
‚îÇ     predictions = model.predict(X_test) ‚îÇ
‚îÇ                                         ‚îÇ
‚îÇ  5. EVALUATE                            ‚îÇ
‚îÇ     score = model.score(X_test, y_test) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### This Works For All Models:

- ‚úì Linear Regression (what we learned)
- ‚úì Logistic Regression
- ‚úì Decision Trees
- ‚úì Random Forests
- ‚úì Support Vector Machines
- ‚úì Gradient Boosting
- ‚úì Neural Networks
- ‚úì And 100+ more algorithms!

**Same workflow! Just change the model name.** üéØ

This is why sklearn is so powerful - learn the pattern once, apply it everywhere.

---

## Complete Example - All Steps Together

Here's a **complete sklearn program** in one place:

```python
# Step 1: Import everything you need
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Step 2: Prepare your data
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']

# Step 3: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Create the model
model = LinearRegression()

# Step 5: Train the model
model.fit(X_train, y_train)

# Step 6: Make predictions
predictions = model.predict(X_test)

# Step 7: Evaluate performance
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"MAE: ${mae:,.2f}")
print(f"R¬≤: {r2:.4f}")
```

**That's it!** This is the template for 90% of ML projects! üìã

Copy this template, change the model name, and you're ready to try different algorithms!

# Final Summary üéì

## What We Accomplished Today

Congratulations! You just learned the **professional way** to do machine learning!

### 1. Linear Regression with Sklearn ‚úì
- The 3-step workflow (Create ‚Üí Fit ‚Üí Predict)
- Creating and training models in seconds
- Making predictions for new data
- Inspecting learned parameters (slope, intercept)

### 2. Train-Test Split ‚úì
- Why splitting data is critical (avoid overfitting!)
- How to split data in ONE line of code
- Understanding `test_size` and `random_state`
- Testing on unseen data for realistic performance

### 3. Model Evaluation ‚úì
- Calculate MAE, RMSE, R¬≤ with simple function calls
- What each metric means and when to use it
- Comparing training vs test performance
- Detecting overfitting (big gap between train and test)

### 4. Multiple Features ‚úì
- Same code works for 1, 10, or 1000 features
- Preparing multi-feature data
- Interpreting multiple coefficients
- Making batch predictions efficiently

---

## From Scratch vs Sklearn - The Comparison

| Task | From Scratch | Sklearn | Time Saved |
|------|-------------|---------|------------|
| **Linear Regression** | ~20 lines of formulas | `model.fit(X, y)` | 95% ‚ö° |
| **Train-Test Split** | ~10 lines of code | `train_test_split(X, y)` | 90% ‚ö° |
| **Calculate MAE** | Manual formula | `mean_absolute_error(...)` | 80% ‚ö° |
| **Calculate R¬≤** | Complex formula | `r2_score(...)` | 85% ‚ö° |
| **Multiple Features** | Matrix algebra (~30 lines) | `model.fit(X, y)` | 98% ‚ö° |

**Average time saved: ~90%!**

But more importantly, sklearn code is:
- More readable
- Less error-prone
- Battle-tested by millions
- Production-ready

---

## Why Learn Both?

### From Scratch Approach üß†

**Benefits:**
- Understand fundamentals deeply
- Know what's happening "under the hood"
- Better debugging when things go wrong
- Explain concepts to non-technical people
- Understand limitations and edge cases

**When to use:**
- Learning and education
- Implementing custom algorithms
- Research and experimentation

### Sklearn Approach üöÄ

**Benefits:**
- Fast and efficient
- Tested and reliable (used by millions!)
- Industry standard (what companies use)
- Production-ready code
- Access to 100+ algorithms
- Active community and documentation

**When to use:**
- Real-world projects
- Production systems
- Quick prototyping
- Business applications
- When you need results fast

**You now have BOTH!** This makes you a **complete** ML practitioner! üí™
---

## Resources üìö

### Official Documentation:
- **Sklearn Docs:** https://scikit-learn.org/
- **User Guide:** https://scikit-learn.org/stable/user_guide.html
- **Examples Gallery:** https://scikit-learn.org/stable/auto_examples/

### Practice Platforms:
- **Kaggle Learn:** Free interactive ML courses
- **Kaggle Competitions:** Practice on real problems
- **Google Colab:** Free GPU for experimentation

---

## You Did It! üéâ

You went from manually calculating slopes to using professional ML tools.

**That's HUGE progress!**

**Key takeaways:**
- You understand the fundamentals (from scratch)
- You can use professional tools (sklearn)
- You know the workflow (create ‚Üí fit ‚Üí predict ‚Üí evaluate)
- You're ready for real projects!

**Now go build something amazing!** üöÄ

**Keep learning! Keep coding! Keep growing!**

Happy Learning! üí°

By Abdulellah Mojalled : [linkedin](https://www.linkedin.com/in/abdulellah-mojalled/)

## 

## Extra Tips: Best Practices for Using Sklearn ‚úÖ

### 1. Always Split Your Data

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This prevents overfitting and gives you honest performance estimates.

---

### 2. Train ONLY on Training Data

**Do this:**

    model.fit(X_train, y_train)

**Not this:**

    model.fit(X, y)  # Don't use all your data!

Your model needs to see fresh data at test time, not data it's already seen.

---

### 3. Evaluate on Test Data

**Right:**

    test_score = model.score(X_test, y_test)

**Wrong:**

    train_score = model.score(X_train, y_train)  # Too optimistic

Test performance is what actually matters.

---

### 4. Use Multiple Metrics

    mae = mean_absolute_error(y_test, predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    r2 = r2_score(y_test, predictions)

Each metric shows something different about your model.

---

### 5. Compare Training vs Test Performance

    train_r2 = model.score(X_train, y_train)
    test_r2 = model.score(X_test, y_test)
    
    if abs(train_r2 - test_r2) > 0.1:
        print("Warning: Possible overfitting!")

Big gap between training and test scores? Your model memorized instead of learned.

---

### 6. Use Consistent Random States

    train_test_split(X, y, random_state=42)

Makes your results reproducible.

---

### 7. Check Your Data Shapes

    print(f"X shape: {X.shape}")  # Should be (n_samples, n_features)
    print(f"y shape: {y.shape}")  # Should be (n_samples,)

Catches issues before they become confusing errors.