# Linear Regression - Complete Practical Guide

## Business Problem: Predicting House Prices

**Scenario:** You work as a Data Scientist at a real estate company. Your manager wants a model that **predicts house prices** based on features like size, bedrooms, age, etc.

---

### What is Linear Regression?

It finds the **best-fit straight line** through your data.

- **Simple:** `y = mx + b` (1 feature)
- **Multiple:** `y = b0 + b1*x1 + b2*x2 + ... + bn*xn` (many features)

---

## Step 1: Install & Import Libraries

In [None]:
# Run this ONCE to install required libraries
!pip install numpy pandas matplotlib seaborn scikit-learn -q

In [None]:
# --- Data Handling ---
import numpy as np                # Math operations on arrays
import pandas as pd               # Data tables (like Excel)

# --- Visualization ---
import matplotlib.pyplot as plt   # Creating charts
import seaborn as sns             # Beautiful statistical plots

# --- Machine Learning ---
from sklearn.model_selection import train_test_split   # Split data
from sklearn.linear_model import LinearRegression      # Our model
from sklearn.preprocessing import StandardScaler       # Scale features
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

import warnings
warnings.filterwarnings('ignore')

print("All libraries loaded!")

## Step 2: Create the Dataset

We generate **200 realistic houses** with these features:

| Feature | Meaning |
|---------|---------|
| size_sqft | House size in square feet |
| bedrooms | Number of bedrooms |
| age_years | How old the house is |
| distance_city_km | Distance from city center |
| **price_lakhs** | **TARGET - what we predict** |

In [None]:
# Seed for reproducibility (everyone gets same data)
np.random.seed(42)
n = 200

# Generate features
size_sqft = np.random.uniform(500, 3500, n)
bedrooms = np.clip(np.round(size_sqft / 600 + np.random.normal(0, 0.5, n)), 1, 6)
age_years = np.random.uniform(0, 50, n)
distance_city_km = np.random.uniform(1, 30, n)

# Price formula: bigger=more, older=less, far from city=less, + random noise
price_lakhs = (
    15
    + 0.02 * size_sqft
    + 5 * bedrooms
    - 0.3 * age_years
    - 0.5 * distance_city_km
    + np.random.normal(0, 5, n)   # Real-world noise
)

# Create table
df = pd.DataFrame({
    'size_sqft': np.round(size_sqft, 1),
    'bedrooms': bedrooms.astype(int),
    'age_years': np.round(age_years, 1),
    'distance_city_km': np.round(distance_city_km, 1),
    'price_lakhs': np.round(price_lakhs, 2)
})

print(f"Created {len(df)} house records\n")
df.head(10)

## Step 3: Explore the Data (EDA)

**Rule:** ALWAYS explore before modeling!

In [None]:
# Basic info
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nStatistics:")
df.describe().round(2)

In [None]:
# Correlation heatmap - shows how features relate to price
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='RdYlGn', center=0, fmt='.2f', square=True)
plt.title('Correlation Matrix', fontsize=14)
plt.tight_layout()
plt.show()

# Values close to +1 = strong positive relation
# Values close to -1 = strong negative relation
# Values close to  0 = no relation

In [None]:
# Scatter plots: each feature vs price
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
features = ['size_sqft', 'bedrooms', 'age_years', 'distance_city_km']
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']

for i, feat in enumerate(features):
    ax = axes[i // 2][i % 2]
    ax.scatter(df[feat], df['price_lakhs'], alpha=0.5, color=colors[i], edgecolors='black', linewidth=0.5)
    ax.set_xlabel(feat)
    ax.set_ylabel('Price (Lakhs)')
    ax.set_title(f'{feat} vs Price')

plt.suptitle('Feature vs Price Relationships', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

# If points form a line-like pattern = Linear Regression will work well!

## Step 4: Prepare Data

1. Separate **X** (features) from **y** (target price)
2. Split into **80% train** / **20% test**

In [None]:
# Separate features and target
X = df.drop('price_lakhs', axis=1)   # Everything except price
y = df['price_lakhs']                 # Only price

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training: {X_train.shape[0]} houses")
print(f"Testing:  {X_test.shape[0]} houses")

## Step 5: Train the Model

`.fit()` finds the best coefficients that minimize prediction errors.

In [None]:
# Create and train
model = LinearRegression()
model.fit(X_train, y_train)

# What the model learned
print("LEARNED EQUATION:")
print(f"Price = {model.intercept_:.2f}", end="")
for feat, coef in zip(X.columns, model.coef_):
    sign = "+" if coef >= 0 else "-"
    print(f" {sign} {abs(coef):.4f} x {feat}", end="")
print()

print("\nWhat each coefficient means:")
for feat, coef in zip(X.columns, model.coef_):
    effect = "increases" if coef > 0 else "decreases"
    print(f"  {feat:20s} -> +1 unit {effect} price by {abs(coef):.4f} lakhs")

## Step 6: Predict & Compare

In [None]:
# Predict on test data (model has NEVER seen these houses)
y_pred = model.predict(X_test)

# Side-by-side comparison
results = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': np.round(y_pred, 2),
    'Error': np.round(y_test.values - y_pred, 2)
})
print("Actual vs Predicted (first 10):")
results.head(10)

## Step 7: Evaluate the Model

| Metric | Meaning | Good Value |
|--------|---------|------------|
| **R²** | % of variation explained | Closer to 1.0 |
| **MAE** | Avg error in lakhs | Lower = better |
| **RMSE** | Avg error (penalizes big mistakes) | Lower = better |

In [None]:
r2   = r2_score(y_test, y_pred)
mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("MODEL PERFORMANCE")
print("=" * 40)
print(f"  R² Score : {r2:.4f}  ({r2*100:.1f}% explained)")
print(f"  MAE      : {mae:.2f} lakhs")
print(f"  RMSE     : {rmse:.2f} lakhs")
print("=" * 40)

if r2 > 0.85:
    print("Excellent model!")
elif r2 > 0.7:
    print("Good model!")
else:
    print("Needs improvement - try adding more features")

In [None]:
# --- Actual vs Predicted plot ---
# Perfect predictions would sit exactly on the red dashed line

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6, color='#3498db', edgecolors='black', linewidth=0.5)
mn, mx = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot([mn, mx], [mn, mx], 'r--', linewidth=2, label='Perfect Prediction Line')
plt.xlabel('Actual Price (Lakhs)', fontsize=12)
plt.ylabel('Predicted Price (Lakhs)', fontsize=12)
plt.title(f'Actual vs Predicted (R² = {r2:.3f})', fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# --- Residual Plots ---
# Residual = Actual - Predicted (the error)

residuals = y_test - y_pred
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residuals should be randomly scattered around 0 (no pattern)
axes[0].scatter(y_pred, residuals, alpha=0.6, color='#e74c3c', edgecolors='black', linewidth=0.5)
axes[0].axhline(y=0, color='black', linestyle='--')
axes[0].set_xlabel('Predicted Price')
axes[0].set_ylabel('Residual (Error)')
axes[0].set_title('Residuals vs Predicted')

# Residuals should follow a bell curve centered at 0
axes[1].hist(residuals, bins=20, color='#9b59b6', edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residual (Error)')
axes[1].set_ylabel('Count')
axes[1].set_title('Residual Distribution')

plt.tight_layout()
plt.show()

## Step 8: Feature Importance

Which features influence price the most?

In [None]:
# Scale features so coefficients are comparable
scaler = StandardScaler()
model_s = LinearRegression()
model_s.fit(scaler.fit_transform(X_train), y_train)

imp = pd.DataFrame({'Feature': X.columns, 'Coefficient': model_s.coef_})
imp = imp.sort_values('Coefficient')

plt.figure(figsize=(8, 5))
colors = ['#e74c3c' if c < 0 else '#2ecc71' for c in imp['Coefficient']]
plt.barh(imp['Feature'], imp['Coefficient'], color=colors, edgecolor='black')
plt.axvline(x=0, color='black', linewidth=0.5)
plt.xlabel('Impact on Price')
plt.title('Feature Importance (Green=Increases, Red=Decreases Price)', fontsize=13)
plt.tight_layout()
plt.show()

## Step 9: Predict on New Houses (Business Use Case)

In [None]:
# New houses that the company wants to price
new_houses = pd.DataFrame({
    'size_sqft':        [1200,  2500,   800,  3000],
    'bedrooms':         [   2,     4,     1,     5],
    'age_years':        [   5,    15,    30,     2],
    'distance_city_km': [  10,     5,    20,     3]
})

new_houses['predicted_price'] = np.round(model.predict(new_houses), 2)

print("NEW LISTING PRICE PREDICTIONS")
print("=" * 60)
for i, row in new_houses.iterrows():
    print(f"House {i+1}: {row['size_sqft']:.0f}sqft, {row['bedrooms']}BHK, "
          f"{row['age_years']}yrs old, {row['distance_city_km']}km from city")
    print(f"  --> Predicted Price: {row['predicted_price']:.2f} Lakhs\n")

## Step 10: Simple Linear Regression (Visual Understanding)

Using **only 1 feature** so we can SEE the regression line on a 2D plot.

In [None]:
# Simple model: only size -> price
simple = LinearRegression()
simple.fit(df[['size_sqft']], df['price_lakhs'])

# Line points
x_line = np.linspace(df['size_sqft'].min(), df['size_sqft'].max(), 100).reshape(-1, 1)
y_line = simple.predict(x_line)

plt.figure(figsize=(10, 6))
plt.scatter(df['size_sqft'], df['price_lakhs'], alpha=0.5, color='#3498db',
            edgecolors='black', linewidth=0.5, label='Data Points')
plt.plot(x_line, y_line, color='red', linewidth=3, label='Best Fit Line')
plt.xlabel('House Size (sqft)', fontsize=13)
plt.ylabel('Price (Lakhs)', fontsize=13)
plt.title('Simple Linear Regression: Size vs Price', fontsize=15)
plt.legend(fontsize=11)

# Show equation on plot
plt.text(0.05, 0.95, f'Price = {simple.coef_[0]:.3f} x Size + {simple.intercept_:.2f}',
         transform=plt.gca().transAxes, fontsize=12, va='top',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.tight_layout()
plt.show()

print(f"Simple Model R² = {simple.score(df[['size_sqft']], df['price_lakhs']):.4f}")
print(f"Full Model R²   = {r2:.4f}")
print("\nFull model is better because it uses more information!")

---

## Practice Exercises

Try these yourself to solidify your understanding!

In [None]:
# ============================================================
# EXERCISE 1: Predict YOUR dream house price
# ============================================================
# Change the values below to your dream house and run the cell!

my_house = pd.DataFrame({
    'size_sqft':        [1500],     # Change this: house size in sqft
    'bedrooms':         [3],        # Change this: number of bedrooms
    'age_years':        [10],       # Change this: age of house
    'distance_city_km': [8]         # Change this: distance from city
})

my_price = model.predict(my_house)[0]
print(f"Your dream house predicted price: {my_price:.2f} Lakhs")

In [None]:
# ============================================================
# EXERCISE 2: What happens if we use only 2 features?
# ============================================================
# Try: train a model using only size_sqft and bedrooms
# Compare R² with the full model

X_two = df[['size_sqft', 'bedrooms']]        # Only 2 features
y_two = df['price_lakhs']

X2_train, X2_test, y2_train, y2_test = train_test_split(X_two, y_two, test_size=0.2, random_state=42)

model_two = LinearRegression()
model_two.fit(X2_train, y2_train)

y2_pred = model_two.predict(X2_test)
r2_two = r2_score(y2_test, y2_pred)

print(f"2-feature model R² = {r2_two:.4f}")
print(f"4-feature model R² = {r2:.4f}")
print(f"\nDifference: {(r2 - r2_two)*100:.1f}% -- More features = better predictions!")

In [None]:
# ============================================================
# EXERCISE 3: Try different train/test split ratios
# ============================================================

for test_pct in [0.1, 0.2, 0.3, 0.5]:
    Xt, Xte, yt, yte = train_test_split(X, y, test_size=test_pct, random_state=42)
    m = LinearRegression().fit(Xt, yt)
    score = r2_score(yte, m.predict(Xte))
    print(f"Train {(1-test_pct)*100:.0f}% / Test {test_pct*100:.0f}%  ->  R² = {score:.4f}  (train size: {len(Xt)})")

# Notice: too little training data = worse model
# 80/20 split is a common good balance

---

## Quick Reference Summary

| Step | Code | Purpose |
|------|------|---------|
| Create model | `model = LinearRegression()` | Initialize |
| Train | `model.fit(X_train, y_train)` | Learn patterns |
| Predict | `model.predict(X_test)` | Make predictions |
| Evaluate | `r2_score(y_test, y_pred)` | Check accuracy |
| Coefficients | `model.coef_` | Feature weights |
| Intercept | `model.intercept_` | Base value |

### When to use Linear Regression:
- Predicting a **continuous number** (price, salary, temperature)
- Features have **linear relationship** with target
- You need an **explainable** model

### When NOT to use:
- Classification (yes/no) -> use Logistic Regression
- Non-linear patterns -> use Decision Trees / Random Forest
- Very complex data -> use Neural Networks