# 1. Linear Regression

## üìñ What is Linear Regression?

Linear Regression models the relationship between input features (X) and output (y) as a straight line:

**Formula**: `y = Œ≤‚ÇÄ + Œ≤‚ÇÅx‚ÇÅ + Œ≤‚ÇÇx‚ÇÇ + ... + Œ≤‚Çôx‚Çô`

Where:
- `Œ≤‚ÇÄ` = intercept (bias)
- `Œ≤‚ÇÅ, Œ≤‚ÇÇ, ..., Œ≤‚Çô` = coefficients (weights)
- `x‚ÇÅ, x‚ÇÇ, ..., x‚Çô` = features

**How it works**: Finds the best-fit line that minimizes the sum of squared errors (distance between predictions and actual values).

## üéØ Why Use Linear Regression?

### **Advantages:**
1. **Simple and Fast** - Easy to implement, quick to train
2. **Interpretable** - Clear understanding of feature impact
3. **Low Computational Cost** - Works on large datasets
4. **Good Baseline** - Starting point for comparison
5. **Well-Understood** - Extensive theory and diagnostics

### **Disadvantages:**
1. **Assumes Linearity** - Can't capture complex non-linear patterns
2. **Sensitive to Outliers** - Outliers heavily influence the line
3. **Multicollinearity Issues** - Correlated features cause problems
4. **Overfitting Risk** - With many features

## ‚è±Ô∏è When to Use Linear Regression

### ‚úÖ **Use When:**

**1. Linear Relationship Exists**
- Example: Predict house price from square footage
- Why: Price increases linearly with size
- Data: Square footage vs price shows straight line

**2. Need Interpretability**
- Example: Medical research - understand factor impact
- Why: Coefficients show feature importance
- Benefit: Can explain predictions to stakeholders

**3. Small to Medium Dataset**
- Example: 100-10,000 samples
- Why: Efficient, no risk of overfitting
- Alternative: For huge data, consider simpler models

**4. Quick Baseline Needed**
- Example: Start of ML project
- Why: Fast to implement and test
- Next step: Compare against complex models

**5. Few Features**
- Example: 1-20 features
- Why: Avoids overfitting
- Note: Can handle more with regularization

### ‚ùå **Don't Use When:**

**1. Non-Linear Relationships**
- Example: Exponential growth (population, virus spread)
- Better: Polynomial regression, tree-based models
- Why: Linear model won't capture curves

**2. Many Outliers**
- Example: Income data (billionaires skew results)
- Better: Robust regression, tree models
- Why: Outliers pull the line away from most data

**3. Categorical Predictions**
- Example: Classify spam/not spam
- Better: Logistic regression, classifiers
- Why: Linear regression for continuous values only

**4. High Multicollinearity**
- Example: Correlated features (height in cm and inches)
- Better: Ridge/Lasso regression, PCA first
- Why: Unstable coefficients

## üìä Complexity

- **Training Time**: O(n √ó p¬≤) where n=samples, p=features
- **Prediction Time**: O(p)
- **Space**: O(p)

## üåç Real-World Applications

1. **Real Estate** - Predict house prices from features
2. **Finance** - Stock price forecasting, risk assessment
3. **Marketing** - Sales prediction based on ad spend
4. **Healthcare** - Predict patient recovery time
5. **Economics** - GDP prediction, inflation modeling
6. **Weather** - Temperature forecasting

## üí° Key Insights

‚úÖ Best for linear relationships  
‚úÖ Highly interpretable - know which features matter  
‚úÖ Fast and simple - good starting point  
‚úÖ Assumptions: linearity, normality, homoscedasticity  
‚úÖ Check residual plots to validate assumptions  
‚úÖ Use regularization (Ridge/Lasso) for many features  
‚úÖ Scale features for better performance

In [None]:
# LINEAR REGRESSION - COMPLETE EXAMPLE

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import seaborn as sns

print("="*80)
print("LINEAR REGRESSION - HOUSE PRICE PREDICTION")
print("="*80)

# 1. CREATE SAMPLE DATA
print("\n1. CREATING SAMPLE DATA")
print("-"*80)

np.random.seed(42)
n_samples = 200

# Features
square_feet = np.random.randint(800, 3500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)

# Target: Price (with some noise)
# Formula: Price = 100 + 150*sqft + 20000*bedrooms - 2000*age + noise
price = (100 + 
         150 * square_feet + 
         20000 * bedrooms - 
         2000 * age + 
         np.random.normal(0, 20000, n_samples))

# Create DataFrame
df = pd.DataFrame({
    'SquareFeet': square_feet,
    'Bedrooms': bedrooms,
    'Age': age,
    'Price': price
})

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nStatistics:")
print(df.describe())

# 2. PREPARE DATA
print("\n2. PREPARING DATA")
print("-"*80)

X = df[['SquareFeet', 'Bedrooms', 'Age']]
y = df['Price']

# Split into train and test sets (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# 3. CREATE AND TRAIN MODEL
print("\n3. TRAINING LINEAR REGRESSION MODEL")
print("-"*80)

# Initialize model
model = LinearRegression()

# Train model
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"\nIntercept (Œ≤‚ÇÄ): ${model.intercept_:,.2f}")
print(f"\nCoefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: ${coef:,.2f}")

# 4. MAKE PREDICTIONS
print("\n4. MAKING PREDICTIONS")
print("-"*80)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Show sample predictions
print("\nSample predictions (first 5 test samples):")
comparison = pd.DataFrame({
    'Actual': y_test.values[:5],
    'Predicted': y_pred_test[:5],
    'Difference': y_test.values[:5] - y_pred_test[:5]
})
print(comparison)

# 5. EVALUATE MODEL
print("\n5. MODEL EVALUATION")
print("-"*80)

# Calculate metrics
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
train_mae = mean_absolute_error(y_train, y_pred_train)
test_mae = mean_absolute_error(y_test, y_pred_test)

print("Training Set Metrics:")
print(f"  R¬≤ Score: {train_r2:.4f}")
print(f"  RMSE: ${train_rmse:,.2f}")
print(f"  MAE: ${train_mae:,.2f}")

print("\nTest Set Metrics:")
print(f"  R¬≤ Score: {test_r2:.4f}")
print(f"  RMSE: ${test_rmse:,.2f}")
print(f"  MAE: ${test_mae:,.2f}")

# 6. INTERPRET RESULTS
print("\n6. INTERPRETATION")
print("-"*80)

print(f"\nModel Equation:")
print(f"Price = {model.intercept_:,.2f} + ")
print(f"        {model.coef_[0]:,.2f} √ó SquareFeet + ")
print(f"        {model.coef_[1]:,.2f} √ó Bedrooms + ")
print(f"        {model.coef_[2]:,.2f} √ó Age")

print(f"\nInterpretation:")
print(f"  - Each additional square foot adds ${model.coef_[0]:,.2f} to price")
print(f"  - Each additional bedroom adds ${model.coef_[1]:,.2f} to price")
print(f"  - Each year of age changes price by ${model.coef_[2]:,.2f}")
print(f"  - R¬≤ = {test_r2:.2%} of variance explained by model")

# 7. MAKE NEW PREDICTION
print("\n7. PREDICT NEW HOUSE PRICE")
print("-"*80)

new_house = pd.DataFrame({
    'SquareFeet': [2000],
    'Bedrooms': [3],
    'Age': [10]
})

predicted_price = model.predict(new_house)[0]

print(f"New house features:")
print(f"  Square Feet: {new_house['SquareFeet'].values[0]:,}")
print(f"  Bedrooms: {new_house['Bedrooms'].values[0]}")
print(f"  Age: {new_house['Age'].values[0]} years")
print(f"\nPredicted Price: ${predicted_price:,.2f}")

# 8. VISUALIZE RESULTS
print("\n8. CREATING VISUALIZATIONS")
print("-"*80)

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Actual vs Predicted
axes[0, 0].scatter(y_test, y_pred_test, alpha=0.5)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Price')
axes[0, 0].set_ylabel('Predicted Price')
axes[0, 0].set_title(f'Actual vs Predicted (R¬≤ = {test_r2:.3f})')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Residuals
residuals = y_test - y_pred_test
axes[0, 1].scatter(y_pred_test, residuals, alpha=0.5)
axes[0, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Price')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residual Plot')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Feature Importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)

axes[1, 0].barh(feature_importance['Feature'], feature_importance['Coefficient'])
axes[1, 0].set_xlabel('Coefficient Value')
axes[1, 0].set_title('Feature Importance (Coefficients)')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Distribution of Errors
axes[1, 1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Residual Value')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Residuals')
axes[1, 1].axvline(x=0, color='r', linestyle='--', lw=2)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Visualizations created!")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print("‚úì Linear Regression successfully trained")
print(f"‚úì Model explains {test_r2:.1%} of price variation")
print(f"‚úì Average prediction error: ${test_mae:,.2f}")
print("‚úì Best for: Linear relationships, interpretability needed")
print("‚úì Assumptions: Check residual plots for validation")
print("="*80)

# 2. Polynomial Regression

## üìñ What is Polynomial Regression?

Polynomial Regression extends linear regression by adding polynomial terms (x¬≤, x¬≥, etc.) to capture **non-linear relationships**.

**Formula**: `y = Œ≤‚ÇÄ + Œ≤‚ÇÅx + Œ≤‚ÇÇx¬≤ + Œ≤‚ÇÉx¬≥ + ... + Œ≤‚Çôx‚Åø`

**How it works**: 
1. Transform features into polynomial features
2. Apply linear regression on transformed features
3. Result: Curved line that fits data better

**Example**:
```
Original: x = [1, 2, 3]
Degree 2: [1, x, x¬≤] = [1, 1, 1], [1, 2, 4], [1, 3, 9]
Degree 3: [1, x, x¬≤, x¬≥] = [1, 1, 1, 1], [1, 2, 4, 8], [1, 3, 9, 27]
```

## üéØ Why Use Polynomial Regression?

### **Advantages:**
1. **Captures Non-Linearity** - Models curves and complex patterns
2. **Still Interpretable** - Extension of linear regression
3. **Flexible** - Control complexity with degree
4. **No New Algorithm** - Uses linear regression

### **Disadvantages:**
1. **Overfitting Risk** - High degrees fit noise
2. **Extrapolation Issues** - Poor predictions outside training range
3. **More Features** - Computational cost increases
4. **Multicollinearity** - x and x¬≤ are correlated

## ‚è±Ô∏è When to Use Polynomial Regression

### ‚úÖ **Use When:**

**1. Curved Relationships**
- Example: Temperature vs ice cream sales (peaks at optimal temp)
- Why: Linear model misses the curve
- Degree: Usually 2-3

**2. Known Non-Linear Physics**
- Example: Projectile motion (parabolic)
- Why: Physics follows polynomial laws
- Degree: Based on physics (often 2)

**3. Small Dataset with Curve**
- Example: 100 samples showing clear curve
- Why: Tree models might overfit
- Benefit: Smooth curve generalization

**4. Need Interpretability**
- Example: Scientific research
- Why: Can still interpret coefficients
- Alternative: Neural networks are black boxes

**5. Interpolation Needed**
- Example: Fill gaps in time series
- Why: Smooth curve between points
- Warning: Don't extrapolate far!

### ‚ùå **Don't Use When:**

**1. Need to Extrapolate**
- Example: Predict 10 years beyond training data
- Better: Domain-specific models
- Why: Polynomials diverge outside range

**2. Many Features**
- Example: 100 features ‚Üí thousands after polynomial transform
- Better: Tree-based models, neural networks
- Why: Curse of dimensionality

**3. Very Complex Non-Linearity**
- Example: Image recognition
- Better: Neural networks
- Why: Polynomial can't capture complex patterns

**4. Large Dataset**
- Example: Millions of samples
- Better: Gradient boosting, neural networks
- Why: Computational cost too high

## üìä Complexity

- **Training Time**: O(n √ó p^d) where d=degree
- **Prediction Time**: O(p^d)
- **Space**: O(p^d) features created

## üåç Real-World Applications

1. **Physics** - Trajectory prediction, force relationships
2. **Economics** - Diminishing returns, growth curves
3. **Biology** - Population growth, enzyme kinetics
4. **Marketing** - Sales response curves
5. **Engineering** - Stress-strain relationships
6. **Climate** - Temperature patterns

## üí° Key Insights

‚úÖ Best for curved relationships  
‚úÖ Degree 2-3 usually sufficient  
‚úÖ Higher degree = overfitting risk  
‚úÖ Always use cross-validation  
‚úÖ Don't extrapolate far from training range  
‚úÖ Consider regularization (Ridge/Lasso)  
‚úÖ Visualize to choose appropriate degree

In [None]:
# POLYNOMIAL REGRESSION - COMPLETE EXAMPLE

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

print("="*80)
print("POLYNOMIAL REGRESSION - TEMPERATURE vs SALES")
print("="*80)

# 1. CREATE NON-LINEAR DATA
print("\n1. CREATING NON-LINEAR DATA")
print("-"*80)

np.random.seed(42)
n_samples = 100

# Temperature (¬∞C)
temperature = np.random.uniform(0, 40, n_samples)

# Sales: Quadratic relationship (peaks around 25¬∞C)
# Formula: Sales = -2*(temp-25)¬≤ + 1000 + noise
sales = -2 * (temperature - 25)**2 + 1000 + np.random.normal(0, 50, n_samples)

# Create DataFrame
df = pd.DataFrame({
    'Temperature': temperature,
    'Sales': sales
})

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())

# 2. PREPARE DATA
print("\n2. PREPARING DATA")
print("-"*80)

X = df[['Temperature']]
y = df['Sales']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# 3. TRAIN MODELS WITH DIFFERENT DEGREES
print("\n3. TRAINING MODELS (Degrees 1, 2, 3, 5)")
print("-"*80)

degrees = [1, 2, 3, 5]
models = {}
scores = {}

for degree in degrees:
    # Create pipeline: PolynomialFeatures ‚Üí LinearRegression
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    # Train
    model.fit(X_train, y_train)
    
    # Predict
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Evaluate
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    
    models[degree] = model
    scores[degree] = {
        'train_r2': train_r2,
        'test_r2': test_r2,
        'test_rmse': test_rmse
    }
    
    print(f"\nDegree {degree}:")
    print(f"  Train R¬≤: {train_r2:.4f}")
    print(f"  Test R¬≤: {test_r2:.4f}")
    print(f"  Test RMSE: {test_rmse:.2f}")
    
    # Show number of features created
    n_features = model.named_steps['poly'].n_output_features_
    print(f"  Features created: {n_features}")

# 4. FIND BEST DEGREE
print("\n4. SELECTING BEST DEGREE")
print("-"*80)

best_degree = max(scores, key=lambda k: scores[k]['test_r2'])
best_model = models[best_degree]

print(f"Best degree: {best_degree}")
print(f"Best test R¬≤: {scores[best_degree]['test_r2']:.4f}")
print(f"Best test RMSE: {scores[best_degree]['test_rmse']:.2f}")

# 5. MAKE PREDICTIONS
print("\n5. MAKING PREDICTIONS WITH BEST MODEL")
print("-"*80)

# Predict for new temperatures
new_temps = pd.DataFrame({'Temperature': [15, 25, 35]})
predictions = best_model.predict(new_temps)

print("\nPredictions for new temperatures:")
for temp, pred in zip(new_temps['Temperature'], predictions):
    print(f"  Temperature: {temp}¬∞C ‚Üí Predicted Sales: {pred:.2f}")

# 6. VISUALIZE RESULTS
print("\n6. CREATING VISUALIZATIONS")
print("-"*80)

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Create smooth curve for plotting
X_plot = np.linspace(0, 40, 300).reshape(-1, 1)

# Plot 1-4: Different polynomial degrees
for idx, degree in enumerate(degrees):
    row = idx // 2
    col = idx % 2
    
    # Scatter plot of actual data
    axes[row, col].scatter(X_train, y_train, alpha=0.5, label='Training data')
    axes[row, col].scatter(X_test, y_test, alpha=0.5, color='red', label='Test data')
    
    # Plot fitted curve
    y_plot = models[degree].predict(X_plot)
    axes[row, col].plot(X_plot, y_plot, 'g-', linewidth=2, label=f'Degree {degree} fit')
    
    axes[row, col].set_xlabel('Temperature (¬∞C)')
    axes[row, col].set_ylabel('Sales')
    axes[row, col].set_title(
        f'Polynomial Degree {degree}\n'
        f'Train R¬≤={scores[degree]["train_r2"]:.3f}, '
        f'Test R¬≤={scores[degree]["test_r2"]:.3f}'
    )
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Comparison plot
plt.figure(figsize=(12, 6))
plt.scatter(X_train, y_train, alpha=0.5, label='Training data')

colors = ['blue', 'green', 'orange', 'red']
for degree, color in zip(degrees, colors):
    y_plot = models[degree].predict(X_plot)
    plt.plot(X_plot, y_plot, color=color, linewidth=2, 
             label=f'Degree {degree} (R¬≤={scores[degree]["test_r2"]:.3f})')

plt.xlabel('Temperature (¬∞C)')
plt.ylabel('Sales')
plt.title('Comparison of Polynomial Degrees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Visualizations created!")

# 7. DEMONSTRATE OVERFITTING
print("\n7. OVERFITTING ANALYSIS")
print("-"*80)

print("\nDegree vs Performance:")
print(f"{'Degree':<10} {'Train R¬≤':<12} {'Test R¬≤':<12} {'Difference':<12}")
print("-"*50)
for degree in degrees:
    train_r2 = scores[degree]['train_r2']
    test_r2 = scores[degree]['test_r2']
    diff = train_r2 - test_r2
    print(f"{degree:<10} {train_r2:<12.4f} {test_r2:<12.4f} {diff:<12.4f}")

print("\nInterpretation:")
print("  - Large difference = Overfitting")
print("  - Degree 1 (linear) = Underfitting")
print(f"  - Degree {best_degree} = Best balance")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print("‚úì Polynomial Regression successfully trained")
print(f"‚úì Best polynomial degree: {best_degree}")
print("‚úì Captures curved relationships")
print("‚úì Use When: Non-linear but smooth relationships")
print("‚úì Warning: Don't extrapolate beyond training range!")
print("‚úì Tip: Use cross-validation to choose degree")
print("="*80)