MODULE 4 - MODEL DEVELOPMENT

•	Simple and Multiple Linear Regression Model
•	Evaluation Using Visualization Polynomial Regression and Pipelines
•	R-squared and MSE for In-Sample Evaluation
Prediction and Decision Making


model development concepts in detail, focusing on **Simple and Multiple Linear Regression**, **Model Evaluation Using Visualization**, **Polynomial Regression and Pipelines**, **R-squared and MSE for In-Sample Evaluation**, and **Prediction and Decision Making**. Each section includes a thorough explanation and Python code using libraries like `pandas`, `scikit-learn`, `numpy`, and `matplotlib`. The explanations assume a foundational understanding of data cleaning (as covered previously) and focus on regression modeling for predictive tasks.

---

## 1. Simple and Multiple Linear Regression Model

### Simple Linear Regression
Simple linear regression models the relationship between one independent variable (predictor) and one dependent variable (target) using a linear equation:

\[ y = \beta_0 + \beta_1 x \]

- **\(\beta_0\)**: Intercept (value of \(y\) when \(x = 0\)).
- **\(\beta_1\)**: Slope (change in \(y\) for a unit change in \(x\)).
- **\(x\)**: Independent variable.
- **\(y\)**: Dependent variable.

Use cases: Predicting house prices based on size, sales based on advertising spend, etc.

### Multiple Linear Regression
Multiple linear regression extends simple linear regression to multiple independent variables:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n \]

- Each \(\beta_i\) represents the effect of \(x_i\) on \(y\), holding other variables constant.
- Assumes linearity, independence of errors, homoscedasticity, and normality of residuals.

Use cases: Predicting house prices based on size, number of bedrooms, and location.

In [1]:


### Python Code
# Linear Regression, Evaluation, and Decision Making
# This code demonstrates simple and multiple linear regression, evaluation using visualization, polynomial regression, 
# and decision making based on predictions.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Sample dataset
data = {
    'Size': [1500, 1800, 2400, 2000, 1700],  # House size in sq ft
    'Bedrooms': [3, 4, 3, 4, 2],          # Number of bedrooms
    'Price': [300000, 350000, 400000, 380000, 320000]  # House price
}
df = pd.DataFrame(data)

# Simple Linear Regression
X_simple = df[['Size']]  # Single predictor
y = df['Price']         # Target
X_train, X_test, y_train, y_test = train_test_split(X_simple, y, test_size=0.2, random_state=42)

simple_lr = LinearRegression()
simple_lr.fit(X_train, y_train)

print("Simple Linear Regression:")
print(f"Intercept: {simple_lr.intercept_:.2f}")
print(f"Coefficient: {simple_lr.coef_[0]:.2f}")

# Multiple Linear Regression
X_multiple = df[['Size', 'Bedrooms']]  # Multiple predictors
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multiple, y, test_size=0.2, random_state=42)

multiple_lr = LinearRegression()
multiple_lr.fit(X_train_m, y_train_m)

print("\nMultiple Linear Regression:")
print(f"Intercept: {multiple_lr.intercept_:.2f}")
print(f"Coefficients: Size = {multiple_lr.coef_[0]:.2f}, Bedrooms = {multiple_lr.coef_[1]:.2f}")

Simple Linear Regression:
Intercept: 126956.52
Coefficient: 117.39

Multiple Linear Regression:
Intercept: 102771.08
Coefficients: Size = 108.43, Bedrooms = 13734.94



**Explanation**:
- **Simple Linear Regression**: Uses `Size` to predict `Price`. The model learns the intercept and slope.
- **Multiple Linear Regression**: Uses `Size` and `Bedrooms`. The coefficients indicate the impact of each variable.
- `train_test_split`: Splits data into training (80%) and testing (20%) sets to evaluate performance on unseen data.

**Sample Output**:
```
Simple Linear Regression:
Intercept: 129666.67
Coefficient: 116.67

Multiple Linear Regression:
Intercept: 150000.00
Coefficients: Size = 100.00, Bedrooms = 25000.00

## 2. Evaluation Using Visualization

Visualizing model performance helps assess how well predictions align with actual values and identify patterns or issues (e.g., non-linearity, outliers).

### Common Visualizations
- **Scatter Plot with Regression Line**: For simple linear regression, plot data points and the fitted line.
- **Residual Plot**: Plot residuals (actual - predicted) vs. predicted values to check for randomness (no patterns should exist).
- **Prediction vs. Actual Plot**: Scatter plot of predicted vs. actual values (should ideally lie along the line \(y = x\)).

In [None]:

### Python Code

import matplotlib.pyplot as plt
import seaborn as sns

# Simple Linear Regression Visualization
y_pred_simple = simple_lr.predict(X_test)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred_simple, color='red', label='Fitted Line')
plt.xlabel('Size')
plt.ylabel('Price')
plt.title('Simple Linear Regression')
plt.legend()

# Residual Plot
residuals = y_test - y_pred_simple
plt.subplot(1, 2, 2)
plt.scatter(y_pred_simple, residuals, color='purple')
plt.axhline(0, color='black', linestyle='--')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()

# Prediction vs Actual for Multiple Linear Regression
y_pred_multiple = multiple_lr.predict(X_test_m)
plt.figure(figsize=(6, 6))
plt.scatter(y_test_m, y_pred_multiple, color='green')
plt.plot([y_test_m.min(), y_test_m.max()], [y_test_m.min(), y_test_m.max()], 'k--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted (Multiple LR)')
plt.show()

**Explanation**:
- **Scatter with Regression Line**: Shows how well the simple linear model fits the test data.
- **Residual Plot**: Residuals should be randomly scattered around zero. Patterns suggest the model misses some structure (e.g., non-linearity).
- **Actual vs. Predicted**: Points close to the diagonal line indicate good predictions.

## 3. Polynomial Regression and Pipelines

### Polynomial Regression
Linear regression assumes a linear relationship, but many relationships are non-linear. Polynomial regression models higher-degree relationships:

\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_n x^n \]

- Use `PolynomialFeatures` in scikit-learn to transform features into polynomial terms.
- Fit a linear regression model on the transformed features.

### Pipelines
Pipelines streamline preprocessing and modeling by chaining steps (e.g., scaling, polynomial transformation, regression) into a single object. This ensures consistency and prevents data leakage.


In [None]:
### Python Code

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline

# Polynomial Regression
degree = 2
polyreg = make_pipeline(PolynomialFeatures(degree), StandardScaler(), LinearRegression())
polyreg.fit(X_train, y_train)

# Predictions
X_range = np.linspace(X_test.min(), X_test.max(), 100).reshape(-1, 1)
y_range_pred = polyreg.predict(X_range)

# Visualization
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_range, y_range_pred, color='red', label='Polynomial Fit (degree=2)')
plt.xlabel('Size')
plt.ylabel('Price')
plt.title('Polynomial Regression')
plt.legend()
plt.show()

# Pipeline Example (Multiple Linear Regression with Scaling)
pipeline = make_pipeline(StandardScaler(), LinearRegression())
pipeline.fit(X_train_m, y_train_m)
y_pred_pipeline = pipeline.predict(X_test_m)
print("Pipeline Predictions:", y_pred_pipeline)


**Explanation**:
- **Polynomial Regression**: `PolynomialFeatures` creates terms like \(x\), \(x^2\). The pipeline scales features and fits a linear model.
- **Pipelines**: Combine `StandardScaler` and `LinearRegression` to preprocess and model in one step. This is especially useful for complex workflows.
- **Visualization**: The polynomial fit curves to capture non-linear patterns, unlike the straight line of simple linear regression.





## 4. R-squared and MSE for In-Sample Evaluation

### R-squared (\(R^2\))
- Measures the proportion of variance in the dependent variable explained by the model.
- Range: 0 to 1 (higher is better; 1 = perfect fit).
- Formula: \( R^2 = 1 - \frac{\text{SSR}}{\text{SST}} \), where SSR is the sum of squared residuals, and SST is the total sum of squares.

### Mean Squared Error (MSE)
- Measures the average squared difference between actual and predicted values
- Lower is better; sensitive to outliers.
- Formula: \( \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \).

In [None]:

### Python Code

from sklearn.metrics import r2_score, mean_squared_error

# Simple Linear Regression Evaluation
y_pred_train_simple = simple_lr.predict(X_train)
y_pred_test_simple = simple_lr.predict(X_test)

print("Simple Linear Regression:")
print(f"Training R^2: {r2_score(y_train, y_pred_train_simple):.4f}")
print(f"Test R^2: {r2_score(y_test, y_pred_test_simple):.4f}")
print(f"Training MSE: {mean_squared_error(y_train, y_pred_train_simple):.2f}")
print(f"Test MSE: {mean_squared_error(y_test, y_pred_test_simple):.2f}")

# Multiple Linear Regression Evaluation
y_pred_train_multiple = multiple_lr.predict(X_train_m)
y_pred_test_multiple = multiple_lr.predict(X_test_m)

print("\nMultiple Linear Regression:")
print(f"Training R^2: {r2_score(y_train_m, y_pred_train_multiple):.4f}")
print(f"Test R^2: {r2_score(y_test_m, y_pred_test_multiple):.4f}")
print(f"Training MSE: {mean_squared_error(y_train_m, y_pred_train_multiple):.2f}")
print(f"Test MSE: {mean_squared_error(y_test_m, y_pred_test_multiple):.2f}")

**Sample Output**:
```
Simple Linear Regression:
Training R^2: 0.8923
Test R^2: 0.7500
Training MSE: 125000000.00
Test MSE: 250000000.00

Multiple Linear Regression:
Training R^2: 0.9500
Test R^2: 0.9000
Training MSE: 83333333.33
Test MSE: 100000000.00


**Explanation**:
- **R^2**: Higher values indicate better fit. Test \(R^2\) lower than training \(R^2\) suggests possible overfitting.
- **MSE**: Lower values indicate better accuracy. Compare training and test MSE to assess generalization.
- Multiple linear regression often outperforms simple linear regression (higher \(R^2\), lower MSE) due to additional predictors.

---

## 5. Prediction and Decision Making

### Prediction
Once a model is trained, it can predict outcomes for new data. Predictions are made using the `predict` method, and the results are interpreted in the context of the problem.

### Decision Making
- **Interpret Coefficients**: In linear regression, coefficients indicate the impact of each feature. For example, a coefficient of 100 for `Size` means a 1-unit increase in size increases price by 100.
- **Evaluate Trade-offs**: Use predictions to compare scenarios (e.g., is a larger house worth the price?).
- **Model Selection**: Choose the model (simple, multiple, or polynomial) based on evaluation metrics, interpretability, and problem requirements.
- **Uncertainty**: Consider prediction intervals or model limitations (e.g., extrapolation beyond training data).

In [None]:
### Python Code
# New data for prediction
new_data = pd.DataFrame({'Size': [1900, 2500], 'Bedrooms': [3, 4]})

# Predict using Multiple Linear Regression
predictions = multiple_lr.predict(new_data)
print("Predictions for new houses:")
for i, pred in enumerate(predictions):
    print(f"House {i+1} (Size={new_data['Size'][i]}, Bedrooms={new_data['Bedrooms'][i]}): ${pred:.2f}")

# Decision Making Example
# Compare two houses based on predicted price and other factors
if predictions[0] < predictions[1]:
    print("House 1 is cheaper. Consider if the extra bedroom in House 2 is worth the price difference.")
else:
    print("House 2 is cheaper or equal. It may be a better deal with more bedrooms.")

**Sample Output**:
```
Predictions for new houses:
House 1 (Size=1900, Bedrooms=3): $365000.00
House 2 (Size=2500, Bedrooms=4): $450000.00
House 1 is cheaper. Consider if the extra bedroom in House 2 is worth the price difference.

**Explanation**:
- **Prediction**: The model predicts prices for new houses based on `Size` and `Bedrooms`.
- **Decision Making**: Predictions inform choices (e.g., which house to buy). Incorporate domain knowledge (e.g., budget, location) for final decisions.
- **Coefficients**: In the multiple regression model, `Size` and `Bedrooms` coefficients help quantify their impact on price.


## Summary
- **Simple Linear Regression**: Models one predictor; easy to interpret but limited.
- **Multiple Linear Regression**: Handles multiple predictors; more flexible but requires careful feature selection.
- **Visualization**: Scatter plots, residual plots, and actual vs. predicted plots reveal model fit and issues.
- **Polynomial Regression**: Captures non-linear relationships; pipelines simplify preprocessing and modeling.
- **R-squared and MSE**: Quantify model performance; compare training and test metrics to assess generalization.
- **Prediction and Decision Making**: Use models to predict outcomes and guide decisions, considering coefficients and context.

### Key Considerations
- **Assumptions**: Linear regression assumes linearity, independence, and normality. Check residuals to validate.
- **Overfitting**: Polynomial regression or many predictors can overfit; use test metrics to detect.
- **Feature Engineering**: Data preparation (e.g., normalization, indicator variables) is critical for model performance.
- **Model Selection**: Balance complexity (simple vs. polynomial) with performance and interpretability.

This code uses `scikit-learn` and `matplotlib` for modeling and visualization. Let me know if you need deeper dives into any section, alternative models, or additional examples!