# Day 6: Linear Regression

**Course**: Machine Learning with Python: From Basics to Applications  
**Objective**: Understand linear regression and apply it to predict house prices using the Boston Housing dataset.  
**Prerequisites**: Basic Python, NumPy, Pandas (Day 2), preprocessing (Days 3–5).  
**Tools**: Pandas, scikit-learn (install with `pip install pandas scikit-learn`).  
**Dataset**: Boston Housing dataset (available via `sklearn.datasets.load_boston`).  

In this notebook, we will:  
1. Load the Boston Housing dataset.  
2. Split data into training (80%) and test (20%) sets.  
3. Scale features using `StandardScaler`.  
4. Train a linear regression model and predict on the test set.  
5. Evaluate with Mean Squared Error (MSE) and R-squared (R²).  
6. Save predictions to a CSV.  
7. Verify splits, scaling, and metrics.  

Let’s get started!

## Step 1: Import Libraries

Import Pandas for data handling and scikit-learn for dataset loading, splitting, scaling, modeling, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Step 2: Load the Boston Housing Dataset

Load the dataset and convert it to a Pandas DataFrame for easier inspection.

In [None]:
# Load dataset
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

# Display first 5 rows
print("First 5 rows of features:")
print(X.head())

# Display dataset info
print("\nDataset shape:", X.shape)
print("Target shape:", y.shape)

**Expected Output**:  
- `X.head()` shows 13 features (e.g., `CRIM`, `RM`, `AGE`).  
- `X.shape`: (506, 13), `y.shape`: (506,).  

**Note**: The dataset has no missing values, so no imputation is needed.

## Step 3: Train/Test Split

Split the data into 80% training and 20% test sets with `random_state=42`.

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify shapes
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Test set shape (X_test, y_test):", X_test.shape, y_test.shape)

**Expected Output**:  
- Training: ~404 rows, test: ~102 rows.

## Step 4: Feature Scaling

Scale all features using `StandardScaler` to ensure linear regression performs well.

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform training data
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns=boston.feature_names)

# Transform test data
X_test = scaler.transform(X_test)
X_test = pd.DataFrame(X_test, columns=boston.feature_names)

# Verify scaling
print("Training set mean (first few features):")
print(X_train.mean()[:5])
print("\nTraining set std (first few features):")
print(X_train.std()[:5])

**Expected Output**:  
- Mean ~0, std ~1 for all features in `X_train`.  
- `X_test` stats may differ slightly (normal, as scaler was fit on training data).

## Step 5: Train Linear Regression Model

Train the model on the training data.

In [None]:
# Initialize and train model
model = LinearRegression()
model.fit(X_train, y_train)

## Step 6: Predict and Evaluate

Predict on the test set and compute MSE and R².

In [None]:
# Predict on test set
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

**Expected Output**:  
- MSE: ~20–30 (depends on split).  
- R²: ~0.7–0.8 (indicates good fit).

## Step 7: Save Predictions

Save actual and predicted values to a CSV.

In [None]:
# Create DataFrame with actual and predicted values
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
results.to_csv('boston_predictions.csv', index=False)

print("Predictions saved as boston_predictions.csv")
print("\nFirst 5 predictions:")
print(results.head())

## Step 8: Verification

Verify split sizes, scaling, and metrics.

In [None]:
# Verify split sizes
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

# Verify scaling
print("\nTraining set mean (first feature):", X_train.iloc[:, 0].mean())
print("Training set std (first feature):", X_train.iloc[:, 0].std())

# Verify metrics
print("\nMSE:", mse)
print("R²:", r2)

**Expected Output**:  
- Training: ~404 rows, test: ~102 rows.  
- Mean ~0, std ~1 for training features.  
- Reasonable MSE and R² values.

## Assignment

1. Run this notebook to train and evaluate a linear regression model on the Boston Housing dataset.  
2. Verify:  
   - Split sizes (~404 train, ~102 test).  
   - Features scaled (mean ~0, std ~1 in training data).  
   - Reasonable MSE (~20–30) and R² (~0.7–0.8).  
3. Save predictions to `boston_predictions.csv`.  
4. Submit a screenshot of the notebook output showing split sizes, MSE, and R².  

**Next Steps**: On Day 7, we’ll explore logistic regression for classification using the Titanic dataset.