## Student Information

**Name:** [Your Name Here]  
**Roll Number:** [Your Roll Number]  
**Date:** [Submission Date]  
**College:** [Your College Name]  
**Academic Year:** 2025-2026  

---

## Learning Outcomes Checklist

- [ ] LO1: Understand linear regression theory
- [ ] LO2: Implement Simple Linear Regression
- [ ] LO3: Implement Multiple Linear Regression
- [ ] LO4: Analyze regression coefficients
- [ ] LO5: Calculate evaluation metrics (MSE, RMSE, MAE, R¬≤)
- [ ] LO6: Perform residual analysis
- [ ] LO7: Detect overfitting/underfitting
- [ ] LO8: Apply feature scaling
- [ ] LO9: Visualize regression results
- [ ] LO10: Compare simple vs multiple regression

## Phase 0: Environment Setup & Library Imports

In [None]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (mean_squared_error, mean_absolute_error, r2_score)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Print versions
print("üìö LIBRARY VERSIONS:")
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Scikit-learn: {__import__('sklearn').__version__}")
print("\n‚úÖ All libraries imported successfully!")

# Configure display
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

## Phase 1: Dataset Creation & Exploration

Create a regression dataset for model training.

In [None]:
# Create regression dataset
from sklearn.datasets import load_diabetes

print("üìä CREATING REGRESSION DATASET:\n")

# Load Diabetes dataset
X, y = load_diabetes(return_X_y=True)
feature_names = load_diabetes().feature_names

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['Target'] = y

print(f"Dataset shape: {df.shape}")
print(f"Features: {list(feature_names)[:5]}... (10 total)")
print(f"Target: Diabetes disease progression")
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nDataset statistics:")
print(df[['age', 'sex', 'bmi', 'Target']].describe())

In [None]:
# Data exploration
print("üìà DATA EXPLORATION:\n")

print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Data shape: {df.shape}")
print(f"\nTarget variable statistics:")
print(f"Mean: {df['Target'].mean():.2f}")
print(f"Std: {df['Target'].std():.2f}")
print(f"Min: {df['Target'].min():.2f}")
print(f"Max: {df['Target'].max():.2f}")

In [None]:
# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram
axes[0].hist(df['Target'], bins=30, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Target Value')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Target Distribution')

# Box plot
axes[1].boxplot(df['Target'])
axes[1].set_ylabel('Target Value')
axes[1].set_title('Target Box Plot')

plt.tight_layout()
plt.show()

print("‚úÖ Target distribution visualized!")

## Phase 2: Correlation Analysis

Analyze relationships between features and target.

In [None]:
# Calculate correlations
print("üìä CORRELATION ANALYSIS:\n")

correlation_with_target = df.corr()['Target'].sort_values(ascending=False)
print("Correlation with Target (Top 10):")
print(correlation_with_target.head(10))

# Visualize correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of correlations
top_features = correlation_with_target[1:6]  # Exclude Target itself
axes[0].barh(range(len(top_features)), top_features.values)
axes[0].set_yticks(range(len(top_features)))
axes[0].set_yticklabels(top_features.index)
axes[0].set_xlabel('Correlation Coefficient')
axes[0].set_title('Top 5 Features by Correlation')

# Scatter plot: BMI vs Target
axes[1].scatter(df['bmi'], df['Target'], alpha=0.5)
axes[1].set_xlabel('BMI')
axes[1].set_ylabel('Target')
axes[1].set_title('BMI vs Target (Strongest Correlation)')

plt.tight_layout()
plt.show()

print("\n‚úÖ Correlation analysis completed!")

## Phase 3: Data Preparation & Train-Test Split

Prepare data for regression model training.

In [None]:
# Prepare data
print("üîÄ DATA PREPARATION:\n")

X = df.drop('Target', axis=1)
y = df['Target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}")
print(f"\n‚úÖ Data split successfully!")

In [None]:
# Feature scaling
print("\nüìè FEATURE SCALING:\n")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training data - Mean: {X_train_scaled.mean(axis=0):.4f}")
print(f"Training data - Std: {X_train_scaled.std(axis=0):.4f}")
print(f"\n‚úÖ Features scaled successfully!")

## Phase 4: Simple Linear Regression

Implement regression with a single feature.

In [None]:
# Simple Linear Regression (using BMI only)
print("üìà SIMPLE LINEAR REGRESSION:\n")

X_train_simple = X_train[['bmi']].values
X_test_simple = X_test[['bmi']].values

# Scale the feature
scaler_simple = StandardScaler()
X_train_simple_scaled = scaler_simple.fit_transform(X_train_simple)
X_test_simple_scaled = scaler_simple.transform(X_test_simple)

# Train simple model
simple_model = LinearRegression()
simple_model.fit(X_train_simple_scaled, y_train)

# Make predictions
y_pred_simple = simple_model.predict(X_test_simple_scaled)

# Calculate metrics
simple_mse = mean_squared_error(y_test, y_pred_simple)
simple_rmse = np.sqrt(simple_mse)
simple_mae = mean_absolute_error(y_test, y_pred_simple)
simple_r2 = r2_score(y_test, y_pred_simple)

print(f"Simple Linear Regression Results:")
print(f"Coefficient (slope): {simple_model.coef_[0]:.4f}")
print(f"Intercept (bias): {simple_model.intercept_:.4f}")
print(f"Equation: y = {simple_model.coef_[0]:.4f} * x + {simple_model.intercept_:.4f}")
print(f"\nMetrics:")
print(f"  MSE: {simple_mse:.4f}")
print(f"  RMSE: {simple_rmse:.4f}")
print(f"  MAE: {simple_mae:.4f}")
print(f"  R¬≤ Score: {simple_r2:.4f}")

In [None]:
# Visualize Simple Linear Regression
plt.figure(figsize=(12, 5))

# Scatter plot with regression line
plt.subplot(1, 2, 1)
plt.scatter(X_test_simple, y_test, alpha=0.6, label='Actual')
plt.scatter(X_test_simple, y_pred_simple, alpha=0.6, color='red', label='Predicted')
plt.xlabel('BMI')
plt.ylabel('Target')
plt.title('Simple Linear Regression: BMI vs Target')
plt.legend()
plt.grid(True, alpha=0.3)

# Residuals plot
residuals_simple = y_test - y_pred_simple
plt.subplot(1, 2, 2)
plt.scatter(y_pred_simple, residuals_simple, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals Plot')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úÖ Simple regression visualization completed!")

## Phase 5: Multiple Linear Regression

Implement regression using all features.

In [None]:
# Multiple Linear Regression (using all features)
print("üìà MULTIPLE LINEAR REGRESSION:\n")

# Train model on scaled data
multiple_model = LinearRegression()
multiple_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_train = multiple_model.predict(X_train_scaled)
y_pred_test = multiple_model.predict(X_test_scaled)

# Calculate metrics
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = np.sqrt(mse_test)
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

print(f"Multiple Linear Regression Results:")
print(f"\nTraining Metrics:")
print(f"  MSE: {mse_train:.4f}")
print(f"  R¬≤ Score: {r2_train:.4f}")
print(f"\nTesting Metrics:")
print(f"  MSE: {mse_test:.4f}")
print(f"  RMSE: {rmse_test:.4f}")
print(f"  MAE: {mae_test:.4f}")
print(f"  R¬≤ Score: {r2_test:.4f}")

In [None]:
# Feature importance (coefficients)
print("\nüìä FEATURE IMPORTANCE (COEFFICIENTS):\n")

coefficients = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': multiple_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print(coefficients.to_string(index=False))
print(f"\nIntercept: {multiple_model.intercept_:.4f}")

# Visualize coefficients
plt.figure(figsize=(10, 6))
colors = ['green' if x > 0 else 'red' for x in coefficients['Coefficient']]
plt.barh(coefficients['Feature'], coefficients['Coefficient'], color=colors)
plt.xlabel('Coefficient Value')
plt.title('Feature Coefficients - Multiple Linear Regression')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

print("‚úÖ Feature importance visualized!")

## Phase 6: Residual Analysis

Analyze residuals for model diagnostics.

In [None]:
# Residual analysis
print("üîç RESIDUAL ANALYSIS:\n")

residuals = y_test - y_pred_test

print(f"Residuals Statistics:")
print(f"  Mean: {residuals.mean():.6f}")
print(f"  Std Dev: {residuals.std():.4f}")
print(f"  Min: {residuals.min():.4f}")
print(f"  Max: {residuals.max():.4f}")
print(f"\nNormality Test (Shapiro-Wilk):")
stat, p_value = stats.shapiro(residuals)
print(f"  p-value: {p_value:.6f}")
print(f"  Normal distribution: {'Yes' if p_value > 0.05 else 'No'}")

In [None]:
# Visualize residuals
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Predicted
axes[0, 0].scatter(y_pred_test, residuals, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Predicted Values')
axes[0, 0].grid(True, alpha=0.3)

# Histogram of residuals
axes[0, 1].hist(residuals, bins=20, edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Residuals Distribution')

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot')
axes[1, 0].grid(True, alpha=0.3)

# Actual vs Predicted
axes[1, 1].scatter(y_test, y_pred_test, alpha=0.6)
axes[1, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[1, 1].set_xlabel('Actual Values')
axes[1, 1].set_ylabel('Predicted Values')
axes[1, 1].set_title('Actual vs Predicted')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úÖ Residual diagnostics visualized!")

## Phase 7: Model Comparison

Compare Simple vs Multiple Regression.

In [None]:
# Model comparison
print("üìä MODEL COMPARISON:\n")

comparison_data = [
    {
        'Model': 'Simple Linear Regression',
        'Features': 1,
        'MSE': simple_mse,
        'RMSE': simple_rmse,
        'MAE': simple_mae,
        'R¬≤ Score': simple_r2
    },
    {
        'Model': 'Multiple Linear Regression',
        'Features': X_train.shape[1],
        'MSE': mse_test,
        'RMSE': rmse_test,
        'MAE': mae_test,
        'R¬≤ Score': r2_test
    }
]

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

print(f"\nImprovement:")
print(f"  R¬≤ improvement: {(r2_test - simple_r2)*100:.2f}%")
print(f"  RMSE reduction: {(simple_rmse - rmse_test)*100/simple_rmse:.2f}%")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

models = ['Simple', 'Multiple']
rmse_values = [simple_rmse, rmse_test]
r2_values = [simple_r2, r2_test]

# RMSE comparison
axes[0].bar(models, rmse_values, color=['orange', 'green'])
axes[0].set_ylabel('RMSE')
axes[0].set_title('Root Mean Squared Error Comparison')
axes[0].grid(True, alpha=0.3, axis='y')

# R¬≤ comparison
axes[1].bar(models, r2_values, color=['orange', 'green'])
axes[1].set_ylabel('R¬≤ Score')
axes[1].set_title('R¬≤ Score Comparison')
axes[1].set_ylim([0, 1])
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("‚úÖ Model comparison visualized!")

## Phase 8: Overfitting/Underfitting Analysis

Detect and analyze overfitting or underfitting.

In [None]:
# Check for overfitting/underfitting
print("üîç OVERFITTING/UNDERFITTING ANALYSIS:\n")

train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_test

print(f"Training R¬≤ Score: {train_r2:.4f}")
print(f"Testing R¬≤ Score: {test_r2:.4f}")
print(f"Difference: {train_r2 - test_r2:.4f}")

if abs(train_r2 - test_r2) < 0.05:
    status = "‚úÖ Good fit (minimal overfitting)"
elif train_r2 - test_r2 > 0.1:
    status = "‚ö†Ô∏è Overfitting detected (high training accuracy, lower test accuracy)"
elif test_r2 < 0.5:
    status = "‚ö†Ô∏è Underfitting detected (both training and test accuracy are low)"
else:
    status = "‚úÖ Acceptable fit"

print(f"\nModel Status: {status}")

## Phase 9: Practical Tests

Complete all 5 tests to verify your learning outcomes.

In [None]:
# TEST 1: DATA PREPARATION
print("üß™ TEST 1: DATA PREPARATION")
try:
    assert X_train.shape[0] > 0, "Training set is empty!"
    assert X_test.shape[0] > 0, "Test set is empty!"
    assert X_train.shape[1] == X_test.shape[1], "Feature mismatch!"
    assert len(y_train) == X_train.shape[0], "Target mismatch!"
    print(f"‚úÖ TEST 1 PASSED: Train {X_train.shape[0]}, Test {X_test.shape[0]}")
    test1_result = "PASSED"
except AssertionError as e:
    print(f"‚ùå TEST 1 FAILED: {e}")
    test1_result = "FAILED"

In [None]:
# TEST 2: SIMPLE LINEAR REGRESSION
print("\nüß™ TEST 2: SIMPLE LINEAR REGRESSION")
try:
    assert simple_r2 > 0.3, "Simple model R¬≤ too low!"
    assert len(y_pred_simple) == len(y_test), "Prediction size mismatch!"
    assert not np.isnan(simple_model.coef_[0]), "Coefficient is NaN!"
    print(f"‚úÖ TEST 2 PASSED: R¬≤ = {simple_r2:.4f}")
    test2_result = "PASSED"
except AssertionError as e:
    print(f"‚ùå TEST 2 FAILED: {e}")
    test2_result = "FAILED"

In [None]:
# TEST 3: MULTIPLE LINEAR REGRESSION
print("\nüß™ TEST 3: MULTIPLE LINEAR REGRESSION")
try:
    assert r2_test > simple_r2, "Multiple model should outperform simple!"
    assert len(multiple_model.coef_) == X_train.shape[1], "Coefficient count mismatch!"
    assert r2_test > 0.4, "Model R¬≤ too low!"
    print(f"‚úÖ TEST 3 PASSED: R¬≤ = {r2_test:.4f}")
    test3_result = "PASSED"
except AssertionError as e:
    print(f"‚ùå TEST 3 FAILED: {e}")
    test3_result = "FAILED"

In [None]:
# TEST 4: METRICS CALCULATION
print("\nüß™ TEST 4: METRICS CALCULATION")
try:
    assert not np.isnan(mse_test), "MSE is NaN!"
    assert not np.isnan(rmse_test), "RMSE is NaN!"
    assert not np.isnan(mae_test), "MAE is NaN!"
    assert 0 <= r2_test <= 1, "R¬≤ out of range!"
    assert rmse_test > 0, "RMSE should be positive!"
    print(f"‚úÖ TEST 4 PASSED: All metrics valid")
    test4_result = "PASSED"
except AssertionError as e:
    print(f"‚ùå TEST 4 FAILED: {e}")
    test4_result = "FAILED"

In [None]:
# TEST 5: RESIDUAL ANALYSIS
print("\nüß™ TEST 5: RESIDUAL ANALYSIS")
try:
    residuals_test = y_test - y_pred_test
    assert abs(residuals_test.mean()) < 1, "Residuals should be centered near zero!"
    assert residuals_test.std() > 0, "Residuals should have variance!"
    assert len(residuals_test) == len(y_test), "Residuals length mismatch!"
    print(f"‚úÖ TEST 5 PASSED: Residuals analyzed")
    test5_result = "PASSED"
except AssertionError as e:
    print(f"‚ùå TEST 5 FAILED: {e}")
    test5_result = "FAILED"

## Results Summary

### Test Results

In [None]:
# Create summary report
test_summary = pd.DataFrame([
    {'Test': 'Test 1: Data Preparation', 'Result': test1_result},
    {'Test': 'Test 2: Simple Linear Regression', 'Result': test2_result},
    {'Test': 'Test 3: Multiple Linear Regression', 'Result': test3_result},
    {'Test': 'Test 4: Metrics Calculation', 'Result': test4_result},
    {'Test': 'Test 5: Residual Analysis', 'Result': test5_result}
])

print("\n" + "="*60)
print("TEST RESULTS SUMMARY")
print("="*60)
print(test_summary.to_string(index=False))

passed = sum([1 for r in [test1_result, test2_result, test3_result, test4_result, test5_result] if r == 'PASSED'])
total = 5
print(f"\nüìä SCORE: {passed}/{total} TESTS PASSED ({passed*100/total:.0f}%)")
print("="*60)

## Reflections & Learnings

### Key Insights

1. **Simple vs Multiple Regression:**
   - Simple regression is interpretable but limited
   - Multiple regression captures complex relationships
   - More features don't always mean better model

2. **Feature Coefficients:**
   - Positive coefficients increase predictions
   - Negative coefficients decrease predictions
   - Magnitude indicates strength of relationship

3. **Residual Analysis:**
   - Residuals should be normally distributed
   - Residuals should have mean near zero
   - Patterns in residuals indicate model issues

4. **Evaluation Metrics:**
   - R¬≤ tells proportion of variance explained
   - RMSE in same units as target (interpretable)
   - MAE is robust to outliers

5. **Overfitting/Underfitting:**
   - Gap between train and test performance indicates overfitting
   - Low performance on both indicates underfitting
   - Feature scaling is crucial for fair comparison

## Submission Checklist

Before submitting your practical, verify:

### Code Completion
- [ ] Phase 1 (Data Exploration): Completed with visualizations
- [ ] Phase 2 (Correlation): Correlation analysis done
- [ ] Phase 3 (Data Prep): Train-test split with scaling
- [ ] Phase 4 (Simple): Simple regression implemented
- [ ] Phase 5 (Multiple): Multiple regression implemented
- [ ] Phase 6 (Residuals): Residual analysis completed
- [ ] Phase 7 (Comparison): Models compared
- [ ] Phase 8 (Fit Check): Overfitting analysis done

### Test Results
- [ ] All 5 tests passed ‚úÖ
- [ ] Test output clearly visible
- [ ] Error handling demonstrated
- [ ] Performance metrics recorded

### Documentation
- [ ] Student information filled
- [ ] Learning outcomes checklist marked
- [ ] Reflections written
- [ ] Code is well-commented
- [ ] All output is visible and clear

### Files
- [ ] Notebook saved as `Practical_5_Complete_Notebook.ipynb`
- [ ] PDF exported from notebook
- [ ] No errors in notebook execution

---

**Practical Status:** Ready for submission ‚úÖ