# Lab 2 — Task 1: Simple Linear Regression (housing_median_age → median_house_value)

Author: LUV-KUSHWAHA

This notebook follows the full ML pipeline for Task 1 from the assignment:
1. Data Retrieval and Collection
2. Data Cleaning
3. Feature Design
4. Algorithm Selection
5. Loss Function Selection
6. Model Learning (Training)
7. Model Evaluation

Every code cell contains line-by-line comments explaining what each statement does so you can copy-paste directly into your assignment file.

## 1) Imports and setup

Import all libraries used in the notebook and set plotting style.

In [None]:
# Standard data science imports
import numpy as np                               # numerical operations and arrays
import pandas as pd                              # DataFrame manipulation
import matplotlib.pyplot as plt                  # plotting basic charts
import seaborn as sns                            # nicer default plotting style (optional)

from sklearn.datasets import fetch_california_housing  # load California housing data
from sklearn.model_selection import train_test_split   # split data into train/test
from sklearn.linear_model import LinearRegression      # linear regression model
from sklearn.metrics import mean_squared_error, r2_score # evaluation metrics

sns.set(style='whitegrid', context='notebook')       # set a pleasant plotting style

## 2) Data Retrieval and Collection

Load the California Housing dataset and put it into a pandas DataFrame for easy inspection.

In [None]:
# Load California housing dataset as pandas DataFrame (scikit-learn Bunch)
housing = fetch_california_housing(as_frame=True)   # as_frame=True yields pandas objects

# Make a working DataFrame that contains features and the target
df = housing.data.copy()                            # copy feature DataFrame (avoid mutating original)
df['median_house_value'] = housing.target           # add the target column to the DataFrame

# Rename the feature column for 'HouseAge' to match assignment naming
df = df.rename(columns={'HouseAge': 'housing_median_age'})

# Quick inspection: shape and columns
print('Shape (rows, columns):', df.shape)            # number of samples and columns
print('\nColumn names:')
print(list(df.columns))                              # list column names for reference

# Display first 5 rows to understand the data
df.head()

## 3) Data Cleaning

Check for missing values and data types. Explain handling of missing values (if any).

In [None]:
# Check for missing values in each column
missing_counts = df.isnull().sum()                  # count missing values per column
print('Missing counts per column:\n', missing_counts)

# Data types
print('\nData types:')
print(df.dtypes)

# Note: The scikit-learn California housing dataset typically has no missing values.
# If missing values were present we could impute (e.g., median) or drop rows — show commented examples:
# df['total_bedrooms'] = df['total_bedrooms'].fillna(df['total_bedrooms'].median())  # example imputation
# df = df.dropna(subset=['median_house_value'])  # example drop rows missing target

## 4) Feature Design (Task 1: single feature)

Select the single input feature requested by the assignment: `housing_median_age`.

In [None]:
# Select single feature (X) and target (y)
feature_name = 'housing_median_age'                # assignment-specified input feature
target_name = 'median_house_value'                 # label

X = df[[feature_name]]                             # keep as DataFrame (2D) for sklearn
y = df[target_name]                                # Series (1D)

print('X shape:', X.shape)                         # (n_samples, 1)
print('y shape:', y.shape)                         # (n_samples,)

Why this feature?
- The assignment requires using `housing_median_age` for Task 1.
- Intuitively, age of houses can relate to price (older neighborhoods or newer developments influence median value), though it may not be a strong predictor alone.

## 5) Algorithm selection and Loss

We choose Ordinary Least Squares Linear Regression and will evaluate using Mean Squared Error (MSE).

In [None]:
# Instantiate the Linear Regression model (ordinary least squares)
model = LinearRegression()                         # no hyperparameters required for basic OLS

# Loss for evaluation: Mean Squared Error (MSE)
# MSE measures the average squared difference between predicted and actual values.

## 6) Model learning (training)

Split the data into train and test sets, fit the linear regression model, and print learned parameters.

In [None]:
# Split data: 80% training, 20% testing, fixed random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Fit the linear model on the training data
model.fit(X_train, y_train)                        # learn coefficient and intercept from training data

# Extract learned parameters
slope = model.coef_[0]                             # coef_ is an array; single feature -> first element
intercept = model.intercept_                       # scalar intercept

print(f'Fitted intercept: {intercept:.4f}')
print(f'Fitted coefficient (slope): {slope:.4f}')

Interpretation of parameters
- Coefficient (slope): expected change in median_house_value for a one-unit increase in housing_median_age (units are the dataset's target units).
- Intercept: predicted median_house_value when housing_median_age = 0 (may be outside observed range; interpret with caution).

## 7) Model evaluation

Compute predictions on the test set and report MSE and R². Provide a short interpretation.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)                      # predicted median house values for test set

# Evaluate using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)            # average squared error
r2 = r2_score(y_test, y_pred)                       # proportion of variance explained

print(f'Test Mean Squared Error (MSE): {mse:.4f}')
print(f'Test R² score: {r2:.4f}')

print('\nShort interpretation:')
print('- MSE: average squared prediction error; lower is better.')
print('- R²: fraction of variance in the target explained by the model (1 is perfect, 0 means not better than predicting the mean).')

## Extras: Visualizations

1) Scatter plot with regression line
2) Predicted vs Actual
3) Residuals distribution

In [None]:
# 1) Scatter + regression line
plt.figure(figsize=(8, 6))
plt.scatter(X_test[feature_name], y_test, color='blue', alpha=0.5, label='Actual (test)')  # actual test points

# Create smooth line across feature range for plotting fitted line
X_line = np.linspace(X[feature_name].min(), X[feature_name].max(), 200).reshape(-1, 1)  # sorted input values
y_line = model.predict(X_line)                      # model predictions on the smooth input
plt.plot(X_line, y_line, color='red', linewidth=2, label='Regression line')

plt.xlabel('housing_median_age')
plt.ylabel('median_house_value')
plt.title('Simple Linear Regression — housing_median_age vs median_house_value')
plt.legend()
plt.show()

# 2) Predicted vs Actual
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='green', alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linewidth=2)  # identity line
plt.xlabel('Actual median_house_value')
plt.ylabel('Predicted median_house_value')
plt.title('Predicted vs Actual — Test Set')
plt.show()

# 3) Residuals diagnostics
residuals = y_test - y_pred                           # errors: actual - predicted
print('\nResiduals summary:')
print(pd.Series(residuals).describe())                # show descriptive statistics

plt.figure(figsize=(8, 4))
sns.histplot(residuals, kde=True, bins=40, color='purple')
plt.xlabel('Residual (actual - predicted)')
plt.title('Residuals distribution (Test set)')
plt.show()

## Final notes and recommended write-up for the assignment

- Data Retrieval: California Housing dataset from scikit-learn; loaded into pandas DataFrame.
- Data Cleaning: No missing values present; verified data types.
- Feature Design: Single input — housing_median_age (as required by Task 1).
- Algorithm: Ordinary Least Squares Linear Regression — suitable for modeling continuous target with assumed linear relation.
- Loss & Evaluation: Mean Squared Error (MSE) used for evaluation; R² provided as optional metric.
- Model Learning: 80/20 train-test split; fit model with sklearn's LinearRegression; reported slope and intercept.
- Model Evaluation: reported MSE and R², plotted regression line and residuals for diagnostics.

Interpret the coefficient and intercept in your assignment answers. Mention the linear regression assumptions (linearity, independence, homoscedasticity, normality of residuals) and whether they appear to hold based on residual plots.

If you want, I can also:
- Convert the target back to dollar units (multiply by 100,000) and update plots/metrics for clarity,
- Produce a version of this notebook where each pipeline step is split into its own notebook cell with richer markdown explanations above each code cell (for an assignment write-up style), or
- Provide a plain .py script version using the same commented lines.