# Linear Regression: Salary Prediction

This notebook demonstrates the full workflow for fitting an ordinary least squares model to the salary dataset contained in this module. We will focus on sound exploratory data analysis, statistical validation, and reusable training code that mirrors the `src/` implementation.

## 1. Mathematical refresher

Ordinary least squares finds coefficients $\beta$ that minimise the residual sum of squares: $\min_\beta \|y - X\beta\|^2$. When $X$ has full column rank the solution admits a closed form, $\hat{\beta} = (X^\top X)^{-1} X^\top y$. The residual diagnostics that follow will help us check linearity, homoscedasticity, and normality assumptions.

## 2. Dataset and experiment plan

\n
We work with `salary_data.csv`, a compact example that maps years of professional experience to annual salary. Even though the dataset is tiny, it is perfect for illustrating:



- deterministic train/validation splits for reproducibility;

- feature scaling inside a pipeline so training and inference use the same logic;

- lightweight evaluation metrics that we can compare with the numbers emitted by `src/train.py`.



> **Tip:** Notebook experiments should mirror the production configuration. The helper utilities in `src/` will ensure that any improvements made here (e.g. switching to Ridge regression) can be ported back with minimal code changes.


In [None]:
# Core libraries
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

DATA_PATH = Path('..') / 'data' / 'salary_data.csv'
DATA_PATH.resolve()

The pipeline components we import here match the definitions in `src/pipeline.py`. Keeping both sides aligned ensures that the experiment you run locally is identical to the code path that the automated training routine executes.

In [None]:
# Load data
df = pd.read_csv(DATA_PATH)
df.head()

The raw dataset contains only two columns, so exploratory work focuses on spotting outliers and understanding the linear relationship. With larger datasets you would extend this section to include correlation matrices, missing-value audits, and feature engineering sketches.

In [None]:
# Exploratory plots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', ax=axes[0])
axes[0].set_title('Salary vs. Experience')
sns.histplot(df['Salary'], kde=True, ax=axes[1])
axes[1].set_title('Salary distribution')
plt.tight_layout()
plt.show()

We reserve 20% of the observations for validation. Because the dataset is tiny, consider running multiple random seeds or K-fold cross-validation when you expand the module; the helper functions in `src/data.py` can be adapted accordingly.

In [None]:
# Train / validation split
X = df[['YearsExperience']]
y = df['Salary']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_val.shape

In [None]:
# Build the same pipeline used in src/pipeline.py
numeric_features = ['YearsExperience']
numeric_pipeline = Pipeline([
    ('scaler', StandardScaler()),
])
preprocessor = ColumnTransformer([
    ('numeric', numeric_pipeline, numeric_features),
])
regression_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression()),
])
regression_pipeline.fit(X_train, y_train)
y_pred = regression_pipeline.predict(X_val)
r2 = r2_score(y_val, y_pred)
rmse = mean_squared_error(y_val, y_pred, squared=False)
mae = mean_absolute_error(y_val, y_pred)
{'r2': r2, 'rmse': rmse, 'mae': mae}

The metrics dictionary should align with the contents of `artifacts/metrics.json` generated by the Python training script. Use this parity check to ensure the notebook and pipeline stay in sync whenever you tweak preprocessing or model choices.

In [None]:
# Residual diagnostics
residuals = y_val - y_pred
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(x=y_pred, y=residuals, ax=axes[0])
axes[0].axhline(0.0, color='black', linestyle='--')
axes[0].set_title('Residuals vs. Fitted')
sns.histplot(residuals, kde=True, ax=axes[1])
axes[1].set_title('Residual distribution')
plt.tight_layout()
plt.show()

## 3. Serving the model



The notebook complements the production code in `src/`. After validating an idea here, run:



```bash

python "Supervised Learning/Linear Regression/src/train.py"

```



This command retrains the model, refreshes artifacts, and makes the FastAPI endpoint immediately pick up the latest weights via the shared registry. Start the API with:



```bash

python -m fastapi_app.main

```



You can then send a POST request to `http://localhost:8000/models/linear_regression` with a JSON payload like `{"years_experience": 6.5}` or explore the interactive docs at `/docs`.



As you replicate this structure for other algorithms, mirror the three pillars shown here: rich markdown explanations, reproducible experiments, and modular code that production services can import without drifting from the research findings.