# Linear Regression: Hands-on Notebook

**Goal:** Learn how to load data, visualise it, fit a Linear Regression model, evaluate it, and use OOP to wrap scikit-learn.

**You will:**
- Plot the relationship between hours studied and exam scores
- Fit a linear regression model using scikit-learn
- Interpret slope and intercept
- Evaluate with Mean Squared Error (MSE) and \(R^2\)
- Create a small OOP wrapper class for regression
- (Extension) Compare linear vs polynomial regression on the same data

**Dataset**: `study_scores.csv` (hours, score) — already provided in this workspace.


## 1. Setup
Run this cell to import libraries. If anything errors, ask your teacher to help install packages.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import pandas as pd
import pathlib
print('Libraries imported successfully!')

## 2. Load the dataset
We will load `study_scores.csv` (hours vs exam scores).

In [None]:
import io
import pandas as pd

# Embedded CSV data (hours, score)
csv_data = '''hours,score
1,51.36
2,56.97
3,65.3
4,67.81
5,73.84
6,81.13
7,87.73
8,92.18
9,99.77
10,104.37
'''

df = pd.read_csv(io.StringIO(csv_data))
df.head()

## 3. Visualise the data
Make a scatter plot to see if a straight line looks reasonable.

**Rule for this course:** When making charts, use `matplotlib` (no seaborn), create one chart per cell, and don't set specific colors.

In [None]:
plt.figure()
plt.scatter(df['hours'], df['score'])
plt.xlabel('Hours studied')
plt.ylabel('Exam score')
plt.title('Study hours vs exam score')
plt.show()

## 4. Fit a Linear Regression model
Split into a training set and a test set, then fit the model and inspect the slope and intercept.

In [None]:
X = df[['hours']].values  # 2D array for scikit-learn
y = df['score'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

print('Slope (coef_):', lin_reg.coef_)
print('Intercept:', lin_reg.intercept_)

## 5. Evaluate the model
Use Mean Squared Error (MSE) and \(R^2\) to evaluate on the test set.

In [None]:
y_pred = lin_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Test MSE:', round(mse, 2))
print('Test R^2:', round(r2, 3))

## 6. Visualise predictions vs actual
Plot the line of best fit on top of the scatter plot.

In [None]:
plt.figure()
plt.scatter(X, y)
x_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
y_line = lin_reg.predict(x_line)
plt.plot(x_line, y_line)
plt.xlabel('Hours studied')
plt.ylabel('Exam score')
plt.title('Best fit line on all data')
plt.show()

## 7. Residuals (errors)
Residual = Actual − Predicted. Let's compute residuals for the test set and inspect the Mean Squared Error manually.

In [None]:
residuals = y_test - y_pred
manual_mse = np.mean(residuals ** 2)
print('First 5 residuals:', np.round(residuals[:5], 2))
print('Manual MSE:', round(manual_mse, 2))

## 8. Make new predictions
What score would the model predict for a student who studies 6.5 hours? Try other values too.

In [None]:
new_hours = np.array([[6.5]])
predicted_score = lin_reg.predict(new_hours)
print('Predicted score for 6.5 hours:', round(float(predicted_score[0]), 2))

## 9. OOP Wrapper for Regression (Assessment-ready)
Create a simple class that wraps scikit-learn's LinearRegression and exposes `fit`, `predict`, and `score`.

In [None]:
class MyLinearRegressor:
    def __init__(self):
        self.model = LinearRegression()

    def fit(self, X, y):
        self.model.fit(X, y)
        return self

    def predict(self, X):
        return self.model.predict(X)

    def score(self, X, y):
        y_pred = self.predict(X)
        return r2_score(y, y_pred)

# Demo usage
wrapper = MyLinearRegressor().fit(X_train, y_train)
print('Wrapper R^2 on test set:', round(wrapper.score(X_test, y_test), 3))

## 10. (Extension) Polynomial Regression
When the relationship is curved, a polynomial model may fit better. We'll add degree-2 features and compare \(R^2\).

In [None]:
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('lin', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
r2_poly = poly_pipeline.score(X_test, y_test)
r2_lin = lin_reg.score(X_test, y_test)
print('Linear R^2:', round(r2_lin, 3))
print('Polynomial (deg=2) R^2:', round(r2_poly, 3))

# Visualise both fits on the full dataset (separate plot)
plt.figure()
plt.scatter(X, y)
x_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
y_line_lin = lin_reg.predict(x_line)
y_line_poly = poly_pipeline.predict(x_line)
plt.plot(x_line, y_line_lin, label='Linear fit')
plt.plot(x_line, y_line_poly, label='Polynomial fit (deg=2)')
plt.xlabel('Hours studied')
plt.ylabel('Exam score')
plt.title('Linear vs Polynomial fit')
plt.legend()
plt.show()

## 11. Your Turn — Tasks (submit to Canvas)
1. Change the train/test split size and re-run. Record how \(R^2\) changes.
2. Try adding an outlier row (e.g., `hours=10, score=40`). What happens to slope/intercept and \(R^2\)?
3. Update `MyLinearRegressor` to also return MSE from a new method `mse(X, y)`.
4. (Extension) Try `PolynomialFeatures(degree=3)`. Does it overfit? Explain using train vs test \(R^2\).
