# Linear Regression: Advanced Tutorial

This notebook demonstrates linear regression using synthetic and real-world data.
We'll cover:
- Manual implementation
- Scikit-learn usage
- Residual analysis
- Feature importance
- Error metrics


## 1. Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

sns.set(style="whitegrid")


## 2. Generate Synthetic Data (1D)

In [None]:
# Generating data: y = 3x + noise
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 3 * X + np.random.randn(100, 1)

plt.scatter(X, y)
plt.title("Synthetic Linear Data")
plt.xlabel("X")
plt.ylabel("y")
plt.show()


## 3. Manual Linear Regression (Closed-form)

In [None]:
X_b = np.c_[np.ones((100, 1)), X]  # add bias term
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

print("Theta (Intercept, Coef):", theta_best.ravel())

# Predict
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)

plt.plot(X_new, y_predict, "r-", label="Prediction")
plt.scatter(X, y, alpha=0.6)
plt.title("Closed-form Linear Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()


## 4. Linear Regression with scikit-learn

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X, y)

print("Intercept:", lin_reg.intercept_)
print("Coefficient:", lin_reg.coef_)

y_pred = lin_reg.predict(X)
plt.scatter(X, y, alpha=0.6)
plt.plot(X, y_pred, color='red', label='sklearn model')
plt.title("Linear Regression with scikit-learn")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()


## 5. Residual Analysis

In [None]:
residuals = y - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals Plot")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()


## 6. Real Dataset: California Housing

In [None]:
data = fetch_california_housing(as_frame=True)
df = data.frame

df.head()


## 7. Prepare the Data

In [None]:
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


## 8. Model Evaluation

In [None]:
print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R²:", r2_score(y_test, y_pred))


## 9. Feature Importance

In [None]:
importance = pd.Series(model.coef_, index=X.columns).sort_values()

plt.figure(figsize=(10, 6))
importance.plot(kind='barh')
plt.title("Feature Importance (Coefficients)")
plt.show()


## 10. Summary

- Linear regression is fast and interpretable.
- Works well with linear relationships.
- Residuals should be randomly distributed.
- Sensitive to multicollinearity and outliers.

Try experimenting with polynomial features or regularization!