# Introduction to Predictive Analysis

Hands-on introduction to predictive modeling: we will build a simple regression model, evaluate it, and inspect predictions. This notebook keeps examples small and explains each step for beginners.

In [None]:
# Setup - minimal imports for a simple regression demo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, r2_score
%matplotlib inline
np.random.seed(0)
plt.style.use('seaborn-whitegrid')

## Synthetic regression problem
Create a simple dataset with a few features and a continuous target. We'll keep n small so runs are quick (n=150).

What we'll do (step-by-step):
1. Create a small synthetic dataset.
2. Split into train/test.
3. Fit a simple Linear Regression.
4. Evaluate with MAE and R² and inspect predictions.

Notes:
- Try Ridge/Lasso if coefficients look unstable.
- Try cross-validation to get more stable metrics.

In [None]:
# 1) Create a tiny dataset (n=150)
n = 150
X1 = np.random.normal(loc=0, scale=1, size=n)
X2 = np.random.normal(loc=3, scale=2, size=n)
X3 = np.random.binomial(1, 0.25, size=n)
y = 2.5 * X1 + 0.8 * X2 + 1.5 * X3 + np.random.normal(0, 0.8, size=n)
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y})
df.head()

In [None]:
# 2) Split into train/test
X = df[['X1','X2','X3']]
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

In [None]:
# 3) Fit a simple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print('MAE:', mean_absolute_error(y_test, preds))
print('R2:', r2_score(y_test, preds))

In [None]:
# 4) Inspect coefficients and basic interpretation
coeffs = pd.Series(model.coef_, index=X.columns)
print('Intercept:', model.intercept_)
print('Coefficients:')
print(coeffs)
# Plot predicted vs actual
plt.figure(figsize=(6,6))
lims = [min(y_test.min(), preds.min()), max(y_test.max(), preds.max())]
plt.scatter(y_test, preds, alpha=0.6)
plt.plot(lims, lims, 'r--')
plt.xlabel('Actual y')
plt.ylabel('Predicted y')
plt.title('Predicted vs Actual')
plt.show()

In [None]:
# Optional: quick check with Ridge (regularized) to compare coefficients
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_preds = ridge.predict(X_test)
print('Ridge MAE:', mean_absolute_error(y_test, ridge_preds))
print('Ridge R2:', r2_score(y_test, ridge_preds))
print('Ridge coeffs:')
print(pd.Series(ridge.coef_, index=X.columns))

In [None]:
# Quick residual plot to inspect model fit
residuals = y_test - preds
plt.figure(figsize=(6,4))
plt.scatter(preds, residuals, alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')
plt.show()

In [None]:
# Short exercise: try different features or noise levels to see how metrics change
# For example, re-run dataset generation with larger noise and compare MAE/R2
print('Try: change the noise in the synthetic data and re-run the pipeline to observe effects on MAE/R2')

In [None]:
# Clean up: display a short summary DataFrame of results
summary = pd.DataFrame({
    'model': ['LinearRegression','Ridge(alpha=1.0)'],
    'MAE': [mean_absolute_error(y_test, preds), mean_absolute_error(y_test, ridge_preds)],
    'R2': [r2_score(y_test, preds), r2_score(y_test, ridge_preds)]
})
summary

In [None]:
# Final short tips for learners (printable)
print('- Keep datasets small when learning to iterate quickly.')
print('- Inspect coefficients but remember correlation/collinearity can affect interpretation.')
print('- Use cross-validation and regularization when in doubt.')

In [None]:
# (Empty helper cell) - you can run small experiments here


In [None]:
# (Empty)