# Polynomial Regression: A Step-by-Step Guide for ML Newbies

Welcome to this interactive notebook designed to introduce you to Polynomial Regression! If you're new to machine learning, you've come to the right place. We'll break down each step, making complex concepts easy to understand.

## What is Polynomial Regression?

At its core, Polynomial Regression is a form of **Linear Regression** where the relationship between the independent variable(s) $X$ and the dependent variable $y$ is modeled as an $n$-th degree polynomial. While it's called "polynomial," it's still considered a linear model because the relationship is linear in the coefficients.

Think of it this way:
* **Simple Linear Regression:** $y = \beta_0 + \beta_1 X + \epsilon$ (a straight line)
* **Polynomial Regression (Degree 2):** $y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon$ (a curve)
* **Polynomial Regression (Degree $n$):** $y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_n X^n + \epsilon$

This allows us to model non-linear relationships between variables, which is very common in real-world data.

## Notebook Sections:

1.  **Setting the Stage:** Importing Libraries
2.  **Getting Our Hands Dirty:** Data Loading and Creation
3.  **Understanding Our Data:** Exploratory Data Analysis (EDA)
4.  **Preparing for Battle:** Data Preprocessing
5.  **Building the Brain:** Polynomial Regression Model
6.  **Judging the Performance:** Model Evaluation
7.  **Final Thoughts:** Conclusion and Next Steps

Let's get started!

---

## 1. Setting the Stage: Importing Libraries

First, we need to import all the necessary tools (libraries) that will help us with data manipulation, visualization, and machine learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import os

# Set a style for our plots for better aesthetics
sns.set_style("whitegrid")

---

## 2. Getting Our Hands Dirty: Data Loading and Creation

For this tutorial, instead of loading a pre-existing CSV, we'll create a synthetic (mock) dataset. This gives us full control over the relationship between our variables, making it easier to demonstrate polynomial regression. We'll create a dataset where the `y` variable clearly has a polynomial relationship with `X`.

In [None]:
# Create a directory to save the CSV if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# --- Create Synthetic Data ---
np.random.seed(42) # for reproducibility

# Generate X values
X = np.linspace(-5, 5, 100).reshape(-1, 1) # 100 points between -5 and 5

# Generate y values with a polynomial relationship (e.g., y = X^2 + X + noise)
y = 0.5 * X**2 + 2 * X + 10 + np.random.normal(0, 5, X.shape) # Add some random noise

# Create a Pandas DataFrame
df = pd.DataFrame({'X': X.flatten(), 'y': y.flatten()})

# Save the DataFrame to a CSV file (optional, but good for demonstration)
csv_file_path = 'data/polynomial_data.csv'
df.to_csv(csv_file_path, index=False)

print(f"Synthetic data saved to: {csv_file_path}")
print("\nFirst 5 rows of the synthetic data:")
print(df.head())

Now, let's load the data from the CSV, just as you would with any other dataset.

In [None]:
# --- Load Data from CSV ---
# Replace 'data/polynomial_data.csv' with your actual CSV file path if you have one
# For this notebook, we'll load the one we just created.
try:
    df = pd.read_csv('data/polynomial_data.csv')
    print("\nData loaded successfully from CSV:")
    print(df.head())
except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found. Please ensure it exists.")
    print("If you skipped the data creation step, you'll need to create or provide your own CSV.")

---

## 3. Understanding Our Data: Exploratory Data Analysis (EDA)

EDA is crucial for understanding the characteristics of our dataset. It helps us identify patterns, detect outliers, and prepare for modeling.

### Basic Information

In [None]:
print("--- DataFrame Info ---")
df.info()

print("\n--- Descriptive Statistics ---")
df.describe()

### Visualizing the Relationship

Let's visualize the relationship between our feature `X` and the target `y` using a scatter plot. This will help us understand if a linear or non-linear model is more appropriate.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='X', y='y', data=df, alpha=0.7)
plt.title('Scatter Plot of X vs. y')
plt.xlabel('X (Independent Variable)')
plt.ylabel('y (Dependent Variable)')
plt.grid(True)
plt.show()

**Observation:** From the scatter plot, it's evident that the relationship between X and y is curved, not linear. This strongly suggests that a simple linear regression model might not capture the underlying pattern well, and a polynomial regression model would be a better fit.

---

## 4. Preparing for Battle: Data Preprocessing

Before feeding data into our model, we need to preprocess it. This involves separating features from the target and splitting our data into training and testing sets.

* **Features (X):** The input variables we use to make predictions.
* **Target (y):** The output variable we want to predict.
* **Training Set:** Used to train our model. The model learns patterns from this data.
* **Testing Set:** Used to evaluate how well our trained model performs on unseen data. This helps us ensure the model generalizes well and isn't just memorizing the training data.

In [None]:
# Separate features (X) and target (y)
# X needs to be a 2D array for scikit-learn
X = df[['X']]
y = df['y']

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

# Split the data into training and testing sets
# test_size=0.20 means 20% of data will be for testing, 80% for training
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nShape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

---

## 5. Building the Brain: Polynomial Regression Model

This is where the magic happens! We'll use `PolynomialFeatures` to transform our `X` data into polynomial features, and then apply `LinearRegression` to these transformed features.

### Understanding `PolynomialFeatures`

`PolynomialFeatures` creates new features by raising the existing features to a specified degree.
For example, if you have a feature `X` and you set `degree=2`:
It will generate:
* $X^0$ (a column of ones, representing the intercept)
* $X^1$ (the original feature)
* $X^2$ (the feature squared)

These new features are then used as inputs for a standard `LinearRegression` model.

In [None]:
# Choose the degree of the polynomial
# A common choice is degree=2 or 3 for curves. Let's start with 2.
degree = 2

# Create PolynomialFeatures object
# include_bias=False prevents adding the X^0 term (intercept) as LinearRegression handles it
poly_features = PolynomialFeatures(degree=degree, include_bias=False)

# Transform the training features
X_train_poly = poly_features.fit_transform(X_train)

# Transform the testing features (important: use the *same* transform learned from training data)
X_test_poly = poly_features.transform(X_test)

print(f"Original X_train shape: {X_train.shape}")
print(f"Transformed X_train_poly shape (degree={degree}): {X_train_poly.shape}")
print("\nFirst 5 rows of X_train_poly (transformed features):")
print(X_train_poly[:5]) # Displaying transformed features

Now, we train a standard `LinearRegression` model on these newly created polynomial features.

In [None]:
# Create a Linear Regression model
model = LinearRegression()

# Train the model on the polynomial features of the training data
model.fit(X_train_poly, y_train)

print("\nModel training complete!")
print(f"Model Intercept (β0): {model.intercept_:.2f}")
print(f"Model Coefficients (β1, β2, ...): {np.round(model.coef_, 2)}")

---

## 6. Judging the Performance: Model Evaluation

After training, it's essential to evaluate how well our model performs. We'll make predictions on the test set (data the model hasn't seen during training) and use common regression metrics.

### Making Predictions

In [None]:
# Make predictions on the transformed test set
y_pred = model.predict(X_test_poly)

print("First 10 actual y_test values:", np.round(y_test.head().values, 2))
print("First 10 predicted y_pred values:", np.round(y_pred[:5], 2))

### Evaluation Metrics

* **Mean Squared Error (MSE):** The average of the squared differences between the actual and predicted values. Lower MSE means a better fit.
* **Root Mean Squared Error (RMSE):** The square root of MSE. It's in the same units as the target variable, making it easier to interpret.
* **R-squared ($R^2$):** Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It ranges from 0 to 1, where 1 indicates a perfect fit. A higher R-squared is generally better.

In [None]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"\nMean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

### Visualizing Actual vs. Predicted

A good way to see how well our model fits the data is to plot the actual test data points and the regression line generated by our model.

In [None]:
plt.figure(figsize=(12, 7))
sns.scatterplot(x='X', y='y', data=df, alpha=0.7, label='Original Data')

# To plot the regression line, we need to predict over the full range of X values
# Sort X values to ensure the line is drawn correctly
X_range = np.linspace(df['X'].min(), df['X'].max(), 500).reshape(-1, 1)
X_range_poly = poly_features.transform(X_range)
y_range_pred = model.predict(X_range_poly)

plt.plot(X_range, y_range_pred, color='red', label=f'Polynomial Regression (Degree {degree})', linewidth=2)

plt.title(f'Polynomial Regression Fit (Degree {degree})')
plt.xlabel('X (Independent Variable)')
plt.ylabel('y (Dependent Variable)')
plt.legend()
plt.grid(True)
plt.show()

### Experimenting with Different Degrees (Optional but Recommended)

What happens if we choose a different degree? Let's quickly see the effect of choosing a much higher degree.

In [None]:
# Let's try a higher degree, for example, 10
high_degree = 10

poly_features_high = PolynomialFeatures(degree=high_degree, include_bias=False)
X_train_poly_high = poly_features_high.fit_transform(X_train)
X_test_poly_high = poly_features_high.transform(X_test)

model_high_degree = LinearRegression()
model_high_degree.fit(X_train_poly_high, y_train)

y_pred_high = model_high_degree.predict(X_test_poly_high)

mse_high = mean_squared_error(y_test, y_pred_high)
rmse_high = np.sqrt(mse_high)
r2_high = r2_score(y_test, y_pred_high)

print(f"\n--- Evaluation with Polynomial Degree {high_degree} ---")
print(f"MSE: {mse_high:.2f}")
print(f"RMSE: {rmse_high:.2f}")
print(f"R-squared: {r2_high:.2f}")

# Visualize the fit with the higher degree
plt.figure(figsize=(12, 7))
sns.scatterplot(x='X', y='y', data=df, alpha=0.7, label='Original Data')

X_range_high = np.linspace(df['X'].min(), df['X'].max(), 500).reshape(-1, 1)
X_range_poly_high = poly_features_high.transform(X_range_high)
y_range_pred_high = model_high_degree.predict(X_range_poly_high)

plt.plot(X_range_high, y_range_pred_high, color='green', label=f'Polynomial Regression (Degree {high_degree})', linewidth=2)

plt.title(f'Polynomial Regression Fit (Degree {high_degree}) - Beware of Overfitting!')
plt.xlabel('X (Independent Variable)')
plt.ylabel('y (Dependent Variable)')
plt.ylim(y.min() - 10, y.max() + 10) # Adjust y-limits for better visualization if needed
plt.legend()
plt.grid(True)
plt.show()

**Observation:** You might notice that with a very high degree (e.g., 10), the model tries too hard to fit every single training data point, including the noise. This often results in a very wiggly line that performs excellently on the training data but might perform poorly on new, unseen data. This phenomenon is called **overfitting**. It's crucial to find the right balance (the optimal degree) that captures the underlying pattern without memorizing the noise. Cross-validation is a technique often used to find this optimal degree.

---

## 7. Final Thoughts: Conclusion and Next Steps

Congratulations! You've successfully walked through the entire process of performing polynomial regression. You've learned how to:

* Load and prepare data.
* Perform basic Exploratory Data Analysis (EDA).
* Transform features using `PolynomialFeatures`.
* Train a `LinearRegression` model on polynomial features.
* Evaluate the model's performance using key metrics like MSE, RMSE, and R-squared.
* Visualize the model's fit and understand the concept of overfitting.

### Next Steps to Deepen Your Understanding:

1.  **Experiment with Different Degrees:** Try changing the `degree` variable in the "Building the Brain" section (e.g., to 3, 4, or even higher) and observe how the model's fit and evaluation metrics change.
2.  **Cross-Validation:** For real-world scenarios, you would use techniques like K-Fold Cross-Validation to systematically find the optimal polynomial degree that balances bias and variance (avoiding both underfitting and overfitting).
3.  **Regularization:** Learn about regularization techniques (Lasso, Ridge Regression) that can help prevent overfitting, especially when dealing with high-degree polynomials or many features.
4.  **Real-World Datasets:** Apply polynomial regression to different datasets to see how it performs on various types of data.
5.  **Multi-variate Polynomial Regression:** Extend this concept to datasets with multiple independent variables. `PolynomialFeatures` can handle this automatically by generating interaction terms (e.g., $X_1 X_2$, $X_1^2 X_2$, etc.).

Keep exploring, keep learning, and happy coding!