<a href="https://www.kaggle.com/code/elainedazzio/simplest-conformal-prediction-example?scriptVersionId=261335989" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# The Absolute Simplest Conformal Prediction Example (For Learning!)

This notebook shows you the *very basic idea* behind conformal prediction. We're going to keep things incredibly simple to make it easy to understand.

**Important Note:** This example is *only* for learning the core concept. It does *not* follow best practices that you'd need for real-world projects. Specifically, we're going to use all our data for *both* training and calculating how "wrong" our model usually is.  This is a shortcut that makes the example easier to grasp, but it will give us prediction intervals that are too narrow (meaning they're more confident than they should be).

Let's get started!

In [None]:
# --- 1. Imports ---

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

We've just imported the libraries we need:

*   `numpy`: For numerical operations (like creating arrays).
*   `matplotlib.pyplot`: For plotting.
*   `sklearn.linear_model.LinearRegression`: Our simple prediction model.

In [None]:
# --- 2. Generate Data ---

# Super simple linear data with noise
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)  # Reshape for sklearn
y = 2 * X.flatten() + 1 + np.random.normal(0, 2, 100) # Linear relationship + noise

Now we've created some fake data. It's just a simple line with some random "noise" added in. This makes it a little more realistic than a perfect line.  We've also made sure that `X` is in the right format for our model (a "column vector").

In [None]:
# --- 3. Train a Model ---

model = LinearRegression()
model.fit(X, y)

We've trained a basic linear regression model. This model tries to find the "best-fit" straight line through our data.

In [None]:
# --- 4. Make Predictions and Calculate Residuals ---

y_pred = model.predict(X)
residuals = np.abs(y - y_pred)  # Absolute residuals as nonconformity scores

Here's a key step! We've calculated the "residuals."  A residual is simply how far off each of our model's predictions was from the actual value. We use the *absolute value* of the residuals, so we're only looking at the *size* of the error, not whether it was too high or too low.  These residuals are our **nonconformity scores**. They measure how "non-conforming" or "unusual" each data point is, *according to our model*.

In [None]:
# --- 5.  New Data Point for Prediction ---
X_new = np.array([[5]]) # Predict at x=5

We've created a new data point, `X_new`.  This is the `X` value for which we want to make a prediction *and* create a prediction interval.

In [None]:
# --- 6. Calculate Prediction Interval ---

# Find the (1-alpha) quantile of the residuals.  Let's use 90% confidence (alpha=0.1)
alpha = 0.1
quantile_value = np.quantile(residuals, 1 - alpha)

y_new_pred = model.predict(X_new)
y_lower = y_new_pred - quantile_value
y_upper = y_new_pred + quantile_value

print(f"Prediction for X={X_new[0,0]}: {y_new_pred[0]:.2f}")
print(f"90% Prediction Interval: [{y_lower[0]:.2f}, {y_upper[0]:.2f}]")

This is where the conformal prediction magic happens!

1.  **Choose a Confidence Level:** We've chosen a 90% confidence level (which means `alpha = 0.1`). This means we want our prediction interval to be "right" 90% of the time.

2.  **Find the Quantile:** We use `np.quantile` to find the value that separates the bottom 90% of our residuals from the top 10%.  Think of it like this: 90% of our training data points had errors *smaller* than this `quantile_value`.

3.  **Create the Interval:**  We make our prediction for `X_new` using the model.  Then, we create the prediction interval by *adding* and *subtracting* the `quantile_value` from that prediction.  This interval is our "range of likely values" for the true `y` value at `X_new`.

We're essentially saying: "Based on how wrong our model was on the data it saw, we're 90% confident that the true value for this new `X` will fall within this interval."

In [None]:
# --- 7. Visualization ---

plt.figure(figsize=(8, 6))
plt.scatter(X, y, label="Training Data")
plt.plot(X, y_pred, color='red', label="Model Prediction")

# Show the prediction interval for the new point
plt.plot([X_new[0,0], X_new[0,0]], [y_lower[0], y_upper[0]], color='green', linestyle='--', linewidth=2, label="Prediction Interval")
plt.scatter(X_new, y_new_pred, color='green', marker='o', s=100)

plt.xlabel("X")
plt.ylabel("y")
plt.title("Simplest Conformal Prediction Example")
plt.legend()
plt.show()

Finally, we plot everything:

*   The original data points (blue dots).
*   The line our model learned (red line).
*   The new data point we predicted for (green dot).
*   The prediction interval for that new data point (green dashed line).

You can see that the interval gives us a range of plausible values, not just a single prediction.

**Important Reminder (Again!)**

Remember, this example is simplified! In a real application, you *must* split your data into training, calibration, and test sets to get valid prediction intervals. This example skips that crucial step to make the core idea of conformal prediction as clear as possible.
