#Simplified Explanation of Predictions in Multiple Regression
 This notebook focuses on how to interpret predictions in multiple regression and the different sources of uncertainty that affect prediction accuracy.


### Prediction Equation

In a multiple regression model, once the coefficients $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p$ have been estimated, predicting the response $Y$ for a new set of predictor values $ X_1, X_2, \dots, X_p $ is done using the equation:

$$
\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p
$$

This gives the predicted value $\hat{Y}$, which is our best estimate of the response based on the linear model.

However, this prediction comes with **three types of uncertainty**:

### 1. **Inaccuracy in Coefficient Estimates (Reducible Error)**

- The discrepancy between the estimated coefficients and the true coefficients introduces **inaccuracy** into the predictions. This inaccuracy is part of the **reducible error**, meaning it can be reduced with better data or larger sample sizes.

### 2. **Model Bias (Assumption of Linear Model)**


- The linear model estimates the **best linear approximation** to the true underlying function, but it ignores the possibility that the true function could be non-linear. This bias is part of the **systematic error** in the model.
- While model bias is harder to quantify directly, we acknowledge that even with perfect coefficient estimates, the model might not capture the true nature of the data.

### 3. **Random Error $\epsilon$ (Irreducible Error)**

- The model includes a random error term, denoted $\epsilon$, which accounts for factors that affect the response but are not captured by the predictors in the model. This error represents the randomness inherent in the data that cannot be predicted or modeled.
- Even if we knew the exact values of the true coefficients, the predictions $\hat{Y}$ would still not be perfect because of this **irreducible error**.




### Comparison of Confidence vs. Prediction Intervals:

To account for both the uncertainty in the coefficient estimates (reducible error) and the random error $\epsilon$ (irreducible error), we use confidence and prediction intervals.


1. To quantify the fisrt uncertainty (reducible error), we can compute confidence intervals for the predicted values $\hat{y}$. A confidence interval estimates the range within which the true mean response lies, given the uncertainty in the coefficient estimates.

2. **Prediction Intervals**: Provide a range for individual future observations, considering both coefficient uncertainty and random error, making them wider than confidence intervals. A prediction interval gives the range within which we expect the actual value of $ Y$  for a new observation to fall.

A prediction interval gives the range within which we expect the actual value of Y for a new observation to fall.

**Example**:

For the Advertising data:

Given \$100,000 spent on TV, \$20,000 on radio, and \$1000 on newspaper what is the 95% prediction interval for sales?
  
95% of such intervals will contain the true sales value for that city.


In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy import stats

# Load the dataset
url = "https://www.statlearning.com/s/Advertising.csv"
data = pd.read_csv(url, index_col=0)

In [None]:
data.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [None]:
#define feature and target
X =
y =

In [None]:
# Add constant to predictor variables to estimates an intercept along with the slopes of the predictor variables
X = sm.add_constant(X)

In [None]:
# Split the data into training and test sets



In [None]:
# Fit the model using statsmodels
model = sm.OLS(y_train, X_train).fit()



In [2]:
 ## define a dataframe with the values in the question and call it X_new

#Make predictions on a new observation
predictions = model.get_prediction(X_new)

In [5]:
#Get a summary of prediction
summary_frame = predictions.summary_frame(alpha=0.05)  # 95% intervals
print(summary_frame[['mean', 'mean_ci_lower', 'mean_ci_upper', 'obs_ci_lower', 'obs_ci_upper']])


In [None]:
# Now find the confidence and prediction intervals
