### **Model Evaluation and Refinement**  

In the following sections we'll learn:  
- Model evaluation  
- Over-fitting, underfitting, and model selection  
- Ridge regression  
- Grid search  

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats as sts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, root_mean_squared_error

df_data = Path().cwd().parent.parent/"Data"/"Clean_Data"/"clean_auto_df.csv"
auto_df = pd.read_csv(df_data)

---  

### **Model Evaluation**  

While in-sample evaluation tells us how well our data fits the data used to  
train it, but not how well the trained data can be used to predict new data.  

Our solution is to separate our data into **in-sample data** or training data  
and **out of sample data** or a test set.  

- Our test set simulating real-world data.  
- Usually a large portion of our data is used for training, lets say 70%,  
      and our testing data would be 30%.  
  - Build + train model = training data  
  - Evaluation (real-world representation) = test data  

How do we seperate that data?  

- With `train_test_split()` from `scikit-learn.model_selection`!

In [None]:
from sklearn.model_selection import train_test_split

x_data = auto_df.drop("price", axis=1)
y_data = auto_df["price"]

x_train, x_test, y_train, y_test = train_test_split(
    x_data,
    y_data,
    test_size=0.3,
    random_state=0
)

split_data = [x_train, x_test, y_train, y_test]
data_names = {1: "x_train", 2: "x_test", 3: "y_train", 4: "y_test"}
loop_count = 0

for data_set in split_data:
    loop_count += 1
    label = data_names[loop_count]
    print(f"The {label} data has a shape of: {data_set.shape}")

- **x_data**: features or independent variables.  
- **y_data**: dataset target, auto_df["price"].  
- **test_size**: percentage of the data for testing (30% here).  
- **random_state**: number generator used for random sampling.  

---  

### **Generalization Performance**  

The goal of using training and then testing our data is a measure of how well  
our data does at predicting previously unseen data. 

- The error we obtain using our testing data is an approximation of this error,  
  *genralization performance*.

Important to note:  
- using a lot of data for training gives us an accurate means of determining  
  how our model will perform in the real world, **but the precision will be  
  low**
- If we use fewer data points to train the model and more to test it, **the  
  generalization error will be higher, but the model will have good precision**.  
- To overcome this, we use **cross validation**.  
  - One of the most common out-of-sample evaluation methods, it splits the  
    data set into k-equal groups (called a fold), uses all variations of the  
    data to train and test, then produces and array of $R^2$ scores.  


In [None]:
from sklearn.model_selection import cross_val_score

# Model for cross val
lr = LinearRegression()

scores = cross_val_score(
    lr,
    x_data[["horsepower"]],
    y_data,
    cv=3
) # cv = # of folds

print(scores)

# Mean of R^2
np.mean(scores)

What if we want a little more information, like actual predicted values  
supplied by our model *before* the $R^2$ values are calculated?  
- Enter: `cross_val_predicted()`, which takes the exact same arguments as  
  `cross_val_score()`.

In [None]:
from sklearn.model_selection import cross_val_predict

cross_p = cross_val_predict(
    lr,
    x_data[["horsepower"]],
    y_data,
    cv=3
)

cross_p[0:5]

### **Overfitting, Underfitting and Model Selection**  

This section will discuss how to pick the best polynomial order and problems  
that arise with selecting the wrong order polynomial.  

**Underfitting**  
*Assuming training points are coming from a polynomial function + some noise,  
and our goal of model selection is to determine the order of polynomial*

- With a (simple) linear regression model, we see our regression line slash  
  through a non linear graph. An obvious sign of underfitting.  
- When a model underfits, specifically in the case of applying just linear  
  regression, it means the model is too simple to fit the data.  
- Underfitting can still happen with lower order polynomial regression, even  
  though the model fit may imporve.
- We will visually see a better fit when applying higher order polynomial  
  regression (assuming chosen features are accurate), especially at inflection  
  points.  

**Overfitting**  
When we move past that "sweet-spot" of orders in our polynomial linear  
regression model, we start to see overfitting.  

- A model overfits when it does extremely well tracking the training points,  
  but performs poorly at estimating the correct function (testing data).  
- The overfit will be especially dramatic in areas where there is little  
  training data; visually, you will see a lot of function oscillation.  
- Overall, the function is *too* flexible and fits the noise rather than the  
  function.  

We can also analyze the $R^2$ from an array of linear polynomial equations.  
If we were to plot the training and test error from our equations, we would  
most likely observe the following pattern:  
- **Test Data**: A decrease in $R^2$ until it reaches its lowest point,  
  and increases as x (the order) increases. 
  - Anything on the left is *underfitting*, anything on the right is  
    *overfitting*.
- **Training Data**: A linear decline in $R^2$ as the degree increases.  

Our test data is what we want to pay attention to, it gives us a better means  
of estimating the error of our polynomial. **However**, even when choosing the  
best fitting polynomial, we will still have some level of error, or noise.  
- Noise is random, we cannot predict all of it. Sometimes, this is referred to  
  as **irreducible error**.  
    - Other reasons for noise: polynomial assumption might be wrong, or sample  
      points may have come from a different function, or for real data, it may  
      be too difficult to fit or we may not have the correct type of data.  

Below, we'll see how to quickly loop through models with different polynomial  
degrees to see which is the best fit.

In [None]:
Rsqu_test = []
order = [2, 3, 4, 5, 6, 7, 8, 8, 10, 11]

for n in order:
    pr = PolynomialFeatures(degree=n)

    x_train_pr = pr.fit_transform(x_train[["horsepower"]])
    x_test_pr = pr.fit_transform(x_test[["horsepower"]])

    lr.fit(x_train_pr, y_train)
    Rsqu_test.append(lr.score(x_test_pr, y_test))

print(Rsqu_test)

### **Ridge Regression**  

Ridge regression prevents overfitting. Whether overfitting comes from many  
independent variables or from polynomial regression, ridge regression  
minimizes overfitting in our model by managing the **magnitude** (how large the  
coefficients are allowed to get) through the **hyperperameter, alpha ($\alpha$)**.  

It works by applying a penalty for those large coefficients, $\alpha$ being how  
large or small that penalty is, to minimize prediction error while keeping the  
model simpler and more generalizable.  
- `Ridge()` goes beyond "fitting a line" by controlling the complexity of that  
  that line to avoid chasing noise.

As alpha increases, the coefficients for each $x^n$ term shrink toward 0:  
- Alpha must be selected carefully.  
- **$\alpha$ = 0**: No penalty, just OLS. Overfitting most evident.  
- **Small $\alpha$**: Light regularization. May still overfit, but more stable.  
- **Medium $\alpha$**: Balanced regularization. Reduces variance, but may  
  slightly bias.
- **Large $\alpha$**: Strong regularization, heavily shrinks coefficients. As  
  they approach 0, they can under-fit the data.  

*Note: Even though the following example will use variables `x_train, y_train, x_test, y_test`,  
it is very important to use a completely separate split of the data (**validation data**)  
as a subset of `x_train` and `y_train` to calculate `Ridge()`. Only if we are NOT using  
`GridSearchCV()`, which calculates cross-val*, eliminating the need to split.  

Another important clarification is that if you want to calculate for alpha on  
a polynomial function, you must (poly) `fit_transform()` the data first.

In [None]:
from sklearn.linear_model import Ridge
from tqdm import tqdm # Progress bar visualization

I wanted to split the data again just to keep things sanitary since there is  
a lot happening with and to our data. Important to note:  
- Polynomial equations very quickly scale up, as we know, but it deeply affects  
  modeling and visualization.
- While scaling and normalization help with scaled values, overfitting is  
  likely, which is why Ridge regression is helpful, but:  
  - The larger our data scales with polynomial regression and/or dummy  
    variables, the more likely our **design matrix** (usually X, contains all  
    the input features used in regression) is to become ill-conditioned,  
    preventing our model from learning.

As it relates to ridge regression, ill-conditioned models break conventional   
vizualization patterns, usually signalling:
- Exploding polynomial terms from high-degree expansion.
- Redundant or near-contsant cross-terms (especially from dummy variables).
- Too few rows to support the expanded feature space.
- A breakdown in transformation logic (like applying `fit_transform()` to  
  **test/validation data**).  

For consistency I wanted to use the model I created in a previous example, but  
the use of dummy variables and polynomial regression ruin the design matrix,  
creating an unreliable model (even with scaling).  

Later I will be covering the use of `scitkit-learn.compose`'s `ColumnTransformer`  
function and `Pipeline` to overcome these obstacles.

In [None]:
# Prep variables for model

x_train1, x_test1, y_train1, y_test1 = train_test_split(
    x_data, y_data, random_state=0
)

X1 = x_train1[[
    "horsepower",
    "engine-size",
    "fuel-type-gas",
    "highway-L/100km"
]]

X2 = x_test1[[
    "horsepower",
    "engine-size",
    "fuel-type-gas",
    "highway-L/100km"
]]

SS = StandardScaler()
xtr_scaled = SS.fit_transform(X1)
xte_scaled = SS.transform(X2)

PR = PolynomialFeatures(degree=2, include_bias=False)
xtr_ptf = PR.fit_transform(xtr_scaled)
xte_ptf = PR.transform(xte_scaled)

In [None]:
# Modeling and visualization prep

Rsqu_test = []
Rsqu_train = []
dummy1 = []
Alpha = 10 * np.array(range(0,1000))
pbar = tqdm(Alpha)

for alpha in pbar:
    RigeModel = Ridge(alpha=alpha) 
    RigeModel.fit(xtr_ptf, y_train1)
    test_score, train_score = (
        RigeModel.score(xte_ptf, y_test1),
        RigeModel.score(xtr_ptf, y_train1)
    )
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    Rsqu_test.append(test_score)
    Rsqu_train.append(train_score)

Now we can plot the most effective alpha value!

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))

plt.plot(Alpha,Rsqu_test, label='validation data  ')
plt.plot(Alpha,Rsqu_train, 'r', label='training Data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()