### **Model Evaluation and Refinement**  

In the following sections we'll learn:  
- Model evaluation  
- Over-fitting, underfitting, and model selection  
- Ridge regression  
- Grid search  

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats as sts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, root_mean_squared_error

df_data = Path().cwd().parent.parent/"Data"/"Clean_Data"/"clean_auto_df.csv"
auto_df = pd.read_csv(df_data)

---  

### **Model Evaluation**  

While in-sample evaluation tells us how well our data fits the data used to  
train it, but not how well the trained data can be used to predict new data.  

Our solution is to separate our data into **in-sample data** or training data  
and **out of sample data** or a test set.  

- Our test set simulating real-world data.  
- Usually a large portion of our data is used for training, lets say 70%,  
      and our testing data would be 30%.  
  - Build + train model = training data  
  - Evaluation (real-world representation) = test data  

How do we seperate that data?  

- With `train_test_split()` from `scikit-learn.model_selection`!

In [None]:
from sklearn.model_selection import train_test_split

x_data = auto_df.drop("price", axis=1)
y_data = auto_df["price"]

x_train, x_test, y_train, y_test = train_test_split(
    x_data,
    y_data,
    test_size=0.3,
    random_state=0
)

split_data = [x_train, x_test, y_train, y_test]
data_names = {1: "x_train", 2: "x_test", 3: "y_train", 4: "y_test"}
loop_count = 0

for data_set in split_data:
    loop_count += 1
    label = data_names[loop_count]
    print(f"The {label} data has a shape of: {data_set.shape}")

- **x_data**: features or independent variables.  
- **y_data**: dataset target, auto_df["price"].  
- **test_size**: percentage of the data for testing (30% here).  
- **random_state**: number generator used for random sampling.  

---  

### **Generalization Performance**  

The goal of using training and then testing our data is a measure of how well  
our data does at predicting previously unseen data. 

- The error we obtain using our testing data is an approximation of this error,  
  *genralization performance*.

Important to note:  
- using a lot of data for training gives us an accurate means of determining  
  how our model will perform in the real world, **but the precision will be  
  low**
- If we use fewer data points to train the model and more to test it, **the  
  generalization error will be higher, but the model will have good precision**.  
- To overcome this, we use **cross validation**.  
  - One of the most common out-of-sample evaluation methods, it splits the  
    data set into k-equal groups (called a fold), uses all variations of the  
    data to train and test, then produces and array of R^2 scores.  


In [None]:
from sklearn.model_selection import cross_val_score

# Model for cross val
lr = LinearRegression()

scores = cross_val_score(
    lr,
    x_data[["horsepower"]],
    y_data,
    cv=3
) # cv = # of folds

print(scores)

# Mean of R^2
np.mean(scores)

What if we want a little more information, like actual predicted values  
supplied by our model *before* the R-squarred values are calculated?  
- Enter: `cross_val_predicted()`, which takes the exact same arguments as  
  `cross_val_score()`.

In [None]:
from sklearn.model_selection import cross_val_predict

cross_p = cross_val_predict(
    lr,
    x_data[["horsepower"]],
    y_data,
    cv=3
)

cross_p[0:5]