<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/106_ols-train-test-cv.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Ordinary Least Squares (OLS), Overfitting, Training / Testing, and Cross-Validation
___

In this notebook we will cover several new and important concepts of machine learning simultaneously. As you will soon discover, we have (purposefully) ignored **some crucial concepts** so far, as we have focused on the engineering of the learning machine. But there is clearly more to machine learning! 


### üßë‚Äçüíª <font color=green>**Your Task**</font>

Read the notebook sections, try to understand the code. Most importantly, understand the pictures and answer the questions below.

___
## Ordinary Least Squares (OLS)
In notebook 05, you have already become accustomed to the most standard method of regression: ordinary least squares (OLS). Recall that the OLS model consists of building a linear relationship between a target variable $\mathbf{y}$ and some features $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_k$. 
___
### Mathematical intuition
One of most important (if not the most important) assumption of OLS is that the target variable ${y}$ is related to the feature variables $x_1, x_2, \dots, x_p$ in a linear manner (at least for the purpose of making reasonably good predictions):

$${y} = b + w_1 \cdot {x}_1 + w_2 \cdot {x}_2 + \dots + w_p \cdot {x}_p + {\epsilon},$$

where ${\epsilon}$ is some randomly distributed error term with zero mean.

Our main goal is to find an estimate of weights $b, w_1, w_2, \dots, w_p$, which we will denote by $\hat{b}, \hat{w}_1, \hat{w}_2, \dots, \hat{w}_k$. Once we have this estimate, it is easy to build a prediction:

$$\hat{{y}} = \hat{b} + \hat{w}_1 \cdot {x}_1 + \hat{w}_2 \cdot {x}_2 + \dots + \hat{w}_p \cdot {x}_p$$

However, regression folks do not like so much that notation, they like greek letters. So they would write:

$${y} = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \dots + \beta_p \cdot x_p + \epsilon$$

In some books, you will also find an $\alpha$ instead of $\beta_0$. OLS people normally call the weights *coefficients*. This simply means that they are used to multiply the features. 

It is also common to use vector and matrix notation. Let $\mathbf{\beta} = [\beta_0, \beta_1, \beta_2, \dots, \beta_p]^\top$ denote the $(p+1) \times 1$ vector of weights, and let $\mathbf x_i = [1, x_{1i}, x_{2i}, \dots, x_{pi}]^\top$ denote the $(p+1) \times 1$ vector of feature values of case $i$. Then we can can also write 

$$
y_i = \mathbf \beta^\top \mathbf x_i
$$

and 

$$
\hat{y}_i = \mathbf{\hat \beta}^\top \mathbf x_i
$$

___
#### üôÄ ü§Ø Closed-form solution vs. gradient descent (you can skip this in a first reading)
We have viewed and discussed the use of gradient descent to find a minimizing solution for the MSE in the preceding notebooks. However, when estimating OLS, we do not actually need gradient descent. Why is that?

Well, as it turns out, there exists a **closed-form solution** to OLS which minimizes the mean squared error. You can think of a closed-form expression as a formula, i.e., it doesn't require running multiple steps like an algorithm but you can simply plug the variables in an equation to obtain the result. So, while we could theoretically use gradient descent to find the weights, there exists a formula which will generally be faster.

Define $\mathbf{X} = [\mathbf{x}^\top_1 \, \mathbf{x}^\top_2 \, \dots \, \mathbf{x}^\top_N]$ as the $N \times (p+1)$ matrix of features (with a first column of only ones); this is also called the "design matrix". Further, define $\mathbf y = [y_i, y_2, \dots, y_N]^\top$ the $N \times 1$ vector of target value for all cases. The OLS estimate is then given by the matrix equation

$$\hat{\mathbf{\beta}} = \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{y}$$

On another note, when estimating machine learning models that do use gradient descent, we never explicitly write the gradient descent algorithm ourselves, there are library which have been written in a very efficient manner that allow us to focus on other aspects of the problem. Why did we nevertheless do it in this course? For you to et a glimpse of what happens under the hood of a machine learnign algorithm!

One more element of jargon. Regression people also like to call the **target variable** an **independent variable** and the **feature variables** are also called **dependent variables**. This language makes more sense in classical statistics or econometrics where we link "exogeneous" independent variables to an "endogenous" dependent variable, and the link is often taken as causal. For you, it's just important to know this terminology, so you can talk to statisticians :-)


___
## Data pre-processing

That's enough math for now. Let's have some fun and turn to coding!

We will work with an agricultural research dataset on U.S. crop yields consisting of only two columns: 
1. the temperature `temp`
2. the crop yield `yield`

For more information view: https://www.pnas.org/content/106/37/15594, or search the web for keywords *temperature*, and *crop yield*.

In [None]:
# Import necessary packages
import numpy as np # Numerical computation package
import pandas as pd # Dataframe package
import matplotlib.pyplot as plt # Plotting package
# Machine learning objects
from sklearn.linear_model import LinearRegression # OLS

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Read in the crop yield dataset
crops = pd.read_csv(f"{DATA_PATH}/data/us_crops.csv")
# Sort the data by temperature, 
# this is not needed for the estimation but will help with the plotting later
crops.sort_values("temp", inplace=True)

Let's start with a short visual inspection of the data using a scatterplot.

In [None]:
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Add scatterplot
ax.scatter(crops["temp"], crops["yield"])
# Label axes
ax.set_xlabel("Temperature")
ax.set_ylabel("Yield")
# Add a grid
ax.grid(True)

___
#### ü§î Pause and ponder

Can you spot any patterns in the data? Can you explain the patterns you spot? Discuss with your classmates.

___
## Baseline regression with full dataset

As you already know, running a linear regression in Python using `scikit-learn` (to which `sklearn` belongs to) is an easy task. Let us define the `yield` as our target variable, and the `temp` as our feature variable.

In [None]:
# Define our features and labels (‚ö†Ô∏è notice the double bracket for features ‚ö†Ô∏è)
X, y = crops[["temp"]], crops["yield"] 
# Define the estimator
ols1 = LinearRegression() 

In [None]:
# Fit the estimator 
ols1.fit(X, y)
# Add the predictions to our `crops` dataframe
crops["pred"] = ols1.predict(X)

# Get value of constant and coefficient
print(f"constant: {ols1.intercept_:.2f}, coefficient: {ols1.coef_[0]:.2f}.")

In [None]:
# Visualize our predictions
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Add scatterplot
ax.scatter(crops["temp"], crops["yield"], label="Data")
ax.plot(crops["temp"], crops["pred"], "-o", label="Prediction (1)", color="orange")
# Label axes
ax.set_xlabel("Temperature")
ax.set_ylabel("Yield")
# Add a grid
ax.grid(True)
# Add a legend
ax.legend()

___
#### ü§î Pause and ponder
Notice how our predictions follow a straight line. Is this what you would have expected? Why? Why not? Think about the coefficients estimated by OLS, what are they?
___

In [None]:
# Compute the MSE of our model
mse = np.mean((crops["pred"] - crops["yield"]) ** 2)
mse # Display the MSE

The MSE can sometimes be hard to interpret due to the squares, but we can also look at other measures, e.g., the **mean absolute error (MAE)**.

In [None]:
# Compute the MAE of our model
mae = np.mean(np.abs(crops["pred"] - crops["yield"]))
mae # Display the MAE

This number is much easier to interpret, e.g., we can compare it to the mean of the `yield`.

In [None]:
crops["yield"].mean() # Display the mean of the yield

Another useful model fit metric is the coefficient of determination, the R¬≤. The R¬≤ tells us, in percentage, how much of the variance in the dependent variable is explained by our model. Intuitively, an R¬≤ close to one (100%) fits the data well, while an R¬≤ close to zero (0%) fits the data poorly. Another good way of thinking about the R¬≤ is to know that:
+ An R¬≤ of 100% implies that our model predicts perfectly (but not necessarily on new data!)
+ An R¬≤ of 0% implies that our model does just as well as if we simply used the mean of the target to make our predictions.
+ A negative R¬≤ implies that we are doing worse than if we just used the mean to make predictions.

The R¬≤ is directly implemented as the `.score` method of the `LinearRegression` object in `scikit-learn`.

In [None]:
ols1.score(X, y) # Display the R¬≤ of our model

___
#### ü§î Pause and ponder
Are you happy with the model we set up? What do you think of the results? Does it match the relationship you expected from the first task? Discuss with your classmates.

___
## Polynomial regressions
Perhaps you feel that the relationship between crop yield and temperature can be better represented by a curve rather than a straight line. In such a case, we can still use linear regression, despite the name indicating that it is **linear**. The trick is to introduce higher-order polynomials of the features, i.e., we can use the temperature squared.

___
#### ü§î Pause and ponder
Why would adding a squared feature help to estimate a nonlinear relationship between temperature and crop yield? What happens if the coefficient on the *simple* feature (baseline, not squared, i.e. $x$) is positive and the coefficient on the squared feature ($x^2$) is sligthly negative? Can you try to picture it in your head? Or perhaps even sketch it on a sheet of paper?
___

In [None]:
# scikit-learn provides a nice functionality to compute polynomial of our features
from sklearn.preprocessing import PolynomialFeatures

In any case, let's go ahead and extend our features with the squared temperature. Our `y` stays the same as before.

In [None]:
# 2 degrees of polynomials, without the constant
poly2 = PolynomialFeatures(2, include_bias=False) 
# Define a new X with the squared feature
X2 = poly2.fit_transform(X)
X2[:5, :] # Check the first five rows of our new features

In [None]:
# Define a new OLS object
ols2 = LinearRegression()

In [None]:
# Fit the estimator 
ols2.fit(X2, y)
# Add the predictions to our `crops` dataframe
crops["pred2"] = ols2.predict(X2)

In [None]:
# Visualize our predictions
# Set up the canvas
fig, ax = plt.subplots(figsize=(12, 8))
# Add scatterplot
ax.scatter(crops["temp"], crops["yield"], label="Data")
ax.plot(crops["temp"], crops["pred"], "-o", label="Prediction (1)", color="orange")
ax.plot(crops["temp"], crops["pred2"], "-o", label="Prediction (2)", color="green")
# Label axes
ax.set_xlabel("Temperature")
ax.set_ylabel("Yield")
# Add a grid
ax.grid(True)
# Add a legend
ax.legend()

In [None]:
# Compute the MSE of our new model
mse2 = np.mean((crops["pred2"] - crops["yield"]) ** 2)
mse2 # Display the MSE

In [None]:
# Compute the MAE of our new model
mae2 = np.mean(np.abs(crops["pred2"] - crops["yield"]))
mae2 # Display the MAE

In [None]:
ols2.score(X2, y) # Display the R¬≤ of our new model

It seems hard to argue that this new model including the squared temperature does not perform better than the one with only the temperature. So, what if we add more polynomials instead of only the squared temperature? Can we do even better? Let's see...

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 1**</font>

1. Create an `X3` array with a higher-order polynomial of your liking. There is a catch, however. If you raise the values of `temp` to a high power (say 10 or 50, this will yield numbers that are higher than the numerical limits of `numpy`. So you will not get very interesting results. To prevent this, first define a function `standardize = lambda x: (x - x.mean()) / x.std()` and apply it to `X`. Then calculate the polynomial terms in analogy to the code above. Call the result `X3`. Finally, standardize all the columns of `X3` using `np.apply_along_axis(standardize,0, X3)`. Why does the standardization solve the numerical limit problem?
2. Create an `ols3` model which uses this new (and standardized) `X3` to estimate a linear regression.
3. Compute the MSE, MAE, R¬≤.
4. Plot the results of your new model, compare the results to `ols1` and `ols2`.
5. Try to find the best model. If you were tasked with predicting the effect of an increase in average temperature by, say, 1 degree celusius, which model would you choose? Why?

In [None]:
# Enter your code below


___
## Investigating model quality using a validation set

We have now used different OLS models and observed how they perform differently with respect to the MSE, MAE, and R¬≤ measures. There is, however, a major caveat in the approach we have used. If you think about it, for every one of the three model specifications, we have fitted our model on the data and then used this very same data to measure how well our model performs. Why is this a problem?


___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 2**</font>

Suppose that we had built a model where we end up with predicting the exact values of the target variable, i.e. $\hat{y}_i = y_i$ for all cases $i$. 
1. What could be problematic with such a model? **It has a MSE of zero!** So isn't it perfect?
2. Do you think such a model with an MSE of zero exists? Can you construct one? Try for a few minutes!
___

<br/><br/>
### Splitting the data
We now split the data into two subsets:

#### Train(ing) set  
This is the subset of data which we use to fit our model, i.e., estimate the parameters of our models, such as the weights $\hat{\mathbf{w}}$. In essence, what we did above is to use the full dataset as a training set. (This is not a good practice, however!)

#### Test(ing) set  
The test set refers to the subset of data which we use to test how our model performs **out-of-sample**. Since the model was fitted on the training data, the training data results are the **in-sample** results, i.e., the results *within the trained sample*. However, from the point of view of our model, the test data is completely new, it was never seen before. So the performance on this new data indicates how well our model is able to generalize what was learned to a new dataset, i.e., we talk about **out-of-sample** performance.

So let's use some code to split our data. A good split ratio depends on how much data you have, but in general we use something like 60%-80% on the training set.

In [None]:
# sklearn provides a nifty function to split train/test sets
from sklearn.model_selection import train_test_split

In [None]:
X.shape # Display the size of X

In [None]:
# Split the dataset into train and test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, test_size=0.3, random_state=72)

In [None]:
# Define a new Xtrain and Xtest with the squared feature
X2train = poly2.fit_transform(Xtrain)
X2test = poly2.fit_transform(Xtest)

In [None]:
# Define a new Xtrain and Xtest with the higher-order polynomial features
# NOTE: poly3 is the polynomial from your solution to Question 1 above.

Xtrain_s = standardize(Xtrain)
X3train = poly3.fit_transform(Xtrain_s)
X3train = np.apply_along_axis(standardize,0, X3train)

Xtest_s = standardize(Xtest)
X3test = poly3.fit_transform(Xtest_s)
X3test = np.apply_along_axis(standardize,0, X3test)


In [None]:
Xtrain.shape # Display the size of the train set

In [None]:
Xtest.shape # Display the size of the test set

### Training the models
Alright. We have a train set of roughly two thirds of our original dataset and a test set consisting of the rest. Let's go ahead and fit some models on the train set.

In [None]:
# Define the estimator
ols1_t = LinearRegression() 
# Fit the estimator (‚ö†Ô∏è notice the fit on the train data only ‚ö†Ô∏è)
ols1_t.fit(Xtrain, ytrain)

In [None]:
# Define the estimator
ols2_t = LinearRegression() 
# Fit the estimator (‚ö†Ô∏è notice the fit on the train data only ‚ö†Ô∏è)
ols2_t.fit(X2train, ytrain)

In [None]:
# Define the estimator
ols3_t = LinearRegression() 
# Fit the estimator (‚ö†Ô∏è notice the fit on the train data only ‚ö†Ô∏è)
ols3_t.fit(X3train, ytrain)

### Evaluating the models
The models are set up, let us now compute the model metrics on the train and test sets in order to evaluate their performance.

In [None]:
# Define some lists to help us compute the metrics
model_list = [ols1_t, ols2_t, ols3_t]
Xtrain_list = [Xtrain, X2train, X3train]
ytrain_list = [ytrain for _ in range(3)]
Xtest_list = [Xtest, X2test, X3test]
ytest_list = [ytest for _ in range(3)]

# # In case you are confused by the lilst comprehensions
# print("ytrain:", ytrain)
# print("\n\n List comprehension:\n\n", [ytrain for _ in range(3)])

In [None]:
# Helpers to compute MSE, MAE, R2
compute_mse = lambda m, X, y: np.mean((m.predict(X) - y) ** 2)
compute_mae = lambda m, X, y: np.mean(np.abs(m.predict(X) - y))
compute_r2  = lambda m, X, y: m.score(X, y)

In [None]:
# Compute the metrics into lists for plotting
# OLS with single feature
ols1_results = {
    "train": [f(ols1_t, Xtrain, ytrain) for f in [compute_mse, compute_mae, compute_r2]],
    "test": [f(ols1_t, Xtest, ytest) for f in [compute_mse, compute_mae, compute_r2]]
}
# OLS with 2 features
ols2_results = {
    "train": [f(ols2_t, X2train, ytrain) for f in [compute_mse, compute_mae, compute_r2]],
    "test": [f(ols2_t, X2test, ytest) for f in [compute_mse, compute_mae, compute_r2]]
}
# OLS with multiple features
ols3_results = {
    "train": [f(ols3_t, X3train, ytrain) for f in [compute_mse, compute_mae, compute_r2]],
    "test": [f(ols3_t, X3test, ytest) for f in [compute_mse, compute_mae, compute_r2]]
}

# OK, this list comprehensions may feel a little dense, but you see that they are very elegant and practical. Do you understand them?
# Here is their output:

print("ols1: ", ols1_results)
print("\nols2: ", ols2_results)
print("\nols3: ", ols3_results)

In [None]:
# Now we present these results as plots (üôÄ ü§Ø this code is quite complicated, no need to focus on it for now)
fig, axs = plt.subplots(1, 3, figsize=(18, 6))
width = .3 # Bar width
for i in range(3):
    labs = [f"OLS {i}" for i in range(1, 4)] if i == 0 else ["" for _ in range(1, 4)]
    axs[i].bar(0, ols1_results["train"][i], width, label=labs[0], color="blue")
    axs[i].bar(0 + width, ols2_results["train"][i], width, label=labs[1], color="orange")
    axs[i].bar(0 + 2 * width, ols3_results["train"][i], width, label=labs[2], color="green")
    axs[i].bar(1, ols1_results["test"][i], width, color="blue")
    axs[i].bar(1 + width, ols2_results["test"][i], width, color="orange")
    axs[i].bar(1 + 2 * width, ols3_results["test"][i], width, color="green")
# Plot titles
axs[0].set_title("Mean Squared Error")
axs[1].set_title("Mean Absolute Error")
axs[2].set_title("R¬≤")
# Labels and legend
for ax in axs:
    ax.set_xticks([width, width + 1], ["Train", "Test"])
fig.legend()

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 3**</font>

1. What is going on in the plots? Why is OLS 3 performing this badly on the test set? Is it even possible for an R¬≤ to be negative? 
2. Is it also possible (at least theoretically) that the error on the testing set is smaller than that of the training set?

___
## Validation and cross-validation
It's a natural question to ask what degree of the polynomial delivers the best model in the sense of optimizing the performance
on data that has not been used to derive the estimated (trained) weights or coefficients. This means asking the question *what is the best model out-of-sample?*.

We have already talked about a **validation set** above, we are now going to explore this concept in more depth. The difference between **validation** and **testing** can be difficult to grasp at first. In fact, it is so difficult to grasp that more often than not, the terms are used interchangeably, even [Wikipedia has a section on this confusion in terminology](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets). All-in-all, understanding the difference in semantics is less important than understanding the main concept. Just be wary that when you hear about *validation* and *testing*, it might mean something different, depending on the source.

In general, the process is the following:
1. **Training**: we estimate the parameters of a model on some data.
2. **Validation**: we select the best model (*validate*) based on how it performs on some other data that has not been used yet.
3. **Testing**: we assess the performance of the model. At this point, the model is final, you can think of this as *observing how your chosen model might perform in the real world*.

**Cross-validation** is then simply an advanced method of validation, where we split the data into training and validation sets multiple times, this generally allows us to get more robust and accurate models. Cross-validation is generally the preferred method in machine learning.

`scikit-learn` provides a lot of helpful functions for machine learning, and, of course, [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) is also part of this toolkit. So let's dive into the code and see how we might use cross-validation to find the optimal number of polynomials to include in our model!

In [None]:
from sklearn.model_selection import cross_val_score # cross-validation function

The `cross_val_score` function is the default `scikit-learn` function for cross-validation. We will first demonstrate its use for our simple OLS model with a single variable and observe the mean squared error over the iterations of a 5-fold cross-validation.

In [None]:
# cv = 5, implies that we use 5 folds, the function will return 
# the negative mean absolute error for each fold, so we have to take its 
# negative value again to obtain the MAE
mae_cv = -cross_val_score(ols1_t, Xtrain, ytrain, cv=5, scoring="neg_mean_absolute_error")
print(f"Cross-validation mean absolute error: {np.mean(mae_cv):.2f} (¬± {np.std(mae_cv)/np.sqrt(5):.2f})") 

So let's now use this to actually select the number of polynomials we want to include in our model...

We will run a loop from 1 to 20 polynomials and then select the model with the lowest cross-validation MAE as our final model.

In [None]:
# Create lists to keep track of the results
mae_mean_list = []
mae_se_list = []
# Instantiate a LinearRegression model
ols_cv = LinearRegression()
# Loop over the polynomials
for p in range(1, 21):
    
    # Compute polynomials
    poly = PolynomialFeatures(p, include_bias=False)
    X_cv = poly.fit_transform(Xtrain_s)
    X_cv = np.apply_along_axis(standardize,0, X_cv)
    
    # Run cross-validation
    mae_cv = -cross_val_score(ols_cv, X_cv, ytrain, cv=5, scoring="neg_mean_absolute_error")
    
    # Store the mean and s.e. of the 5-folds
    mae_mean_list.append(np.mean(mae_cv))
    mae_se_list.append(np.std(mae_cv) / np.sqrt(5))

In [None]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
# Errorbar plot of mean MAE and standard error of the mean MAE
ax.errorbar(range(1, 21), mae_mean_list, yerr=mae_se_list)
# Single red dot for best result
best = np.argmin(mae_mean_list)
ax.scatter(best+1, mae_mean_list[best], color="red", s=100, label="Best")
# Add labels, ticks, legend, grid
ax.set_xlabel("Number of polynomials")
ax.set_ylabel("Mean CV MAE")
ax.legend()
ax.grid(True)
ax.set_xticks(range(1, 21), range(1, 21))

According to our cross-validation, the best model is the one with polynomials up to the fourth order.

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 4**</font>

1. What does this actually mean, that the best model is one with a polynomial of fourth order? Is this a guarantee that the model with polynomials up to the fourth order will also perform best on a test set? 
2. What if instead of 100 data points in total we had 1'000, what about 10'000, 100'000, and 1'000'000. Would this change anything about how confident you feel about results from cross-validation?

___
To conclude, let's have a quick look at how this *winning* model performs on the test set, very similar to above, we'll just compare it to the *simple* OLS model and the model with polynomials of second order.

In [None]:
# Define model
ols_best = LinearRegression()
# Compute polynomials
poly_best = PolynomialFeatures(4, include_bias=False)

Xtrain_best = poly_best.fit_transform(Xtrain_s)
Xtrain_best = np.apply_along_axis(standardize,0, Xtrain_best)

Xtest_best = poly_best.fit_transform(Xtest_s)
Xtest_best = np.apply_along_axis(standardize,0, Xtest_best)

# Fit model
ols_best.fit(Xtrain_best, ytrain)
# OLS with 9 features
ols_best_results = {
    "train": [f(ols_best, Xtrain_best, ytrain) for f in [compute_mse, compute_mae, compute_r2]],
    "test": [f(ols_best, Xtest_best, ytest) for f in [compute_mse, compute_mae, compute_r2]]
}

print("ols_best_results: train: ", ols_best_results["train"])
print("\nols_best_results: test: ", ols_best_results["test"])


In [None]:
# Plot the metrics (üôÄ ü§Ø same code as above, no need to focus on it for now)
fig, axs = plt.subplots(1, 3, figsize=(18, 6))
width = .3 # Bar width
for i in range(3):
    labs = ["OLS 1", "OLS 2", "Best CV"] if i == 0 else ["" for _ in range(1, 4)]
    axs[i].bar(0, ols1_results["train"][i], width, label=labs[0], color="blue")
    axs[i].bar(0 + width, ols2_results["train"][i], width, label=labs[1], color="orange")
    axs[i].bar(0 + 2 * width, ols_best_results["train"][i], width, label=labs[2], color="green")
    axs[i].bar(1, ols1_results["test"][i], width, color="blue")
    axs[i].bar(1 + width, ols2_results["test"][i], width, color="orange")
    axs[i].bar(1 + 2 * width, ols_best_results["test"][i], width, color="green")
# Plot titles
axs[0].set_title("Mean Squared Error")
axs[1].set_title("Mean Absolute Error")
axs[2].set_title("R¬≤")
# Labels and legend
for ax in axs:
    ax.set_xticks([width, width + 1], ["Train", "Test"])
fig.legend()

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 5**</font>

1. While the *best* cross-validation clearly outperforms the other two on the training set, it does not perform very well on the testing set. As a matter of fact, the model selected as optimal by the cross-validation method is really disappointing. Why could that be the case?
2. Rerun the entire analysis with the data set `us_crops_900.csv`, instead of `us_crops.csv`. What is different? What do you conclude about cross-validation and testing?