## Model Evaluation
Model Evaluation tells us how our model preforms in the real world. In-sample evaluation tells us how well our model fits the data already given to train it. It does not give us an estimate of how well the trained model can predict new data. The solution is to split our data up, use the In-sample data or training data to train the model. The rest of the data called test data is used as out-of-sample data. This data is then used to approximate how the model preforms in the real world.

We use the test data to get an idea how our model will perform in the real world. When we split a data set, usually the larger portion of data (70%) is used for training and a smaller part (30%) is used for testing.

We use a training set to build a model and discover predictive relationships. We then use a testing set to evaluate model performance. When we have completed testing our model, we should use all the data to train the model.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import Ridge

The "train test split" function randomly splits a dataset into training and testing subsets:
> xTrain, xTest, yTrain, yTest = train_test_split(xData, yData, test_size=0.3, random_state=0)

**yData** - is the target variable ("price" in the car appraisal example)

**xDdata** - the list of predictor variables. (all the other variables in the car data set that we use to predict the price.)

The output is an array: 
- "xTrain" and "yTrain" - the subsets for training
- "xTest" and "yTest" - the subsets for testing. 
- "test_size" - percentage of the data for the testing set (30%). 
- random_state - a random seed for random dataset splitting.

## Generalization error
Generalization error is a measure of how well our data does at predicting previously unseen data. The error we obtain using our testing data is an approximation of this error.

Using a lot of data for training gives us an accurate means of determining how our model will perform in the real world, but the precision of the performance will be low. If we use fewer data points to train the model and more to test the model, the accuracy of the generalization performance will be less, but the model will have good precision.

To overcome this problem, we use cross validation. It is one of the most common 'out-of-sample evaluation metrics'. In this method, the dataset is split into k-equal groups; each group is referred to as a fold. For example 4 folds. Some of the folds can be used as a training set, which we use to train the model, and the remaining parts are used as a test set, which we use to test the model. For example, we can use three folds for training; then use one fold for testing. This is repeated until each partition is used for both training and testing. At the end, we use the average results as the estimate of out-of-sample error.
The evaluation metric depends on the model. For example, the R-squared.
> scores = cross_val_score(lr, xData, yData, cv=3)

- lr -  the type of model we are using to do the cross validation.
- x_data - the predictor variable data
- y_data - the target variable data
- cv - to manage the number of partitions. Here, cv = 3, which means the data set is split into 3 equal partitions.

The function returns an array of scores, one for each partition that was chosen as the testing set.
We can average the result together to estimate out-of-sample R-squared using the mean function in numpy.

`np.mean(scores)`

What if we want a little more information: what if we want to know the actual predicted values supplied by our model before the R squared values are calculated? To do this, we use the cross_val_predict() function.
> yHat = cross_val_predict(lr2e, xData, yData, cv=3)

The inputs are the same, but the output is a prediction.

## Overfitting, Underfitting and Model Selection:
- Underfitting -  where the model is too simple to fit the data.
- Overfitting - where the model is too flexible and fits the noise rather than the function.

Plotting a graph of R^2 and the order of the polynomial for the train and test data will help us get select the right model

We can calculate different R-squared values as follows:

In [None]:
order = [1,2,3,4] # We create a list containing different polynomial orders.
for n in order:
    pr = PolynomialFeatures(degree = n) # We create a polynomial feature object with the order of the polynomial as a parameter 
    xTrainPr = pr.fit_transform(xTrain["horsepower"]) # We transform the training and test data into a polynomial 
    xTestPr = pr.fit_transform(xTest["horsepower"])
    lr.fit(xTrainPr, yTrain) # We fit the regression model using the transformed data.
    RsquTest.append(lr.score(xTestPr,yTest)) # We then calculate the R-squared using the test data and store it in the array.

## Ridge regression:
Higher degree polynomials have coefficients of very large magnitude. Ridge regression controls the magnitude of these polynomial coefficients by introducing the parameter alpha. Alpha is a parameter we select before fitting or training the model. As alpha increases, the parameters get smaller. This is most evident for the higher order polynomial features, but alpha must be selected carefully. If alpha is too large, the coefficients will approach zero and under-fit the data. If alpha is zero, the over-fitting is evident.

In order to select alpha we use cross-validation.

In [None]:
ridgeModel = Ridge(alpha=0.1)
ridgeModel.fit(x,y)
ridgeModel.predict(x)

In order to determine the parameter alpha, we use some data for training. We use a second set called validation data; this is similar to test data, but it is used to select parameters like alpha. We start with a small value of alpha, we train the model, make a prediction using the validation data, then calculate the R squared and store the values. Repeat the value for a larger value of alpha. We train the model again, make a prediction using the validation data, then calculate the R squared and store the values of R squared.

We repeat the process for a different alpha value, training the model, and making a prediction. We select the value of alpha that maximizes the R squared. 

Note that we can use other metrics to select the value of alpha like mean squared error.

The Overfitting problem is even worse if we have lots of features. There are chances that increasing the alpha gives a higher r squared for the validation data. But the same alpha value  might produce a lower r squared in the training data because it prevent overfitting. 

## Grid search:
Grid search allows us to scan through multiple free parameters with few lines of code. Parameters like the alpha term discussed in the previous video are not part of the fitting or training process. These values are called hyperparameters. Scikit-learn has a means of automatically iterating over these hyperparameters using cross-validation. This method is called Grid search. 

Grid search takes the model or objects you would like to train and different values of the hyperparameters.
It then calculates the mean square error or R squared for various hyperparameter values, allowing you to choose the best values. Let the small circles represent different hyperparameters.

To select the hyperparameter, we split our dataset into three parts, the training set, validation set, and test set. We train the model for different hyperparameters. We use the R squared or mean square error for each model. We select the hyperparameter that minimizes the mean squared error or maximizes the R squared on the validation set. We finally test our model performance using the test data.