# Model Evaluation
- Previously we had discussed about In-sample evaluation. Tells us how or model will fit the data used to train it.
- Problem: It does not tell us how the trained model can be used to predict new data.
- But in real world we will be given a new data set which we have not seen before. We need to know how our model will perform on this new data.
- Solution: We can use `train_test_split()` function from `sklearn.model_selection` library to split the data into training and testing data.
- It divides the data into `training set` or `In-Sample data` and `testing set` or `Out-of-sample data`.
- When we split our dataset, usually the larger portion of the dataset is used for `training` and smaller part is used for `testing`.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

In [None]:
pipe.fit(x_train, y_train)

In [None]:
pipe.score(x_test, y_test)

# Generalization Error
- Generalization error is the error that results from using a model to predict new data.
- It is calculated by comparing the predicted values with the actual values.
- The smaller the difference between the predicted and actual values, the lower the generalization error.
- Generalization error is also known as `out-of-sample error`.
- The goal of any model is to have the lowest possible generalization error.
- Using a lot of data for training and testing will help us to achieve this goal.
- For example, let's say we take a random sample of the data using 90% of the data for training and 10% for testing.
    - The first time we experiment we get a good estimate of the training data.
    - If we experiment again, training the model with a different combination of samples, we also get a good result, but the results will be different relative to the first time we run the experiment.
    - Repeating the experiment again with a different combination of training and testing samples, the results are relatively close to the Generalization error, but distinct from each other.
    - Repeating the process, we get good approximation of the generalization error, but the precision is poor i.e., all the results are extremely different from one another.
    - If we use fewer data points to train the model and more to test the model, the accuracy of the generalization performance will be less, but the model will have good precision.
    - If we use more data points to train the model and less to test the model, the accuracy of the generalization performance will be high, but the model will have poor precision.
    - To overcome this problem, we use `cross validation`.
        - It is one of the most common `out-of-sample evaluation metrics`.
        - In this method, the dataset is split into k-equal groups; each group is referred to as a fold.
        - For example 4 folds.
        - Some of the folds can be used as a training set, which we use to train the model, and the remaining parts are used as a test set, which we use to test the model.
        - For example, we can use three folds for training; then use one fold for testing.
        - This is repeated until each partition is used for both training and testing.
        - At the end, we use the average results as the estimate of `out-of-sample error`.
        - The advantage of this method is that it matters less how the training and testing sets are partitioned.
        - The disadvantage of this method is that it is more computationally expensive than train/test split.
- The Simplest way to apply cross validation is to call the `cross_val_score()` function, which performs multiple `out-of-sample` evaluations.
- This method is imported from `sklearn.model selection` package.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe, X=X, y=Y, cv=3, n_jobs=1)

- We then use the function `cross_val_score()`: 
    - The first input parameter is the type of model we are using to do the `cross validation`.
    - The second parameter is the `feature data`.
    - The third parameter is the `target data`.
    - The fourth parameter is the `number of folds`. Here, cv = 3, which means the data set is split into 3 equal partitions.
    - The fifth parameter is the `R-squared` scoring we would like to use to evaluate the model.
- The function returns a list of R-squared scores.


- If we want to know the actual predicted values supplied by our model before the R squared values are calculated we use the `cross_val_predict()` function.
- The input parameters are exactly the same as the `cross_val_score()` function, but the output is a prediction.

In [None]:
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator=pipe, X=X, y=Y, cv=3, n_jobs=1)

# Over and Under Fitting 
- Overfitting occurs when the model fits the noise, not the underlying process.
- It occurs when we train our model with a lot of data.
- It is characterized by a low training error and high testing error.
- Underfitting occurs when the model does not fit the data enough.
- It occurs when we train our model with very few data.
- It is characterized by a high training error and high testing error.
- The best way to detect overfitting is to use `cross validation`.
- We can use `cross validation` to estimate the `out-of-sample error` and compare it with the `in-sample error`.
- If the `out-of-sample error` is much higher than the `in-sample error`, then we are overfitting.
- If the `out-of-sample error` is about the same as the `in-sample error`, then we are underfitting.
- If the `out-of-sample error` is much lower than the `in-sample error`, then we are neither overfitting nor underfitting.

Consider the following function `y(x)+noise`: 
- We assume the training points come from a polynomial function plus some noise.
- The goal of model selection is to determine the order of the polynomial to provide the best estimate of the function y x.
- If we try and fit the function with a linear function `y = b0 + b1x`, the line is not complex enough to fit the data.
- As a result, there are many errors.
- This is called under-fitting, where the model is too simple to fit the data.
- If we increase the order of the polynomial `y = b0 + b1x + b2x^2`, the model fits better, but the model is still not flexible enough and exhibits under-fitting.
- The 8th order polynomial used to fit the data `y = b0 + b1x + b2x^2 + b3x^3 + b4x^4 + b5x^5 + b6x^6 + b7x^7 + b8x^8`; we see the model does well at fitting the data and estimating the function, even at the inflection points.
- Increasing it to a 16th order polynomial `y = b0 + b1x + b2x^2 + b3x^3 + b4x^4 + b5x^5 + b6x^6 + b7x^7 + b8x^8 + b9x^9 + b10x^10 + b11x^11 + b12x^12 + b13x^13 + b14x^14 + b15x^15 + b16x^16`, the model does extremely well at tracking the training points, but performs poorly at estimating the function.
- This is especially apparent where there is little training data; the estimated function oscillates not tracking the function.
- This is called over-fitting, where the model is too flexible and fits the noise rather than the function.

- If we select the best order of the polynomial, we will still have some errors, if you recall, the original expression for the training points.
- We see a noise term `y(x)+noise`; this term is one reason for the error.
- This is because the noise is random and we can't predict it; this is sometimes referred to as an `irreducible error`.
- There are other sources of errors as well. For example, 
    - Our polynomial assumption may be wrong. 
    - Our sample points may have come from a different function.
- We can calculate different R^2 values for different order polynomials and plot them.
- We see that the R^2 value increases as the order of the polynomial increases.
- First, we make an empty list to store the R^2 values.
- Then we create an another list to store the order of the polynomial.
- We then use a for loop to iterate through list of orders.
- For each order, we create a polynomial model and fit it using the training data.
- We then calculate the R^2 value using the test data.
- We then append the R^2 value to the list of R^2 values.
- We then append the order of the polynomial to the list of orders.
- We then plot the R^2 values against the order of the polynomial.

In [None]:
Rsqu_test = []
order = [1,2,3,4]
for n in order:
    pr = PolynomialFeatures(degree=n)
    x_train_pr = pr.fit_transform(x_train[['horsepower']])
    x_test_pr = pr.fit_transform(x_test[['horsepower']])
    lr.fit(x_train_pr, y_train)
    Rsqu_test.append(lr.score(x_test_pr, y_test))

# Ridge Regression
- `Ridge regression` is a technique used when the data suffers from `multicollinearity (independent variables are highly correlated)'.
- In `multicollinearity`, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value.
- By adding a degree of bias to the regression estimates, `ridge regression` reduces the standard errors.
- It is mostly used when there is a high correlation between the independent variables.
-`Ridge regression` solves the multicollinearity problem through `shrinkage parameter λ (lambda)`.
- In many cases, real data has outliers, and these outliers have a big influence on the estimated model.
- Now, if we use the higher order polynimial to fit these outliers, we will get a model that fits the outliers well, but the model will not fit the rest of the data.
- `Ridge regression` controls the magnitude of these polynomial coefficients by introducing the parameter `alpha`.
- `Alpha` is a parameter we select before fitting or training the model.
- As alpha increases, the parameters get smaller. This is most evident for the higher order polynomial features, but alpha must be selected carefully.
- If alpha is too large, the coefficients will approach zero and under-fit the data.
- If alpha is `zero`, the `over-fitting is evident`.
- For alpha equal to `0.001`, the `over fitting begins to subside`.
- For alpha equal to `0.01`, the `estimated function tracks the actual function`.
- When alpha equals `1`, we see the `first signs of under-fitting`.
- `Ridge regression` solves this problem by adding a penalty term to the loss function.
- The penalty term is the sum of the squared value of each coefficient multiplied by some constant lambda.
- This constant lambda should be greater than 0.
- The higher the value of lambda, the more the shrinkage.
- The coefficients are shrunk toward zero and each other.
- The advantage of ridge regression is that it can reduce the variance of the model, and therefore we don't suffer from over-fitting.
- The disadvantage of ridge regression is that it includes all the variables, and thus we can't perform feature selection.
- The first term is the same as the ordinary least square.
- The second term is the penalty term.
- The lambda parameter is a scalar that should be learned.
- The lambda parameter is also known as the shrinkage parameter.
- The lambda parameter can be found using cross validation.
- The lambda parameter can be found using the `Ridge()` function from the `sklearn.linear_model` library.
- The input parameter is the `alpha` parameter, which is the same as the lambda parameter.
- The output is the model object.
- We can use the model object to predict the values of the test data.
- We can also use the model object to find the R^2 value of the test data.
- We can also use the model object to find the R^2 value of the training data.
- We can also use the model object to find the coefficients of the model.
- We can also use the model object to find the intercept of the model.
- We can also use the model object to find the predicted values of the training data.
- We can also use the model object to find the predicted values of the test data.

In [None]:
# Ridge Regression
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
x_test_pr = pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
RigeModel = Ridge(alpha=0.1)
RigeModel.fit(x_train_pr, y_train)
RigeModel.score(x_test_pr, y_test)

# Grid Search
- `Grid search` is a method used to tune hyper-parameters.
- `Hyper-parameters` are parameters that are not directly learnt within estimators.
- `Grid search` takes the model or objects you would like to train and different values of the hyperparameters.
- `Grid search` allows us to scan through multiple free parameters with few lines of code.
- In `scikit-learn`, they are passed as arguments to the constructor of the estimator classes.
- For example, `n_estimators` is a hyper-parameter of `RandomForestRegressor()`.
- The term `alpha` is a hyper-parameter of `Ridge()`.
- We can use `GridSearchCV()` function from `sklearn.model_selection` library to tune hyper-parameters.
- The input parameters are the model object, the dictionary of parameters, and the number of folds.
- The output is the model object with the best parameters.
- We can use the model object to predict the values of the test data.

In [None]:
# Grid Search
from sklearn.model_selection import GridSearchCV
parameters1 = [{'alpha':[0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}]
RR = Ridge()
Grid1 = GridSearchCV(RR, parameters1, cv=4)
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
BestRR = Grid1.best_estimator_
BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test)

- One of the advantages of `Grid search` is how quickly we can test multiple parameters. For example, `Ridge regression` has the option to normalize the data.
    - The term alpha is the first element in the dictionary, 
    - The second element is the normalize option. The value is the different options, in this case, because we can either normalize the data or not, the values are true or false, respectively.


In [None]:
# Grid Search
from sklearn.model_selection import GridSearchCV
parameters1 = [{'alpha':[0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}, {'normalize':[True, False]}]
RR = Ridge()
Grid1 = GridSearchCV(RR, parameters1, cv=4)
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
BestRR = Grid1.best_estimator_
BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test)