# Regression for Car price Prediction

AutoScout24 provided us with a dump of their productive database. You are assigned to train a prediction model for the selling price of second-hand cars, which in the future shall enable the platform to automatically suggest an adequate selling price whenever a customer uploads a new sale advertisement.

<a href="https://www.autoscout24.ch/de"><img src="https://www.autoscout24.ch/MVC/Content/desktop/img/autoscout24-logo-og.png" height="30%" width="30%"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline

from sklearn.dummy import DummyRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import KFold

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

from tqdm.notebook import tqdm
import ipywidgets as widgets
%matplotlib inline

### Load the data

In [None]:
df = pd.read_csv("cars.csv", parse_dates=['Registration'])

In [None]:
df.shape

In [None]:
df.head(n=5)

### Data Preparation

Here we set datatypes and indices, split fields, maybe join tables if we have multiple input sources and so on. In this exercise, the data is mostly clean and well-prepared. All we have to do is set the datatypes. Please note that you can do this directly with pd.read_csv if you want (as shown above with the `parse_date` argument).

##### Handle categorical variables

One variable is categorical. Encode it as Pandas datatype `categorical` (you can use [pd.DataFrame.astype()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html)) and print the dtypes afterwards.

*Click on the dots to display the solution*

In [None]:
df.Color = df.Color.astype('category')
df.dtypes

### Data Analysis and Quality Assessment

To get to know the data and to get a feeling for it, it is important to examine it before doing any modelling. Data Cleaning and Data Analysis (also called Explorative Data Analysis) are a large part of Machine Learning.

##### Duplicates

Let's begin by checking for duplicate rows. With the data at hand, exact duplicates should not occur. While it is of course possible that two cars of the same maker and model are being sold, the mileage and registration probably would not match. Exact duplicates thus hint at an anomaly from entering, processing or extracting the data (which is in fact very common).

> Find out if there are rows that are exact duplicates. If there are, examine and possibly drop them. You can use [pd.DataFrame.duplicated()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) and [pd.DataFrame.drop_duplicates()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html).

In [None]:
# check if they are any

*Click on the dots to display the solution*

In [None]:
# check if they are any
df.duplicated().any()

In [None]:
# if there are, print them...

*Click on the dots to display the solution*

In [None]:
# since there are, print them
df[df.duplicated(keep=False)].head(n=10)

In [None]:
# ... and drop them

*Click on the dots to display the solution*

In [None]:
# and drop them
df.drop_duplicates(inplace=True)
#df = df.drop_duplicates() # same result

##### Null values

> Now check for null values.

*Click on the dots to display the solution*

In [None]:
df.isna().any() # add another .any() to aggregate to a single Boolean

##### Data ranges

> Check if the date ranges make sense.

*Click on the dots to display the solution*

In [None]:
df.Year.min(), df.Year.max()

In [None]:
df.Registration.min(), df.Registration.max()

Even though that could be (the [oldest car](https://en.wikipedia.org/wiki/History_of_the_automobile) is quite a bit older than 1900), it is unlikely that cars from 1900 will be sold on AutoScout24. We need to examine ...

But first, it is also a bit suspicious that both the Registration and the Year show the same range. Maybe we have redundant information that we can drop?

##### Redundant data

> Check whether the year from *Registration* and the *Year* column are identical. You can use the [.dt accessor](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dt-accessors) to get the year from the registration date.

*Click on the dots to display the solution*

In [None]:
(df.Registration.dt.year == df.Year).all()

> Next, check if there is variation in the day of the Registration date. You can use [.nunique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) or [.value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) for this.

*Click on the dots to display the solution*

In [None]:
df.Registration.dt.day.value_counts()

So the *Registration* column contains the same year that we also have in the *Year* column and an additional month. The day is always 1.

To simplify the rest of this exercise, let's only keep the *Year* column. For more detailed analysis, you might want to keep the month, however.

> Drop the *Registration* column.

*Click on the dots to display the solution*

In [None]:
df.drop('Registration', axis='columns', inplace=True)

##### Data ranges again

Now back to these very old cars. 

> Sort the dataframe so that the oldest cars show first and list the first 5 entries.

*Click on the dots to display the solution*

In [None]:
df.sort_values(by='Year').head()

Googling these, it looks like only the first row has an invalid Year and the remaining are ok. 

>Select the row with year 1900 and drop it.

*Click on the dots to display the solution*

In [None]:
df.drop(df.index[df.Year==1900], axis='rows', inplace=True)

Now we calculate basic statistics and check if they are reasonable.

In [None]:
df.describe()

Do you see anything else suspicious? 
> Feel free to examine further.

##### Outliers

Let's next take care of outliers. As we have seen in the lecture, outliers can affect the performance of algorithms that are based on distance or similarity.

We create a boxplot for the four numerical columns.

In [None]:
numerical_cols = ['Price', 'Mileage', 'Horsepower', 'EngineSize']
df.loc[:, numerical_cols].plot(kind='box', subplots=True, layout=(2, 2), figsize=(10, 10), sharex=False)

According to the rule that values at least 1.5 * IQR above the 3rd quartile or below the 1st quartile are considered outliers, we have many of those (the circles in the boxplot). However, as we already have checked the validity of the data ranges, we assume that these do not signify a problem with the data (such as wrongly entered data), but are in fact valid, but extreme samples.

For algorithms that are heavily influenced by outliers (such as Linear Regression), it might be best to remove them anyway. But this is something that has to be evaluated with experiments and cross-validation. Furthermore, it is very important that statistics such as the mean for normalisation and the IQR for outlier removal are computed on the training set only (meaning *after* a cross-validation split and not *before*). 

We will skip outlier removal for now and look at it again in detail in one of the next exercises. The following code is just given for reference. 

In [None]:
# The following code can be used to calculate an upper bound.
# If applied, this bound must be calculated only on the training set, not on the complete dataset.
# In a dataset where there are outliers above as well as below the two quartiles, the lower bound
# would have to be calculated accordingly
if False:
    q3 = df.loc[:, numerical_cols].describe().loc['75%']
    iqr = q3 - df.loc[:, numerical_cols].describe().loc['25%']
    upper_boundary = q3 + 1.5*iqr
    upper_boundary

In [None]:
# And here the outliers are removed
if False:
    df = df[(df.Price <= upper_boundary.Price) &
            (df.Mileage <= upper_boundary.Mileage) &
            (df.Horsepower <= upper_boundary.Horsepower) &
            (df.EngineSize <= upper_boundary.EngineSize)]

##### Data distribution and pairwise relation

Lastly, let's look at the distributions and the pairwise relations. 

In [None]:
sns.pairplot(df.loc[:, numerical_cols], diag_kind = "kde", kind = "scatter")

# or with just pandas:
# pd.plotting.scatter_matrix(df.loc[:, numerical_cols], diagonal='kde')

We see that all distributions are right-skewed. It might be beneficial to transform them, another experiment that would have to be evaluated with cross-validation. We also see the expected linear relationship between Horsepower and EngineSize. Something else to expect which is not supported by the data at hand could be a linear relationship between Mileage and Price.

And certain outliers are also visible. Let's remove them now.

>Remove the 3 elements with the largest Mileage.

*Click on the dots to display the solution*

In [None]:
df.sort_values(by='Mileage', ascending=False).head(n=6)

In [None]:
df.drop([17010, 7734, 47002], axis='rows', inplace=True)
sns.pairplot(df.loc[:, numerical_cols], diag_kind = "kde", kind = "scatter")

>Remove the 2 samples with the largest Price.

*Click on the dots to display the solution*

In [None]:
df.sort_values(by='Price', ascending=False).head(n=6)

In [None]:
df.drop([44369, 24720], axis='rows', inplace=True)
sns.pairplot(df.loc[:, numerical_cols], diag_kind = "kde", kind = "scatter")

Remove the 8 samples with the largest EngineSize.

*Click on the dots to display the solution*

In [None]:
df.sort_values(by='EngineSize', ascending=False).head(n=12)

In [None]:
df.drop(df.index[df.EngineSize > 7500], axis='rows', inplace=True)
sns.pairplot(df.loc[:, numerical_cols], diag_kind = "kde", kind = "scatter")

Okay, let's leave it at that.

## Price prediction

Ok, now that we have done our data quality assessment, let's dive into the [Scikit-learn](https://scikit-learn.org/) toolkit. Our goal is to fit a model which is able to predict the price of a car. We are going to use the **K-Nearest Neighbor (KNN)** algorithm for this regression problem.

### Feature Engineering
Before we can train our model, we have to do some feature engineering. Let's take a look at our dataset again.

In [None]:
df.head()

Since we now actually use the categorical column, we need to transform it. Use [pd.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) and [pd.concat](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) to [One-Hot encode](https://en.wikipedia.org/wiki/One-hot) the *Color* column (in the lecture, this is described in the section *Vector Space Model*).

> Then, drop the original *Colors* column.

*Click on the dots to display the solution*

In [None]:
df = pd.concat([df, pd.get_dummies(df.Color)], axis='columns')
df.drop('Color', axis='columns', inplace=True)
df.head()

And we still have that name column, which isn't really useful. One possibility would be to treat it as a categorical variable.
>Let's check how many different car names that would yield.

*Click on the dots to display the solution*

In [None]:
df.Name.nunique()

Too many. If we want to use this information, we need to either make bins or dissect the column into subfields. This is a bit of work, so to keep it simple, extract the brand from the beginning of the string and use this as a feature. This is a simplification, and it doesn't get everything right, e.g. brand names with a space such as 'LAND ROVER' or 'ASTON MARTIN' are truncated. But for the sake of the exercise, let's be happy with that.

>Use the [.str accessor](https://pandas.pydata.org/pandas-docs/stable/text.html#splitting-and-replacing-strings) to split the Name column, [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) with a lambda expression to take only the brand and assign to a new column called *Brand*. Then drop the Name column.

*Click on the dots to display the solution*

In [None]:
df['Brand'] = df.Name.str.split(' ').map(lambda x: x[0])
df.drop('Name', axis='columns', inplace=True)
df.head()

In [None]:
df.Brand.nunique()

Now we have a new categorical variable. Again we have to [One-Hot encode](https://en.wikipedia.org/wiki/One-hot) this column.

*Click on the dots to display the solution*

In [None]:
df = pd.concat([df, pd.get_dummies(df.Brand)], axis='columns')
df.drop('Brand', axis='columns', inplace=True)

Now our Dataframe looks like this:

In [None]:
df.head()

We probably have too many features in relation to the amount of data. So let's start by only using 5 features.
As a next step, we split the features from the target *(price)* we are going to predict.

In [None]:
feature_columns = ['Mileage', 'Year', 'Horsepower', 'Doors', 'schwarz']

In [None]:
X = df[feature_columns].values
# We convert it to float so we don't get a conversion warning when normalizing the data
X = X.astype("float") 
y = df.Price.values

#### Splitting the data
Split the data into a training and test set using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Use 20% of the data for the test set. Then print  the shapes of the resulting data sets and make sure the splitting was correct.

In [None]:
# X_train, ...
# print("X_train:", X_train.shape)
# print("X_test:", X_test.shape)
# print("y_train", y_train.shape)
# print("y_test", y_test.shape)

*Click on the dots to display the solution*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train:", X_train.shape)
print("y_train", y_train.shape)
print("X_test:", X_test.shape)
print("y_test", y_test.shape)

In [None]:
print(X_train[0])

### Baseline Model

To get an idea how good our model performs, it often makes sense to implement a baseline model. In our case the baseline model should just returns the median of the price. For that case we can use the [DummyRegressor]("https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html") from sklearn.

Instantiate a new DummyRegressor with the strategy `median`

In [None]:
# dummy = ...

*Click on the dots to display the solution*

In [None]:
dummy = DummyRegressor(strategy="median")

Now fit the model on the training data.

In [None]:
# dummy.fit(...)

*Click on the dots to display the solution*

In [None]:
dummy.fit(X_train, y_train)

Congrats, you have trained your first model (even though the model is pretty dumb :-))! Let's check how good the baseline model performs.

But first we need to make a prediction on the test set.

In [None]:
# y_pred = ...

*Click on the dots to display the solution*

In [None]:
y_pred = dummy.predict(X_test)

We will measure the quality of our result with the [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error). This means that we don't care whether we estimate a price as too high or too low, only the magnitude of the difference counts. We will also calculate the [Coefficient of Determination (R<sup>2</sup>) score](https://en.wikipedia.org/wiki/Coefficient_of_determination). Our goal is to minimize the MAE and to maximize the R<sup>2</sup> score.

Calculate the [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error) for the predictions made on the test set. Implement it yourself instead of using the Scikit-Learn toolkit.

In [None]:
# Calculate the MAE
# np....

*Click on the dots to display the solution*

In [None]:
np.mean(np.abs(y_test - y_pred))

Scikit-Learn also provides the function [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) so we don't have to implement it ourseleves.

In [None]:
mean_absolute_error(y_test, y_pred)

The R<sup>2</sup> score can be calculated using the Scikit-Learn function [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html).

In [None]:
r2_score(y_test, y_pred)

### K-Nearest Neighbors (KNN)

Okay, our baseline model is pretty bad! Let's see if we can improve our performance with a KNN-model.

#### Normalize the data
When using the K-Nearest Neighbors algorithm it is important to normalize the data. We use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from Scikit-Learn which represents the Z-Score normalization.

In [None]:
print("Original", X_train[0])

In [None]:
scaler = StandardScaler()

Now that we have instantiated the scaler, we could call the `fit` method, which will calculate the *variance* and *mean* per feature store it. We then call `transform` to apply the normalization to the data. 

In [None]:
# fit the scaler and then transform
# scaler...
# X_train_scaled = ...
# X_test_scaled = ...

*Click on the dots to display the solution*

In [None]:
scaler.fit(X_train)
X_train = scaler.transform(X_train)

# You could also call fit_transform, which calls fit first and then transforms the data.
# X_train_scaled = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [None]:
print("Mean per feature:", scaler.mean_)
print("Variance per feature", scaler.var_)

In [None]:
print("Scaled", X_train[0])

As we have measured the performance of the baseline model, we can start to use an actual machine learning model. We will use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) model.
Instantiate a new object of the `KNeighborsRegressor` model. Set the hyperparameter `n_neighbors` equal to 5. Then fit the model on the training data and calculate the metrics on the test set.

In [None]:
# Instantiate model
# knr = ...

*Click on the dots to display the solution*

In [None]:
knr = KNeighborsRegressor(n_neighbors=5)

In [None]:
# Fit the model

*Click on the dots to display the solution*

In [None]:
knr.fit(X_train, y_train)

In [None]:
# Predict on the test set
# y_test = ...

*Click on the dots to display the solution*

In [None]:
y_pred = knr.predict(X_test)
y_pred_train = knr.predict(X_train)

In [None]:
# Calculate metrics
# mae = ....
# r2 = ...
# print("MAE:", mae)
# print("r2:", r2)

*Click on the dots to display the solution*

In [None]:
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("r2:", r2)

## Hyperparameter Tuning

The KNeighborsRegressor has the hyperparameter `n_neighbors` that we would like to tune. As explained in the lecture you **should not use the test set** to tune the hyperparameters. Therefore we further split the training set into a validation set or use a k-fold cross-validation. 



#### Splitting the data
Split the original data into the following: 60% training data, 20% validation, 20% test data. Make sure the splitting is correct by checking the shapes.

In [None]:
# X_train, X_test, y_train, y_test = ...
# X_test, X_val, y_test, y_val = ...
# print("X_train:", X_train.shape)
# print("y_train", y_train.shape)
# print("X_val", X_val.shape)
# print("y_val", y_val.shape)
# print("X_test:", X_test.shape)
# print("y_test", y_test.shape)

*Click on the dots to display the solution*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
print("X_train:", X_train.shape)
print("y_train", y_train.shape)
print("X_val", X_val.shape)
print("y_val", y_val.shape)
print("X_test:", X_test.shape)
print("y_test", y_test.shape)

#### Normalization
Again, apply z-normalization to the data

In [None]:
# X_train = 
# X_test = 
# X_val = 

*Click on the dots to display the solution*

In [None]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

#### Manually select best parameter
Let's select the value for `n_neighbors` where we get the highest R<sup>2</sup> score.

In [None]:
@widgets.interact(k=widgets.IntSlider(
    value=1,
    min=1,
    max=12,
    step=1,
    description='n_neighbors:'))
def f(k):
    knr = KNeighborsRegressor(n_neighbors=k)
    knr.fit(X_train, y_train)
    y_pred = knr.predict(X_val)
    r2 = r2_score(y_val, y_pred)
    print("K={} -> r2={}".format(k, r2))

#### Automatically select best parameter
Now let's try different values for `n_neighbors` by looping through a list of possible values. We want to automatically select the parameter for `n_neighbors` which results in the highest R<sup>2</sup> score.

In [None]:
r2_scores = []
k_range = list(range(1, 13))
for k in tqdm(k_range):
    # START YOUR CODE
    
    
    
    # END YOUR CODE
    #print("K={} -> r2={}".format(k, r2))
    pass

#idx = np.argmax(r2_scores)
#best_k = k_range[idx]
#max_r2 = np.max(r2_scores)
#print("Best K:", best_k)
#print("max r:", max_r2)

*Click on the dots to display the solution*

In [None]:
r2_scores = []
k_range = list(range(1, 13))
for k in tqdm(k_range):
    knr = KNeighborsRegressor(n_neighbors=k)
    knr.fit(X_train, y_train)
    y_pred = knr.predict(X_val)
    r2 = r2_score(y_val, y_pred)
    r2_scores.append(r2)
    print("K={} -> r2={}".format(k, r2))
    
idx = np.argmax(r2_scores)
best_k = k_range[idx]
max_r2 = np.max(r2_scores)
print("Best K:", best_k)
print("max r2:", max_r2)

We can now plot how the R<sup>2</sup> score changes when we increase K.

In [None]:
plt.plot(k_range, r2_scores)
plt.xlabel("K")
plt.ylabel("r2")

Now that we have found the our optimal number of neighbors, we can now finally evaluate how good our model performs on the test set. 

**Note**: Until now we should have never touched the test set. If we want to further improve our model after looking at the test data we would need to split the data again! Otherwise our result might be optimistically biased.

In [None]:
knr = KNeighborsRegressor(n_neighbors=best_k)
knr.fit(X_train, y_train)

y_pred = knr.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("r2: ", r2)

### Cross-Validation

What we have done so far is a so called 1-fold cross-validation: We splitted our data into a training, test and validation set. Now we want to perform a k-fold cross validation.

![k-Fold-CV.png](attachment:k-Fold-CV.png)

##### Splitting the data
When we do cross-validation we split our data set into a training and a test set and then split the training set during the cross-validation into a training and validation set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train:", X_train.shape)
print("y_train", y_train.shape)
print("X_test:", X_test.shape)
print("y_test", y_test.shape)

For applying cross-validation use the [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class from Sci-Kit learn to further split the training data into a training and validation set. Use the `split` method and don't forget to normalize your data.

In [None]:
def apply_cv(X, y, model, n_splits):
    y_trues = []
    y_preds = []
    # START YOUR CODE

    
    # END YOUR CODE
    return y_trues, y_preds

*Click on the dots to display the solution*

In [None]:
def apply_cv(X, y, model, n_splits):
    y_trues = []
    y_preds = []
    kf = KFold(n_splits=n_splits, random_state=42, shuffle=True)
    for train_index, val_index in kf.split(X):
        X_train_cv, X_val_cv = X[train_index], X[val_index] 
        y_train_cv, y_val_cv = y[train_index], y[val_index]

        X_train_cv = scaler.fit_transform(X_train_cv)
        X_val_cv = scaler.transform(X_val_cv)

        model.fit(X_train_cv, y_train_cv)
        y_pred = model.predict(X_val_cv)
        
        y_trues.extend(y_val_cv.tolist())
        y_preds.extend(y_pred.tolist())
    return y_trues, y_preds

We can now use our implemented method `apply_cv` to run the 10-fold cross-validation for our dataset.

In [None]:
y_trues, y_preds = apply_cv(X_train, y_train, model=knr, n_splits=10)
MAE = mean_absolute_error(y_trues, y_preds)
r2 = r2_score(y_trues, y_preds)
print("MAE:", MAE)
print("r2:", r2)

We have now implemented a simple cross-validation function. Let's use this function now for tuning the `n_neighbors` hyperparameter.

In [None]:
r2_scores = []
k_range = list(range(1, 13))
for k in tqdm(k_range):
    # START YOUR CODE

    
    # END YOUR CODE
    #print("K={} -> r2={}".format(k, r2))
    pass
    
#idx = np.argmax(r2_scores)
#best_k = k_range[idx]
#max_r2 = np.max(r2_scores)
#print("Best K:", best_k)
#print("max r2:", max_r2)

*Click on the dots to display the solution*

In [None]:
r2_scores = []
k_range = list(range(1, 13))
for k in tqdm(k_range):
    knr = KNeighborsRegressor(n_neighbors=k)
    y_trues, y_preds = apply_cv(X_train, y_train, knr, n_splits=10)
    y_pred = knr.predict(X_val)
    r2 = r2_score(y_trues, y_preds)
    r2_scores.append(r2)
    print("K={} -> r2={}".format(k, r2))
    
idx = np.argmax(r2_scores)
best_k = k_range[idx]
max_r2 = np.max(r2_scores)
print("Best K:", best_k)
print("max r2:", max_r2)

### This is how it's done with Scikit-Learn

Scikit-Learn proides the function [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) which enables us tune the hyperparameters by applying a k-fold cross-validation. We have to create a `params` dictionary where we define all hyperparameters that we want to tune. We then pass the model, the parameters and specify the metric we want to optimize.

In [None]:
knr = KNeighborsRegressor()
params = {
    'n_neighbors': list(range(1, 13))
}
grid_search = GridSearchCV(knr, params, cv=10, scoring="r2")
grid_search.fit(X_train, y_train)
print("Best params", grid_search.best_params_)
print("max r2", grid_search.best_score_)

The maximum R<sup>2</sup> score seems to be much lower than when we applied cross-validation ourselves. 

> Can you think of an explanation for this?

We did not apply normalization on the training data.

As you have seen, you should calculate the *variance* and *mean* for the z-score normalization from the training data. When we want to do a cross-validation, this makes it a bit complicated. Therefore Scikit-Learn has introduced [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). With a pipeline we can assemble multiple transformations and a final model into a combined model. When we then call the `fit` method, it will automatically preprocess the data before passing it to the model. In our case we use the pipline to combine the z-score normalization and the KNN model.

In [None]:
pipe =[("s", scaler), ("knr", knr)]
model = Pipeline(pipe)

When we call the `fit` and the `predict` methods, the passed data is automatically normalized before being used with the models. We don't have to care about scaling the data by ourselves.

In [None]:
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
r2_score(y_test, y_pred)

#### Cross-Validation using the pipeline
We can pass the pipeline instead of the `KNearestNeighborRegressor` object to the `GridSearchCV` class and run the grid search.

In [None]:
params = {
    'knr__n_neighbors': list(range(1, 13))
}
grid_search = GridSearchCV(model, params, cv=10, scoring="r2")
grid_search.fit(X_train, y_train)
print("Best params", grid_search.best_params_)
print("max r2", grid_search.best_score_)

As expected, the result is now much better.

## Assignment

Now answer the ILIAS Quiz **Supervised Learning Fundamentals**

> Fit a KNN-model with 7 neighbors and predict the price of a black car with the following properties:
* Mileage = 1000
* Year = 2012
* Horsepower = 150
* Doors = 5