In this exercise, you will leverage what you've learned in the two previous tutorials to leverage a **pipeline** to tune a machine learning model with **cross-validation**.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
#from learntools.ml_level_2_dev.ex4 import *
print("Setup Complete")

You will work with the [Ames Housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) from the previous exercise. 

![Ames Housing dataset image](./images/ex1_housesbanner.png)

Run the next code cell without changes to load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`.  The test set is loaded in `X_test`.

For simplicity, we drop categorical variables.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# read the data
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv', index_col='Id')
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv', index_col='Id')

# remove rows with missing target, separate target from predictors
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice              
train_data.drop(['SalePrice'], axis=1, inplace=True)

# select numeric columns only
numeric_cols = [cname for cname in train_data.columns if
                train_data[cname].dtype in ['int64', 'float64']]

X = train_data[numeric_cols]
X_test = test_data[numeric_cols]

Use the next code cell to print the first several rows of the data.

In [None]:
X.head()

So far, you've learned how to build pipelines with scikit-learn.  For instance, the pipeline below will use [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to replace missing values in the data, before using [`RandomForestRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) to train a random forest model to make predictions.  We set the number of trees in the random forest model with the `n_estimators` parameter, and setting `random_state` ensures reproducibility.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

my_pipeline = make_pipeline(SimpleImputer(), 
                            RandomForestRegressor(n_estimators=80,
                                                  random_state=0))

You have also learned how to use pipelines in cross-validation.  The code below uses the [`cross_val_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function to obtain the mean absolute error (MAE), averaged across five different folds.  Recall we set the number of folds with the `cv` parameter.

In [None]:
from sklearn.model_selection import cross_val_score

# multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("Average MAE score:", scores.mean())

# Step 1: Build this function

Write a function `get_score()` that

In [None]:
def get_score(n_estimators):
    my_pipeline = make_pipeline(SimpleImputer(),
                                RandomForestRegressor(n_estimators,
                                                      random_state=0))
    scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=3,
                              scoring='neg_mean_absolute_error')
    return scores.mean()

# Step 2: Test different parameters

Create a dictionary `results`

In [None]:
results = {}
for i in range(1,9):
    results[50*i] = get_score(50*i)

step_2.check()

In [None]:
step_2.hint()
step_2.solution()

# Step 3: Find the best parameter

run the next code cell without changes.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(results.keys(), results.values())
plt.show()

In [None]:
n_estimators_best = min(results, key=results.get)

step_3.check()

In [None]:
step_3.hint()
step_3.solution()