# Increase Linear Regression Performance 

We will learn how to improve the performance of a linear regression model. We will use this model to predict the transaction price of a house using a real estate dataset. 

## Overview: 
1. Import the dataset
2. Separate the dataset's features from target variable
3. Split data into training and testing sets
4. Make a pipeline 
5. Make a hyper-parameter dictionary
6. Perform a cross validaton grid search using the training set, and setting the following parameters of the `GridSearchCV` object initialization:

  - `estimator` parameter will be assigned with the `pipeline` object instance as its argument
  - `param_grid` parameter will be assigned with the `hyper-parameter-dictionary` instance

7. Identify best performing model using the training set with specific hyper-parameters set
8. Evaluate the best performing model using the testing data

<hr>

<br>

## Import Required Libraries

**Note:** You can tell the difference between a class and a function by the case sensivity. 

- A **class** will be captialized
- A **function** will be lowercase
- A **method**, or a function belonging to a class, will also be lowercase. You can call a method by invoking it through an instance of a class (instance method), or through a class definition (static method)

References: 
- [Understanding what a class is](https://www.hackerearth.com/practice/python/object-oriented-programming/classes-and-objects-i/tutorial/)
- [Differences between functions and methods](https://www.tutorialspoint.com/difference-between-method-and-function-in-python)
- [Different types of methods](https://www.bogotobogo.com/python/python_differences_between_static_method_and_class_method_instance_method.php)

In [0]:
# Collection libraries 
import numpy as np
import pandas as pd

# Visual libraries 
import matplotlib.pyplot as plt
import seaborn as sns

# Helper for splitting training and testing data
from sklearn.model_selection import train_test_split

# Model/Estimator
from sklearn.linear_model import LinearRegression

# Helper for pipelines
from sklearn.pipeline import make_pipeline

# Helper for normailizing dataset
from sklearn.preprocessing import StandardScaler

# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

#### Notes about imports with this notebook:
We will re-import some of the libraries when we use these modules, this is to get you used to importing and understanding their classes and functions. Reference the documentation to understand the libraries classes, methods, and functions. 

## Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/machine-learning/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/machine-learning

<hr> 
<br>

### Import Real Estate Dataset
Read in the real estate dataset using the path provided and store it in a variable called `df`.

#### Import the cleaned real estate dataset
- Use pandas' `read_csv` function

#### Pandas' `read_csv` parameters:
- `filepath_or_buffer` (string): path of csv to import

```python 
filepath_or_buffer = './data/cleaned_and_feature_engineered_real_estate.csv'
```

In [0]:
df = pd.read_csv(filepath_or_buffer = './data/cleaned_and_feature_engineered_real_estate.csv')

### Show Head Of Datset

In [0]:
df.head()

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
0,295850.0,1.0,1.0,584.0,2013.0,0.0,0.0,33.0,65.0,84.0,234.0,81.0,9.0,3.0,2013.0,1.493259,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
1,216500.0,1.0,1.0,612.0,1965.0,0.0,1.0,39.0,73.0,69.0,169.0,51.0,3.0,3.0,2006.0,0.676598,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,279900.0,1.0,1.0,615.0,1963.0,0.0,0.0,28.0,15.0,86.0,216.0,74.0,8.0,3.0,2012.0,2.298254,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
3,379900.0,1.0,1.0,618.0,2000.0,33541.0,0.0,36.0,25.0,91.0,265.0,92.0,9.0,3.0,2005.0,2.47365,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
4,340000.0,1.0,1.0,634.0,1992.0,0.0,0.0,37.0,20.0,75.0,88.0,30.0,9.0,3.0,2002.0,1.661371,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0


### Show Tail Of Dataset

In [0]:
df.tail()

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
1877,385000.0,5.0,6.0,6381.0,2004.0,224334.0,1.0,46.0,76.0,87.0,1250.0,381.0,10.0,3.0,2002.0,-0.792553,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
1878,690000.0,5.0,6.0,6501.0,1956.0,23086.0,1.0,42.0,73.0,61.0,1553.0,473.0,9.0,3.0,2015.0,0.247411,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1879,600000.0,5.0,6.0,7064.0,1995.0,217800.0,1.0,43.0,87.0,66.0,942.0,287.0,8.0,1.0,1999.0,-0.643123,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
1880,759900.0,5.0,6.0,7500.0,2006.0,8886.0,1.0,43.0,61.0,51.0,803.0,245.0,5.0,2.0,2009.0,-0.524305,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1
1881,735000.0,5.0,6.0,7515.0,1958.0,10497.0,1.0,37.0,80.0,86.0,1459.0,444.0,9.0,3.0,2015.0,-0.696683,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1


<hr> 

<br>

## Separate the dataset's features from the target variable

**Tasks:**
- Print shape of original DataFrame before manipulating the DataFrame
- Create a new DataFrame called `X` to contain only the features 
- Create a new DataFrame called `y` to contain only the labels

<br>

### Question: 
Why would you split the data this way?



Answer:

We will do this to separate the features of the dataset from the target value. For our problem we will set the `tx_price` as the target variable for this machine learning model, because we want to predict the house price based on a selected amount of other features of the dataset.     

<br>

### Print Shape Of Original DataFrame
We will do this to confirm our manipulations later


In [0]:
df.shape

(1882, 32)

### Create A DataFrame Called `X` To Hold All The Features
**Note:** `X` is uppercase because it's a 2D array / matrix. A matrix holds multiple rows and more than one column. 

**Tip:** Consider using the DataFrame's `drop` method to create this new DataFrame


#### DataFrame's `drop` method parameters:
- `labels` (string or list of strings): index or a  column labels to drop
- `axis`  ({0 or ‘index’, 1 or ‘columns’}): default 0; whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
- `inplace` (bool): default False; If True, do operation inplace and return None.



In [0]:
X = df.drop(labels="tx_price", axis=1)

#### Show Shape of `X` to make sure we created the features DataFrame correctly:

It should have 1882 rows and 31 columns

In [0]:
X.shape

(1882, 31)

### Print Head Of Features Matrix `X`

In [0]:
X.head()

Unnamed: 0,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
0,1.0,1.0,584.0,2013.0,0.0,0.0,33.0,65.0,84.0,234.0,81.0,9.0,3.0,2013.0,1.493259,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
1,1.0,1.0,612.0,1965.0,0.0,1.0,39.0,73.0,69.0,169.0,51.0,3.0,3.0,2006.0,0.676598,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,1.0,1.0,615.0,1963.0,0.0,0.0,28.0,15.0,86.0,216.0,74.0,8.0,3.0,2012.0,2.298254,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
3,1.0,1.0,618.0,2000.0,33541.0,0.0,36.0,25.0,91.0,265.0,92.0,9.0,3.0,2005.0,2.47365,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
4,1.0,1.0,634.0,1992.0,0.0,0.0,37.0,20.0,75.0,88.0,30.0,9.0,3.0,2002.0,1.661371,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0


### Create A Series Called `y` To Hold All The Labels
**Note:** `y` is lowercase because it's Series, meaning it holds multiple rows and only one column per row.



In [0]:
y = df.loc[:, 'tx_price']

#### Show Shape of `y` to make sure we created the target series correctly:

It should have 1882 rows and 1 column

**Note:** the shape will print out like this, `(1882,)` which means that is has 1882 rows and 1 column.

In [0]:
y.shape

(1882,)

### Print Head Of Label Series `y`

In [0]:
y.head()

0    295850.0
1    216500.0
2    279900.0
3    379900.0
4    340000.0
Name: tx_price, dtype: float64

<hr>
<br>

## Split Data Into Training And Testing 

Even though we will perform cross validation in the near future we will still want to split the dataset in to a training and testing set. We do this so that after we find the best estimator, or best fitted model, through utilizing the training data with a specific hyper-parameters values, we can evaluate the model using unseen testing data. This will allow us to understand if the model is overfitting or underfitting. 

Additional Resources:
- [Learn more about overfitting and underfitting](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.03-Hyperparameters-and-Model-Validation.ipynb)
- [Interested in how to better fit a model?](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb)

Note: The second resource is very informative about feature engineering, but we specifically want to emphasize the **Derived Features** section. 

<br>

#### Split Data Into Training And Testing Sets Using The `train_test_split` Function
 
Requirements: 
- pass in `0.20` as the argument for the `test_size` parameter
- pass in `1` as the argument for the `random_state` parameter

**Note:** We set the `random_state` parameter to a unique argument value so that when we run this notebook multiple times or using different computers we will recieve the same split of data, which is important for re-running experiments and simulations.

[`train_test_split` function documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### Import `train_test_split` Function From Sklearn's Library

In [0]:
# Helper for splitting training and testing sets
from sklearn.model_selection import train_test_split

#### Split Data Into Training And Testing Sets 

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

<hr>

<br>

## Cross Validation Recap
One disadvantage of using a holdout set, or in other words a static split of data for model validation is that we have lost a portion of our data to the model training. In this case some of the dataset would not contribute to the training of the model, saying we don't use cross validation and stick with splitting the data only once. This is not optimal, and can cause problems – especially if the initial set of training data is small.

We will be using cross validation in conjunction to splitting the data using the `train_test_spit` function. We will do this so that we are not training and testing our model's evaluation with the same data, meaning we will eventually want to have our model get scored using unseen data to get a feel for how the model is performing.

<br>

### Perform A Cross Validation Grid Search 
A Grid Search Cross Validation is an exhaustive search over specified hyper-parameter values to find the most performant estimator with a specific hyper-parameter set. 

<br>

#### To perform a cross validation grid search on each machine learning algorithm we need to construct the following:
- Pipeline object, one for each algorithm 
- Model hyper-parameter dictionary, one for each algorithm

<br>

Think of a Pipeline object as production line of transforming `features`, or X DataFrame, before eventually fitting the model. It's important to know that each argument of the pipeline, each transformer object, that will transform the `features` data, must have the following methods implemented:
- `fit()` 
- `fit_transform()`

<br>

In summary the pipeline will pre-process any `features` data provided to it before fitting the model. We will accomplish this by calling the first parameter's `fit_transform` method. The output of the first object's `fit_transform` method will be passed automatically to the next parameter's `fit_transform` method, and so on. Eventually the output of the last transformer object's `fit_transform` method will be passed to the `fit()` method of the estimator object.

#### Rundown of what is conceptually happening:

``` python 
def process_of_pipeline(self, Features, labels):
  Features_transformed = Features
  for name, transformer_object: in self.steps[:-1]:
    Features_transformed = transformer_object.fit_transform(Features_transformed, labels)
  estimator = self.steps[-1][1]
  estimator.fit(Features_transformed, labels) # The last step is the estimator, fit the model after all the transformation operations are complete
```

**Note:** `Features` is capitalized because it's a 2D Matrix, which can interchangably be refered to as a DataFrame, or many rows of data each containing multiple columns of data points. While `labels` should be thought of as a Series, because each row only has one column to hold a singular data point.


[For a better understanding check out this video](https://www.youtube.com/watch?v=6zk6uQSuXqs)

[`GridSearchCV` Class's Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

<br>

### Make a Pipeline

We will start by tranforming the `features` matrix data using the `StandardScalar` object's `fit_transform()` method, which will normalize all of the data. We will then pass the transformed `features` matrix into the estimator's `fit` method along with the unmodified `labels` series as parameters.


### Import The Following Libraries From Sklearn's Library
- `make_pipeline` function, which will help us create a pipeline object
- `LinearRegression` estimator class 
- `StandardScalar` class, which will normalize the dataset


**Remember:** You can tell the difference between a class and a function by the case sensivity

- A **class** will be captialized
- A **function** will be lowercase

In [0]:
# Helper for pipelines
from sklearn.pipeline import make_pipeline

# Model/Estimator
from sklearn.linear_model import LinearRegression

# Helper for normailizing dataset
from sklearn.preprocessing import StandardScaler

### Create The Pipeline

#### Using the `make_pipeline()` function:
- Pass in an instanciated `StandardScalar` object as the first argument
- Pass in  an instanciated `LinearRegression` object as the second argument

**Note:** Both the `StandardScalar` and `LinearRegression` object will be instanciated with no parameters being passed into their constructor function

In [0]:
pipeline = make_pipeline(StandardScaler(), LinearRegression())

<br>

### Make A Hyper-Parameter Dictionary

**Hint:**
You can find out what hyper-parameters an estimator has by using the pipeline object's `get_params` method. This will return a dictionary. 

#### Understanding A Dictionary
A `dictionary` is an unordered collection of data values, used to store data values like a map, which unlike other Data Types that hold only single value as an element, a dictionary holds `key:value` pair. A dictionary has a `key`, and each key maps to a `unique value`. A dictionary is useful when you are trying locate a specific value based on a key in a collection, opposed to iterating over an array/list to get to find specific value. Picture that you have to cycle through a really long list of items just to find the one you were looking for. Is cycling through all those items really necessary? Technically speaking cycling through a list takes longer time and more computer performance, something we need to be mindful of when working with machine learning with big data. A dictionary allows us to quickly access a value based on a unique key, without having to iterate, or cycle, through all elements in this collection.

It's a good time to mention that values of dictionaries can be dictionaries themselves. 

[Dictionary reference](https://www.geeksforgeeks.org/python-dictionary/)

<br>

#### To view the pipeline dictionary, print the dictionary using the pipeline object's `get_params` method: 
```python
# pipeline dicitonary
pipeline.get_params()
```

In [0]:
# pipeline dicitonary
pipeline.get_params()

{'linearregression': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'linearregression__copy_X': True,
 'linearregression__fit_intercept': True,
 'linearregression__n_jobs': None,
 'linearregression__normalize': False,
 'memory': None,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('linearregression',
   LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False))],
 'verbose': False}


#### Invoke the `keys` method on the pipeline dictionary to view all of the pipeline's parameter names

```python
# pipeline dictionary keys
pipeline.get_params().keys()
```

In [0]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'linearregression', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize'])

#### To view all the parameter values of the pipeline dictionary, print the invocation of the pipeline dictionary's `values` method:

#### We will invoke the `values()` method on the pipeline dictionary to view all the parameter values
```python
# pipeline dicitonary values
pipeline.get_params().values()
```

In [0]:
pipeline.get_params().values()

dict_values([None, [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False))], False, StandardScaler(copy=True, with_mean=True, with_std=True), LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), True, True, True, True, True, None, False])

### Create A Dictionary To Hold The Different Hyper-Parameters For An Indvidual Estimator

For this dictionary:
- The `keys` will be a unique estimator hyper-parameter name
- The `values` will be arrays, filled with multiple unique values for that specific hyper-parameter

#### <span style="color:red"> Important Note: </span>
When creating an estimator's hyper-parameter dictionary we need to make sure we are using the pipeline object's dictionary `keys`, not the actual estimator object's dictionary `keys` **i.e.** `LinearRegression()`. 
 
##### Don't Do:

<del>

```python 
# Do not use the actual estimator object's keys
LinearRegression().get_params().keys()

hyper_parameter_dict = {
  'fit_intercept': [True, False] # When we fit, an error
}
```

</del>

When we perform fitting the model with the following hyper-parameters we will get the following error if we do not use the proper hyper-parameter names:

```ValueError: Invalid parameter fit_intercept for estimator Pipeline```

This is because we need to use the pipeline's hyper-parameters naming convention instead: 

```python 
# Get Pipeline's hyper-parameter options instead due to naming conventions sklearn follows
pipeline.get_params().keys()

# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
hyper_parameter_dict = {
    'linearregression__fit_intercept': [True, False],
}


In [0]:
# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
hyper_parameter_dict = {
    'linearregression__fit_intercept': [True, False],
}

### Perform A Cross Validation Grid Search
Now that we have a pipeline and an estimator's hyper-parameter dictionary we can perform a cross validation grid search. This is the preparation step for finding the best estimator with a specific hyper-parameters set.


### Import `GridSearchCv` Class From Sklearn's Library

In [0]:
# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

### Build The GridSearchCV Object 

#### Set the following parameters of the `GridSearchCV` object initialization:
- `estimator` parameter will be assigned with the `pipeline` object instance as its argument
- `param_grid` parameter will be assigned with the `hyper-parameter-dictionary` instance
- `return_train_score` parameter will be assigned with the value `True`
- `refit` parameter will be assigned with the value `True`
- `n_jobs` parameter will be assigned the value `-1` to use all available cpu power

[`GridSearchCV` Class's Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


In [0]:
model = GridSearchCV(pipeline, hyper_parameter_dict, cv=10, n_jobs=-1, return_train_score=True, refit=True) 

<br>

#### **Note:** The data type of the model instance is of type GridSearchCv

### Use pythons built in `type()` function passing in the model instance to print the type 

In [0]:
type(model)

sklearn.model_selection._search.GridSearchCV

<br>

### Fit Model to Training Data
We will now fit an estimator/model. We will do this so that we can use the fitted model to predict the values to which unseen samples belong.
This is the training/learning aspect of the model as well as where the pipeline process comes alive!

In [0]:
model.fit(X_train, y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('linearregression',
                                        LinearRegression(copy_X=True,
                                                         fit_intercept=True,
                                                         n_jobs=None,
                                                         normalize=False))],
                                verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'linearregression__fit_intercept': [True, False]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring=None, verbose=0)

<hr>

## Evaluate Model


### Highest Performing Model Score
We will now find out how well the highest performing model with a specific hyper-parameter value set performs.

Use the `GridSearchCV` class's `score` method using the training data. 

**Note:** The reason why we can train and evaluate with the same data is because we are performing cross validation. 

This `score` method will use the best estimator's scoring function, in our case the **r^2** score because we are using a LinearRegression estimator.

Refrences:
- [`GridSearchCV` class's `score` method documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.score)
- [`LinearRegression` class's `score` method documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score)

In [0]:
model.score(X_train,y_train) 

0.4541561568926857

### Mean Performing Score
Print the `GridSearchCV` object's `best_score_` attribute to see the mean cross-validated score using the best_estimator

Todo: Confirm if `GridSearchCv.best_score_` is using the GridSearchCv.best_estimator to perform the cross validation.

In [0]:
model.best_score_

0.404873450982989

# Store the Best Fitted Estimator

Assign a new variable named `best_estimator` with the GridSearchCV object's `best_estimator_` attribute

**Remember:** `best_estimator_` is actually a `Pipeline` object

In [0]:
best_estimator = model.best_estimator_

<br>

### Question
What can we conclude from looking at the best_estimator's set hyperparameters?

### Answer:
The most performant model will have the value `True` for the `fit_intercept` hyper-parameter.

<br>

#### Important note: 
The `best_estimator`'s hyper-parameter values were specifically chosen because of the data it was trained on. If you manipulate the data and then re-fit the model we might notice different values.  


<hr>

<br>

### Evaluate the best performing model using the testing data

**Remember:** `best_estimator_` is actually a `Pipeline` object. 

Once you have access to the pipeline object(s) then use the `score` method passing in X_test and y_test as arguments.

<br>

### <span style="color:red">**It's really important to always understand the type of object you are working with, use that object's documentation to guide you in the right direction** </span>

In [0]:
best_estimator.score(X_test, y_test)

0.46674832991442416

<hr>

# Make Predictions
Now that we have a fitted model we can now make a prediction using the testing set.

Make predictions using the first five observations of the testing data.

In [0]:
for prediction in model.predict(X_test[:5]):
  print(prediction)

426755.0020805967
656771.0020805968
493411.0020805967
584163.0020805968
386659.0020805967


### Compare Predictions To The True Values

In [0]:
for actual_value in y_test[:5]:
  print(actual_value)

360000.0
765000.0
333250.0
590000.0
369900.0


#### What else can you do to boost the model performance?

Iterate:
- Exploratory analysis
- Data cleaning 
- Feature engineering
- Get More Data