# Linear Regression With Multiple Models

We will learn how to solve a linear regression problem through the use of multiple linear regression model algorithms, specifically regularization algorithms. In conjunction we will perform a cross validation grid search, looping through different models to find the model and it's hyper-parameter set with the highest perforformance based on the current state of the data. We will use this model to predict the transaction price of a house using a real estate dataset. 

## Overview: 
1. Import the dataset
2. Separate the dataset's features from target variable
3. Split data into training and testing sets
4. Make a dictionary called `pipelines_dict` to store multiple pipelines, one for each algorithm 
5. Make multiple hyper-parameter dictionaries, one for each algorithm
6. Make a dictionary called `hyper_parameters_dict` to hold all hyper-parameter dictionaries
7. Make a dictionary called `best_performing_models_dict` to populate and store the highest perfoming models during the cross validation grid search loop 
8. Create a loop to perform multiple cross validaton grid searches, one for each algorithm using the training set. Setting the following parameters of the `GridSearchCV` object initialization:
  - `estimator` parameter will be assigned with one of the `pipeline` object instances with respect to a specific algorithm selected for the current loop iteration as its argument
  - `param_grid` parameter will be assigned with one of the `hyper-parameter-dictionary` instances with respect to a specific algorithm selected for the current loop iteration as its argument
9. Identify the highest performing algorithm with its hyper-parameters set using the training set
10. Evaluate the best performing model using the testing data
11. Using the best performing model predict values and compare those results to the real values
<hr>

<br>

## Import Required Libraries

**Note:** You can tell the difference between a class and a function by the case sensivity. 

- A **class** will be captialized
- A **function** will be lowercase
- A **method**, or a function belonging to a class, will also be lowercase. You can call a method by invoking it through an instance of a class (instance method), or through a class definition (static method)

References: 
- [Understanding what a class is](https://www.hackerearth.com/practice/python/object-oriented-programming/classes-and-objects-i/tutorial/)
- [Differences between functions and methods](https://www.tutorialspoint.com/difference-between-method-and-function-in-python)
- [Different types of methods](https://www.bogotobogo.com/python/python_differences_between_static_method_and_class_method_instance_method.php)

In [0]:
# Collection libraries 
import numpy as np
import pandas as pd

# Visual libraries 
import matplotlib.pyplot as plt
import seaborn as sns

# Helper for splitting training and testing data
from sklearn.model_selection import train_test_split

# Models/Estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet

# Helper for pipelines
from sklearn.pipeline import make_pipeline

# Helper for normailizing dataset
from sklearn.preprocessing import StandardScaler

# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

#### Notes about imports with this notebook:
We will re-import some of the libraries when we use these modules, this is to get you used to importing and understanding their classes and functions. Reference the documentation to understand the libraries classes, methods, and functions. 

## Load Data

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [122]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [123]:
directory = "studsent"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/machine-learning/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/machine-learning

[Errno 2] No such file or directory: 'drive/Shared drives/Rubrik/Data Science Track/machine-learning'
/content/drive/Shared drives/Rubrik/Data Science Track/machine-learning


<hr> 
<br>

### Import Real Estate Dataset
Read in the real estate dataset using the path provided and store it in a variable called `df`.

#### Import the cleaned real estate dataset
- Use pandas' `read_csv` function

#### Pandas' `read_csv` parameters:
- `filepath_or_buffer` (string): path of csv to import

```python 
filepath_or_buffer = './data/cleaned_and_feature_engineered_real_estate.csv'
```

In [0]:
df = pd.read_csv(filepath_or_buffer = './data/cleaned_and_feature_engineered_real_estate.csv')

### Show Head Of Datset

In [125]:
df.head()

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
0,295850.0,1.0,1.0,584.0,2013.0,0.0,0.0,33.0,65.0,84.0,234.0,81.0,9.0,3.0,2013.0,1.493259,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
1,216500.0,1.0,1.0,612.0,1965.0,0.0,1.0,39.0,73.0,69.0,169.0,51.0,3.0,3.0,2006.0,0.676598,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,279900.0,1.0,1.0,615.0,1963.0,0.0,0.0,28.0,15.0,86.0,216.0,74.0,8.0,3.0,2012.0,2.298254,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
3,379900.0,1.0,1.0,618.0,2000.0,33541.0,0.0,36.0,25.0,91.0,265.0,92.0,9.0,3.0,2005.0,2.47365,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
4,340000.0,1.0,1.0,634.0,1992.0,0.0,0.0,37.0,20.0,75.0,88.0,30.0,9.0,3.0,2002.0,1.661371,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0


### Show Tail Of Dataset

In [126]:
df.tail()

Unnamed: 0,tx_price,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
1877,385000.0,5.0,6.0,6381.0,2004.0,224334.0,1.0,46.0,76.0,87.0,1250.0,381.0,10.0,3.0,2002.0,-0.792553,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
1878,690000.0,5.0,6.0,6501.0,1956.0,23086.0,1.0,42.0,73.0,61.0,1553.0,473.0,9.0,3.0,2015.0,0.247411,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
1879,600000.0,5.0,6.0,7064.0,1995.0,217800.0,1.0,43.0,87.0,66.0,942.0,287.0,8.0,1.0,1999.0,-0.643123,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
1880,759900.0,5.0,6.0,7500.0,2006.0,8886.0,1.0,43.0,61.0,51.0,803.0,245.0,5.0,2.0,2009.0,-0.524305,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1
1881,735000.0,5.0,6.0,7515.0,1958.0,10497.0,1.0,37.0,80.0,86.0,1459.0,444.0,9.0,3.0,2015.0,-0.696683,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1


<hr> 

<br>

## Separate the dataset's features from the target variable

**Tasks:**
- Print shape of original DataFrame before manipulating the DataFrame
- Create a new DataFrame called `X` to contain only the features 
- Create a new DataFrame called `y` to contain only the labels

<br>

### Question: 
Why would you split the data this way?



Answer:

We will do this to separate the features of the dataset from the target value. For our problem we will set the `tx_price` as the target variable for this machine learning model, because we want to predict the house price based on a selected amount of other features of the dataset.     

<br>

### Print Shape Of Original DataFrame
We will do this to confirm our manipulations later


In [127]:
df.shape

(1882, 32)

### Create A DataFrame Called `X` To Hold All The Features
**Note:** `X` is uppercase because it's a 2D array / matrix. A matrix can hold multiple rows and more than one column. 

**Tip:** Consider using the DataFrame's `drop` method to create this new DataFrame


#### DataFrame's `drop` method parameters:
- `labels` (string or list of strings): index or a  column labels to drop
- `axis`  ({0 or ‘index’, 1 or ‘columns’}): default 0; whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)
- `inplace` (bool): default False; If True, do operation inplace and return None.



In [0]:
X = df.drop(labels="tx_price", axis=1)

#### Show Shape of `X` to make sure we created the features DataFrame correctly:

It should have 1882 rows and 31 columns

In [129]:
X.shape

(1882, 31)

### Print Head Of Features Matrix `X`

In [130]:
X.head()

Unnamed: 0,beds,baths,sqft,year_built,lot_size,basement,median_age,married,college_grad,property_tax,insurance,median_school,num_schools,tx_year,lifestyle_avg,two_and_two,exterior_walls_Brick,exterior_walls_Brick veneer,exterior_walls_Combination,exterior_walls_Metal,exterior_walls_Other,exterior_walls_Siding (Alum/Vinyl),exterior_walls_Wood,exterior_walls_missing,roof_Asphalt,roof_Composition Shingle,roof_Other,roof_Shake Shingle,roof_missing,property_type_Apartment / Condo / Townhouse,property_type_Single-Family
0,1.0,1.0,584.0,2013.0,0.0,0.0,33.0,65.0,84.0,234.0,81.0,9.0,3.0,2013.0,1.493259,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
1,1.0,1.0,612.0,1965.0,0.0,1.0,39.0,73.0,69.0,169.0,51.0,3.0,3.0,2006.0,0.676598,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,1.0,1.0,615.0,1963.0,0.0,0.0,28.0,15.0,86.0,216.0,74.0,8.0,3.0,2012.0,2.298254,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
3,1.0,1.0,618.0,2000.0,33541.0,0.0,36.0,25.0,91.0,265.0,92.0,9.0,3.0,2005.0,2.47365,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0
4,1.0,1.0,634.0,1992.0,0.0,0.0,37.0,20.0,75.0,88.0,30.0,9.0,3.0,2002.0,1.661371,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0


### Create A Series Called `y` To Hold All The Labels
**Note:** `y` is lowercase because it's Series, meaning can hold multiple rows with only one column per row.



In [0]:
y = df.loc[:, 'tx_price']

#### Show Shape of `y` to make sure we created the target series correctly:

It should have 1882 rows and 1 column

**Note:** the shape will print out like this, `(1882,)` which means that is has 1882 rows and 1 column.

In [132]:
y.shape

(1882,)

### Print Head Of Label Series `y`

In [133]:
y.head()

0    295850.0
1    216500.0
2    279900.0
3    379900.0
4    340000.0
Name: tx_price, dtype: float64

<hr>
<br>

## Split Data Into Training And Testing 

Even though we will perform cross validation in the near future we will still want to split the dataset in to a training and testing set. We do this so that after we find the best estimator, or best fitted model, through utilizing the training data with a specific hyper-parameters values, we can evaluate the model using unseen testing data. This will allow us to understand if the model is overfitting or underfitting. 

Additional Resources:
- [Learn more about overfitting and underfitting](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.03-Hyperparameters-and-Model-Validation.ipynb)
- [Interested in how to better fit a model?](https://github.com/SoftStackFactory/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb)

**Note:** The second resource is very informative about feature engineering, but we specifically want to emphasize the **Derived Features** section. 


#### Split Data Into Training And Testing Sets Using `train_test_split` function
 
Requirements: 
- pass in `0.20` as the argument for the `test_size` parameter
- pass in `1` as the argument for the `random_state` parameter

**Note:** We set the `random_state` parameter to a unique argument value so that when we run this notebook multiple times or using different computers we will recieve the same split of data, which is important for re-running experiments and simulations.

[`train_test_split` function documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### Import `train_test_split` Function From Sklearn's Library

In [0]:
# Helper for splitting training and testing sets
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

<hr>

<br>

## Cross Validation Recap
One disadvantage of using a holdout set, or in other words a static split of data for model validation is that we have lost a portion of our data to the model training. In this case some of the dataset would not contribute to the training of the model, saying we don't use cross validation and stick with splitting the data only once. This is not optimal, and can cause problems – especially if the initial set of training data is small.

We will be using cross validation in conjunction to splitting the data using the `train_test_spit` function. We will do this so that we are not training and testing our model's evaluation with the same data, meaning we will eventually want to have our model get scored using unseen data to get a feel for how the model is performing.

<br>

### Perform A Cross Validation Grid Search 
A Grid Search Cross Validation is an exhaustive search over specified hyper-parameter values to find the most performant estimator with a specific hyper-parameter set. 

<br>

#### To perform a cross validation grid search on each machine learning algorithm we need to construct the following:
- Pipeline object, one for each algorithm 
- Model hyper-parameter dictionary, one for each algorithm

<br>

Think of a Pipeline object as production line of transforming `features`, or X DataFrame, before eventually fitting the model. It's important to know that each argument of the pipeline, each transformer object, that will transform the `features` data, must have the following methods implemented:
- `fit()` 
- `fit_transform()`

<br>

In summary the pipeline will pre-process any `features` data provided to it before fitting the model. We will accomplish this by calling the first parameter's `fit_transform` method. The output of the first object's `fit_transform` method will be passed automatically to the next parameter's `fit_transform` method, and so on. Eventually the output of the last transformer object's `fit_transform` method will be passed to the `fit()` method of the estimator object.

#### Rundown of what is conceptually happening:

``` python 
def process_of_pipeline(self, Features, labels):
  Features_transformed = Feaures
  for name, transformer_object: in self.steps[:-1]:
    Features_transformed = transformer_object.fit_transform(Features_transformed, labels)
  estimator = self.steps[-1][1]
  estimator.fit(Features_transformed, labels) # The last step is the estimator, fit the model after all the transformation operations are complete
```

**Note:** `Features` is capitalized because it's a 2D Matrix, which can interchangably be refered to as a DataFrame, or many rows of data each containing multiple columns of data points. While `labels` should be thought of as a Series, because each row only has one column to hold a singular data point.


[For a better understanding check out this video](https://www.youtube.com/watch?v=6zk6uQSuXqs)

[`GridSearchCV` Class's Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

<br>

### Make Pipelines, One For Each Model Algorithm

We will start by tranforming the `features` matrix data using the `StandardScalar` object's `fit_transform()` method, which will normalize all of the data. We will then pass the transformed `features` matrix into the estimator's `fit` method along with the unmodified `labels` series as parameters.

#### Import The Following Libraries From Sklearn's Library
- `make_pipeline` function, which will help us create a pipeline object
- `StandardScalar` class, which will normalize the dataset
- `LinearRegression` estimator class 
- `Lasso` estimator class
- `Ridge` estimator class
- `ElasticNet` estimator class 

**Remember:** You can tell the difference between a class and a function by the case sensivity

- A **class** will be captialized
- A **function** will be lowercase

In [0]:
# Helper for pipelines
from sklearn.pipeline import make_pipeline

# Models/Estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet

# Helper for normailizing dataset
from sklearn.preprocessing import StandardScaler

Resources:
- [Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso)
- [Ridge Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge)
- [ElasticNet Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet)

### Create Pipeline Dictionary
Create a dictionary called `pipelines_dict` to hold multiple pipeline objects. The `key` of the dictionary should be the model's name and the `value` should be an instanciated pipeline object, which can be made by invoking the `make_pipeline` function. We will create a key and value pair for each estimator.

<br>

#### Understanding A Dictionary
A `dictionary` is an unordered collection of data values, used to store data values like a map, which unlike other Data Types that hold only single value as an element, a dictionary holds `key:value` pair. A dictionary has a `key`, and each key maps to a `unique value`. A dictionary is useful when you are trying locate a specific value based on a key in a collection, opposed to iterating over an array/list to get to find specific value. Picture that you have to cycle through a really long list of items just to find the one you were looking for. Is cycling through all those items really necessary? Technically speaking cycling through a list takes longer time and more computer performance, something we need to be mindful of when working with machine learning with big data. A dictionary allows us to quickly access a value based on a unique key, without having to iterate, or cycle, through all elements in this collection.

It's a good time to mention that values of dictionaries can be dictionaries themselves. 

[Dictionary reference](https://www.geeksforgeeks.org/python-dictionary/)

<br> 

#### Use The `make_pipeline()` Function To Set `Values` Of The Dictionary:
- The `keys` should be the estimator's name. 
- Pass in an instanciated `StandardScalar` object as the first argument to the `make_pipeline` function
- Pass in an instanciated estimator object as the second argument to the `make_pipline` function

**Note:** Both the `StandardScalar` and `estimator objects will be instanciated with no parameters being passed into their constructor function

**Note:** We will make multiple dictionaries so it is important to keep using the same `key`, so that later we can use a `key` to access values from multiple dicitonaries.

In [0]:
pipelines_dict = {
    "linear_regression": make_pipeline(StandardScaler(), LinearRegression()),
    "lasso": make_pipeline(StandardScaler(), Lasso()),
    "ridge": make_pipeline(StandardScaler(), Ridge()),
    "elastic_net": make_pipeline(StandardScaler(), ElasticNet())
}

<br>

### Make Multiple Hyper-Parameter Dictionaries
One for each machine learning estimator algorithm.

**Hint:**
You can find out what hyper-parameters an estimator has by using the pipeline object's `get_params` method. This will return a dictionary. 


<br>

#### To view the pipeline dictionary, print the dictionary using a pipeline object's `get_params` method: 
```python
# linear regression pipeline's paramater dicitonary
pipelines_dict['linear_regression'].get_params()
```
**Note:** We will access a pipeline object through using bracket notation to access a value from the `pipelines_dict` dictionary. The key will go inside the brackets to return the desired value (pipeline object).

[Accessing Dictionary Values Reference](https://realpython.com/python-dicts/#accessing-dictionary-values)


#### Print The Linear Regression Pipeline Parameters Using The Pipeline's `get_params` method

In [138]:
# linear regression pipeline's paramater dicitonary
pipelines_dict['linear_regression'].get_params()

{'linearregression': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'linearregression__copy_X': True,
 'linearregression__fit_intercept': True,
 'linearregression__n_jobs': None,
 'linearregression__normalize': False,
 'memory': None,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('linearregression',
   LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False))],
 'verbose': False}


#### Invoke the `keys` method on the linear regression pipeline's parameter dictionary to view all of the pipeline's parameter names

```python
# linear regression pipeline's parameter dicitonary keys
pipelines_dict['linear_regression'].get_params().keys()
```

In [139]:
pipelines_dict['linear_regression'].get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'linearregression', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize'])

#### Invoke the `values()` method on the linear regression pipeline's parameter dictionary to view all the  pipeline's parameter values
```python
# pipeline dicitonary values
pipelines_dict['linear_regression'].get_params().values()
```

In [140]:
pipelines_dict['linear_regression'].get_params().values()

dict_values([None, [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False))], False, StandardScaler(copy=True, with_mean=True, with_std=True), LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), True, True, True, True, True, None, False])

### Create Multiple Dictionaries To Hold The Different Hyper-Parameters For Each Indvidual Estimator

For each dictionary:
- Name the dictionary with respect to the model name
- The `key` will be a unique estimator hyper-parameter name
- The `value` will be an array, filled with multiple unique values for that specific hyper-parameter

#### <span style="color:red"> Important Note: </span>
When creating an estimator's hyper-parameter dictionary we need to make sure we are using the pipeline object's dictionary `keys`, not the actual estimator object's dictionary `keys` **i.e.** `LinearRegression()`. 
 
##### Don't Do:

<del>

```python 
# Do not use the actual estimator object's keys
LinearRegression().get_params().keys()

linear_regression_hyper_parameter_dict = {
  'fit_intercept': [True, False] # When we fit, an error
}
```

</del>

When we perform fitting the model with the following hyper-parameters we will get the following error if we do not use the proper hyper-parameter names:

```ValueError: Invalid parameter fit_intercept for estimator Pipeline```

This is because we need to use the pipeline's hyper-parameters naming convention instead: 

```python 
# Get Pipeline's hyper-parameter options instead due to naming conventions sklearn follows
pipelines_dict['linear_regression'].get_params().keys()

# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
linear_regression_hyper_parameter_dict = {
    'linearregression__fit_intercept': [True, False],
}
```


#### Create Linear Regression Hyper-Parameter Dictionary

In [141]:
print(pipelines_dict['linear_regression'].get_params().keys())

# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
linear_regression_hyper_parameter_dict = {
    'linearregression__fit_intercept': [True, False],
}

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'linearregression', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize'])


#### Create Lasso Hyper-Parameter Dictionary

In [142]:
print(pipelines_dict['lasso'].get_params().keys())

# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
lasso_hyper_parameter_dict = {
    'lasso__random_state': [1],
    'lasso__fit_intercept': [True, False],
    'lasso__alpha': [0.01, 0.1, 0.5, 0.7, 1, 2, 5, 10],
}

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'lasso', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'lasso__alpha', 'lasso__copy_X', 'lasso__fit_intercept', 'lasso__max_iter', 'lasso__normalize', 'lasso__positive', 'lasso__precompute', 'lasso__random_state', 'lasso__selection', 'lasso__tol', 'lasso__warm_start'])


#### Create Ridge Hyper-Parameter Dictionary

In [143]:
print(pipelines_dict['ridge'].get_params().keys())

# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
ridge_hyper_parameter_dict = {
    'ridge__random_state': [1],
    'ridge__fit_intercept': [True, False],
    'ridge__alpha': [0.01, 0.1, 0.5, 0.7, 1, 2, 5, 10],
}

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'ridge', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'ridge__alpha', 'ridge__copy_X', 'ridge__fit_intercept', 'ridge__max_iter', 'ridge__normalize', 'ridge__random_state', 'ridge__solver', 'ridge__tol'])


#### Create ElasticNet Hyper-Parameter Dictionary

In [144]:
print(pipelines_dict['elastic_net'].get_params().keys())

# Use Pipeline's hyper-parameter options instead due to the naming conventions sklearn follows
elastic_net_hyper_parameter_dict = {
    'elasticnet__random_state': [1],
    'elasticnet__fit_intercept': [True, False],
    'elasticnet__alpha': [0.01, 0.1, 0.5, 0.7, 1, 2, 5, 10],
    'elasticnet__l1_ratio': [0.01, 0.1, 0.5, 0.7, 0.8, 0.9, 1]
}

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'elasticnet', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'elasticnet__alpha', 'elasticnet__copy_X', 'elasticnet__fit_intercept', 'elasticnet__l1_ratio', 'elasticnet__max_iter', 'elasticnet__normalize', 'elasticnet__positive', 'elasticnet__precompute', 'elasticnet__random_state', 'elasticnet__selection', 'elasticnet__tol', 'elasticnet__warm_start'])


### Create a Dictionary To Group All Individual Model Hyper-Parameter Dictionaries 
Call this dictionary `hyper_parameters_dict`.
Each `key` should match the same `key` name used for the `pipelines_dict`, and each value should be assigned that model's hype-parameter dictionary. 

**Note:** It is important to use the same `key` name so that later when we perform the cross valudation grid search we can loop through the `key` names once and use this key to access multple values from different dictionaries with similar key names but different values.

For Example:
```python
hyper_parameters_dict = {
    "linear_regression": linear_regression_hyper_parameter_dict,
    ...
}
```


In [0]:
hyper_parameters_dict = {
    "linear_regression": linear_regression_hyper_parameter_dict,
    "lasso": lasso_hyper_parameter_dict,
    "ridge": ridge_hyper_parameter_dict,
    "elastic_net": elastic_net_hyper_parameter_dict
}

<hr>

### Perform A Cross Validation Grid Search With Each Model
Now that we have a `pipelines_dict` and `hyper_parameter_dict` dictionary we can perform multiple cross validation grid searches, one for each model. 

We will do this by looping through a dictionary's keys to get access to the `key` names, we will then use the `key` names to access values from mutliple dictionaries.

We will loop through the key names using the dictionary's `keys` method:

```python 
# Loop through model names
for model_name in pipelines_dict.keys():
  print(model_name)
```


[Looping Through Dictionary Keys Reference](https://realpython.com/iterate-through-dictionary-python/#iterating-through-keys)

#### Print Key Names By Looping Through The `pipelines_dict` Keys
This exercise will show you that we can access `key` names from dictionaries. We will use these `key` names to perform a cross validation grid search for each model to find the highest performing model with specific hyper-parameters set. 

In [146]:
# Loop through model names
for model_name in pipelines_dict.keys():
  print(model_name)

linear_regression
lasso
ridge
elastic_net


<br> 


### Import `GridSearchCv` Class From Sklearn's Library

In [0]:
# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

### Create A Dictionary To Hold All Of The Highest Performing Models
Create an empty dictionary called, `best_performing_models_dict`. Once we have created this dictionary, we will then create a loop to populate this new dictionary. This dictionary will contain an entry for each model, each entry will contain a model with a specific hyper-parameter set which maximizes the performance for each specific model. 

In [0]:
best_performing_models_dict = {}

### Create A Loop To Perform A Cross Validation Grid Search For Each Model To Populate the `best_performing_models_dict`


#### When looping, for each iteration of the loop, set the following parameters of the `GridSearchCV` object initialization:
- `estimator` parameter will be assigned with it's corresponding `pipeline` object instance as its argument
- `param_grid` parameter will be assigned with it's corresponding  `hyper-parameter-dictionary` instance
- `return_train_score` parameter will be assigned with the value `True`
- `refit` parameter will be assigned with the value `True`
- `n_jobs` parameter will be assigned the value `-1` to use all available cpu power

**Note:** We will index the `pipelines_dict` and `hyper_parameters_dict` with the current iterations model name and create key value pairs for the `best_performing_models_dict` using that same model name as the key and assigning the value to that model's `best_estimator_` attribute

[`GridSearchCV` Class's Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)


In [0]:
pipelines_dict = {
    "linear_regression": make_pipeline(StandardScaler(), LinearRegression()),
    # "lasso": make_pipeline(StandardScaler(), Lasso()),
    # "ridge": make_pipeline(StandardScaler(), Ridge()),
    # "elastic_net": make_pipeline(StandardScaler(), ElasticNet())
}

In [0]:
hyper_parameters_dict = {
    "linear_regression": linear_regression_hyper_parameter_dict,
    # "lasso": lasso_hyper_parameter_dict,
    # "ridge": ridge_hyper_parameter_dict,
    # "elastic_net": elastic_net_hyper_parameter_dict
}

In [151]:
# Loop through model names
for model_name in pipelines_dict.keys():

  # Print to screen which model is being fitted
  print("Searching for the best {} model:".format(model_name))

  # Preparation step for finding the best estimator with a specific hyper-parameters set.
  model = GridSearchCV(pipelines_dict[model_name], hyper_parameters_dict[model_name], cv=10, n_jobs=-1, return_train_score=True, refit=True) 

  # Fit model to training data
  model.fit(X_train, y_train)

  # Print the model's mean performing score 
  print("{} model's cross validation mean performing score: {}".format(model_name, model.best_score_))

  # Populate the best_performing_models_dict with the model's best_estimator_ attribute
  best_performing_models_dict[model_name] = model.best_estimator_
  
  # Print if a specific model had been stored   
  if best_performing_models_dict[model_name] != None:
    print("Best {} model stored \n".format(model_name))

Searching for the best linear_regression model:
linear_regression model's cross validation mean performing score: 0.404873450982989
Best linear_regression model stored 





<br>

#### **Note:** The data type of the model instance is of type GridSearchCv


### Print the Cross Validation Results For the Last Model Run In The Cross Validation Search
Use the `cv_results_` property on the current model instance.

```python
# cross validation results
model.cv_results_
```

**Note:** the current model instance is the last model that was run in the cross validation grid search. If you wanted to see the results for each model you would print the `cv_results` inside the loop above.

In [152]:
# cross validation results
display(model.cv_results_)

{'mean_fit_time': array([0.01072845, 0.01089587]),
 'mean_score_time': array([0.00249472, 0.00247881]),
 'mean_test_score': array([ 0.40487345, -7.53932783]),
 'mean_train_score': array([ 0.45632874, -7.47182603]),
 'param_linearregression__fit_intercept': masked_array(data=[True, False],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'linearregression__fit_intercept': True},
  {'linearregression__fit_intercept': False}],
 'rank_test_score': array([1, 2], dtype=int32),
 'split0_test_score': array([ 0.38213891, -7.42620115]),
 'split0_train_score': array([ 0.45946899, -7.44700949]),
 'split1_test_score': array([ 0.37662457, -8.35941026]),
 'split1_train_score': array([ 0.45942683, -7.44587178]),
 'split2_test_score': array([ 0.47636875, -8.50996462]),
 'split2_train_score': array([ 0.45107483, -7.4396096 ]),
 'split3_test_score': array([ 0.38005467, -7.83495383]),
 'split3_train_score': array([ 0.4598655 , -7.35421953]),
 'split4_test

<hr>

## Evaluate Highest Performing Models Score Using Training Data
We will now find out how well each of the highest performing models with a specific hyper-parameter value set performs.

Use the `GridSearchCV` class's `score` method using the training data. 

**Note:** The reason why we can train and evaluate with the same data is because we performed cross validation. 

This `score` method will use the best estimator's scoring function, each estimator might have a different scoring function.

<br>

**Steps:**
1. Create an array called `highest_performing_models` to store the best models incase there are multiple models with the same scores
2. Loop through model names
3. Store the current iteration's model score using training data of the best estimator using the training data in a temporary variable 
4. Print the model name and score using the temporary variable holding the current iteration's model
5. Store the highest performing model or models, if there are multiple models with the same highest score, name and score inside of a tuple and then store that tuple inside the `highest_performing_models` array.   

**Note:** When storing the highest performing model or models make sure to handle the following cases:
- Handling of first highest performing model using training data 
- Handling if we find a model with a new high training score
- Handling of multiple models with the same highest training  score

Refrences:
- [`GridSearchCV` class's `score` method documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.score)
- [Tuples Reference](https://www.tutorialspoint.com/python/python_tuples.htm)


In [153]:
# Create an array to store the best models incase there are multiple models with the same scores
highest_performing_models = []

# Loop through model names
for model_name in pipelines_dict.keys():
  # Store the score of the best estimator using the training data
  training_model_score = best_performing_models_dict[model_name].score(X_train, y_train)

  # Print the model name and score
  print("{} training model score: {}".format(model_name, training_model_score))

  ## Store Highest performing model or models if there are multiple models with the same highest score 
  # Handling of first highest performing model 
  if len(highest_performing_models) == 0:
    value = (model_name, training_model_score)
    highest_performing_models.append(value)
  
  # Handling if we find a model with a new high score
  elif training_model_score > highest_performing_models[0][1]:
    # clear array if we have found a new high score to wype the old records
    highest_performing_models = []
    value = (model_name, training_model_score)
    highest_performing_models = [value]
  
  # Handling of multiple models with the same highest score
  elif training_model_score == highest_performing_models[0][1]:
    value = (model_name, training_model_score)
    highest_performing_models.append(value)

linear_regression training model score: 0.4541561568926857


### Print The Top Performing Model Name(s), Score, and Hyper-Parameter Set(s) Built 
**Note:** Handle the following cases:
- A single model having the highest training score
- Multiple models have the same high training  score



**Remember:** Each `value` of the `best_performing_models_dict` is actually a `Pipeline` object. Each `pipeline` object has a `score` method, use this to evaluate the model on the training set. 

<span style="color:red">**It's really important to always understand the type of object you are working with, use that object's documentation to guide you in the right direction**</span>

**Hint:**
Use the `highest_performing_models` array to access the name(s) of the highest performing model(s). Remember each item in the array is a tuple of two values, the `model_name` and the `training_model_score`. Once you have the access to the `model_name` use that to access the `best_performing_models_dict` using the `model_name` and invoke the `get_params` method to find out the hyper-parameter set.

[Indexing tuples Reference](https://www.tutorialspoint.com/python/python_tuples.htm)

In [154]:
# Handling of a single model having the highest score
if len(highest_performing_models) == 1:
  for model_name, training_model_score in highest_performing_models:
    print("{} is the highest performing model, with a training score of {}".format(model_name, training_model_score))
    print("{}'s highest performing model hyper-parameter set: \n {} \n".format(model_name, best_performing_models_dict[model_name].get_params()))

# Handling of multiple models have the same high score
else:
  print("Top performing models with a training score of {}: \n".format(highest_performing_models[0][1]))
  for model_name, training_model_score in highest_performing_models:
    print("{} model hyper-parameter set:".format(model_name))
    print("{}\n".format(best_performing_models_dict[model_name].get_params()))

linear_regression is the highest performing model, with a training score of 0.4541561568926857
linear_regression's highest performing model hyper-parameter set: 
 {'memory': None, 'steps': [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False))], 'verbose': False, 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'linearregression': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'linearregression__copy_X': True, 'linearregression__fit_intercept': True, 'linearregression__n_jobs': None, 'linearregression__normalize': False} 



<br>

### Question
What can we conclude from looking at the all the best_estimator's scores?

### Answer:
The most performant model is the lasso algorithm

<br>

#### **Important note:** 
The `best_estimator`'s hyper-parameter values were specifically chosen because of the data it was trained on. If you manipulate the data and then re-fit the model we might notice different values.  

<hr>

<br>

## Evaluate The Best Performing Model(s) Using The Testing Data

**Remember:** Each `value` of the `best_performing_models_dict` is actually a `Pipeline` object. Each `pipeline` object has a `score` method, use this to evaluate the model on the testing set. 

<span style="color:red">**It's really important to always understand the type of object you are working with, use that object's documentation to guide you in the right direction**</span>

**Hint:**
Use the `highest_performing_models` array to access the name(s) of the highest performing model(s). Remember each item in the array is a tuple of two values, the `model_name` and the `model_score`. Once you have the access to the `model_name` use that to access the `pipeline` object(s) from the `best_performing_models_dict`. Once you have access to the `pipeline` object(s) then use the `score` method passing in `X_test` and `y_test` as arguments.  

[Indexing tuples Reference](https://www.tutorialspoint.com/python/python_tuples.htm)

In [155]:
# Handling of a single model having the highest score
if len(highest_performing_models) == 1:
  for model_name, training_model_score in highest_performing_models:
    print('Model name: {}'.format(model_name))
    print("Training score: {}".format(training_model_score))
    print("Testing score: {}".format(best_performing_models_dict[model_name].score(X_test, y_test)))

# Handling of multiple models have the same high score
else:
  print("Top performing models with a training score of {}: \n".format(highest_performing_models[0][1]))
  for model_name, training_model_score in highest_performing_models:
    print('Model name: {}'.format(model_name))
    print("Testing score: {}".format(best_performing_models_dict[model_name].score(X_test, y_test)))

Model name: linear_regression
Training score: 0.4541561568926857
Testing score: 0.46674832991442416


<hr>

# Make Predictions Using Highest Performing Model(s) and Testing Data
Now that we have the highest scoring fitted model(s) we can now make a prediction using the testing set.

Make predictions using the first ten observations of the testing data to get a feel of how well the highest performing model performed. We will be interested in the residual, or the difference between the true value and the predicted value. 


If we have more than one model with the same highest performing score then make multiple series of predictions. Prepend the model name to the predictions series. 
For example:
```python
# Create predictions with testing data
lasso_test_predictions = pd.Series(best_performing_models_dict['lasso'].predict(X_test))
# Re-assign indexes
lasso_test_predictions.index = X_test.index

print("Lasso predictions:")
display(lasso_test_predictions.head(10))
print("True Values:")
display(y_test.head(10))


# Create predictions with testing data
ridge_test_predictions = pd.Series(best_performing_models_dict['ridge'].predict(X_test))
# Re-assign indexes
ridge_test_predictions.index = X_test.index

print("Ridge predictions:")
display(ridge_test_predictions.head(10))
print("True Values:")
display(y_test.head(10))
```

**Note:**
After re-assigning the predictions series, each index of the  predictions series pertains to the same observation as the true values series.

### Display the predictions using the testing set

In [205]:
# Create predictions with testing data
lasso_test_predictions = pd.Series(best_performing_models_dict['linear_regression'].predict(X_test))
# Re-assign indexes
lasso_test_predictions.index = X_test.index
 
print("Lasso predictions:")
display(lasso_test_predictions.head(10))

Lasso predictions:


1347    426755.002081
1868    656771.002081
887     493411.002081
650     584163.002081
102     386643.002081
1219    335059.002081
1165    357187.002081
282     463235.002081
1813    545915.002081
510     360867.002081
dtype: float64

### Compare Predictions To The True Values

In [157]:
print("True Values:")
display(y_test.head(10))

True Values:


1347    360000.0
1868    765000.0
887     333250.0
650     590000.0
102     369900.0
1219    429000.0
1165    384350.0
282     526275.0
1813    508000.0
510     306000.0
Name: tx_price, dtype: float64

<hr>

<br>

### What else can you do to boost the model performance?

Iterate:
- Exploratory analysis
- Data cleaning 
- Feature engineering
- Get More Data
- Create a loop to try building models with all combinations of features to find out which features are the most important to accurately predicting the target value 