# Practicing The Fundamentals of the Machine Learning Process

In this end-of-phase summary practice notebook, you will work through an end-to-end machine learning workflow, focusing on the fundamental concepts of machine learning theory and processes. The main emphasis is on modeling theory (not EDA or preprocessing), so we will skip over some of the data visualization and data preparation steps that you would take in an actual modeling process.

Most of Phase 3 has been focused on classification - here, you'll be solving a regression task instead.

## Your Task: Build a Model to Predict Blood Pressure

![stethoscope sitting on a case](images/stethoscope.jpg)

<span>Photo by <a href="https://unsplash.com/@marceloleal80?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Marcelo Leal</a> on <a href="https://unsplash.com/s/photos/blood-pressure?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Business and Data Understanding

Hypertension (high blood pressure) is a treatable condition, but measuring blood pressure requires specialized equipment that most people do not have at home.

The question, then, is ***can we predict blood pressure using just a scale and a tape measure***? These measuring tools, which individuals are more likely to have at home, might be able to flag individuals with an increased risk of hypertension.

[Researchers in Brazil](https://doi.org/10.1155/2014/637635) collected data from several hundred college students in order to answer this question. We will be specifically using the data they collected from female students.

The measurements we have are:

* Age (age in years)
* BMI (body mass index, a ratio of weight to height)
* WC (waist circumference in centimeters)
* HC (hip circumference in centimeters)
* WHR (waist-hip ratio)
* SBP (systolic blood pressure)

The chart below describes various blood pressure values:

<a title="Ian Furst, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Hypertension_ranges_chart.png"><img width="512" alt="Hypertension ranges chart" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Hypertension_ranges_chart.png/512px-Hypertension_ranges_chart.png"></a>

### The Steps

#### 1. Perform a Train-Test Split

Load the data into a dataframe using pandas, separate the features (`X`) from the target (`y`), and use the `train_test_split` function to separate data into training and test sets. Do a bit of exploration to become familiar with the training data.

#### 2. Build and Evaluate a First Simple Model

Using the `LinearRegression` model and `mean_squared_error` function from scikit-learn to arrive at your root mean squared error (RMSE), build and evaluate a simple linear regression model using just the training data. Also, use `cross_val_score` to simulate unseen data, without actually using the holdout test set just yet.

#### 3. Compare to a Different Model Type

Use a `DecisionTreeRegressor` model, evaluating based on RMSE and using `cross_val_score` to avoid overfitting to the test set, to compare model architectures.

#### 4. Choose Which Model to Use, then Iterate and Improve

Based on steps 2 and 3, and what you know from your initial exploration, decide whether you'd like to pursue a linear or tree-based model. Write out the justification of your decision.

- If you choose to continue with the linear regression model:

    - Use `PolynomialFeatures` to Reduce Underfitting: Apply a `PolynomialFeatures` transformer to give the model more ability to pick up on information from the training data. Test out different polynomial degrees until you have a model that is perfectly fit to the training data.
    - Use Regularization to Reduce Overfitting: Instead of a basic `LinearRegression`, use a `Ridge` regression model to apply regularization to the overfit model. In order to do this you will need to scale the data. Test out different regularization penalties to find the best model.
    
- If you choose to continue with the tree-based model:

    - Level Up to an Ensemble Method: Implement an example of both a bagged (`RandomForestRegressor` or `ExtraTreesRegressor`) and boosted (`AdaBoostRegressor` or `XGBRegressor`) model, and decide which one to tune. Remember that you're exploring without yet finding the optimal parameters at this point, so write out your justification for choosing one over another.
    - Find Optimal Hyperparameters: Use visualizations and `GridSearchCV` to tune the model you choose to reduce overfitting.

#### 5. Evaluate a Final Model on the Test Set

Process `X_test` and `y_test` appropriately in order to evaluate the performance of your final model on unseen data.

## 1. Perform a Train-Test Split

Before looking at the text below, try to remember: why is a train-test split the *first* step in a machine learning process?

.

.

.

A machine learning (predictive) workflow fundamentally emphasizes creating *a model that will perform well on unseen data*. We will hold out a subset of our original data as the "test" set that will stand in for truly unseen data that the model will encounter in the future.

We make this separation as the first step for two reasons:

1. Most importantly, we are avoiding *leakage* of information from the test set into the training set. Leakage can lead to inflated metrics, since the model has information about the "unseen" data that it won't have about real unseen data. This is why we always want to fit our transformers and models on the training data only, not the full dataset.
2. Also, we want to make sure the code we have written will actually work on unseen data. If we are able to transform our test data and evaluate it with our final model, that's a good sign that the same process will work for future data as well.

### Loading the Data

In the cell below, we import the pandas library and open the full dataset for you. It has already been formatted and subsetted down to the relevant columns.

In [None]:
# Run this cell without changes
import pandas as pd
df = pd.read_csv("data/blood_pressure.csv", index_col=0)
df.head()

### Exploring the Data 

Do a bit of exploration to see what this data contains.

In [None]:
# Code here to explore


### Identifying Features and Target

Recall that in this instance, we are trying to predict systolic blood pressure.

In [None]:
# Replace None with appropriate code

X = None
y = None

### Consider a Model-Less Baseline

If you had to make a model-less prediction of this target, what would you predict? How well that model-less prediction perform, on average?

In [None]:
# Code here to explore a model-less prediction


### Performing a Train-Test Split

[documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# Code here to create your train-test split


## 2. Build and Evaluate a First Simple Model

For our baseline model (FSM), we'll use a `LinearRegression` from scikit-learn ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)).

In [None]:
# Code here to create your baseline model


### Fitting and Evaluating the Model on the Full Training Set

In [None]:
# Code here to fit and evaluate your baseline model


Then, evaluate the model using root mean squared error (RMSE). To do this, first import the `mean_squared_error` function from scikit-learn ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)). Then pass in both the actual and predicted y values, along with `squared=False` (to get the RMSE rather than MSE).

(if you have an older version of scikit-learn, you may need to use numpy to get to RMSE, since the `squared` option was added in a fairly recent version)

In [None]:
# Code here to generate predictions

# Evaluate on your train set using RMSE


If you correctly evaluated on your train set, you saw how far off on average your model is on your training data.

But what about on *unseen* data?

To stand in for true unseen data (and avoid making decisions based on this particular data split, therefore not using `X_test` or `y_test` yet), let's use cross-validation.

### Fitting and Evaluating the Model with Cross Validation

In the cell below, import `cross_val_score` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)).

For specific implementation reasons within the scikit-learn library, you'll need to use `scoring="neg_root_mean_squared_error"`, which returns the RMSE values with their signs flipped to negative. Then we take the average and negate it at the end, so the number is directly comparable to the RMSE number above.

In [None]:
# Import the relevant function

# Get the cross validated scores for our baseline model

# Display the average of the cross-validated scores


### Analysis of Baseline Model

How can you explain your model's scores, based on our business problem above? How can you explain your cross-validated RMSE score in the context of the actual data - which groups of patients, based on the chart provided at the top, are we likely to confuse? Did this do better or worse than a model-less prediction?

- 


## 3. Compare to a Different Model Type

For our next model, we'll use a `DecisionTreeRegressor` from scikit-learn (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)).

In [None]:
# Code here to create your tree-based model


### Fitting the Model

In [None]:
# Code here to fit and evaluate your tree-based model


### Evaluating the Model with Cross Validation

In [None]:
# Get the cross validated scores for our tree-based model

# Display the average of the cross-validated scores


### Analysis of Tree-based Model

Did this model do better or worse than your linear model? Do you think it better represents the underlying data, or is a linear model a better choice?

- 


## 4. Choose Which Model to Use, then Iterate and Improve

#### If you choose to continue with a linear regression model:

1) Use `PolynomialFeatures` to Reduce Underfitting: Apply a `PolynomialFeatures` transformer to give the model more ability to pick up on information from the training data. Test out different polynomial degrees until you have a model that is perfectly fit to the training data.

In [None]:
# Code here to work on polynomial features, if relevant


2) Use Regularization to Reduce Overfitting: Instead of a basic `LinearRegression`, use a `Ridge` regression model to apply regularization to the overfit model. In order to do this you will need to scale the data. Test out different regularization penalties to find the best model.

In [None]:
# Code here to create a ridge model, if relevant


#### If you choose to continue with the tree-based model:

1) Level Up to an Ensemble Method: Implement an example of both a bagged (`RandomForestRegressor` or `ExtraTreesRegressor`) and boosted (`AdaBoostRegressor` or `XGBRegressor`) model, and decide which one to tune. Remember that you're exploring without yet finding the optimal parameters at this point, so write out your justification for choosing one over another.

In [None]:
# Code here to train and evaluate ensemble methods, if relevant


2) Find Optimal Hyperparameters: Use visualizations and `GridSearchCV` to tune the model you choose to reduce overfitting.

In [None]:
# Code here to tune hyperparameters, if relevant


### Choosing a Final Model

Decide your best model, and explain why: is it a better representation of the underlying data? Does it have the best scores? Is it the most generalizable?

## 5. Evaluate a Final Model on the Test Set

Often our lessons leave out this step because we are focused on other concepts, but if you were to present your final model to stakeholders, it's important to perform one final analysis on truly unseen data to make sure you have a clear idea of how the model will perform in the field.

### Instantiating the Final Model

Unless you are using a model that is very slow to fit, it's a good idea to re-create it from scratch prior to the final evaluation. That way you avoid any artifacts of how you iterated on the model previously.

In [None]:
# Code here to instantiate and fit a new model with your best parameters


### Processing the Holdout Test Set

The training data for our final model may have been transformed during the preparation and earlier training iteration process, especially if you went with a linear model.

In the cell below, transform the test data in the same way, with the same transformer objects. Do NOT re-instantiate or re-fit these objects - just `transform`.

In [None]:
# Code here to transform the test set to mimic your final train set


### Evaluating RMSE with Final Model and Processed Test Set

This time we don't need to use cross-validation, since we are using the test set. In the cell below, generate predictions for the test data then use `mean_squared_error` with `squared=False` to find the RMSE for our holdout test set.

In [None]:
# Code here to evaluate your final model on your holdout set
# Make sure to grab your test predictions as y_pred_test so the next section works

y_pred_test = None

### Interpreting Our Results

So, we successfully arrived at some version of a final model, after iterating through a few approaches. But, can we recommend that this model be used for the purpose of predicting blood pressure based on these features?

Let's create a scatter plot of actual vs. predicted blood pressure, with the boundaries of high blood pressure indicated:

In [None]:
# Run this cell without changes
import seaborn as sns

# Set up plot
fig, ax = plt.subplots(figsize=(8,6))

# Seaborn scatter plot with best fit line
sns.regplot(x=y_test, y=y_pred_test, ci=None, truncate=False, ax=ax)
ax.set_xlabel("Actual Blood Pressure")
ax.set_ylabel("Predicted Blood Pressure")

# Add spans showing high blood pressure + legend
ax.axvspan(129, max(y_test) + 1, alpha=0.2, color="blue", label="actual high blood pressure risk")
ax.axhspan(129, max(y_pred_test) + 1, alpha=0.2, color="gray", label="predicted high blood pressure risk")
ax.legend();

Now the question is: how well did we do? Recall that our question was: ***can we predict blood pressure using just a scale and a tape measure?*** Do we answer that question? What's your interpretation of these results?

- 
