## Supervised Learning - Regression
In the first tutorial you worked on binary classification to predict whether a point should be one class or another. In the second, you extended to multiple classes to predict a digit that the hand-drawn image corresponded to. But what if you wanted to predict a number? Let's say you wanted to predict a housing price. Knowing what you currently know, you might think about bucketizing prices. Maybe you end up with 100 buckets between \$400,000 - \$1,000,000. This would result in 100 classes where class 0 represents prices `[400,000, 406,000)`, class 1 represents `[406,000, 412,000)`, etc. But what do you do once you're in a bucket? You still need an actual value. You could randomly pick some value, but you might end up leaving money on the table. Also think about what this means. This would mean you've set a minimum and maximum range for a housing prediction. Your classifier would have to select something. What if a house is valued at \$12m and your model says it's worth \$1m? This is where regression comes in. Regression works with continuous values, while classification works with discrete labels. The goal of regression is to predict a continuous value rather than a label. Because of this, the goal is to minimize the error - the difference between the predicted value and the real value.

### Ames Housing Data
In this tutorial you will be working with the Ames housing market data. This is a very common dataset to work with when getting started with machine learning. Just like once upon a time you did a `hello world`  or a `Fizz Buzz` or a `fibonacci` program, you will get familiar with these "starter" datasets. Half moon and MNIST are also common (your first two tutorials).

In this tutorial you will learn:<br/>
- How to load data from openml
- Data cleaning and preprocessing
- How to handle categorical variables
- Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
- Ensemble methods
- How to visualize decision trees
- One way to determine feature importance

Let's get started!

In [None]:
# This cell is meant to install dependencies within google colab
def _install_deps():
    try:
        if 'google.colab' in str(get_ipython()):
            print('Installing dependencies within Google Colab...')
            !wget https://github.com/Chrispresso/ML-for-coders/blob/main/requirements.txt?raw=True -O requirements.txt
            !python -m pip install -r requirements.txt
    except:
        pass

def _install_extra_deps():
    try:
        if 'google.colab' in str(get_ipython()):
            import requests
            # Grab the name of the notebook and download https://github.com/Chrispresso/ML-for-coders/tree/main/<filename>/requirements.txt
            filename = requests.get('http://172.28.0.2:9000/api/sessions').json()[0]['name']
            filename = filename[:filename.rindex('.')]
            notebook_specific_reqs = f"https://github.com/Chrispresso/ML-for-coders/tree/main/{filename}/requirements.txt?raw=True"
            !wget {notebook_specific_reqs} -O requirements_extra.txt
            !python -m pip install -r requirements_extra.txt
    except Exception as e:
        pass

try:
    if __cell_install_requirements:
        _install_deps()
        _install_extra_deps()
    else:
        pass
except:
    _install_deps()

__cell_install_requirements = False

### Loading Data
There are many publicly available datasets and servers where they reside. In this tutorial you will use `openml`. Specifically you will be using [this](https://www.openml.org/search?type=data&sort=runs&id=42165) dataset. Below you will load the dataset into a pandas dataframe.

In [None]:
import sklearn
sklearn.__version__

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd

data = fetch_openml(data_id=42165, as_frame=True)
column_names = data.feature_names
df = pd.DataFrame(data.data, columns=column_names)
df['Price'] = data.target

And that's all that's needed to load the data. Luckily Python has many packages to help make loading the data fairly easy. I  wanted to also introduce `openml` since it can be a good place to find datasets. Again, a lot of the problems you will work on can first be tested on publicly available datasets.

### Data Cleaning and Preprocessing
Data cleaning and data preprocessing are some of the most important things you'll do when creating a machine learning solution. Machine learning methods consume tons of data. If the data is bad, the model will be too. I can't possibly cover all the things you'll want to look at in terms of data cleaning and preprocessing in a single tutorial, so I'll be revisiting these throughout the tutorials.

Let's start by getting some information on our newly created [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [None]:
df.info()

As you can see you have 80 features and the target price in the dataframe. Some of the features are floats and others are objects. You can also see that a number of features have missing values. Notice that there are a total of 1460 entries, so any feature that states a different value of 1460 in the non-null count column will contain some missing data.

The features that are `float64` are basically what you've seen in previous tutorials - some real number. Let's take a look at something like `GarageType` instead and get the different values.

In [None]:
df['GarageType'].unique()

There are seven categorical values. These need to get converted to a numerical value in order to be interpreted by a model. There's a number of ways to do this. The first is to use one-hot encoding. For `GarageType`, this would create a vector of length 6 and place a `1` at the location representing the `GarageType` for that entry, and a `0` elsewhere. For instance, if a certain house had an `Attchd` garage, this would be represented by `[1, 0, 0, 0, 0, 0, 0]`.

Another option would be an ordinal encoder. This would replace the categoroical value with an integer value representing the location in the original array. For instance, if a house had `Basment` garage, this would be represented by `6`.

Another option would be to slap on some weights to an embedding layer and allow a model to learn the embeddings of the categorical values. This is something we'll look at in the future, but is a bit complex for now. One of the things we're after is ensuring that all features get treated fairly. If we were to use the ordinal encoder, then we run into a problem where `CarPort` will have a higher value than `Attchd` simply because of the location in the array. So instead we'll be using a one-hot encoder to essentially say whether a certain feature value is on or off.

Cool, so now you have an idea of what you can do with categorical features. But before we handle categorical features we need to talk about the elephant in the room - missing values. What do we do with missing values? Let's discuss it by first looking at `FireplaceQu`

In [None]:
import matplotlib.pyplot as plt

df_copy = df.copy()
df_copy['FireplaceQu'].value_counts(dropna=False).plot(kind='bar')

Now, one option we have is to remove entries with missing values. This isn't great in this case because most of the entries would be gone. We can also try to be intelligent and use `Fireplaces` to help fill in the missing values. We can assume that if they have no fireplaces, there won't be a fireplace quality. Let's see what happens.

In [None]:
df_copy['FireplaceQu'] = df_copy['FireplaceQu'].fillna('None')
df_copy['FireplaceQu'].value_counts(dropna=False).plot(kind='bar')

As you can see, making this change would keep the same distribution as before. That's good. You wouldn't want to make a change that resulted in `Fa` suddenly having a higher frequency that `Gd`.  But now let's look at `Fence`.

In [None]:
df_copy['Fence'].value_counts(dropna=True).plot(kind='bar')

It's hard to know if any of the other features would give us insight into whether or not a house has a fence. We could also train a model to take other features and predict whether or not the house has a fence. This might work but there are a lot more missing values than filled in values, so we probably wouldn't gain much insight.

There's a couple more options. One option would be to sample from the distribution and randomly assign the missing values. This option can work well, but again, there's a lot of missing values here. A second option would be to just remove this feature entirely. This is the route I'm choosing to take. Let's start by removing the features that have too many missing values.

In [None]:
# $MODIFY 1
threshold = 0.85  # Percentage of values that must be filled in for a feature in order to include the feature
cols_to_remove = list(df.columns[df.isna().sum() > (1 - threshold) * len(df) ])
cols_to_remove

So even though we could replace the missing values in `FireplaceQu`, there's enough missing values that it gets flagged for removal. Honestly that's fine since there are so many missing values. 

Next we want to look for features that have primarily zero value. Before this we looked at missing values - values that just weren't entered. But in this case we're looking for values that have been entered and are zero. If there's enough values for a feature that are zero, again, we might not gain much insight into the meaning of that feature. For instance, you and I know that everyone loves a nice kitchen. If there were a `KitchenSqft` feature and 90% of data were 0, how would we expect a model to be able to know what to do when it finds non-zero values?

Sometimes in machine learning literature you'll find terms related to "signal to noise ratio". We want a good signal to noise ratio (SNR). This essentially means we expect to extract a good amount of signal compared to the background noise. If 90% of the data is 0 and 10% is non-zero, then the SNR is most likely going to be quite poor, and could hurt model performance. Let's make these modifications now.

In [None]:
# $MODIFY 2
threshold_zero_vals = 0.75  # Percentage of values that must be non-zero

additional_cols_to_reomve = []

for col in df.select_dtypes(include= 'float').columns:
    count = df[col].count()
    try:
        count_zeros = df[col].value_counts()[0.0]
    except:
        count_zeros = 0.0
    ratio_zeros = count_zeros / float(count)
    
    if ratio_zeros > threshold_zero_vals:
        additional_cols_to_reomve.append(col)

print(additional_cols_to_reomve)

Now remove these columns from the dataframe.

In [None]:
drop_cols = list(set(cols_to_remove) | set(additional_cols_to_reomve))
df_new = df.drop(drop_cols, axis=1).copy()
df_new.info()

This is looking a lot better. But we're not done yet! Yes, we removed features tha thave a large number of missing and zero values, but we also need to look at numeric features that still contain missing values.

In [None]:
df_new_numeric_cols = df_new.select_dtypes(include= 'float').columns
missing_num_cols = list(df_new_numeric_cols[df_new.select_dtypes(include= 'float').isna().sum() > 0])
missing_num_cols

Let's start with `GarageYrBlt`. You might be tempted to just fill the missing values with 0, but that has a high chance to shift the distribution. Let's take a look to see what I mean.

In [None]:
import seaborn as sns

if 'GarageYrBlt' in missing_num_cols:
    og_garage = df_new['GarageYrBlt'].rename('OG')#.value_counts().rename('OG')
    mod_garage = df_new['GarageYrBlt'].fillna(0.0).rename('MOD') #.value_counts().rename('MOD')
    garage = pd.concat([og_garage, mod_garage], axis=1)

    sns.displot(garage, kind="kde")

    print('og mean', og_garage.mean())
    print('og std', og_garage.std())
    print('mod mean', mod_garage.mean())
    print('mod std', mod_garage.std())


As you can see, the above change would drastically change the original distribution. Instead, let's see what happens if we just assume that if there isn't a year built for the garage, then it must have been built "when the house was built.

In [None]:

if 'GarageYrBlt' in missing_num_cols:
    og_garage = df_new['GarageYrBlt'].rename('OG')
    garage_missing_idxs = df_new['GarageYrBlt'].isna()
    mod_garage = df_new['GarageYrBlt'].rename('MOD')
    mod_garage[garage_missing_idxs] = df_new['YearBuilt'][garage_missing_idxs]
    # mod_garage = df_new['GarageYrBlt'].fillna(0.0).value_counts().rename('MOD')
    garage = pd.concat([og_garage, mod_garage], axis=1)

    sns.displot(garage, kind="kde")

    print('og mean', og_garage.mean())
    print('og std', og_garage.std())
    print('mod mean', mod_garage.mean())
    print('mod std', mod_garage.std())

That's much better. The distribution changes slightly, but still maintains a lot of the original information. Now let's take a look at `MasVnrArea` and see what happens if we replace the missing values with 0.

In [None]:
if 'MasVnrArea' in missing_num_cols:
    og_mas_vnr_area = df_new['MasVnrArea'].rename('OG')
    mod_mas_vnr_area = df_new['MasVnrArea'].fillna(0.0).rename('MOD')
    mas_vnr_area = pd.concat([og_mas_vnr_area, mod_mas_vnr_area], axis=1)

    sns.displot(mas_vnr_area, kind="kde")

    print('og mean', og_mas_vnr_area.mean())
    print('og std', og_mas_vnr_area.std())
    print('mod mean', mod_mas_vnr_area.mean())
    print('mod std', mod_mas_vnr_area.std())


That doesn't seem to have much effect on it, so that modification will be fine.

Now let's commit to these modification and change them within our dataframe.

In [None]:
if 'GarageYrBlt' in missing_num_cols:
    garage_missing_idxs = df_new['GarageYrBlt'].isna()
    df_new.loc[garage_missing_idxs, 'GarageYrBlt'] = df_new['YearBuilt'][garage_missing_idxs]

if 'MasVnrArea' in missing_num_cols:
    df_new['MasVnrArea'] = df_new['MasVnrArea'].fillna(0.0)


Now for the data we need to create a one-hot encoding for the categorical values. There are a couple of ways to do this. One would be to use `pd.get_dummies` which creates dummy values (one-hot). Another way is to use `sklearn` and to use their `OneHotEncoder`. I'm choosing to use that latter because it follows more of a supervised learning approach that you'll often see with these models. First we need to fill the missing values of categories. We don't need to do anything fancy here, but the one-hot encoder will have issues with a value that's missing. A simple way to correct this is to replace all missing categorical values with "missing".

In [None]:
cat_cols = list(df_new.select_dtypes(include='object'))
df_new[cat_cols] = df_new[cat_cols].fillna('missing')

Now we can create a one-hot encoding.

In [None]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder().fit(df_new[cat_cols])

This can be a common approach in supervised learning. Less so these days, but most of the sklearn package follows a similar format, which is why I chose to use it. Sklearn is still a wildly popular package and for good reason. For these approaches you will generally see `.fit()`, `.fit_transform()` and `.transform()`.

`.fit()` simply fits a model to the data provided.<br/>
`.fit_transform()` will fit and then modify the data. For instance in this case `.fit()` is really only concerned with figuring out what the values of the categories are, but the transformation is what actually modifies the values.<br/>
`.transform()` takes the underlying model that has already been learned and applies the transformation to the data.

These probably seem like the same thing right now but there are subtle differences. If we were splitting this data ahead of time and running `.fit()` on the training data, we can also run `.transform()` (or simply combine them into `.fit_transform()`). However, with the test set, you would use just `.transform()`. Why? Because we don't know the distribution of data for the test set. We can assume that it's the same as the training data, but we can't know for sure. 

Next let's create standardize the numerical features. We want each feature/column to have similar importance when given to the model. If we left everything as-is, then features that naturally have higher values would be more effected by small weight changes. For instance, the model might bias towards square feet instead of number of bedrooms, since their range of values is so different.

In [None]:
from sklearn.preprocessing import StandardScaler

# First remove Price.
target = df_new['Price'].copy()
df_new = df_new.drop(['Price', 'Id'], axis=1)

# Grab the numerical columns and fit a standard scaler
num_cols = list(df_new.select_dtypes(include='float'))
standard_scaler = StandardScaler().fit(df_new[num_cols])

One final look at the dataframe before we transform it:

In [None]:
df_new.head()

Now we can transform the data and create a new dataframe with that transformed data.

In [None]:
import numpy as np


# Combine the data by stacking along the columns
combined_data = np.column_stack((
    one_hot_encoder.transform(df_new[cat_cols]).todense(), 
    standard_scaler.transform(df_new[num_cols])
))

# Create the new column names
new_cols = list(one_hot_encoder.get_feature_names_out()) + num_cols

X = pd.DataFrame(combined_data, columns=new_cols)
X.head()

Awesome! You're done with data preprocessing now. You just created one-hot encoded variables for your categorical data and used a standard scaler across numerical features. You're now left with more columns than you started with, but your data is now in a good state for machine learning models!

### MSE

In the past you were doing classification. The loss functions that you dealt with previously are concerned with whether or not a predicted class is the correct one. This changes a bit for predicting a real value. You now want to minimize the error of the model. In a sense you want to minimize the difference between the predicted value and the actual - this is where MSE comes in. MSE forms a loss that is equal to the sum over all training examples of the squared difference.

Technically it looks like this:

$$
MSE = \frac{1}{n} ∑_{i=1}^{n}{(Y_i - \hat{Y}_i)}^2
$$

where $Y_i$ is the actual value and $\hat{Y}_i$ is the predicted.
Now get ready cause this will blow your mind. RMSE is the same thing.... but you take a square root after.

i.e. $$
RMSE = \sqrt{MSE}
$$

It's good to use RMSE when you want to know how well your model is fitting a dataset since it will be in the same units.

For instance, say I wanted to predict the number of cookies I can eat. If the actual is 12 but I predicted 14 then MSE would be $(12 - 14)^2 = 4$. Using RMSE it would be $\sqrt{4} = 2$. This is helpful because if I did this a number of times I would end up with the average deviation between predicted and actual.

### Training and Feature Importance

Let's start by splitting the training set into a train and test set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0xC0FFEE)

In the previous two tutorials you used PyTorch to solve your problems. Technically you could use PyTorch here as well, but it's not needed. Sometimes it's faster or even better to apply simple solutions. Sklearn offers quite a few traditional and advanced machine learning techniques and we're going to explore a few. I'll introduce one and you can explore some more in the tasks section. The one I'll introduce is `RandomForestRegressor` which is an `ensemble` method. Ensemble methods work by having a number of learners - often trained with different parameters or data. These learners each have their own prediction, or model output. The goal of ensemble methods is to combine the predictions of these learners into a final prediction. Generally these are done either with `averaging` or with `boosting`.

With averaging methods, each learner is trained independently and then you average their prediction at the end. On the other hand, boosting methods are built sequentially, where learner 2 would try to reduce the bias of learner 1. This effectively makes learner 2 try to focus on harder examples. At the end a prediction is made via a weighted majority vote.

We'll be looking at an averaging method based on randomized decision trees called `RandomForest`.

In [None]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor()
regressor.fit(X_train, y_train)

Let's check the score that the model gets.

In [None]:
print(regressor.score(X_train, y_train))
print(regressor.score(X_test, y_test))

One way to get around this is to set a minimum number of samples per leaf node. A great thing about decision trees is that you can use them to know **why** your model is predicting something. At some point you might want to be able to explain why your model is choosing what it's choosing. Imagine you made a prediction if someone gets a car loan. If they don't, you don't really want to say "well my black-box model said no". It would be better to trace it through and find a reason to give the customer.

In [None]:
from sklearn import tree

regressor = RandomForestRegressor(min_samples_leaf=20)
regressor.fit(X_train, y_train)
print(regressor.score(X_train, y_train))
print(regressor.score(X_test, y_test))

estimator = regressor.estimators_[0]
plt.figure(figsize=(80,30))
tree.plot_tree(estimator, feature_names = X.columns, filled=True, rounded=True, fontsize=12)
plt.show()



You can see a higher score for the train set than the test set. This is a sign that there might be some overfitting. Let's look at feature importance. Luckily our `RandomForestRegressor` has a built in `feature_importances_` attribute that will tell us the score. Let's see the **least** important features first.

In [None]:
bottom_10 = list(regressor.feature_importances_.argsort())[:10]
print(np.array(new_cols)[bottom_10])

Well that's.... not helpful. We have our one-hot encoded variables in there but not being treated as a whole. What we want is a way to look at feature importance without the dummy values, i.e. ` 'Exterior2nd_CBlock' 'RoofMatl_ClyTile' 'RoofMatl_Roll', 'Exterior2nd_Other'` would instead be `'Exterior2nd', 'RoofMatl'`. Sklearn offers a utility here! It's called `permutation_importance`. Basically it takes a model and predicts a score for it. Then it permutes a feature column and calculates the score again. The difference between the two is the permutation importance. Now, this is a bit tricky because if we permute the feature as they are, we're going to end up in the same boat. We want to essentially permute the features before the categorical features become one-hot encoded. To do this we can use a `Pipeline`. Machine learning pipelines are generally used in production to help make things uniform and to simplify the process. We're going to use a `Pipeline` here to tell sklearn that we want it to perform the encoding and standardization. This way it can permute the features **before** running the pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import *
from sklearn.linear_model import *

# Train/test split on the raw dataframe
X_train, X_test, y_train, y_test = train_test_split(df_new, target, random_state=0xC0FFEE)

In [None]:
# Categorical pipeline
cat_pipe = Pipeline(
    [
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

# Numerical pipeline
num_pipe = StandardScaler()

# Define the column transformer which will run the numerical and categorical
# pipelines that we defined above
preprocessing = ColumnTransformer(
    [
        ('cat', cat_pipe, cat_cols),
        ('num', num_pipe, num_cols)
    ],
    verbose_feature_names_out=False
)

# The full pipeline - just the preprocessing and then the regressor
pipeline = Pipeline(
    [
        ('preprocess', preprocessing),
        # $MODIFY 3
        ('reg', RandomForestRegressor(random_state=0xC0FFEE))
    ]
)

# $MODIFY 4 - have some fun dawg
# Remember, the  set_params() takes in a name followed by double underscore
# and then the attribute you want to change. In this case I'm using 'reg'
# since I named the regressor 'reg'.
pipeline.set_params(reg__min_samples_leaf=20)

pipeline.fit(X_train, y_train)



Notice that in the pipeline we're not specifying a loss function to minimize like we did with PyTorch. Instead, this is usually passed as a `criterion` parameter (depending on the model you select) and most of them use MSE by default.

 Now we can check how the model performs on the training and test data.

In [None]:
print('train score', pipeline.score(X_train, y_train))
print('test score', pipeline.score(X_test, y_test))

The goal is to get the test score somewhere around the train score. We don't want to be overfitting the train set. Let's now take a look at the feature importance calculated over the test set.

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(
    pipeline, X_test, y_test, n_repeats=5, random_state=0xC0FFEE, n_jobs=10
)

sorted_importances_idx = result.importances_mean.argsort()
most_important = list(sorted_importances_idx[-10:])[::-1]
least_important = list(sorted_importances_idx[:5])[::-1]
idxs = np.array(most_important + least_important)
importances = pd.DataFrame(
    result.importances[idxs].T,
    columns=df_new.columns[idxs],
)

ax = importances.plot.box(vert=False, whis=10, figsize=(12, 8))
ax.set_title("Permutation Importances (test set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()


Now let's take a look at the feature importance of the training set. These should be similar. If the training set importance is drastically different, there's a good chance you overfit to the training set and the model didn't generalize well.

In [None]:
result = permutation_importance(
    pipeline, X_train, y_train, n_repeats=5, random_state=0xC0FFEE, n_jobs=10
)

sorted_importances_idx = result.importances_mean.argsort()
most_important = list(sorted_importances_idx[-10:])[::-1]
least_important = list(sorted_importances_idx[:5])[::-1]
idxs = np.array(most_important + least_important)
importances = pd.DataFrame(
    result.importances[idxs].T,
    columns=df_new.columns[idxs],
)
ax = importances.plot.box(vert=False, whis=10, figsize=(12, 8))
ax.set_title("Permutation Importances (train set)")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()

Now let's take a look at some of the predicted values vs the actual ones.

In [None]:
from sklearn.metrics import mean_squared_error


y_pred = pipeline.predict(X_test)
y_actual = y_test.to_numpy()

nshow = len(y_actual)


df_outcome = pd.DataFrame(
    np.column_stack((
        y_pred[:nshow],
        y_actual[:nshow],
        y_pred[:nshow] - y_actual[:nshow]
    )),
    columns=['pred', 'actual', 'delta']
).round()


print(f'MSE: {mean_squared_error(y_actual, y_pred, squared=True)}')
print(f'RMSE: {mean_squared_error(y_actual, y_pred, squared=False)}')

print('\n=== Predictions ===')
print(df_outcome)
print()

print('Average absolute difference', df_outcome['delta'].abs().mean())
print('Average difference', df_outcome['delta'].mean())

### Tasks

1. If you change `threshold` and re-run the code, how does it effect the model? Is there a threshold that seems better?

2. If you change `threshold_zero_vals` and re-run the code, how does it effect the model? Also, think about this. We are thresholding on values that are mostly zero. Should we also threshold on values that are mostly the same? For instance, if a feature has 90% of its values as `2`, does it really matter if we use it?

3. Try changing `RandomForestRegressor` to a different kind of regressor. A list of supported sklearn ensemble methods can be found [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble). Try to read through some of them and try out a few. How do they compare? A list of linear models and their regressors can be found [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). Read through some of them and try out a few.

4. This can be done in conjunction with task 3, but try to change some of the parameters for the regressor(s) of your choice. How do these parameters effect the performance?

5. If you made it this far consider taking [the survey](https://www.surveymonkey.com/r/MBZCZMT). Results are anonymous and help me improve future tutorials.

### Conclusion

In this tutorial you saw how to load datasets using openml. You also saw how to go about data preprocessing and cleaning - a very important part of all machine learning applications. You learned several ways to handle categorical features and how to use one-hot encoding specifically to help. You learned about ensemble methods and how to use several types of regressors to fit your data and even saw how to plot the decision tree that your model chose! Finally you saw how to look at feature importance to measure how important a particular feature is for model accuracy.

That's all for this one, see you next time!