In this exercise, you will leverage what you learned in the two previous tutorials to deal with **categorical variables** and **missing values** in a challenging dataset.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.ml_level_2.ex3 import *
print("Setup Complete")

You will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course). 

![Ames Housing dataset image](./images/ex1_housesbanner.png)

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, your goal is to predict the final price of each home.  

Run the next code cell without changes to load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`.  The test set is loaded in `X_test`.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/home-data-for-ml-course/train.csv', index_col='Id')
X_test = pd.read_csv('../input/home-data-for-ml-course/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train.columns if
                        X_train[cname].nunique() < 10 and X_train[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train[my_cols]
X_valid = X_valid[my_cols]
X_test = X_test[my_cols]

Use the next code cell to print the first several rows of the data.

In [None]:
X_train.head()

If you try to use the data as-is to train a random forest model, you'll get an error.  This is because the data currently contains categorical variables and missing data that must be preprocessed.

In the first two steps of this exercise, you'll preprocess the data.  Then, in the third (and final!) step, you'll fit a random forest model and test its performance.

![Exercise Steps Visualized](./images/ex2_steps.png)

# Step 1: Deal with categorical variables

In this step, you'll preprocess the training and validation sets in `X_train` and `X_valid`.  The preprocessed datasets with encoded categorical variables should be stored in `X_train_1` and `X_valid_1`.  

In order for this step to be marked as correct, `X_train_1` and `X_valid_1` should have any categorical variables in a format that is recognizable by a scikit-learn random forest model.

In [None]:
X_train_1 = ____
X_valid_1 = ____

# Check your answer
step_1.check()

In [None]:
#%%RM_IF(PROD)%
# One-hot encode categorical data
X_train_1 = pd.get_dummies(X_train)
X_valid_1 = pd.get_dummies(X_valid)

# Ensure columns are in same order in both datasets
X_train_1, X_valid_1 = X_train_1.align(X_valid_1, join='inner', axis=1)

step_1.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_1.hint()
#_COMMENT_IF(PROD)_
step_1.solution()

# Step 2: Deal with missing values

Now, you'll need to deal with missing values.  Use `X_train_1` and `X_valid_1` as a starting point.  The preprocessed datasets with missing values either removed or imputed should be stored in `X_train_2` and `X_valid_2`.

In order for this step to be marked as correct, `X_train_2` and `X_valid_2` should not have any missing values (_and any categorical variables should be properly encoded -- but you get this for free from your work in Step 1!_).

In [None]:
X_train_2 = ____
X_valid_2 = ____

# Check your answer
step_2.check()

In [None]:
#%%RM_IF(PROD)%
from sklearn.impute import SimpleImputer

# Make copy to avoid changing original data (when imputing)
X_train_imp = X_train_1.copy()
X_valid_imp = X_valid_1.copy()

# Get names of columns with missing values
cols_with_missing = [col for col in X_train_1.columns if X_train_1[col].isnull().any()]

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_imp[col + '_was_missing'] = X_train_1[col].isnull()
    X_valid_imp[col + '_was_missing'] = X_valid_1[col].isnull()
    
# Imputation
my_imputer = SimpleImputer()
X_train_2 = pd.DataFrame(my_imputer.fit_transform(X_train_imp))
X_valid_2 = pd.DataFrame(my_imputer.transform(X_valid_imp))

# Imputation removed column names; put them back
X_train_2.columns = X_train_imp.columns
X_valid_2.columns = X_valid_imp.columns

step_2.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_2.hint()
#_COMMENT_IF(PROD)_
step_2.solution()

# Step 3: Test your performance

Now that your data is preprocessed, you're ready to plug it in to a model!  You'll use the same `score_dataset()` function from the previous tutorial.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Run the code cell below without changes to obtain your MAE score.  

In order for this step to be marked as correct, your MAE score must be less than 18000.  If your MAE score is too high, please revisit Step 1 and Step 2 above to test alternative preprocessing techniques.

In [None]:
score = score_dataset(X_train_2, X_valid_2, y_train, y_valid)
print('MAE score:', score)

# Check your answer
step_3.check()

In [None]:
#%%RM_IF(PROD)%
score = score_dataset(X_train_2, X_valid_2, y_train, y_valid)
print('MAE score:', score)

step_3.assert_check_passed()

In [None]:
# Line below will give you a hint 
#_COMMENT_IF(PROD)_
step_3.hint()

# Keep going

Move on to learn all about how to use **[pipelines](#$NEXT_NOTEBOOK_URL$)** to improve your machine learning code!