By encoding **categorical variables**, you'll obtain your best results thus far!

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.ml_level_2.ex3 import *
print("Setup Complete")

In this exercise, you will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course). 

![Ames Housing dataset image](https://i.imgur.com/lTJVG4e.png)

Run the next code cell without changes to load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`.  The test set is loaded in `X_test`.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

Use the next code cell to print the first five rows of the data.

In [None]:
X_train.head()

Notice that the dataset contains both numerical and categorical variables.  You'll need to encode the categorical data before training a random forest model.

To compare different models, you'll use the same `score_dataset()` function from the tutorial.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

# Step 1: Drop columns with categorical data

...

In [None]:
# Fill in the lines below: drop columns in training and validation data
drop_X_train = ____
drop_X_valid = ____

# Check your answers
step_1.check()

In [None]:
#%%RM_IF(PROD)%%
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
step_1.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_1.hint()
#_COMMENT_IF(PROD)_
step_1.solution()

Run the next code cell to get the MAE for this approach.

In [None]:
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

# Step 2: Label Encoding

...

In [None]:
from sklearn.preprocessing import LabelEncoder

# Get names of columns containing categorical data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Fill in the lines below: apply label encoder to each column with categorical data
label_X_train = ____
label_X_valid = ____

# Check your answers
step_2.check()

In [None]:
#%%RM_IF(PROD)%%
# Get names of columns containing categorical data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
X = pd.concat([X_train, X_valid], axis=0)
for col in object_cols:
    my_encoder.fit(X[col])
    label_X_train[col] = label_encoder.transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

step_2.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_2.hint()
#_COMMENT_IF(PROD)_
step_2.solution()

Run the next code cell to get the MAE for this approach.

In [None]:
print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

# Step 3: Investigating cardinality

what is cardinality? run the code cell below.

In [None]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

fill in answer

In [None]:
# Fill in the line below: How many categorical columns have more than ten unique values?
high_cardinality_numcols = ____

# Fill in the line below: How many columns are needed to one-hot encode the 'Neighborhood' column?
num_cols_neighborhood = ____

# Check your answers
step_3.a.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_3.a.hint()
#_COMMENT_IF(PROD)_
step_3.a.solution()

thought question

In [None]:
#_COMMENT_IF(PROD)_
step_3.b.hint()

In [None]:
#_COMMENT_IF(PROD)_
step_3.b.solution()

# Step 4: One-Hot Encoding

...

In [None]:
# List containing names of low cardinality columns
low_cardinality_cols = [col for col in X_train.columns if
                        X_train[col].nunique() < 10 and
                        X_train[col].dtype == "object"]

...

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!

OH_X_train = ____ # Your code here
OH_X_valid = ____ # Your code here

# Check your answer
step_4.check()

In [None]:
#%%RM_IF(PROD)%%
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

step_4.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
step_4.hint()
#_COMMENT_IF(PROD)_
step_4.solution()

Run the next code cell to get the MAE for this approach.

In [None]:
print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

# Step 5: Generate test predictions and submit your results

completely optional.

need to edit ... Once you have successfully completed Step 4, you're ready to submit your results to the leaderboard!  (_You also learned how to do this in the previous exercise.  If you need a reminder of how to do this, please use the instructions below._)
- Begin by clicking on the blue **COMMIT** button in the top right corner.  This will generate a pop-up window.  
- After your code has finished running, click on the blue **Open Version** button in the top right of the pop-up window.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
- Click on the **Output** tab on the left of the screen.  Then, click on the **Submit to Competition** button to submit your results to the leaderboard.
- If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your model and repeat the process.

In [None]:
# (Optional) Your code here

# Keep going

Continue to learn how to use **[pipelines](#$NEXT_NOTEBOOK_URL$)** to preprocess datasets with a mixture of categorical variables and missing values.