![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Feature Engineering

## Categorical features and dirty cats

In this project, you will be working with dirty cats again to practice all the techniques you learned on previous lessons.  

**Remember**: it's important to always learn from your training data and transform your test data.  Let's gain some practice applying feature engineering to train and test sets.

In [1]:
# Import necessary packages
import numpy as np
import pandas as pd

In [2]:
# Read in dirty_cats.csv
df = pd.read_csv('datasets/dirty_cats.csv')
df.head()

Unnamed: 0,nom_0,nom_1,nom_2,nom_3,nom_4,ord_0,ord_1,ord_2
0,Green,Triangle,Snake,Finland,Bassoon,2,Grandmaster,Cold
1,Green,Trapezoid,Hamster,Russia,Piano,1,Grandmaster,Hot
2,Blue,Trapezoid,Lion,Russia,Theremin,1,Expert,Lava Hot
3,Red,Trapezoid,Snake,Canada,Oboe,1,Grandmaster,Boiling Hot
4,Red,Trapezoid,Lion,Canada,Oboe,1,Grandmaster,Freezing


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## How do we get dummies from our train set and apply them to our test set?

In [3]:
# Let's work with our nom_0 column again
# Get dummies for nom_0
nom_0_dummies = pd.get_dummies(df.nom_0, drop_first=True)

In [4]:
# Check out the dummy columns
nom_0_dummies

Unnamed: 0,BULE,Blue,Bule,GEREN,GREEN,Geren,Green,RED,Rde,Red,blue,green,red
0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0,0,0,0,0,0,1,0,0,0
299996,0,0,0,0,0,0,1,0,0,0,0,0,0
299997,0,1,0,0,0,0,0,0,0,0,0,0,0
299998,0,0,0,0,0,0,1,0,0,0,0,0,0


---
### nom_0 has 3 categories: Green, Blue and Red
But, the dummies (because we dropped a column) only has 2 of the 3.

> How do we make sure our test dummies has the same 2 out of 3 columns?

In [5]:
# Split the data into train and test
# There's no target in this toy dataset.  We can just split the data randomly
# Select 80% of the data randomly for train
msk = np.random.rand(len(df)) < 0.8
# Pull the train data out
df_train = df[msk].copy()
# Pull the test data out
df_test = df[~msk].copy()

In [6]:
# Learn our dummy columns from our train data
# Use pd.get_dummies to get dummy columns for train data

# Save the columns from the dummy data frame as a list
df_train_dummies =  pd.get_dummies(df.nom_0, drop_first=True)
train_dummies = df_train_dummies.columns
train_dummies

Index(['BULE', 'Blue', 'Bule', 'GEREN', 'GREEN', 'Geren', 'Green', 'RED',
       'Rde', 'Red', 'blue', 'green', 'red'],
      dtype='object')

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Checking dummy columns

There are 2 things you need to check for when applying what you learned about dummies in train to the test data
1. What dummy columns (categories) are missing from test that were in train?
    - We'll add these and fill the values with 0
2. What dummy columns (categories) are in test that were not in train?
    - We'll drop these

In [7]:
# Let's use set operations to get the sets we need to answer the questions above
# Make the train_dummies column list into a set
train_set = set(train_dummies)
# Get the unique categories from test and create a set
test_set = set(df_test.nom_0.unique().tolist())
# cols to add exist in train, but not in test
cols_to_add = train_set.difference(test_set)
# cols to remove exist in test but not in train
cols_to_remove = test_set.difference(train_set)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Now that we know what columns to look for, let's apply what we know to our test set

In [8]:
# One-hot encode the test set (we want to start with all of the columns)
df_test_onehot = pd.get_dummies(df_test.nom_0)
# Add any cols that are missing -> fill values with 0
for col in cols_to_add:
    df_test_onehot[col] = 0
# Remove any cols that weren't in train
df_test_dummies = df_test_onehot.drop(cols_to_remove,axis=1)

# Check that the width (number of columns) of train dummies and test dummies match
df_test_dummies.shape[1] == df_train_dummies.shape[1]

True

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## One final thing

When the numbers of columns match, you have one more thing you need to check --> that the columns are in the same order.

In [9]:
# We already has an ordered list of columns from train in train_dummies
# Let's apply that column order to test dummies
df_test_dummies = df_test_dummies[train_dummies]

# Check that the columns in train and test match
for train_col,test_col in zip(df_train_dummies.columns,df_test_dummies.columns):
    print(train_col,'<===>',test_col)

BULE <===> BULE
Blue <===> Blue
Bule <===> Bule
GEREN <===> GEREN
GREEN <===> GREEN
Geren <===> Geren
Green <===> Green
RED <===> RED
Rde <===> Rde
Red <===> Red
blue <===> blue
green <===> green
red <===> red


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Incongruent data labeling

Looks like we have some incongruent data labeling in this data.

> Go back and fix the labeling then redo the dummy columns

In [10]:
# Check unique categories in nom_0 using value_counts()
df_train.nom_0.value_counts()

Green    101849
Blue      77109
Red       61053
blue          5
green         4
Rde           4
Bule          3
RED           3
Geren         2
red           2
BLUE          2
BULE          2
GREEN         1
GEREN         1
Name: nom_0, dtype: int64

In [11]:
# Looks like making case uniform will solve part of the problem
# Apply .lower() to train and test
df_train['nom_0'] = df_train.nom_0.apply(lambda x: x.lower())
df_test['nom_0'] = df_test.nom_0.apply(lambda x: x.lower())

In [12]:
# Have another look at the train categories
df_train.nom_0.value_counts()

green    101854
blue      77116
red       61058
bule          5
rde           4
geren         3
Name: nom_0, dtype: int64

In [13]:
# Create a mapping using train data only to correct spelling
nom_0_map = {'bule':'blue','rde':'red','geren':'green','blue':'blue','red':'red','green':'green'}
# Use .map to apply the mapping to train data
df_train['nom_0_mapped'] = df_train.nom_0.map(nom_0_map)
# Use .map to apply the mapping to test data
df_test['nom_0_mapped'] = df_test.nom_0.map(nom_0_map)

In [15]:
# Check categories in train
df_train.nom_0_mapped.value_counts(dropna=False)

green    101857
blue      77121
red       61062
Name: nom_0_mapped, dtype: int64

In [16]:
# Now that we've cleaned up our labeling, let's make our dummy columns again
# Learn our dummy columns from our train data
df_train_dummies = pd.get_dummies(df_train.nom_0_mapped, drop_first=True)
# keep the dummy cols for use on test data
train_dummies = df_train_dummies.columns
train_dummies

Index(['green', 'red'], dtype='object')

In [17]:
# Assuming we don't know that our test categories match our train categories 
# (this should always be your assumption)
# Get the cols to check
# Let's use set operations to get the sets we need to answer the questions above
# Make the train_dummies column list into a set
train_set = set(train_dummies)
# Get the unique categories from test and create a set
test_set = set(df_test.nom_0_mapped.unique().tolist())
# cols to add exist in train, but not in test
cols_to_add = train_set.difference(test_set)
# cols to remove exist in test but not in train
cols_to_remove = test_set.difference(train_set)

In [19]:
# One-hot encode the test set (we want to start with all of the columns)
df_test_onehot = pd.get_dummies(df_test.nom_0_mapped)
# Add any cols that are missing
for col in cols_to_add:
    df_test_onehot[col] = 0

# Remove any cols that weren't in train
df_test_dummies = df_test_onehot.drop(cols_to_remove, axis=1)

# Check that the width of train dummies and test dummies match
df_test_dummies.shape[1] == df_train_dummies.shape[1]

True

In [24]:
# Set make sure the order of the test columns matches the order of the train columns
df_test_dummies = df_test_dummies[train_dummies]
print("Train | Test\n")
for a, b in zip(df_train_dummies.columns, df_test_dummies.columns):
    print("{0} <==> {1}".format(a, b))

Train | Test

green <==> green
red <==> red


In [20]:
# Have a look at the train data
df_train_dummies.head()

Unnamed: 0,green,red
0,1,0
3,0,1
4,0,1
5,0,0
6,1,0


In [21]:
# Have a look at the test data
df_test_dummies.head()

Unnamed: 0,green,red
1,1,0
2,0,0
13,0,0
16,0,1
19,0,1


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## They match!

You've successfully cleaned up labeling and created matching dummy columns for train and test!

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>