![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Feature Engineering

## Categorical features and dirty cats

In this project, you will be working with dirty cats again to practice all the techniques you learned on previous lessons.  

**Remember**: it's important to always learn from your training data and transform your test data.  Let's gain some practice applying feature engineering to train and test sets.

In [73]:
# Import necessary packages
import numpy as np
import pandas as pd

In [74]:
# Read in dirty_cats.csv
df_dirty = pd.read_csv('datasets/dirty_cats.csv')

df_dirty.head()

Unnamed: 0,nom_0,nom_1,nom_2,nom_3,nom_4,ord_0,ord_1,ord_2
0,Green,Triangle,Snake,Finland,Bassoon,2,Grandmaster,Cold
1,Green,Trapezoid,Hamster,Russia,Piano,1,Grandmaster,Hot
2,Blue,Trapezoid,Lion,Russia,Theremin,1,Expert,Lava Hot
3,Red,Trapezoid,Snake,Canada,Oboe,1,Grandmaster,Boiling Hot
4,Red,Trapezoid,Lion,Canada,Oboe,1,Grandmaster,Freezing


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## How do we get dummies from our train set and apply them to our test set?

In [75]:
# Let's work with our nom_0 column again
# Get dummies for nom_0
df_dummies = pd.get_dummies(df_dirty['nom_0'], drop_first=True).astype(int)

In [76]:
# Check out the dummy colum
df_dummies.columns

Index(['BULE', 'Blue', 'Bule', 'GEREN', 'GREEN', 'Geren', 'Green', 'RED',
       'Rde', 'Red', 'blue', 'green', 'red'],
      dtype='object')

---
### nom_0 has 3 categories: Green, Blue and Red
But, the dummies (because we dropped a column) only has 2 of the 3.

> How do we make sure our test dummies has the same 2 out of 3 columns?

In [77]:
# Split the data into train and test
# There's no target in this toy dataset.  We can just split the data randomly
# Select 80% of the data randomly for train
msk = np.random.rand(len(df_dirty)) < 0.8
# Pull the train data out
df_train = df_dirty[msk].copy()
# Pull the test data out
df_test = df_dirty[~msk].copy()

In [63]:
'''
1. Why Split Data?
Learning Transformations on Training Data Only
Why is it necessary?

    In real-world machine learning, your model learns patterns from the training data and is evaluated on unseen test data.
    Any preprocessing steps (like dummy encoding) must also mimic this principle:
        Learn how to transform data only on the training data.
        Apply those learned transformations to the test data.
        
What happens if you use the entire dataset (no split)?

If you use the entire dataset for learning transformations (e.g., creating dummy variables), information from the test set can "leak" into the training process. This is called data leakage.
It gives the model access to information it wouldn't normally have in a real-world scenario, leading to overly optimistic performance metrics.
Example of Data Leakage:

Imagine your nom_0 column has the following categories:
Training Data: ['Green', 'Blue', 'Red']
Test Data: ['Green', 'Blue', 'Yellow']
If you include the test data while creating dummies, the model will learn about the Yellow category from the test data. In reality, the model should treat Yellow as unseen during evaluation.
Key Rule: Split the data into training and testing to ensure the test set remains unseen until evaluation.

'''

'\n1. Why Split Data?\nLearning Transformations on Training Data Only\nWhy is it necessary?\n\n    In real-world machine learning, your model learns patterns from the training data and is evaluated on unseen test data.\n    Any preprocessing steps (like dummy encoding) must also mimic this principle:\n        Learn how to transform data only on the training data.\n        Apply those learned transformations to the test data.\n        \nWhat happens if you use the entire dataset (no split)?\n\nIf you use the entire dataset for learning transformations (e.g., creating dummy variables), information from the test set can "leak" into the training process. This is called data leakage.\nIt gives the model access to information it wouldn\'t normally have in a real-world scenario, leading to overly optimistic performance metrics.\nExample of Data Leakage:\n\nImagine your nom_0 column has the following categories:\nTraining Data: [\'Green\', \'Blue\', \'Red\']\nTest Data: [\'Green\', \'Blue\', \

In [78]:
# Learn our dummy columns from our train data
# Use pd.get_dummies to get dummy columns for train data
df_train_dummies = pd.get_dummies(df_train['nom_0'], drop_first=True).astype(int)

# Save the columns from the dummy data frame as a list
train_dummies = df_train_dummies.columns
train_dummies

Index(['BULE', 'Blue', 'Bule', 'GREEN', 'Geren', 'Green', 'RED', 'Rde', 'Red',
       'blue', 'green', 'red'],
      dtype='object')

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Checking dummy columns

There are 2 things you need to check for when applying what you learned about dummies in train to the test data
1. What dummy columns (categories) are missing from test that were in train?
    - We'll add these and fill the values with 0
2. What dummy columns (categories) are in test that were not in train?
    - We'll drop these

In [79]:
# Let's use set operations to get the sets we need to answer the questions above
# Make the train_dummies column list into a set
train_set = set(train_dummies)
# Get the unique categories from test and create a set
test_set = set(df_test.nom_0.unique().tolist())
# cols to add exist in train, but not in test
cols_to_add = train_set.difference(test_set)
# cols to remove exist in test but not in train
cols_to_remove = test_set.difference(train_set)

In [80]:
print(train_set)
print(test_set)
print(cols_to_add)
print(cols_to_remove)

{'BULE', 'Blue', 'Geren', 'RED', 'Red', 'Bule', 'GREEN', 'blue', 'red', 'green', 'Green', 'Rde'}
{'Blue', 'BLUE', 'GEREN', 'Red', 'blue', 'red', 'Green', 'Rde'}
{'BULE', 'Geren', 'RED', 'Bule', 'GREEN', 'green'}
{'GEREN', 'BLUE'}


![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Now that we know what columns to look for, let's apply what we know to our test set

In [50]:
# One-hot encode the test set (we want to start with all of the columns)
df_test_onehot = pd.get_dummies(df_test['nom_0']).astype(int)
# Add any cols that are missing -> fill values with 0
for col in cols_to_add:
    df_test_onehot[col] = 0
# Remove any cols that weren't in train
df_test_dummies = df_test_onehot.drop(cols_to_remove,axis=1)

# Check that the width (number of columns) of train dummies and test dummies match
print(len(df_test_dummies.columns) == len(df_train_dummies.columns))

True


In [81]:
'''
creates a preliminary set of dummy variables for the test set, but it may include extra categories or miss some that exist in the training set.

Purpose: Handle columns (categories) that are in the training set but missing from the test set.
How:
cols_to_add contains the names of dummy columns present in the training set but absent from the test set.
For each missing column, a new column is added to df_test_onehot with all values set to 0.
Why: Ensures that the test set has all the dummy columns required by the model.

Purpose: Handle columns (categories) that are in the test set but not in the training set.
How:
cols_to_remove contains the names of dummy columns present in the test set but not in the training set.
These columns are dropped from df_test_onehot.
Why: Ensures that only the dummy columns learned during training are present in the test set.

'''

'\ncreates a preliminary set of dummy variables for the test set, but it may include extra categories or miss some that exist in the training set.\n\nPurpose: Handle columns (categories) that are in the training set but missing from the test set.\nHow:\ncols_to_add contains the names of dummy columns present in the training set but absent from the test set.\nFor each missing column, a new column is added to df_test_onehot with all values set to 0.\nWhy: Ensures that the test set has all the dummy columns required by the model.\n\nPurpose: Handle columns (categories) that are in the test set but not in the training set.\nHow:\ncols_to_remove contains the names of dummy columns present in the test set but not in the training set.\nThese columns are dropped from df_test_onehot.\nWhy: Ensures that only the dummy columns learned during training are present in the test set.\n\n'

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## One final thing

When the numbers of columns match, you have one more thing you need to check --> that the columns are in the same order.

In [51]:
# We already has an ordered list of columns from train in train_dummies
# Let's apply that column order to test dummies
df_test_dummies = df_test_dummies[train_dummies]

# Check that the columns in train and test match
for train_col,test_col in zip(df_train_dummies.columns,df_test_dummies.columns):
    print(train_col,'<===>',test_col)

BULE <===> BULE
Blue <===> Blue
Bule <===> Bule
GEREN <===> GEREN
GREEN <===> GREEN
Geren <===> Geren
Green <===> Green
RED <===> RED
Rde <===> Rde
Red <===> Red
blue <===> blue
green <===> green
red <===> red


In [82]:
''' 
train_dummies is the ordered list of dummy columns from the training data.
The test DataFrame (df_test_dummies) is reindexed to match this order.
Why: Ensures that the test set dummy columns are aligned in the same order as the training set.

Order matters in machine learning:
Models treat input features as arrays. If the order of the columns differs between training and test sets, the model could receive incorrect inputs, leading to unpredictable behavior.
This step ensures both the structure and order of the columns are consistent.
'''

' \ntrain_dummies is the ordered list of dummy columns from the training data.\nThe test DataFrame (df_test_dummies) is reindexed to match this order.\nWhy: Ensures that the test set dummy columns are aligned in the same order as the training set.\n\nOrder matters in machine learning:\nModels treat input features as arrays. If the order of the columns differs between training and test sets, the model could receive incorrect inputs, leading to unpredictable behavior.\nThis step ensures both the structure and order of the columns are consistent.\n'

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## Incongruent data labeling

Looks like we have some incongruent data labeling in this data.

> Go back and fix the labeling then redo the dummy columns

In [52]:
# Check unique categories in nom_0 using value_counts()
df_train['nom_0'].value_counts()

nom_0
Green    101840
Blue      76917
Red       61090
blue          4
Rde           4
RED           3
GREEN         3
green         3
Bule          2
red           2
Geren         2
BULE          2
GEREN         2
BLUE          1
Name: count, dtype: int64

In [53]:
# Looks like making case uniform will solve part of the problem
# Apply .lower() to train and test
df_train['nom_0'] = df_train['nom_0'].apply(lambda x: x.lower())
df_test['nom_0'] = df_test['nom_0'].apply(lambda x: x.lower())

In [54]:
# Have another look at the train categories
df_train['nom_0'].value_counts()

nom_0
green    101846
blue      76922
red       61095
bule          4
rde           4
geren         4
Name: count, dtype: int64

In [55]:
# Create a mapping using train data only to correct spelling
nom_0_map = {'bule':'blue','rde':'red','geren':'green','blue':'blue','red':'red','green':'green'}
# Use .map to apply the mapping to train data
df_train['nom_0_mapped'] = df_train['nom_0'].map(nom_0_map)
# Use .map to apply the mapping to test data
df_test['nom_0_mapped'] = df_test['nom_0'].map(nom_0_map)

In [56]:
# Check categories in train
df_train['nom_0_mapped'].value_counts()

nom_0_mapped
green    101850
blue      76926
red       61099
Name: count, dtype: int64

In [57]:
# Now that we've cleaned up our labeling, let's make our dummy columns again
# Learn our dummy columns from our train data
df_train_dummies = pd.get_dummies(df_train['nom_0_mapped'], drop_first=True).astype(int)
# keep the dummy cols for use on test data
train_dummies = df_train_dummies.columns

In [58]:
# Assuming we don't know that our test categories match our train categories 
# (this should always be your assumption)
# Get the cols to check
# Let's use set operations to get the sets we need to answer the questions above
# Make the train_dummies column list into a set
train_set = set(train_dummies)
# Get the unique categories from test and create a set
test_set = set(df_test.nom_0_mapped.unique().tolist())
# cols to add exist in train, but not in test
cols_to_add = train_set.difference(test_set)
# cols to remove exist in test but not in train
cols_to_remove = test_set.difference(train_set)

In [59]:
# One-hot encode the test set (we want to start with all of the columns)
df_test_onehot = pd.get_dummies(df_test.nom_0_mapped).astype(int)
# Add any cols that are missing
for col in cols_to_add:
    df_test_onehot[col] = 0
# Remove any cols that weren't in train
df_test_dummies = df_test_onehot.drop(cols_to_remove,axis=1)

# Check that the width of train dummies and test dummies match
df_train_dummies.shape[1],df_test_dummies.shape[1]

(2, 2)

In [60]:
# Have a look at the train data
df_train_dummies.head()

Unnamed: 0,green,red
0,1,0
1,1,0
2,0,0
3,0,1
5,0,0


In [61]:
# Have a look at the test data
df_test_dummies.head()

Unnamed: 0,green,red
4,0,1
20,0,0
22,0,0
25,1,0
27,0,1


In [84]:
''' 
Why Encoding Was Done First
The notebook might have gone straight to dummy encoding first for the following reasons:

Workflow Demonstration:

Often, tutorials and notebooks are structured to show the standard steps (e.g., splitting, encoding, aligning columns) without anticipating data issues up front.
Encoding first allows the notebook to introduce the dummy variable creation process as part of a typical workflow.
Reveal Problems Visually:

While value_counts() could reveal inconsistent labels in the raw data, the issues might be more obvious when seen in the dummy columns.
For example:
value_counts() could show Green and green separately, but creating dummy variables highlights that they create redundant columns, which has a more direct impact on machine learning models.
Educational Intent:

Encoding first emphasizes why clean, consistent data is critical for preprocessing. Spotting the problem during encoding makes it clear how data inconsistencies can directly affect feature engineering.

'''

' \nWhy Encoding Was Done First\nThe notebook might have gone straight to dummy encoding first for the following reasons:\n\nWorkflow Demonstration:\n\nOften, tutorials and notebooks are structured to show the standard steps (e.g., splitting, encoding, aligning columns) without anticipating data issues up front.\nEncoding first allows the notebook to introduce the dummy variable creation process as part of a typical workflow.\nReveal Problems Visually:\n\nWhile value_counts() could reveal inconsistent labels in the raw data, the issues might be more obvious when seen in the dummy columns.\nFor example:\nvalue_counts() could show Green and green separately, but creating dummy variables highlights that they create redundant columns, which has a more direct impact on machine learning models.\nEducational Intent:\n\nEncoding first emphasizes why clean, consistent data is critical for preprocessing. Spotting the problem during encoding makes it clear how data inconsistencies can directly af

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

## They match!

You've successfully cleaned up labeling and created matching dummy columns for train and test!

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>