## Encoding, Transforming, and Scaling Features

This tutorial is based upon the textbook:

Walker, M. (2022). Data Cleaning and Exploration with Machine Learning. Pakt Publishing Ltd..

Typically, machine learning algorithms require some form of encoding of variables.
Additionally, our models often perform better with scaling so that features with higher
variability do not overwhelm the optimization. We will show you how to use different
scaling techniques when your features have dramatically different ranges.

In [1]:
!pip install feature-engine category_encoders



In [2]:
# import pandas, numpy, and matplotlib
import pandas as pd
import category_encoders as ce
from sklearn.model_selection import train_test_split
import feature_engine.selection as fesel
from feature_engine.encoding import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from category_encoders.hashing import HashingEncoder

pd.options.display.float_format = '{:,.2f}'.format


In [3]:
nls97 = pd.read_csv("nls97b.csv")
nls97.set_index("personid", inplace=True)

ltpoland = pd.read_csv("ltpoland.csv")
ltpoland.set_index("station", inplace=True)
ltpoland.dropna(inplace=True)


create training and testing DataFrames for the features (X_train and
X_test) and the targets (y_train and y_test). In this example, wageincome
is the target variable. We set the test_size parameter to 0.3 to leave 30%
of the observations for testing. Note that we will only work with the Scholastic
Assessment Test (SAT) and grade point average (GPA) data from the NLS.

In [4]:
feature_cols = ['satverbal','satmath','gpascience',
  'gpaenglish','gpamath','gpaoverall']

# separate NLS data into train and test datasets
X_train, X_test, y_train, y_test =  \
  train_test_split(nls97[feature_cols],\
  nls97[['wageincome']], test_size=0.3, random_state=0)



In [5]:
X_train.info()
y_train.info()
X_test.info()
y_test.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6288 entries, 574974 to 370933
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   satverbal   1001 non-null   float64
 1   satmath     1001 non-null   float64
 2   gpascience  3998 non-null   float64
 3   gpaenglish  4078 non-null   float64
 4   gpamath     4056 non-null   float64
 5   gpaoverall  4223 non-null   float64
dtypes: float64(6)
memory usage: 343.9 KB
<class 'pandas.core.frame.DataFrame'>
Index: 6288 entries, 574974 to 370933
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   wageincome  3599 non-null   float64
dtypes: float64(1)
memory usage: 98.2 KB
<class 'pandas.core.frame.DataFrame'>
Index: 2696 entries, 363170 to 629736
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   satverbal   405 non-null    float64
 1   satmath     40

## Removing redundant or unhelpful features

During the process of data cleaning and manipulation, we often end up with data that
is no longer meaningful. Perhaps we subsetted data based on a single feature value, and
we have retained that feature even though it now has the same value for all observations.
Or, for the subset of the data that we are using, two features have the same value. Ideally,
we catch those redundancies during our data cleaning. However, if we do not catch them
during that process, we can use the open source feature-engine package to help us.

Additionally, there might be features that are so highly correlated that it is very unlikely
that we could build a model that could use all of them effectively. feature-engine has
a method, DropCorrelatedFeatures, that makes it easy to remove a feature when it
is highly correlated with another feature.

### Warning - you are dropping data without testing it's usefulness to your model. Shown for demonstration but be careful you are not losing valuable inofmration.

Here we will work with land temperature data, along with the NLS data. Note
that we will only load temperature data for Poland here

In [6]:
# remove a feature highly correlated with another
X_train.corr()


Unnamed: 0,satverbal,satmath,gpascience,gpaenglish,gpamath,gpaoverall
satverbal,1.0,0.73,0.44,0.44,0.38,0.42
satmath,0.73,1.0,0.48,0.43,0.52,0.48
gpascience,0.44,0.48,1.0,0.67,0.61,0.79
gpaenglish,0.44,0.43,0.67,1.0,0.6,0.84
gpamath,0.38,0.52,0.61,0.6,1.0,0.75
gpaoverall,0.42,0.48,0.79,0.84,0.75,1.0


Let's drop features that have a correlation higher than 0.75 with another feature.
We pass 0.75 to the threshold parameter of DropCorrelatedFeatures,
indicating that we want to use Pearson coefficients and that we want to evaluate
all the features by setting the variables to None. We use the fit method on the
training data and then transform both the training and testing data. The info
method shows that the resulting training DataFrame (X_train_tr) has all of
the features except gpaoverall, which has correlations of 0.793 and 0.844 with
gpascience and gpaenglish, respectively (DropCorrelatedFeatures will
evaluate from left to right, so if gpamath and gpaoverall are highly correlated,
it will drop gpaoverall. If gpaoverall had been to the left of gpamath, it
would have dropped gpamath):

https://feature-engine.trainindata.com/en/latest/user_guide/selection/DropCorrelatedFeatures.html

In [7]:
tr = fesel.DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.75)
tr.fit(X_train)
X_train_tr = tr.transform(X_train)
X_test_tr = tr.transform(X_test)
X_train_tr.info()




<class 'pandas.core.frame.DataFrame'>
Index: 6288 entries, 574974 to 370933
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   satverbal   1001 non-null   float64
 1   satmath     1001 non-null   float64
 2   gpascience  3998 non-null   float64
 3   gpaenglish  4078 non-null   float64
 4   gpamath     4056 non-null   float64
dtypes: float64(5)
memory usage: 294.8 KB


Let's drop features that have the same values as other features

In [8]:
# drop features that have the same values as another feature
tr = fesel.DropDuplicateFeatures()
tr.fit(X_train_tr)
X_train_tr = tr.transform(X_train_tr)
X_test_tr = tr.transform(X_test_tr)
X_train_tr.head()


Unnamed: 0_level_0,satverbal,satmath,gpascience,gpaenglish,gpamath
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
574974,,,,100.0,
894733,,,,,
452383,,,,,
670866,,,,,
353165,,,,,190.0


## Removing features that leak data!

* **Data leakage** happens when a model has access to information during training that would not be available when making predictions in production.
* It makes your offline results look “too good to be true” but leads to poor real-world performance.

**Example:** predicting loan repayment at the time of application. If you include a feature like “number of missed payments,” that information only becomes available *after* the loan is issued — so your model is secretly peeking into the future.

### Common Types of Leakage

* **Target leakage:** features that directly or indirectly contain the answer.
  *Example: using “loan\_status” to predict default.*
* **Temporal leakage (“time travel”):** features include data from after the prediction point.
  *Example: collections activity after the loan application.*
* **Preprocessing leakage:** fitting scalers, imputers, or encoders on the full dataset instead of only the training portion.
* **Entity leakage:** the same customer or loan appears in both train and validation splits, letting the model memorize.
* **Join leakage:** merging on the “latest snapshot” without restricting to information available at the prediction time.

### How to Detect Leakage

* **Check feature timestamps**: ask, “Was this known *as of prediction time*?” Anything later leaks.
* **Monitor performance jumps**: if a single feature or small group suddenly boosts accuracy/AUC to unrealistic levels, it may be leaking.
* **Inspect top features**: use feature importance or SHAP to see if suspicious “future” features dominate.
* **Cross-validate carefully**: if validation scores are much higher with random splits than with time-based splits, leakage is likely.
* **Audit your joins**: make sure you only join records that existed before the prediction timestamp.

### Best Practices to Prevent Leakage

* **Anchor to a timeline:** define an explicit “as-of date” for every observation, and only use features known at or before that time.
* **Use time-based splits:** validate on future data, not random samples.
* **Point-in-time joins:** when combining tables, only keep rows recorded before the as-of date.
* **Fit preprocessing on train only:** scaling, encoding, and imputing should learn from training folds only.
* **Use group-aware splits:** keep the same customer or entity entirely in train or test, not both.
* **Handle target encoding safely:** compute mean encoding within cross-validation folds, not across the whole dataset.

### Example: Loan Default Prediction

**Allowed features (safe):**

* Customer’s income at application.
* Credit score from bureau snapshot before application date.
* Number of past loans repaid or defaulted *before* application.

**Leaking features (unsafe):**

* Whether the first installment was paid on time.
* Collections calls after the loan was granted.
* Charge-off date or recovery amounts.

Always ask: *Would this information be known at prediction time in production?*
* If not, remove or adjust the feature.
* Use time-aware splits and point-in-time joins to enforce this rule.

Data leakage is one of the most common reasons for models failing in production, but it can be detected with careful thinking about timelines and feature sources.


## Encoding Categorical features

1. **Most ML algorithms need numbers**
   Models like regression, trees, and neural networks can’t work directly with raw text or categories. They require numeric inputs.

2. **Numbers ≠ categories**
   If we code `female = 1` and `male = 2`, the model may mistakenly think "male is greater than female." Encoding makes it clear these are categories, not numeric scales.

3. **Ordinal features matter**
   Some categories *do* have an order (e.g., “low”, “medium”, “high”). Encoding should preserve that order so the model understands the ranking.

4. **High cardinality**
   If a categorical variable has many unique values (e.g., ZIP codes), we may need special strategies to reduce complexity.

## One-Hot Encoding

Turn each category into its own binary (0/1) column.
* Example: Feature `letter` with values `{A, B, C}` becomes three columns:

  * `letter_A`: 1 if A, 0 otherwise
  * `letter_B`: 1 if B, 0 otherwise
  * `letter_C`: 1 if C, 0 otherwise

These binary columns are often called **dummy variables**.

#### When to Use

* **Low-cardinality categorical features** (roughly ≤ 15 categories).
* Examples: gender, education level, marital status.

For ordinal features (with meaningful ranking), we may use **ordinal encoding** instead.
For high-cardinality features, we need other methods (covered later).

Encoding ensures that categorical variables are represented in a way that ML algorithms can use. One-hot encoding is the most common approach for variables with a small set of categories, because it’s simple and preserves category identity without implying false numeric relationships.



In [9]:
feature_cols =['gender','maritalstatus','colenroct99']
nls97demo = nls97[['wageincome'] + feature_cols].dropna()
X_demo_train, X_demo_test, y_demo_train, y_demo_test= train_test_split(nls97demo[feature_cols],\
nls97demo[['wageincome']], test_size=0.3,random_state=0)
X_demo_train.head()

Unnamed: 0_level_0,gender,maritalstatus,colenroct99
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
736081,Female,Married,1. Not enrolled
832734,Male,Never-married,1. Not enrolled
453537,Male,Married,1. Not enrolled
322059,Female,Divorced,1. Not enrolled
324323,Female,Married,2. 2-year college


In [10]:
ohe = OneHotEncoder(drop_last=True,
variables=['gender','maritalstatus'])
ohe.fit(X_demo_train)
X_demo_train_ohe = ohe.transform(X_demo_train)
X_demo_test_ohe = ohe.transform(X_demo_test)

X_demo_test_ohe.head()

Unnamed: 0_level_0,colenroct99,gender_Female,maritalstatus_Married,maritalstatus_Never-married,maritalstatus_Divorced,maritalstatus_Separated
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
653737,1. Not enrolled,1,1,0,0,0
473759,1. Not enrolled,1,0,1,0,0
348093,1. Not enrolled,0,1,0,0,0
163509,1. Not enrolled,1,1,0,0,0
895873,1. Not enrolled,1,1,0,0,0


Categorical features can be either nominal or ordinal, as discussed in Chapter 1,
Examining the Distribution of Features and Targets. Gender and marital status are nominal.
Their values do not imply order. For example, "never married" is not a higher value
than "divorced."

However, when a categorical feature is ordinal, we want the encoding to capture the
ranking of the values. For example, if we have a feature that has the values of low, medium,
and high, one-hot encoding would lose this ordering. Instead, a transformed feature with
the values of 1, 2, and 3 for low, medium, and high, respectively, would be better. We can
accomplish this with ordinal encoding.

The college enrollment feature on the NLS dataset can be considered an ordinal feature.
The values range from 1. Not enrolled to 3. 4-year college. We should use ordinal encoding
to prepare it for modeling. We will do that next:

In [11]:
# OrdinalEncoder expects you to pass categories as a list of lists:
# Each inner list contains the allowed categories in the order you want them encoded.
# Using .unique() directly gives a NumPy array, but sklearn requires a list.
# We also need to wrap it inside another list because we're encoding ONE column here.

categories = [list(X_demo_train['colenroct99'].dropna().unique())]

categories

[['1. Not enrolled', '2. 2-year college ', '3. 4-year college']]

In [12]:
# -----------------------------------------------
# Step 2: Initialize the encoder with fixed categories
# -----------------------------------------------
oe = OrdinalEncoder(categories=categories)

# -----------------------------------------------
# Step 3: Fit *only on training data*, then transform
# -----------------------------------------------
colenr_enc_train = pd.DataFrame(
    oe.fit_transform(X_demo_train_ohe[['colenroct99']]),  # fit + transform
    columns=['colenroct99'],
    index=X_demo_train_ohe.index
)

# Remove original column (to avoid collision), then add encoded version
X_demo_train_enc = (
    X_demo_train_ohe.drop(columns=['colenroct99'], errors='ignore')
    .join(colenr_enc_train)
)

# -----------------------------------------------
# Step 4: Transform the TEST set using same encoder
# -----------------------------------------------
# Notice we use transform(), NOT fit_transform().
# The mapping is already learned from train; we just apply it here.
colenr_enc_test = pd.DataFrame(
    oe.transform(X_demo_test_ohe[['colenroct99']]),  # transform only
    columns=['colenroct99'],
    index=X_demo_test_ohe.index
)

# Drop original, then add encoded version
X_demo_test_enc = (
    X_demo_test_ohe.drop(columns=['colenroct99'], errors='ignore')
    .join(colenr_enc_test)
)

X_demo_train_enc.head()


Unnamed: 0_level_0,gender_Female,maritalstatus_Married,maritalstatus_Never-married,maritalstatus_Divorced,maritalstatus_Separated,colenroct99
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
736081,1,1,0,0,0,0.0
832734,0,0,1,0,0,0.0
453537,0,1,0,0,0,0.0
322059,1,0,0,1,0,0.0
324323,1,1,0,0,0,1.0


## Encoding categorical features with medium or high cardinality

When a categorical feature has **many unique values** (10, 50, or even thousands), creating a dummy variable (one-hot encoding) for each value quickly becomes impractical. Why?

* **Too many columns**: one-hot encoding expands into hundreds of new features.
* **Sparse data**: some categories may only appear a handful of times, giving the model little to learn from.
* **Extreme case – IDs**: if each observation has a unique value (like student ID), one-hot encoding adds no useful information.


#### Common Strategies

1. **Top-K categories + “other”**

   * Keep dummies only for the most common *k* categories.
   * Group all rare categories into a single `"other"` column.
   * Useful when a few categories dominate the data.

2. **Feature hashing (hashing trick)**

   * Map categories into a fixed number of bins using a hash function.
   * You choose the number of bins (say 50), and all categories are compressed into them.
   * Fast and memory-efficient, but collisions (different categories mapping to the same bin) can happen.

#### More Advanced Options

3. **Target / Mean Encoding**

   * Replace each category with a summary statistic of the target variable (often the **mean target** for that category).
   * Example: if the target is `defaulted (0/1)`, and category = `job_title`, we can encode each `job_title` with its default rate.
   * Very powerful, especially with high cardinality, but must be done carefully:

     * Can cause **data leakage** if computed on the whole dataset (solution: compute only on train data, or use cross-validation folds).
     * Works best with regularization (smoothing rare categories toward the global mean).

4. **Embeddings (deep learning approach)**

   * Represent each category as a learned vector (like word embeddings in NLP).
   * These vectors capture similarity among categories automatically.
   * Typically used when features have very high cardinality (e.g., user IDs, product IDs) in neural networks.



In [13]:
covidtotals = pd.read_csv("covidtotals.csv")
feature_cols = ['location','population',
    'aged_65_older','diabetes_prevalence','region']
covidtotals = covidtotals[['total_cases'] + feature_cols].dropna()

# Separate into train and test sets
X_train, X_test, y_train, y_test =  \
  train_test_split(covidtotals[feature_cols],\
  covidtotals[['total_cases']], test_size=0.3, random_state=0)


# use the one hot encoder for region
X_train.region.value_counts()


region
Eastern Europe     16
East Asia          12
Western Europe     12
West Africa        11
East Africa        10
West Asia          10
South Asia          7
South America       7
Southern Africa     7
Central Africa      7
Caribbean           6
Oceania / Aus       6
Central Asia        5
North Africa        4
North America       3
Central America     3
Name: count, dtype: int64

We can use the OneHotEncoder module from feature_engine again to
encode the region feature. This time, we use the top_categories parameter to
indicate that we only want to create dummies for the top six category values.
Any values that do not fall into the top six will have a 0 for all of the dummies:

In [14]:

ohe = OneHotEncoder(top_categories=6, variables=['region'])
covidtotals_ohe = ohe.fit_transform(covidtotals)
covidtotals_ohe.filter(regex='location|region',
  axis="columns").sample(5, random_state=99)



Unnamed: 0,location,region_Eastern Europe,region_Western Europe,region_West Africa,region_East Asia,region_West Asia,region_East Africa
97,Israel,0,0,0,0,1,0
173,Senegal,0,0,1,0,0,0
92,Indonesia,0,0,0,1,0,0
187,Sri Lanka,0,0,0,0,0,0
104,Kenya,0,0,0,0,0,1


Feature hashing maps a large number of unique feature values to a smaller number of
dummy variables. We can specify the number of dummy variables to create. However,
collisions are possible; that is, some feature values might map to the same dummy variable
combination. The number of collisions increases as we decrease the number of requested
dummy variables.
We can use HashingEncoder from category_encoders to do feature hashing.
We use n_components to indicate that we want six dummy variables (we copy the
region feature before we do the transform so that we can compare the original values to
the new dummies):

In [15]:
# use the hashing encoder for region
X_train['region2'] = X_train.region
he = HashingEncoder(cols=['region'], n_components=6)
X_train_enc = he.fit_transform(X_train)
X_train_enc.\
 groupby(['col_0','col_1','col_2','col_3','col_4',
   'col_5','region2']).\
 size().reset_index().rename(columns={0:'count'})

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,region2,count
0,0,0,0,0,0,1,Caribbean,6
1,0,0,0,0,0,1,Central Africa,7
2,0,0,0,0,0,1,East Africa,10
3,0,0,0,0,0,1,North Africa,4
4,0,0,0,0,1,0,Central America,3
5,0,0,0,0,1,0,Eastern Europe,16
6,0,0,0,0,1,0,North America,3
7,0,0,0,0,1,0,Oceania / Aus,6
8,0,0,0,0,1,0,Southern Africa,7
9,0,0,0,0,1,0,West Asia,10


Unfortunately, this gives us a large number of collisions. For example, Caribbean, Central
Africa, East Africa, and North Africa all get the same dummy variable values. In this case
at least, using one-hot encoding and specifying the number of categories, as we did in the
last section, was a better solution.


#### Using mathematical transformations
 Sometimes, we want to use features that do not have a Gaussian distribution with
a machine learning algorithm that assumes our features are distributed in that way. When
that happens, we either need to change our minds about which algorithm to use (for
example, we could choose KNN rather than linear regression) or transform our features so
that they approximate a Gaussian distribution. This can be another use case for mean encoding.

In [None]:
# Read data
covidtotals = pd.read_csv("covidtotals.csv")
feature_cols = ['location', 'population',
                'aged_65_older', 'diabetes_prevalence', 'region']

covidtotals = covidtotals[['total_cases'] + feature_cols].dropna()

# Separate into train and test sets
X = covidtotals[feature_cols]
y = covidtotals['total_cases']      # use a Series (1D), not a DataFrame

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

# Categorical columns to target-encode
cat_cols = ['region']

# Set up the TargetEncoder (mean encoding using y)
encoder = ce.TargetEncoder(cols=cat_cols)

# Fit on training data ONLY (to avoid leakage), using y_train as target
X_train_enc = encoder.fit_transform(X_train, y_train)

# Transform test set using the encoder fitted on train
X_test_enc = encoder.transform(X_test)

# Now X_train_enc and X_test_enc are ready for modeling
print(X_train_enc.head())


### Advantages
* **Handles high-cardinality categoricals**
  Works better than one-hot when you have many unique categories (e.g. hundreds of locations), avoiding huge sparse matrices.
* **Keeps feature space small**
  Each categorical variable becomes **one numeric column**, not dozens/hundreds.
* **Often boosts model performance**
  Especially for tree-based models (Random Forest, XGBoost, LightGBM) that like meaningful numeric encodings.
* **Captures signal from the target**
  If some categories are strongly associated with high/low target values, mean encoding makes that explicit.

### Disadvantages / Risks
* **Target leakage**
  If you compute the mean using the *whole dataset* (train + test), you “peek” at the test labels → overly optimistic performance.
* **Overfitting on rare categories**
  Categories with very few examples can get extreme means that don’t generalize.
* **Less interpretable**
  “Region = 1234.5” is less intuitive than “Region = Europe / Asia / …”.
* **Needs careful implementation**
  Must be fit only on training data; often with smoothing and/or cross-validation schemes.
### When to use
Use target/mean encoding when:
* You have **categorical variables**, especially with **many levels**.
* You’re training **supervised models** (regression or classification).
* You’re comfortable handling **data leakage** properly:
  * Fit encoder on **train only** (like you did with `TargetEncoder` and `y_train`).
  * Optionally use cross-validation-based target encoding in more advanced setups.

Avoid or be cautious when:
* Dataset is **small** and categories are many and very rare.
* You’re doing quick “intro” ML and want **maximum interpretability** → one-hot may be safer and simpler to explain.
