# **Frequency Encoding - Jupyter Notebook**

## **Introduction**
This notebook demonstrates Frequency Encoding using two approaches:
- A simple example for conceptual understanding.
- Applying Frequency Encoding on a real-world dataset loaded from a CSV file.

## **Importing Required Libraries**

In [2]:
import pandas as pd
import category_encoders as ce

# **Simple Example for Understanding**

In [None]:
# Creating a small dataset with categorical values
data = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']})

# Apply Frequency Encoding using category_encoders
encoder = ce.CountEncoder()
data['Encoded_City'] = encoder.fit_transform(data['City'])

print(data)

# Another approch

In [None]:
# Creating a small dataset with categorical values
data = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']})

# Applying Frequency Encoding
freq_encoding = data['City'].value_counts()

# Display the results
data['Encoded_City'] = data['City'].map(freq_encoding)
print(data)

# **Real-World Example - Applying Frequency Encoding on a CSV Dataset**

In [1]:
import pandas as pd
import category_encoders as ce

# Load dataset
df_real = pd.read_csv("sample_data.csv")

# Display first few rows before encoding
print("\nReal-World Dataset (Before Encoding):\n")
df_real


Real-World Dataset (Before Encoding):



Unnamed: 0,Company,Product,TypeName,Inches,Ram,OS,Weight,Price_euros,Screen,ScreenW,...,RetinaDisplay,CPU_company,CPU_freq,CPU_model,PrimaryStorage,SecondaryStorage,PrimaryStorageType,SecondaryStorageType,GPU_company,GPU_model
0,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1339.69,Standard,2560,...,Yes,Intel,2.3,Core i5,128,0,SSD,No,Intel,Iris Plus Graphics 640
1,Apple,Macbook Air,Ultrabook,13.3,8,macOS,1.34,898.94,Standard,1440,...,No,Intel,1.8,Core i5,128,0,Flash Storage,No,Intel,HD Graphics 6000
2,HP,250 G6,Notebook,15.6,8,No OS,1.86,575.00,Full HD,1920,...,No,Intel,2.5,Core i5 7200U,256,0,SSD,No,Intel,HD Graphics 620
3,Apple,MacBook Pro,Ultrabook,15.4,16,macOS,1.83,2537.45,Standard,2880,...,Yes,Intel,2.7,Core i7,512,0,SSD,No,AMD,Radeon Pro 455
4,Apple,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1803.60,Standard,2560,...,Yes,Intel,3.1,Core i5,256,0,SSD,No,Intel,Iris Plus Graphics 650
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1270,Lenovo,Yoga 500-14ISK,2 in 1 Convertible,14.0,4,Windows 10,1.80,638.00,Full HD,1920,...,No,Intel,2.5,Core i7 6500U,128,0,SSD,No,Intel,HD Graphics 520
1271,Lenovo,Yoga 900-13ISK,2 in 1 Convertible,13.3,16,Windows 10,1.30,1499.00,Quad HD+,3200,...,No,Intel,2.5,Core i7 6500U,512,0,SSD,No,Intel,HD Graphics 520
1272,Lenovo,IdeaPad 100S-14IBR,Notebook,14.0,2,Windows 10,1.50,229.00,Standard,1366,...,No,Intel,1.6,Celeron Dual Core N3050,64,0,Flash Storage,No,Intel,HD Graphics
1273,HP,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,15.6,6,Windows 10,2.19,764.00,Standard,1366,...,No,Intel,2.5,Core i7 6500U,1024,0,HDD,No,AMD,Radeon R5 M330


In [6]:
# Count the occurrences of each unique company in the "Company" column and display them in descending order
print(df_real["Company"].value_counts())

Company
Dell         291
Lenovo       289
HP           268
Asus         152
Acer         101
MSI           54
Toshiba       48
Apple         21
Samsung        9
Razer          7
Mediacom       7
Microsoft      6
Xiaomi         4
Vero           4
Chuwi          3
Google         3
Fujitsu        3
LG             3
Huawei         2
Name: count, dtype: int64


In [7]:
# Define TargetEncoder
target_encoder = ce.CountEncoder()

# Apply encoding on "Company" column
df_real["Company"] = target_encoder.fit_transform(df_real["Company"])

# Display first few rows after encoding
print("\nReal-World Dataset (After Encoding):\n")
df_real


Real-World Dataset (After Encoding):



Unnamed: 0,Company,Product,TypeName,Inches,Ram,OS,Weight,Price_euros,Screen,ScreenW,...,RetinaDisplay,CPU_company,CPU_freq,CPU_model,PrimaryStorage,SecondaryStorage,PrimaryStorageType,SecondaryStorageType,GPU_company,GPU_model
0,21,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1339.69,Standard,2560,...,Yes,Intel,2.3,Core i5,128,0,SSD,No,Intel,Iris Plus Graphics 640
1,21,Macbook Air,Ultrabook,13.3,8,macOS,1.34,898.94,Standard,1440,...,No,Intel,1.8,Core i5,128,0,Flash Storage,No,Intel,HD Graphics 6000
2,268,250 G6,Notebook,15.6,8,No OS,1.86,575.00,Full HD,1920,...,No,Intel,2.5,Core i5 7200U,256,0,SSD,No,Intel,HD Graphics 620
3,21,MacBook Pro,Ultrabook,15.4,16,macOS,1.83,2537.45,Standard,2880,...,Yes,Intel,2.7,Core i7,512,0,SSD,No,AMD,Radeon Pro 455
4,21,MacBook Pro,Ultrabook,13.3,8,macOS,1.37,1803.60,Standard,2560,...,Yes,Intel,3.1,Core i5,256,0,SSD,No,Intel,Iris Plus Graphics 650
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1270,289,Yoga 500-14ISK,2 in 1 Convertible,14.0,4,Windows 10,1.80,638.00,Full HD,1920,...,No,Intel,2.5,Core i7 6500U,128,0,SSD,No,Intel,HD Graphics 520
1271,289,Yoga 900-13ISK,2 in 1 Convertible,13.3,16,Windows 10,1.30,1499.00,Quad HD+,3200,...,No,Intel,2.5,Core i7 6500U,512,0,SSD,No,Intel,HD Graphics 520
1272,289,IdeaPad 100S-14IBR,Notebook,14.0,2,Windows 10,1.50,229.00,Standard,1366,...,No,Intel,1.6,Celeron Dual Core N3050,64,0,Flash Storage,No,Intel,HD Graphics
1273,268,15-AC110nv (i7-6500U/6GB/1TB/Radeon,Notebook,15.6,6,Windows 10,2.19,764.00,Standard,1366,...,No,Intel,2.5,Core i7 6500U,1024,0,HDD,No,AMD,Radeon R5 M330


# Difference Between fit(), fit_transform(), and transform()

| Method           | Description |
|-----------------|-------------|
| **fit()**       | Learns the unique labels and assigns them binary vectors but does **NOT** transform data. |
| **fit_transform()** | Learns the labels and transforms the data in one step. |
| **transform()**  | Transforms new data based on learned labels without re-learning. |


# Important Tips for Frequency Encoding  

1. **Key Parameters in Frequency Encoding:**  

| Parameter          | Description | Explanation | Example Usage |
|--------------------|-------------|-------------|---------------|
| `smoothing`       | Controls the balance between category mean and global mean to prevent overfitting. | Higher values make the encoding more stable but less responsive to individual category variations. | `ce.TargetEncoder(smoothing=5)` |
| `min_samples_leaf`| Ensures categories with very few samples do not get misleading encodings. | Helps prevent overfitting by requiring a minimum number of samples before applying category-specific mean. | `ce.TargetEncoder(min_samples_leaf=10)` |
| `target_type`     | Determines whether encoding is optimized for regression (`continuous`) or classification (`binary`). | Use `continuous` for numerical targets and `binary` for categorical targets. | `ce.TargetEncoder(target_type='continuous')` |
| `verbose`         | Controls the display of messages during encoding. | Setting `verbose=1` prints progress messages during encoding. | `ce.TargetEncoder(verbose=1)` |
| `cols`            | Specifies which columns to apply Target Encoding to (if not set, it applies to all categorical data). | Helps in selective encoding if only specific columns need transformation. | `ce.TargetEncoder(cols=['City'])` |
| `drop_invariant`  | If `True`, it removes columns with no variance after encoding. | Useful for cleaning redundant features after transformation. | `ce.TargetEncoder(drop_invariant=True)` |
| `return_df`       | If `True`, returns a DataFrame instead of a NumPy array. | Ensures compatibility with pandas DataFrames when further processing is needed. | `ce.TargetEncoder(return_df=True)` |


2. **Be careful with data leakage.**  
   - If you encode the entire dataset before splitting, target information leaks into training.  
   - **Solution:** Always perform Target Encoding inside cross-validation folds to prevent overfitting.  

**Best Practice:** Choose proper smoothing and apply encoding within cross-validation folds to maintain model integrity.

---

## **Difference Between Target Encoding and Frequency Encoding**
| **Encoding Type**  | **How It Works** | **Best Use Case** |
|--------------------|----------------|-------------------|
| **Target Encoding** | Uses the **target variable’s mean** to encode categorical values. | When there is a strong relationship between the category and the target variable. |
| **Frequency Encoding** | Uses **only the count of occurrences** of each category. | When categories are unordered and have many unique values. |

---

## **When to Use Frequency Encoding?**
- When categorical features have **many unique values** (e.g., `City Names`, `Product Names`).  
- When you need a **simple numerical representation** without depending on the target variable (`Y`).  

---

**Tip**: Use `normalize=True` when working with datasets of varying sizes, and always apply encoding inside cross-validation folds to avoid overfitting.