# Databricks Data Preparation in ML - Notebook 04
## Data Encoding Fundamentals

**Part of the Databricks Data Preparation in ML Training Series**

---

## Objectives

This notebook covers essential categorical variable encoding techniques required for Databricks ML Associate Certification:

- **Label Encoding** - Simple numerical mapping for ordinal categories
- **One-Hot Encoding** - Binary vectors for nominal categories  
- **Ordinal Encoding** - Preserving natural ordering in categorical data
- **Target Encoding** - Leveraging target variable information
- **Categorical Embeddings** - Learned representations through deep learning
- **Text Embeddings** - Semantic encoding for textual data

## Duration: ~45 minutes
## Level: Fundamental → Advanced

---

## Why is Categorical Encoding Critical?

ML algorithms work with **numerical data**, but business data contains **categorical text information**:
- Cities, products, customer segments
- Status values, ratings, sizes
- Textual descriptions, comments, reviews

**Proper encoding** can dramatically improve ML model performance and training efficiency!

## Theory: Categorical Types and Encoding Methods

### Types of Categorical Variables

Understanding the nature of categorical data is crucial for selecting the appropriate encoding strategy.

#### **Nominal Categories** (No natural ordering)
- Cities: "Warsaw", "Krakow", "Gdansk"
- Colors: "red", "blue", "green"  
- Brands: "Apple", "Samsung", "Google"
- **Characteristic**: Categories are mutually exclusive with no inherent ranking

#### **Ordinal Categories** (Natural hierarchy exists)
- Education: elementary < high school < bachelor < master < PhD
- Sizes: XS < S < M < L < XL
- Ratings: poor < fair < good < excellent
- **Characteristic**: Categories have meaningful order relationships

## Environment Setup

In [0]:
# Basic imports for Databricks ML
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, when, lit
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
import pandas as pd
import numpy as np

In [0]:
%pip install faker

In [0]:
from faker import Faker
import random
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

fake = Faker()
Faker.seed(42)
random.seed(42)

# Zakres danych
n_rows = 1000

education_levels = ["elementary", "high_school", "bachelor", "master", "phd"]
cities = ["London", "Manchester", "Birmingham", "Leeds", "Glasgow"]
departments = ["IT", "Marketing", "Finance", "HR", "Operations"]
performance_levels = ["poor", "average", "good", "excellent"]

def random_salary(edu, perf):
    base = {
        "elementary": 3000,
        "high_school": 4000,
        "bachelor": 6000,
        "master": 8000,
        "phd": 10000
    }.get(edu, 5000)
    bonus = {
        "poor": -1000,
        "average": 0,
        "good": 1000,
        "excellent": 2000
    }.get(perf, 0)
    return float(base + bonus + random.randint(-500, 500))

data = []
for i in range(1, n_rows + 1):
    edu = random.choice(education_levels)
    city = random.choice(cities)
    dept = random.choice(departments)
    perf = random.choice(performance_levels)
    salary = random_salary(edu, perf)
    description = fake.sentence(nb_words=8) + " " + fake.job()
    data.append((i, edu, city, dept, perf, salary, description))

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("education", StringType(), True),       # ORDINAL
    StructField("city", StringType(), True),            # NOMINAL  
    StructField("department", StringType(), True),      # NOMINAL
    StructField("performance", StringType(), True),     # ORDINAL
    StructField("salary", DoubleType(), True),          # TARGET
    StructField("description", StringType(), True)      # TEXT
])

df = spark.createDataFrame(data, schema)
display(df)

#Label Encoding - Mapping to numbers

## Theory
**Label Encoding** assigns each category a unique integer (0, 1, 2, 3...).

### ✅ When to use:
- **Ordinal** data with natural ordering  
- **Tree-based models** (Random Forest, XGBoost)
- Need **compact representation**

### ❌ When to avoid:
- **Nominal** data without ordering (artificial ranking!)
- **Linear models** (interpret numbers as order)

In [0]:
# Label Encoding for education (ordinal)
education_indexer = StringIndexer(
    inputCol="education", 
    outputCol="education_encoded"
)

education_model = education_indexer.fit(df)
df_label = education_model.transform(df)

# Check mapping
df_label.select("education", "education_encoded").distinct().orderBy("education_encoded").display()

In [0]:
# ⚠️ WARNING: StringIndexer codes by frequency, not logically!
# For ordinal data it's better to use manual mapping:

education_order = {"elementary": 1, "high_school": 2, "bachelor": 3, "master": 4}
df_manual = df.withColumn("education_manual", 
    when(col("education") == "elementary", 1)
    .when(col("education") == "high_school", 2)  
    .when(col("education") == "bachelor", 3)
    .when(col("education") == "master", 4)
    .otherwise(0)
)

df_manual.select("education", "education_manual").distinct().orderBy("education_manual").display()

#One-Hot Encoding - Binary vectors

## Theory
**One-Hot Encoding** creates **separate binary column (0/1)** for each unique category.

###  When to use:
- **Nominal** data without natural ordering
- **Linear models** (Logistic Regression, SVM)  
- **Low cardinality** (<20 categories)

###  When to avoid:
- **High cardinality** (>20 categories) → curse of dimensionality
- **Limited memory** → many sparse columns

In [0]:
# One-Hot Encoding for department (nominal)
# Step 1: StringIndexer (required before OneHotEncoder)
dept_indexer = StringIndexer(inputCol="department", outputCol="dept_indexed")
df_indexed = dept_indexer.fit(df_label).transform(df_label)

# Step 2: OneHotEncoder  
dept_encoder = OneHotEncoder(
    inputCols=["dept_indexed"], 
    outputCols=["dept_onehot"],
   # dropLast=False
)
df_onehot = dept_encoder.fit(df_indexed).transform(df_indexed)

# Check results
df_onehot.select("department", "dept_indexed", "dept_onehot").display()

In [0]:
# Analysis of One-Hot Encoding dimensions
unique_depts = df.select("department").distinct().count()
sample_vector = df_onehot.select("dept_onehot").first()["dept_onehot"]

print(f"Categories: {unique_depts}")
print(f"Vector dimensions: {len(sample_vector)}")
print(f"Sample vector: {sample_vector.toArray()}")

# 💡 Each category = 1 column, others = 0

#Ordinal Encoding - Preserving Order

## Theory  
**Ordinal Encoding** is **manual mapping** of ordinal categories to numbers, preserving natural hierarchy.

###  When to use:
- Data with **clear ordering** (sizes, ratings, levels)
- You need to **preserve relationships** between categories
- **All ML model types**

###  Advantages vs Label Encoding:
- **Control** over mapping  
- **Logical order** vs frequency occurrence
- **Consistent** results

In [0]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

df_ordinalEncoder = pd.DataFrame({
    "education": ["elementary", "bachelor", "high_school", "master", "phd"]
})

# Ręczna definicja porządku (dla danych ordinalnych)
encoder = OrdinalEncoder(categories=[["elementary", "high_school", "bachelor", "master", "phd"]])
df_ordinalEncoder["education_index"] = encoder.fit_transform(df[["education"]])

display(df_ordinalEncoder.sort_values("education_index"))

In [0]:
# Ordinal Encoding for performance (ordinal with clear ordering)
df_ordinal = df_onehot.withColumn("performance_ordinal",
    when(col("performance") == "poor", 1)
    .when(col("performance") == "average", 2)
    .when(col("performance") == "good", 3) 
    .when(col("performance") == "excellent", 4)
    .otherwise(0)
)

# Check mapping
df_ordinal.select("performance", "performance_ordinal").distinct().orderBy("performance_ordinal").display()

In [0]:
# Comparison: Ordinal vs StringIndexer
perf_indexer = StringIndexer(inputCol="performance", outputCol="performance_auto")
df_comparison = perf_indexer.fit(df_ordinal).transform(df_ordinal)

df_comparison.select("performance", "performance_ordinal", "performance_auto").distinct().orderBy("performance_ordinal").display()

# 💡 Ordinal preserves logical order, StringIndexer codes by frequency!

#Target Encoding - Information from target variable

## Theory
**Target Encoding** replaces each category with **target variable statistics** (e.g. mean, median).

###  When to use:
- **High cardinality** (>20-50 categories)
- **Tree-based models** (Random Forest, XGBoost)
- Categories with **different predictive power**

###  WARNINGS:
- **Overfitting risk** - use cross-validation!
- **Data leakage** - don't use target info from validation/test
- **Smoothing** for small samples

In [0]:
# Target Encoding for city (average salary per city)
city_target_stats = df.groupBy("city").agg(
    avg("salary").alias("city_avg_salary"),
    count("*").alias("city_count")
)

city_target_stats.orderBy("city_avg_salary", ascending=False).display()

# Join with main dataset
df_target = df_comparison.join(city_target_stats, "city")
df_target.select("city", "salary", "city_avg_salary").display()

In [0]:
# Bayesian Smoothing for small samples
global_mean = df.agg(avg("salary")).collect()[0][0]
alpha = 3  # smoothing parameter

df_smoothed = city_target_stats.withColumn("city_smoothed_salary",
    (col("city_count") * col("city_avg_salary") + lit(alpha) * lit(global_mean)) / 
    (col("city_count") + lit(alpha))
)

df_smoothed.select("city", "city_count", "city_avg_salary", "city_smoothed_salary").display()

# 💡 Smoothing "shrinks" small samples toward global mean

In [0]:
city_target_stats.createOrReplaceTempView("city_target_stats")

global_mean = df.agg(avg("salary")).collect()[0][0]
alpha = 3

query = f"""
SELECT
  city,
  city_count,
  city_avg_salary,
  (city_count * city_avg_salary + {alpha} * {global_mean}) / (city_count + {alpha}) AS city_smoothed_salary
FROM city_target_stats
"""

df_smoothed_sql = spark.sql(query)
display(df_smoothed_sql)

#Categorical Embeddings - VectorAssembler

## Theory
**Categorical Embeddings** are **learned dense vector representations** of categories, trained end-to-end with neural network models.

###  When to use:
- **High cardinality** (>50-100 categories)
- **Deep learning models** (neural networks)
- **Complex relationships** between categories
- **Recommendation systems**

###  Advantages:
- **Automatic similarity learning** - similar categories → similar vectors
- **Dense representation** vs sparse one-hot
- **Tunable dimensions** - control over size

### Embedding dimensions (rule of thumb):
```python
embedding_dim = min(50, (cardinality + 1) // 2)
```

In [0]:
display(df_indexed)

In [0]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

# 1. Encode the 'department' column into numerical indices
indexer = StringIndexer(inputCol="department", outputCol="department_idx")
df_indexed = indexer.fit(df).transform(df)

# 2. Select features for vectorization
# Example: we use 'department_idx' and 'salary' as input features
assembler = VectorAssembler(
    inputCols=["department_idx", "salary"],  # you can add other numerical features here
    outputCol="features_vector"
)

df_vectorized = assembler.transform(df_indexed)

# 3. Check the result
df_vectorized.select("department", "department_idx", "salary", "features_vector").show(truncate=False)

#Text Embeddings - Semantic encoding

## Theory
**Text Embeddings** transform **text into dense vector representations** that capture semantic meaning.

### When to use:
- **Text features** (descriptions, reviews, documents)
- **Semantic search** and similarity matching
- **Document classification** and clustering
- **Multilingual** applications

### Popular models:

| Model | Provider | Dimensions | Best for |
|-------|----------|---------|---------------|
| **all-MiniLM-L6-v2** | Sentence-BERT | 384 | Fast, local |
| **text-embedding-ada-002** | OpenAI | 1536 | High quality |
| **all-mpnet-base-v2** | Sentence-BERT | 768 | Balanced |

In [0]:
from sentence_transformers import SentenceTransformer
import pandas as pd

# Konwersja do Pandas
pdf = df.toPandas()

# Załaduj model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generuj embeddingi
pdf["embedding"] = model.encode(pdf["description"].tolist()).tolist()

In [0]:
from pyspark.sql.types import ArrayType, FloatType

# Dodaj kolumnę z embeddingiem do Spark DF
pdf["embedding"] = pdf["embedding"].apply(lambda x: [float(i) for i in x])
df_embedded = spark.createDataFrame(pdf)

display(df_embedded.select("id", "description", "embedding"))

#  Summary: Choosing the Right Encoding Method

##  Decision Matrix - ML Associate Certification

| Scenario | Cardinality | Category Type | ML Algorithm | **Recommended Method** |
|----------|-------------|---------------|-------------|----------------------|
| Colors, countries | Low (2-10) | Nominal | Linear/SVM | **One-Hot** |
| Education level | Low | Ordinal | All | **Ordinal** (manual) |
| User IDs | High (1000+) | Nominal | Tree-based | **Target Encoding** |
| Product codes | High | Nominal | Neural Network | **Embeddings** |
| Descriptions | Text | Semantic | All | **Text Embeddings** |
| Ratings (1-5) | Low | Ordinal | All | **Keep numeric** |

##  Common Mistakes - ML Associate

1. **Label encoding for nominal** → artificial ordering
2. **One-hot for high cardinality** → curse of dimensionality  
3. **Target encoding without CV** → data leakage
4. **Inconsistent train/test encoding** → model degradation

##  Production Best Practices

###  Pipeline Design
- **Consistent encoding** across train/validation/test
- **Handle unknown categories** (StringIndexer handleInvalid="keep")
- **Version control** for encoding mappings
- **Monitoring** for distribution drift

###  Performance Tips
- **Batch processing** for embeddings API calls
- **Caching** for expensive operations  
- **Sparse formats** for one-hot (memory efficiency)
- **Pipeline optimization** (avoid repeated transforms)