# Categorical Encoders Performance: A Classic Comparison

This tutorial compares four fundamental categorical encoding approaches on a milk quality dataset:
OneHot, Frequency, Target, and Ordinal encoders paired with SVM classification.

In [1]:
using Pkg;
Pkg.activate(@__DIR__);

using MLJ, MLJTransforms, LIBSVM, DataFrames, ScientificTypes
using Random, CSV

  Activating project at `~/Documents/GitHub/MLJTransforms/docs/src/tutorials/classic_comparison`


## Load and Prepare Data
Load the milk quality dataset which contains categorical features for quality prediction:

In [2]:
df = CSV.read("./milknew.csv", DataFrame)

first(df, 5)

Row,pH,Temprature,Taste,Odor,Fat,Turbidity,Colour,Grade
Unnamed: 0_level_1,Float64,Int64,Int64,Int64,Int64,Int64,Int64,String7
1,6.6,35,1,0,1,0,254,high
2,6.6,36,0,1,0,1,253,high
3,8.5,70,1,1,1,1,246,low
4,9.5,34,1,1,0,1,255,low
5,6.6,37,0,0,0,0,255,medium


Check the scientific types to understand our data structure:

In [3]:
ScientificTypes.schema(df)

┌────────────┬────────────┬─────────┐
│[22m names      [0m│[22m scitypes   [0m│[22m types   [0m│
├────────────┼────────────┼─────────┤
│ pH         │ Continuous │ Float64 │
│ Temprature │ Count      │ Int64   │
│ Taste      │ Count      │ Int64   │
│ Odor       │ Count      │ Int64   │
│ Fat        │ Count      │ Int64   │
│ Turbidity  │ Count      │ Int64   │
│ Colour     │ Count      │ Int64   │
│ Grade      │ Textual    │ String7 │
└────────────┴────────────┴─────────┘


Automatically coerce columns with few unique values to categorical:

In [4]:
df = coerce(df, autotype(df, :few_to_finite))

ScientificTypes.schema(df)

┌────────────┬───────────────────┬───────────────────────────────────┐
│[22m names      [0m│[22m scitypes          [0m│[22m types                             [0m│
├────────────┼───────────────────┼───────────────────────────────────┤
│ pH         │ OrderedFactor{16} │ CategoricalValue{Float64, UInt32} │
│ Temprature │ OrderedFactor{17} │ CategoricalValue{Int64, UInt32}   │
│ Taste      │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Odor       │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Fat        │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Turbidity  │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}   │
│ Colour     │ OrderedFactor{9}  │ CategoricalValue{Int64, UInt32}   │
│ Grade      │ Multiclass{3}     │ CategoricalValue{String7, UInt32} │
└────────────┴───────────────────┴───────────────────────────────────┘


## Split Data
Separate features from target and create train/test split:

In [5]:
y, X = unpack(df, ==(:Grade); rng = 123)
train, test = partition(eachindex(y), 0.9, shuffle = true, rng = 100);

## Setup Encoders and Classifier
Load the required models and create different encoding strategies:

In [6]:
OneHot = @load OneHotEncoder pkg = MLJModels verbosity = 0
SVC = @load SVC pkg = LIBSVM verbosity = 0

MLJLIBSVMInterface.SVC

**Encoding Strategies Explained:**
1. **OneHot**: Creates binary columns for each category (sparse, interpretable)
2. **Frequency**: Replaces categories with their occurrence frequency
3. **Target**: Uses target statistics for each category
4. **Ordinal**: Assigns integer codes to categories (assumes ordering)

In [7]:
onehot_model = OneHot(drop_last = true, ordered_factor = true)
freq_model = MLJTransforms.FrequencyEncoder(normalize = false, ordered_factor = true)
target_model = MLJTransforms.TargetEncoder(lambda = 0.9, m = 5, ordered_factor = true)
ordinal_model = MLJTransforms.OrdinalEncoder(ordered_factor = true)
svm = SVC()

SVC(
  kernel = LIBSVM.Kernel.RadialBasis, 
  gamma = 0.0, 
  cost = 1.0, 
  cachesize = 200.0, 
  degree = 3, 
  coef0 = 0.0, 
  tolerance = 0.001, 
  shrinking = true)

Create four different pipelines to compare:

In [8]:
pipelines = [
    ("OneHot + SVM", onehot_model |> svm),
    ("FreqEnc + SVM", freq_model |> svm),
    ("TargetEnc + SVM", target_model |> svm),
    ("Ordinal + SVM", ordinal_model |> svm),
]

4-element Vector{Tuple{String, MLJBase.DeterministicPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}}:
 ("OneHot + SVM", DeterministicPipeline(one_hot_encoder = OneHotEncoder(features = Symbol[], …), …))
 ("FreqEnc + SVM", DeterministicPipeline(frequency_encoder = FrequencyEncoder(features = Symbol[], …), …))
 ("TargetEnc + SVM", DeterministicPipeline(target_encoder = TargetEncoder(features = Symbol[], …), …))
 ("Ordinal + SVM", DeterministicPipeline(ordinal_encoder = OrdinalEncoder(features = Symbol[], …), …))

## Evaluate Pipelines
Use 10-fold cross-validation to robustly estimate each pipeline's accuracy:

In [9]:
results = DataFrame(pipeline = String[], accuracy = Float64[])

for (name, pipe) in pipelines
    println("Evaluating: $name")
    mach = machine(pipe, X, y)
    eval_results = evaluate!(
        mach,
        resampling = CV(nfolds = 10, rng = 123),
        measure = accuracy,
        rows = train,
        verbosity = 0,
    )
    acc = mean(eval_results.measurement)
    push!(results, (name, acc))
end

Evaluating: OneHot + SVM
Evaluating: FreqEnc + SVM
Evaluating: TargetEnc + SVM
Evaluating: Ordinal + SVM


Sort results by accuracy (highest first) and display:

In [10]:
sort!(results, :accuracy, rev = true)
results

Row,pipeline,accuracy
Unnamed: 0_level_1,String,Float64
1,OneHot + SVM,0.998951
2,TargetEnc + SVM,0.974816
3,Ordinal + SVM,0.940189
4,FreqEnc + SVM,0.885624


## Results Analysis
We notice that one-hot-encoding was the most performant here followed by target encoding.
Ordinal encoding also produced decent results because we can perceive all the categorical variables to be ordered
On the other hand, frequency encoding lagged behind. Observe that this method doesn't distinguish categories from one another if they occur with similar frequencies.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*