1. Define the Prediction Target

* Choose a **numeric target** that has some real-world meaning (e.g., price, rating, quantity).
* Make sure the target has non-linear relationships with certain categorical variables to discourage overly simplistic linear models without preprocessing.

Example:

* Target: `House_Price` (continuous)
* Model drivers: location (nominal), material quality (ordinal), zoning (nominal), construction phase (ordinal), etc.

2. Build Multiple Categorical Variables with Different Encoding Needs

You want participants to *infer* the encoding type from the data distribution, not from documentation.

| Variable                 | Type    | Encoding Participants Should Choose | Design Note                                         |
| ------------------------ | ------- | ----------------------------------- | --------------------------------------------------- |
| `Neighborhood`           | Nominal | One-hot                             | No inherent order, \~8–15 categories                |
| `Quality_Rating`         | Ordinal | Ordinal encoding                    | Values like `Poor`, `Fair`, `Good`, `Excellent`     |
| `Material_Type`          | Nominal | One-hot                             | Discrete material names                             |
| `Construction_Stage`     | Ordinal | Ordinal encoding                    | `Planning` < `Foundation` < `Framing` < `Finishing` |
| `Energy_Efficiency_Band` | Ordinal | Ordinal encoding                    | `G`, `F`, `E`, `D`, `C`, `B`, `A`                   |
| `Color`                  | Nominal | One-hot                             | Non-ordered strings (`Red`, `Blue`, etc.)           |

3. Inject Ambiguity and Noise

* **Ambiguous ordinal cues**: Name ordinal categories with words instead of numbers so participants must map them manually.
* **Mixed granularity**: Some categorical variables have many unique categories (making one-hot expensive), others have few (making one-hot more feasible).
* Add **irrelevant categorical columns** to encourage feature selection.

4. Create Non-trivial Relationships

* For **ordinal categories**, make sure the target has a monotonic relationship with the category rank (so ordinal encoding helps).
* For **nominal categories**, make sure the target relationship is *non-monotonic*, so ordinal encoding would inject false order and hurt performance.
* Introduce interaction effects between categorical and numeric variables.

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)
n = 5000

# Nominal (Neighborhood)
neighborhoods = ['Downtown', 'Suburb', 'Countryside', 'Industrial', 'Seaside']
neigh = np.random.choice(neighborhoods, n)

# Ordinal (Quality Rating)
quality_levels = ['Poor', 'Fair', 'Good', 'Excellent']
quality = np.random.choice(quality_levels, n, p=[0.2, 0.3, 0.3, 0.2])

# Nominal (Material)
materials = ['Brick', 'Wood', 'Concrete', 'Steel']
material = np.random.choice(materials, n)

# Ordinal (Construction Stage)
stages = ['Planning', 'Foundation', 'Framing', 'Finishing']
stage = np.random.choice(stages, n)

# Numeric features
size = np.random.normal(150, 40, n)  # m^2
age = np.random.randint(0, 50, n)

# Generate target with meaningful encoding differences
quality_map = {'Poor': 0, 'Fair': 1, 'Good': 2, 'Excellent': 3}
stage_map = {'Planning': 0, 'Foundation': 1, 'Framing': 2, 'Finishing': 3}

base_price = 50_000 + size*1000 - age*500
effect_quality = np.array([quality_map[q]*20_000 for q in quality])
effect_stage = np.array([stage_map[s]*5_000 for s in stage])
effect_neigh = np.array([{'Downtown': 30_000, 'Suburb': 10_000,
                          'Countryside': -10_000, 'Industrial': -20_000,
                          'Seaside': 25_000}[n] for n in neigh])

price = base_price + effect_quality + effect_stage + effect_neigh + np.random.normal(0, 10_000, n)

df = pd.DataFrame({
    'Neighborhood': neigh,
    'Quality_Rating': quality,
    'Material_Type': material,
    'Construction_Stage': stage,
    'Size_m2': size,
    'Age_Years': age,
    'Price': price
})

Possibly introduce a "trap" column: a categorical variable that *looks* ordinal (like `"Alpha", "Beta", "Gamma"`) but is actually nominal in target relationship.