# **Digitizing Categorical Features: Diamond Dataset**

## This notebook implements the two approaches:
1. **Label Encoding**: Replace categories with integers (e.g., D=0, E=1, ...)
2. **One-Hot Encoding**: Replace a categorical variable with ℓ dummy (binary) variables.

We apply these to the Diamond.csv dataset (Chu, 2001), which contains three categorical features:<br>
- `colour` (D, E, F, G, H, I)
- `clarity` (IF, VVS1, VVS2, VS1, VS2)
- `certification` (GIA, IGI, HRD)

## **1. Import Libraries and Load Data**

In [18]:
import pandas as pd

# Load the dataset
df = pd.read_csv("Diamond.csv")

print("Original dataset shape:", df.shape)
df.head()

Original dataset shape: (308, 5)


Unnamed: 0,carat,colour,clarity,certification,price
0,0.3,D,VS2,GIA,1302
1,0.3,E,VS1,GIA,1510
2,0.3,G,VVS1,GIA,1510
3,0.3,G,VS1,GIA,1260
4,0.31,D,VS1,GIA,1641


## **2. Inspect Categorical Variables**

In [19]:
# Display unique values for each categorical column
for col in ['colour', 'clarity', 'certification']:
    print(f"\nUnique values in '{col}':")
    print(sorted(df[col].unique()))

# Note: All three variables are nominal or ordinal, but we treat them as nominal for encoding.


Unique values in 'colour':
['D', 'E', 'F', 'G', 'H', 'I']

Unique values in 'clarity':
['IF', 'VS1', 'VS2', 'VVS1', 'VVS2']

Unique values in 'certification':
['GIA', 'HRD', 'IGI']


## **3. Method 1: Label Encoding (Integer Mapping)**

In [20]:
# As described in the lecture (Slide 59):
# > "Encode the levels of the categorical variable with (integer) numerical value"
# > Example: ["Firefox", "Safari", "Chrome"] → [0, 1, 2]
# 
# ⚠️ **Warning**: This method implies an artificial order. Only use for **ordinal** data or tree-based models.

from sklearn.preprocessing import LabelEncoder

# Create a copy to avoid modifying the original
df_label = df.copy()

In [21]:
# Apply LabelEncoder to each categorical column
label_encoders = {}
for col in ['colour', 'clarity', 'certification']:
    le = LabelEncoder()
    df_label[col + '_label'] = le.fit_transform(df[col])
    label_encoders[col] = le

In [22]:
# Display mapping for 'colour'
print("\nLabel Encoding Mapping for 'colour':")
colour_mapping = dict(zip(label_encoders['colour'].classes_, label_encoders['colour'].transform(label_encoders['colour'].classes_)))
print(colour_mapping)


Label Encoding Mapping for 'colour':
{'D': 0, 'E': 1, 'F': 2, 'G': 3, 'H': 4, 'I': 5}


In [23]:
# Show first few rows of encoded data
df_label[['colour', 'colour_label', 'clarity', 'clarity_label', 'certification', 'certification_label']].head()

Unnamed: 0,colour,colour_label,clarity,clarity_label,certification,certification_label
0,D,0,VS2,2,GIA,0
1,E,1,VS1,1,GIA,0
2,G,3,VVS1,3,GIA,0
3,G,3,VS1,1,GIA,0
4,D,0,VS1,1,GIA,0


### **Interpretation**:
- Each category is replaced by an integer (0, 1, 2, ...).
- **Do not use this for linear models** unless the variable is truly ordinal (e.g., clarity has a natural order).
- Safe for tree-based models (Random Forest, XGBoost) which ignore numeric meaning.

## **4. Method 2: One-Hot Encoding (Dummy Variables)**

In [24]:
# As described in the lecture (Slide 59):
# > "Replace a categorical variable with ℓ dummy variables"
# > (We typically use ℓ−1 dummies in regression to avoid multicollinearity, but for ML, ℓ is often fine.)
# 
# This creates **binary (0/1) columns** for each category.

# Create a copy
df_onehot = df.copy()

In [25]:
# Apply one-hot encoding using pandas
df_onehot = pd.get_dummies(df_onehot, columns=['colour', 'clarity', 'certification'], prefix=['colour', 'clarity', 'certification'])

print("\nShape after one-hot encoding:", df_onehot.shape)
print("New columns (first 10):", list(df_onehot.columns)[:10])


Shape after one-hot encoding: (308, 16)
New columns (first 10): ['carat', 'price', 'colour_D', 'colour_E', 'colour_F', 'colour_G', 'colour_H', 'colour_I', 'clarity_IF', 'clarity_VS1']


In [26]:
# Show first few rows of one-hot encoded data
df_onehot.filter(regex='^(colour_|clarity_|certification_)').head()

Unnamed: 0,colour_D,colour_E,colour_F,colour_G,colour_H,colour_I,clarity_IF,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2,certification_GIA,certification_HRD,certification_IGI
0,1,0,0,0,0,0,0,0,1,0,0,1,0,0
1,0,1,0,0,0,0,0,1,0,0,0,1,0,0
2,0,0,0,1,0,0,0,0,0,1,0,1,0,0
3,0,0,0,1,0,0,0,1,0,0,0,1,0,0
4,1,0,0,0,0,0,0,1,0,0,0,1,0,0


### **Interpretation**:
- Each original category becomes a new binary column.
- A value of 1 means the observation belongs to that category; 0 means it does not.
- This method **does not impose ordering** and is safe for all model types.
- **Downside**: Increases dimensionality (e.g., 6 colours → 6 new columns).

In [27]:
# ## 5. Comparison and Best Practices

In [28]:
# ### When to Use Which?
# 
# | Method           | Pros                                      | Cons                                     | Recommended For              |
# |------------------|-------------------------------------------|------------------------------------------|------------------------------|
# | **Label Encoding** | Compact (1 column)                        | Implies false ordering                   | Tree-based models, ordinal data |
# | **One-Hot Encoding**| No artificial order, interpretable        | High dimensionality, sparse data         | Linear models, neural nets, clustering |


In [29]:
# ### Special Note on `clarity`:
# The `clarity` variable is **ordinal** (IF > VVS1 > VVS2 > VS1 > VS2).  
# For such cases, **ordinal encoding** (custom integer mapping) is better than label encoding:
# 
# Example:
clarity_order = ['VS2', 'VS1', 'VVS2', 'VVS1', 'IF']  # worst to best
df['clarity_ordinal'] = df['clarity'].map({v: i for i, v in enumerate(clarity_order)})

print("\nOrdinal encoding for clarity (0=worst, 4=best):")
df[['clarity', 'clarity_ordinal']].drop_duplicates().sort_values('clarity_ordinal')



Ordinal encoding for clarity (0=worst, 4=best):


Unnamed: 0,clarity,clarity_ordinal
0,VS2,0
1,VS1,1
7,VVS2,2
2,VVS1,3
83,IF,4


In [30]:
# Save encoded datasets if needed
# df_label.to_csv("Diamond_label_encoded.csv", index=False)
# df_onehot.to_csv("Diamond_onehot_encoded.csv", index=False)