## Encoding Categorical Variables

### Why encode?
Most ML models require numeric inputs. Categorical variables (nominal/ordinal) must be mapped to numbers **without distorting meaning**.

---

### One-Hot Encoding (OHE)
Creates a binary (0/1) feature for each category.

- **Use for:** Nominal (unordered) categories (e.g., color, city).
- **Pros:** No order is implied; works well with linear models, distance-based models, and neural nets.
- **Cons:** Can explode dimensionality with high-cardinality features.

**Dummy Variable Trap:**  
If you include all one-hot columns, they sum to 1 → perfect multicollinearity in linear models.  
**Fix:** drop one reference category (e.g., `drop='first'`) or use regularization.

---

### Ordinal Encoding
Maps categories to ordered integers (e.g., `S < M < L < XL` → `0,1,2,3`).

- **Use for:** Truly ordered categories.
- **Pros:** Preserves ordinal relationships.
- **Cons:** If used on nominal data, it creates a fake order and misleads models.

---

### Label Encoding
Maps categories to integers arbitrarily (e.g., `{"red":0, "blue":1, "green":2}`).
This encode target labels (not to be used for input variables) with value between 0 and n_classes-1

- **Use for:** Target labels in classification; **not** for nominal features in linear/distance-based models.
- **Cons:** Implies order where none exists → can harm models.

---

### OHE with Most-Frequent Categories (High Cardinality)
When a column has many categories:
- Keep **Top-K** most frequent categories, group the rest into `"Other"`.
- Or use **thresholds** (e.g., `min_frequency` in scikit-learn) to automatically group rare categories.

**Trade-offs:**
- Keeps feature space manageable.
- Slight information loss for rare categories—but usually worth it.

---

## Where to Use What?

| Data Type | Good Choice | Caution |
|---|---|---|
| Nominal | **One-Hot** (drop a baseline to avoid dummy trap) | Avoid Label Encoding for features |
| Ordinal | **Ordinal Encoding** (explicit order) | Don’t one-hot if order matters strongly |
| High Cardinality | **Top-K OHE**, **min_frequency**, **hashing** | Full OHE may blow up feature space |
| Tree Models (RF/XGB) | Often **no scaling/encoding** needed for ordinal integers; OHE sometimes helps | Label Encoding on nominal can still mislead splits |

---

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('cars.csv')

In [3]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [4]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

### 1. OneHotEncoding using Pandas

In [5]:
pd.get_dummies(df, columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


### 2. K-1 OneHotEncoding

In [6]:
pd.get_dummies(df,columns=['fuel','owner'],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


### 3. OneHotEncoding using Sklearn

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)

In [8]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [9]:
from sklearn.preprocessing import OneHotEncoder

In [11]:
ohe = OneHotEncoder(drop='first',sparse_output=False,dtype=np.int32)

In [12]:
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])

In [13]:
X_test_new = ohe.transform(X_test[['fuel','owner']])

In [15]:
X_train_new.shape

(6502, 7)

In [16]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], shape=(6502, 9), dtype=object)

### 4. OneHotEncoding with Top Categories

In [17]:
counts = df['brand'].value_counts()

In [18]:
df['brand'].nunique()
threshold = 100

In [19]:
repl = counts[counts <= threshold].index

In [150]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
8093,0,0,0,0,1,0,0,0,0,0,0,0,0
3274,0,0,0,0,0,0,1,0,0,0,0,0,0
2966,0,0,0,0,0,0,1,0,0,0,0,0,0
1092,1,0,0,0,0,0,0,0,0,0,0,0,0
5355,0,0,0,0,0,0,0,0,0,0,0,0,1


## 5. Ordinal Encoding

In [54]:
df = pd.read_csv('customer.csv')
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
22,18,Female,Poor,PG,Yes
45,61,Male,Poor,PG,Yes
18,19,Male,Good,School,No
9,74,Male,Good,UG,Yes
42,30,Female,Good,PG,Yes


In [55]:
df = df.iloc[:,2:]
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [56]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['purchased']),df['purchased'],
                                                test_size=0.2)

In [57]:
from sklearn.preprocessing import OrdinalEncoder

cat_cols = ['review', 'education']

oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']],
                   handle_unknown='use_encoded_value',
                    unknown_value=-1)

oe.fit_transform(X_train[cat_cols])

array([[1., 0.],
       [1., 2.],
       [1., 0.],
       [0., 2.],
       [2., 1.],
       [1., 1.],
       [2., 2.],
       [1., 1.],
       [0., 0.],
       [0., 1.],
       [0., 2.],
       [0., 1.],
       [0., 0.],
       [2., 0.],
       [2., 2.],
       [2., 1.],
       [2., 0.],
       [0., 0.],
       [1., 1.],
       [0., 2.],
       [0., 2.],
       [1., 0.],
       [1., 0.],
       [2., 0.],
       [0., 2.],
       [2., 1.],
       [1., 2.],
       [0., 0.],
       [0., 1.],
       [1., 1.],
       [0., 2.],
       [0., 1.],
       [2., 2.],
       [0., 0.],
       [1., 1.],
       [2., 1.],
       [1., 2.],
       [2., 2.],
       [0., 2.],
       [2., 1.]])

In [58]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

## 6. Label Encoder

In [59]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [60]:
le.fit(y_train)

In [61]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [62]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [63]:
y_train

array([1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1])

## Column transformation - Normal way

<p><strong>Simple Imputer</strong><br>
Univariate imputer for completing missing values with simple strategies.
Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.
</p>

In [20]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [21]:
df = pd.read_csv('covid_toy.csv')

In [22]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [23]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [24]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_covid']),df['has_covid'],
                                                test_size=0.2)

In [25]:
X_train


Unnamed: 0,age,gender,fever,cough,city
85,16,Female,103.0,Mild,Bangalore
10,75,Female,,Mild,Delhi
64,42,Male,104.0,Mild,Mumbai
74,34,Female,104.0,Strong,Delhi
13,64,Male,102.0,Mild,Bangalore
...,...,...,...,...,...
34,74,Male,102.0,Mild,Mumbai
30,15,Male,101.0,Mild,Delhi
68,54,Female,104.0,Strong,Kolkata
91,38,Male,,Mild,Delhi


In [32]:
X_train['city'].value_counts()

city
Bangalore    26
Kolkata      26
Delhi        17
Mumbai       11
Name: count, dtype: int64

In [28]:
# adding simple imputer to fever col
si = SimpleImputer(strategy='mean') 
X_train_fever = si.fit_transform(X_train[['fever']])

# also the test data
X_test_fever = si.transform(X_test[['fever']])
                                 
X_train_fever.shape

(80, 1)

In [29]:
# Ordinalencoding -> cough
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

# also the test data
X_test_cough = oe.transform(X_test[['cough']])

X_train_cough.shape

(80, 1)

In [33]:
# OneHotEncoding -> gender,city
ohe = OneHotEncoder(drop='first', sparse_output=False)
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])

# also the test data
X_test_gender_city = ohe.transform(X_test[['gender','city']])

X_train_gender_city.shape

(80, 4)

In [34]:
# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

In [35]:
X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)
# also the test data
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape

(80, 7)

## Column transformation - best way

In [37]:
from sklearn.compose import ColumnTransformer

In [38]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse_output=False,drop='first'),['gender','city'])
],remainder='passthrough')

In [39]:
transformer.fit_transform(X_train).shape

(80, 7)

In [40]:
transformer.transform(X_test).shape

(20, 7)