# Categorical Data

Categorical data, also known as nominal or ordinal data, is a type of data that consists of values that fall into distinct categories or groups. Unlike numerical data, which represents measurable quantities, categorical data represents qualitative or descriptive characteristics. It is crucial to understand categorical data when working with machine learning models, as most models require numerical inputs.

- There are different encoding methods, each with its pros and cons. Here’s a quick guide to some popular choices:
    - **One-Hot Encoding:** Great for many categories, creates new binary features (1 for the category, 0 for others).
    - **Label Encoding:** Simple, assigns a number to each category, but assumes order matters (which might not be true).
    - **Ordinal Encoding:** Similar to label encoding, but only use it if categories have a natural order (like low, medium, high).

## Analyzing Categorical Features

### Value Counts

In [9]:
# read csv using pandas
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# check value counts of Cut column
data['Cut'].value_counts()


Cut
Ideal              2482
Very Good          2428
Good                708
Signature-Ideal     253
Fair                129
Name: count, dtype: int64

In [5]:
import plotly.express as px
cut_counts = data['Cut'].value_counts()
fig = px.bar(x=cut_counts.index, y=cut_counts.values)
fig.show()

### Group by

In [None]:
# read csv using pandas
# import pandas as pd
# data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# average carat weight and price by Cut
data.groupby(by = 'Cut').mean()

### Cross tab

In [7]:
# read csv using pandas
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# cross tab of Cut and Color
pd.crosstab(index=data['Cut'], columns=data['Color'])


Color,D,E,F,G,H,I
Cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fair,12,32,24,21,24,16
Good,74,110,133,148,128,115
Ideal,280,278,363,690,458,413
Signature-Ideal,30,35,38,64,45,41
Very Good,265,323,455,578,424,383


### Pivot Table

In [17]:
# read csv using pandas
import pandas as pd
import numpy as np
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# create pivot table
pd.pivot_table(data, values='Price', index='Cut', columns='Color', aggfunc=np.mean)

Color,D,E,F,G,H,I
Cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fair,6058.25,5370.625,6063.625,7345.52381,5908.5,4573.1875
Good,10058.716216,8969.545455,9274.007519,9988.614865,9535.132812,8174.113043
Ideal,18461.953571,12647.107914,14729.426997,13570.310145,11527.700873,9459.588378
Signature-Ideal,19823.1,11261.914286,13247.947368,10248.296875,9112.688889,8823.463415
Very Good,13218.826415,12101.910217,12413.905495,12354.013841,10056.106132,8930.031332


## category_encoders Library

For encoding categorical data, we have a python package category encoders. 

In [3]:
%pip install category_encoders

Collecting category_encoders
  Using cached category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Using cached statsmodels-0.14.2-cp312-cp312-win_amd64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Using cached patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Using cached category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
Using cached patsy-0.5.6-py2.py3-none-any.whl (233 kB)
Downloading statsmodels-0.14.2-cp312-cp312-win_amd64.whl (9.8 MB)
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Label Encoding or Ordinal Encoding

In [4]:
import category_encoders as ce
import pandas as pd
train_df=pd.DataFrame({'Degree':['High school','Masters','Diploma','Bachelors','Bachelors','Masters','Phd','High school','High school']})

# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'],return_df=True,
                           mapping=[{'col':'Degree',
'mapping':{'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'phd':5}}])

#Original data
print(train_df)

        Degree
0  High school
1      Masters
2      Diploma
3    Bachelors
4    Bachelors
5      Masters
6          Phd
7  High school
8  High school


In [5]:
df_train_transformed = encoder.fit_transform(train_df)

In [6]:
df_train_transformed

Unnamed: 0,Degree
0,1.0
1,4.0
2,2.0
3,3.0
4,3.0
5,4.0
6,-1.0
7,1.0
8,1.0


### One Hot Encoding

In [7]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hydrabad
3,Chennai
4,Bangalore
5,Delhi
6,Hydrabad
7,Bangalore
8,Delhi


In [8]:
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded

Unnamed: 0,City_Delhi,City_Mumbai,City_Hydrabad,City_Chennai,City_Bangalore
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0
8,1.0,0.0,0.0,0.0,0.0


### Dummy Encoding

In [12]:
import category_encoders as ce
import pandas as pd
data = pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [13]:
#encode the data
data_encoded=pd.get_dummies(data=data,drop_first=True)
data_encoded

Unnamed: 0,City_Chennai,City_Delhi,City_Hyderabad,City_Mumbai
0,False,True,False,False
1,False,False,False,True
2,False,False,True,False
3,True,False,False,False
4,False,False,False,False
5,False,True,False,False
6,False,False,True,False


### Effect Encoding

In [16]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']}) 
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [17]:
encoder.fit_transform(data)



Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


### Hash Encoder

In [4]:
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)

data

Unnamed: 0,Month
0,January
1,April
2,March
3,April
4,Februay
5,June
6,July
7,June
8,September


In [2]:
#Fit and Transform Data
encoder.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5
0,0,0,0,0,1,0
1,0,0,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0
5,0,1,0,0,0,0
6,1,0,0,0,0,0
7,0,1,0,0,0,0
8,0,0,0,0,1,0


### Binary Encoding

In [5]:
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the Dataframe
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

#Create object for binary categorical encoding
encoder= ce.BinaryEncoder(cols=['City'],return_df=True)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Mumbai
8,Agra


In [6]:
#Fit and Transform Data 
data_encoded=encoder.fit_transform(data) 
data_encoded

Unnamed: 0,City_0,City_1,City_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,1,0,1
5,0,0,1
6,0,1,1
7,0,1,0
8,1,1,0


### Base N Encoding

In [9]:
#Import the libraries
import category_encoders as ce
import pandas as pd

#Create the dataframe
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['City'],return_df=True,base=5)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Mumbai
8,Agra


In [10]:
#Fit and Transform Data
data_encoded=encoder.fit_transform(data)
data_encoded

Unnamed: 0,City_0,City_1
0,0,1
1,0,2
2,0,3
3,0,4
4,1,0
5,0,1
6,0,3
7,0,2
8,1,1


### Target Encoding

In [12]:
#import the libraries
import pandas as pd
import category_encoders as ce

#Create the Dataframe
data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object
encoder=ce.TargetEncoder(cols='class') 

#Original Data
data

Unnamed: 0,class,Marks
0,"A,",50
1,B,30
2,C,70
3,B,80
4,C,45
5,A,97
6,A,80
7,A,68


In [13]:
#Fit and Transform Train Data
encoder.fit_transform(data['class'],data['Marks'])

Unnamed: 0,class
0,63.048373
1,63.581489
2,63.936117
3,63.581489
4,63.936117
5,67.574421
6,67.574421
7,67.574421


### **Key Takeaways:**

- Encoding categorical variables is an essential data preprocessing step for machine learning as most algorithms require numerical input.
- Techniques like one-hot and label encoding are popular for nominal and ordinal categorical data respectively.
- Advanced methods like target and hashing encoding can handle high cardinality categorical features efficiently.
- The choice of encoding depends on the number of categories, presence of order, and the model being used.

## Using SK-Learn to handle Categorical Variables

### Label Encoding

In [25]:
from sklearn.preprocessing import LabelEncoder

# Example DataFrame
data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# Label Encoding
label_encoder = LabelEncoder()
df['Color_Label'] = label_encoder.fit_transform(df['Color'])
print("Label Encoded DataFrame:\n", df)


Label Encoded DataFrame:
    Color  Color_Label
0    Red            2
1   Blue            0
2  Green            1
3   Blue            0
4    Red            2


### One-Hot Encoding

In [19]:
# One-Hot Encoding using pandas
df_one_hot = pd.get_dummies(df, columns=['Color'], prefix='Color')
print("\nOne-Hot Encoded DataFrame:\n", df_one_hot)



One-Hot Encoded DataFrame:
    Color_Label  Color_Blue  Color_Green  Color_Red
0            2       False        False       True
1            0        True        False      False
2            1       False         True      False
3            0        True        False      False
4            2       False        False       True


## Using Pandas to Handle Categorical encoding

### CategoricalDtype in Pandas

In [20]:
# Ordinal Encoding with pandas
from pandas.api.types import CategoricalDtype

# Define the order of categories
categories = ['Low', 'Medium', 'High']
cat_type = CategoricalDtype(categories=categories, ordered=True)

# Example DataFrame
df_ordinal = pd.DataFrame({'Priority': ['Low', 'Medium', 'High', 'Low', 'Medium']})
df_ordinal['Priority_Ordinal'] = df_ordinal['Priority'].astype(cat_type).cat.codes
print("\nOrdinal Encoded DataFrame:\n", df_ordinal)



Ordinal Encoded DataFrame:
   Priority  Priority_Ordinal
0      Low                 0
1   Medium                 1
2     High                 2
3      Low                 0
4   Medium                 1


### Frequency Encoding in Pandas

In [21]:
# Frequency Encoding
df_frequency = df.copy()
df_frequency['Color_Freq'] = df_frequency['Color'].map(df_frequency['Color'].value_counts())
print("\nFrequency Encoded DataFrame:\n", df_frequency)



Frequency Encoded DataFrame:
    Color  Color_Label  Color_Freq
0    Red            2           2
1   Blue            0           2
2  Green            1           1
3   Blue            0           2
4    Red            2           2


### Target Encoding (Mean Encoding) in Pandas

In [22]:
# Example DataFrame with target variable
df_target = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Price': [10, 20, 30, 20, 10]
})

# Target Encoding
target_mean = df_target.groupby('Color')['Price'].mean()
df_target['Color_Target'] = df_target['Color'].map(target_mean)
print("\nTarget Encoded DataFrame:\n", df_target)



Target Encoded DataFrame:
    Color  Price  Color_Target
0    Red     10          10.0
1   Blue     20          20.0
2  Green     30          30.0
3   Blue     20          20.0
4    Red     10          10.0


### Leave-One-Out Encoding

In [23]:
import category_encoders as ce

# Leave-One-Out Encoding
loo_encoder = ce.LeaveOneOutEncoder(cols=['Color'], random_state=42)
df_loo = loo_encoder.fit_transform(df_target['Color'], df_target['Price'])
df_loo.columns = ['Color_LOO']
print("\nLeave-One-Out Encoded DataFrame:\n", df_loo)



Leave-One-Out Encoded DataFrame:
    Color_LOO
0       10.0
1       20.0
2       18.0
3       20.0
4       10.0


## Embedding (Using Deep Learning Models)

In [None]:
# Example using TensorFlow/Keras for embedding

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Flatten
from tensorflow.keras.models import Model

# Assume 3 unique categories
input_layer = Input(shape=(1,))
embedding_layer = Embedding(input_dim=3, output_dim=2, input_length=1)(input_layer)
flattened = Flatten()(embedding_layer)
model = Model(inputs=input_layer, outputs=flattened)

# Example categorical data
import numpy as np
data = np.array([[0], [1], [2]])

# Get embeddings
embeddings = model.predict(data)
print("\nEmbeddings:\n", embeddings)

## Combining Methods

In [24]:
# Example combining One-Hot and Frequency Encoding
df_combined = pd.get_dummies(df, columns=['Color'], prefix='Color')
df_combined['Color_Freq'] = df['Color'].map(df['Color'].value_counts())
print("\nCombined Encoding DataFrame:\n", df_combined)



Combined Encoding DataFrame:
    Color_Label  Color_Blue  Color_Green  Color_Red  Color_Freq
0            2       False        False       True           2
1            0        True        False      False           2
2            1       False         True      False           1
3            0        True        False      False           2
4            2       False        False       True           2
