# One-Hot Encoding in NLP 

One-Hot Encoding (OHE) is a popular technique to convert text or categorical data into a numerical format. 

## **What is One-Hot Encoding?**

One-Hot Encoding is a process of representing categorical data or text tokens as binary vectors.

- Each unique word or token gets a unique index.
- A word is represented as a vector where only the index corresponding to the word is 1, and the rest are 0.

**Example**:

For the words `['cat', 'dog', 'mouse']`, the one-hot encoding would look like this:

```
cat -> [1, 0, 0]
dog -> [0, 1, 0]
mouse -> [0, 0, 1]
```


## **Dataset Example**

Let's create a small dataset to work with.

In [38]:
import pandas as pd

data = {
    "Sentence": [
        "The cat sat on the mat",
        "Dogs are friendly animals",
        "The sun is bright today"
    ],
    "Label": ["Animal", "Animal", "Nature"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Sentence,Label
0,The cat sat on the mat,Animal
1,Dogs are friendly animals,Animal
2,The sun is bright today,Nature


In [39]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Tokenize the sentences
df['Tokens'] = df['Sentence'].apply(word_tokenize)
df

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hassa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,Sentence,Label,Tokens
0,The cat sat on the mat,Animal,"[The, cat, sat, on, the, mat]"
1,Dogs are friendly animals,Animal,"[Dogs, are, friendly, animals]"
2,The sun is bright today,Nature,"[The, sun, is, bright, today]"


In [40]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
encoder = OneHotEncoder(sparse_output=False)

labels = df['Label'].values.reshape(-1, 1)
labels_encoded = encoder.fit_transform(labels)

print("Original Labels:", df['Label'].values)
print("One-Hot Encoded Labels:")
print(labels_encoded)


Original Labels: ['Animal' 'Animal' 'Nature']
One-Hot Encoded Labels:
[[1. 0.]
 [1. 0.]
 [0. 1.]]


In [41]:
df['Tokens']

0     [The, cat, sat, on, the, mat]
1    [Dogs, are, friendly, animals]
2     [The, sun, is, bright, today]
Name: Tokens, dtype: object

In [42]:
Words = [token for sublist in df['Tokens'] for token in sublist]
Words

['The',
 'cat',
 'sat',
 'on',
 'the',
 'mat',
 'Dogs',
 'are',
 'friendly',
 'animals',
 'The',
 'sun',
 'is',
 'bright',
 'today']

In [44]:

Words = np.array(Words).reshape(-1, 1)


In [45]:
Words_encoded = encoder.fit_transform(Words)

print("One-Hot Encoded Sentence:")
print(Words_encoded)


One-Hot Encoded Sentence:
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


## **Advantages**
1. **Simplicity**: 
   - One-Hot Encoding is easy to implement and understand.
2. **No Assumptions**: 
   - It does not impose any assumptions about relationships between categories.
3. **Effective for Small Vocabularies**: 
   - Works well when the vocabulary size is small.
4. **Input for ML Models**: 
   - Provides a straightforward way to represent categorical data for machine learning models.
   
## **Disadvantages**
1. **High Dimensionality**: 
   - For large vocabularies, the representation becomes sparse, consuming significant memory and computational resources.
2. **Lack of Semantic Meaning**: 
   - Does not capture relationships or similarities between words (e.g., "king" and "queen" are equally dissimilar to "apple").
3. **Scalability Issues**: 
   - Not practical for tasks with massive vocabularies (e.g., large corpora or multi-lingual datasets).
4. **Curse of Dimensionality**: 
   - High-dimensional feature spaces can lead to overfitting in machine learning models.
5. **No Context Awareness**: 
   - It treats words independently, ignoring the context in which they appear.

---

## **Conclusion**

In this tutorial, we learned:
- How to tokenize text data.
- How to apply one-hot encoding to words using `CountVectorizer`.
- How to apply one-hot encoding to labels using `OneHotEncoder`.
- How to combine these features into a unified matrix.

One-Hot Encoding is simple but effective for basic NLP tasks. For more advanced tasks, consider using embeddings like Word2Vec, GloVe, or contextual embeddings like BERT.