## One-Hot Encoding in NLP

One-Hot Encoding is a vectorization technique that transforms categorical or textual data into a binary vector representation. Each unique word or token is assigned a vector with a value of 1 at the position representing that word and 0 elsewhere.

### ‚úÖ Characteristics

| Feature           | Description                                  |
| ----------------- | -------------------------------------------- |
| Type              | Binary / Sparse Representation               |
| Use Case          | Text classification, preprocessing           |
| Context Awareness | ‚ùå No                                         |
| Dimensionality    | Equal to vocabulary size (can be very large) |
| Interpretability  | ‚úÖ High (easy to understand)                  |


### üß† Theory
Assume a corpus with 3 tokens:
["apple", "banana", "orange"]
The vocabulary is:
{"apple": 0, "banana": 1, "orange": 2}

### One-hot vectors:
Token	Vector
apple	[1, 0, 0]
banana	[0, 1, 0]
orange	[0, 0, 1]

Each vector is of length equal to the vocabulary size, with only a single 1 indicating the presence of the word.



In [2]:
## üîß Example using Python (sklearn)
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Reshape needed for sklearn
tokens = np.array(["apple", "banana", "apple", "orange"]).reshape(-1, 1)

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(tokens)

print("Vocabulary:", encoder.categories_)
print("Encoded Vectors:\n", encoded)


Vocabulary: [array(['apple', 'banana', 'orange'], dtype='<U6')]
Encoded Vectors:
 [[1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


### üìä Advantages

| Advantage                           | Description                                       |
| ----------------------------------- | ------------------------------------------------- |
| ‚úÖ Simple & interpretable            | Easy to implement and understand                  |
| ‚úÖ Suitable for categorical features | Effective for ML models expecting numerical input |
| ‚úÖ No ordering assumption            | Does not impose ordinal relationships             |


### ‚ö†Ô∏è Disadvantages
| Limitation                   | Description                                                  |
| ---------------------------- | ------------------------------------------------------------ |
| ‚ùå High Dimensionality        | Vocabulary size grows rapidly with corpus                    |
| ‚ùå Sparse Representation      | Memory-inefficient; many zeros                               |
| ‚ùå No Semantic Similarity     | Cannot capture contextual or semantic meaning                |
| ‚ùå No Morphological Awareness | ‚Äòrun‚Äô, ‚Äòruns‚Äô, and ‚Äòrunning‚Äô are treated as different tokens |
