# One-Hot Encoding for Text to Vector Conversion in NLP

One-Hot Encoding is a method used to convert categorical data into a numerical format that machine learning algorithms can understand. In the context of Natural Language Processing (NLP), we often need to convert words or phrases into numerical vectors.

## What is One-Hot Encoding?

In One-Hot Encoding, each word in the vocabulary is represented as a binary vector. The length of the vector is equal to the size of the vocabulary, and for each word, a `1` is placed in the position corresponding to that word, while all other positions are `0`.

### Example Vocabulary

Let's consider a simple vocabulary:
- Vocabulary: `["apple", "banana", "orange"]`

The One-Hot Encodings for these words would be:
- `apple`: `[1, 0, 0]`
- `banana`: `[0, 1, 0]`
- `orange`: `[0, 0, 1]`

## Implementing One-Hot Encoding in Python

We'll implement One-Hot Encoding using a small example.

### Step 1: Create a Sample Text

```python
# Sample text
text = "apple banana apple orange"


In [1]:
text = "apple banana apple orange"

In [2]:
# Create a vocabulary from the text
words = text.split()
vocab = set(words)
vocab


{'apple', 'banana', 'orange'}

In [3]:
import numpy as np

def one_hot_encode(word, vocab):
    # Create a sorted list of vocabulary
    vocab_list = sorted(vocab)
    # Initialize a zero vector of length equal to vocabulary size
    one_hot_vector = np.zeros(len(vocab_list))
    # Set the position of the word to 1
    index = vocab_list.index(word)
    one_hot_vector[index] = 1
    return one_hot_vector


In [4]:
# Encode each word in the sample text
one_hot_encodings = {word: one_hot_encode(word, vocab) for word in words}
one_hot_encodings


{'apple': array([1., 0., 0.]),
 'banana': array([0., 1., 0.]),
 'orange': array([0., 0., 1.])}

In [5]:
# Display the One-Hot Encodings
for word, encoding in one_hot_encodings.items():
    print(f"{word}: {encoding}")


apple: [1. 0. 0.]
banana: [0. 1. 0.]
orange: [0. 0. 1.]


#### Disadvantages of One-Hot Encoding
While One-Hot Encoding has its advantages, it also comes with several drawbacks:

   High Dimensionality: The size of the one-hot encoded vectors increases with the vocabulary size. This can lead to a very sparse matrix, making computations inefficient.

   No Semantic Meaning: One-Hot Encoding does not capture the relationships between words. For example, "apple" and "banana" are treated as completely unrelated even though they are both fruits.

   Inability to Handle Unseen Words: If a new word appears that was not present in the training set, it cannot be represented in the one-hot encoded format without updating the vocabulary.

   Memory Inefficiency: The sparse nature of the resulting matrices can lead to high memory usage, especially for large vocabularies.

In [6]:
# Larger vocabulary example
large_vocab = ["apple", "banana", "orange", "grape", "kiwi", "mango", "pineapple", "strawberry", "blueberry"]
text_large = "apple kiwi grape banana"
words_large = text_large.split()

one_hot_encodings_large = {word: one_hot_encode(word, large_vocab) for word in words_large}
for word, encoding in one_hot_encodings_large.items():
    print(f"{word}: {encoding}")


apple: [1. 0. 0. 0. 0. 0. 0. 0. 0.]
kiwi: [0. 0. 0. 0. 1. 0. 0. 0. 0.]
grape: [0. 0. 0. 1. 0. 0. 0. 0. 0.]
banana: [0. 1. 0. 0. 0. 0. 0. 0. 0.]
