# One-Hot Encoding:
One-hot encoding is the process of turning categorical factors into a numerical structure that machine learning algorithms can readily process. It functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector’s size equivalent to the number of potential categories. 

For example, if we have a feature with three categories (A, B, and C), 
each category can be represented as a binary vector of length three, 
with the vector for category A being [1, 0, 0], 
the vector for category B being [0, 1, 0], and the vector for category C being [0, 0, 1].
# Why One-Hot Encoding is Used in NLP:
One-hot encoding is used in NLP to encode categorical factors as binary vectors, such as words or part-of-speech identifiers. 
This approach is helpful because machine learning algorithms generally act on numerical data, so representing text data as numerical vectors are required for these algorithms to work.
In a sentiment analysis assignment, for example, we might describe each word in a sentence as a one-hot encoded vector and then use these vectors as input to a neural network to forecast the sentiment of the sentence.
Example 1:
Suppose we have a small corpus of text that contains three sentences:

The quick brown fox jumped over the lazy dog.

She sells seashells by the seashore.

Peter Piper picked a peck of pickled peppers.

Each word in these phrases should be represented as a single compressed vector. The first stage is to determine the categorical variable, which is the phrases’ terms. The second stage is to count the number of distinct words in the sentences to calculate the number of potential groups. In this instance, there are 17 potential categories.

The third stage is to make a binary vector for each of the categories. Because there are 17 potential groups, each binary vector will be 17 bytes long. For example, the binary vector for the word “quick” will be [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], with the 1s in the first and sixth places because “quick” is both the first and sixth group in the list of unique words.

Finally, we use the binary vectors generated in step 3 to symbolize each word in the sentences as a one-hot encoded vector. For example, the one-hot encoded vector for the word “quick” in the first sentence is [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], and the one-hot encoded vector for the word “seashells” in the second sentence is [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0].

# Python Implementation for One-Hot Encoding in NLP
Now let’s try to implement the above example using Python. Because finally, we will have to perform this programmatically else it won’t be possible for us to use this technique to train NLP models.

In [1]:
import numpy as np

# Define the corpus of text
corpus = [
	"The quick brown fox jumped over the lazy dog.",
	"She sells seashells by the seashore.",
	"Peter Piper picked a peck of pickled peppers."
]

# Create a set of unique words in the corpus
unique_words = set()
for sentence in corpus:
	for word in sentence.split():
		unique_words.add(word.lower())

# Create a dictionary to map each
# unique word to an index
word_to_index = {}
for i, word in enumerate(unique_words):
	word_to_index[word] = i

# Create one-hot encoded vectors for
# each word in the corpus
one_hot_vectors = []
for sentence in corpus:
	sentence_vectors = []
	for word in sentence.split():
		vector = np.zeros(len(unique_words))
		vector[word_to_index[word.lower()]] = 1
		sentence_vectors.append(vector)
	one_hot_vectors.append(sentence_vectors)

# Print the one-hot encoded vectors 
# for the first sentence
print("One-hot encoded vectors for the first sentence:")
for vector in one_hot_vectors[0]:
	print(vector)


One-hot encoded vectors for the first sentence:
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Using sklearn

In [None]:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample text as a list of words
words = text.split()

# Convert words to a 2D array for one-hot encoding
unique_words = np.array(words).reshape(-1, 1)

# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
one_hot_encoded = encoder.fit_transform(unique_words)

# Display One-Hot Encoded result
print(one_hot_encoded)


In [4]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence 
concerned with the interactions between computers and human language. As such, NLP is related to the area of 
human-computer interaction. Many challenges in NLP involve understanding natural language to derive meaning 
and information from it."""


w=text.split()
unique_words=np.array(w).reshape(-1,1)
print(unique_words)
encoder= OneHotEncoder(sparse_output=False)
one_hot_encoded=encoder.fit_transform(unique_words)

print(one_hot_encoded)

[['Natural']
 ['language']
 ['processing']
 ['(NLP)']
 ['is']
 ['a']
 ['subfield']
 ['of']
 ['linguistics,']
 ['computer']
 ['science,']
 ['and']
 ['artificial']
 ['intelligence']
 ['concerned']
 ['with']
 ['the']
 ['interactions']
 ['between']
 ['computers']
 ['and']
 ['human']
 ['language.']
 ['As']
 ['such,']
 ['NLP']
 ['is']
 ['related']
 ['to']
 ['the']
 ['area']
 ['of']
 ['human-computer']
 ['interaction.']
 ['Many']
 ['challenges']
 ['in']
 ['NLP']
 ['involve']
 ['understanding']
 ['natural']
 ['language']
 ['to']
 ['derive']
 ['meaning']
 ['and']
 ['information']
 ['from']
 ['it.']]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
