<a href="https://colab.research.google.com/github/Ehtisham1053/Natural-Language-Processing/blob/main/One_hot_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#One-Hot Encoding in NLP
One-hot encoding is a technique used in Natural Language Processing (NLP) to represent words or characters as binary vectors. Each unique word in the vocabulary is assigned a unique vector where only one position is 1, and all others are 0.

##Steps to Implement One-Hot Encoding
* Create a sample text corpus
* Tokenize the sentences into words
* Build a vocabulary
* Generate one-hot encoded vectors for each word


In [1]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

## Step 2: Define a Sample Corpus

In [2]:
corpus = ["I love NLP", "NLP is amazing", "I love deep learning"]

##  Step 3: Tokenize the Corpus and Create Vocabulary

In [3]:
words = set(word for sentence in corpus for word in sentence.split())
vocab = list(words)

print("Vocabulary:", vocab)

Vocabulary: ['I', 'learning', 'love', 'NLP', 'amazing', 'is', 'deep']


Total vocabulary size = 7 unique words.

## Step 4: Create Word-to-Index Mapping

In [4]:
# Assign an index to each word
word_to_index = {word: idx for idx, word in enumerate(vocab)}

print("Word Index Mapping:", word_to_index)


Word Index Mapping: {'I': 0, 'learning': 1, 'love': 2, 'NLP': 3, 'amazing': 4, 'is': 5, 'deep': 6}


📌 Explanation:

* We create a dictionary where each word gets a unique index.

## Step 5: Perform One-Hot Encoding

In [6]:
encoder = OneHotEncoder(sparse_output=False)
integer_encoded = np.array([[word_to_index[word]] for word in vocab])
one_hot_encoded = encoder.fit_transform(integer_encoded)

for word, vector in zip(vocab, one_hot_encoded):
    print(f"{word}: {vector}")

I: [1. 0. 0. 0. 0. 0. 0.]
learning: [0. 1. 0. 0. 0. 0. 0.]
love: [0. 0. 1. 0. 0. 0. 0.]
NLP: [0. 0. 0. 1. 0. 0. 0.]
amazing: [0. 0. 0. 0. 1. 0. 0.]
is: [0. 0. 0. 0. 0. 1. 0.]
deep: [0. 0. 0. 0. 0. 0. 1.]


## Step	Action
* 1️⃣	Define a sample corpus
* 2️⃣	Tokenize the sentences into words
* 3️⃣	Create a vocabulary of unique words
* 4️⃣	Assign indices to words
* 5️⃣	Convert words into one-hot encoded vectors