<span style="font-size:16px; font-weight:bold">Welcome to Natural language processing (NLP) in Python</span><br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

<span style="font-size: 16px;font-weight:bold"> One-hot Encoding:</span><br/>
One-hot encoding is a technique used in NLP to convert categorical labels into a binary vector format, where only one element is "1" (hot) and the others are "0".

**Steps:**<br/>
▪ `Lower-case:` Convert all text to lowercase for uniformity<br/>
▪ `Ascending Sorting (A-Z):` Sort unique labels or tokens alphabetically<br/>
▪ `Give a Label:` Assign a unique label or token to each category<br/>
▪ `Transforming to a Binary Vector:` Convert each label into a binary vector with a single "1"<br/>

**Implementation:**<br/>
1️⃣ `Using NumPy:` Manual one-hot encoding with arrays<br/>
2️⃣ `Using Scikit-learn:` Using `OneHotEncoder` for categorical data<br/>
3️⃣ `Using TensorFlow:` Integration with neural network pipelines<br/>
4️⃣ `Using PyTorch:` Manual and tensor-based encoding<br/>

<span style="dont-size:16.5px; color:rgb(245, 5, 5); font-weight:bold;">Importing libraries</span>

In [3]:
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

<span style="font-size: 16.5px; font-weight: bold; color:rgb(240, 185, 7)">1️⃣ Numpy</span><br>

**Steps:**<br/>
1. Convert text to lowercase<br/>
2. Tokenize the text<br/>
3. Get unique words <br/>
4. Sort the word list<br/>
5. Get the integer/position of the words<br/>
6. Create a vector of each word by marking its position as 1 and rest as 0<br/>
7. Create a matrix of the found vectors<br/>

In [4]:
txt = "I am learning Natural Language Processing."

# Step 1: Convert text to lower case
txt = txt.lower()
print(f"Step 1: {txt}")

# Step 2: Tokenize the text
words = txt.split()
print(f"Step 2: {words}")

# Step 3: Get unique words
unique_words = list(set(words))
print(f"Step 3: {unique_words}")

# Step 4: Sort the word list
sort = np.sort(unique_words)
print(f"Step 4: {sort}")

# Step 5: Get the integer/position of the words
word_to_index = {str(word): index for index, word in enumerate(sort)}
print(f"Step 5: {word_to_index}")


# Step 6: Create a vector of each word by marking its position as 1 and rest as 0
vectors = []
for word in words:
    vector = np.zeros(len(unique_words))
    vector[word_to_index[word]] = 1
    vectors.append(vector)

# Step 7: Create a matrix of the found vectors
matrix = np.array(vectors)

print("One-hot encoding matrix:\n", matrix)

Step 1: i am learning natural language processing.
Step 2: ['i', 'am', 'learning', 'natural', 'language', 'processing.']
Step 3: ['language', 'i', 'learning', 'processing.', 'am', 'natural']
Step 4: ['am' 'i' 'language' 'learning' 'natural' 'processing.']
Step 5: {'am': 0, 'i': 1, 'language': 2, 'learning': 3, 'natural': 4, 'processing.': 5}
One-hot encoding matrix:
 [[0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]]


<span style="font-size: 16.5px; font-weight: bold; color:rgb(7, 213, 240)">3️⃣ Scikit-learn</span><br>

**Steps:**<br/>
1. Convert text to lowercase<br/>
2. Tokenize the text<br/>
3. Use LabelEncoder to assign an integer to each unique word<br/>
▪  Create a LabelEncoder instance<br/>
▪  Fit the LabelEncoder on the list of words to learn the unique words and assign each a unique integer<br/>
▪  Transform the list of words into their corresponding integer values (positions)<br/>
4. Use OneHotEncoder to create one-hot vectors for each word<br/>
▪  Create a OneHotEncoder instance<br/>
▪  Reshape the integer labels for compatibility<br/>
▪  Fit and transform the integer labels to one-hot vectors<br/>
5. Stack the vectors to form a matrix<br/>

In [5]:
txt = "I am learning Natural Language Processing."

# Step 1: Convert text to lower case
txt = txt.lower()
print(f"Step 1: {txt}")

# Step 2: Tokenize the text
words = txt.split()
print(f"Step 2: {words}")

# Step 3: Get its integer value (i.e., the position) by using LabelEncoder()
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(words)
word_mapping = {str(word): index for index, word in enumerate(label_encoder.classes_)} # Mapping of words to indices
print("Word to index mapping:", word_mapping)
print("Step 3:", integer_encoded)

# Step 4: Get one hot encoding of the word by referring to the label encoded values using OneHotEncoder()
onehot_encoder = OneHotEncoder(sparse_output=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
one_hot_encoded = onehot_encoder.fit_transform(integer_encoded)
print("Step 4:\n", one_hot_encoded)

Step 1: i am learning natural language processing.
Step 2: ['i', 'am', 'learning', 'natural', 'language', 'processing.']
Word to index mapping: {'am': 0, 'i': 1, 'language': 2, 'learning': 3, 'natural': 4, 'processing.': 5}
Step 3: [1 0 3 4 2 5]
Step 4:
 [[0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]]


<span style="font-size:16.5px; color:#45ff54; font-weight:bold">4️⃣ Keras</span><br/>

**Steps:**<br/>
1. Convert text to lowercase<br/>
2. Tokenize the text (split into words)<br/>
3. Assign a unique integer to each unique word<br/>
4. Integer encode the sequence of words<br/>
5. Use Keras' to_categorical to one-hot encode the integer sequence<br/>
6. Stack the one-hot vectors to form a matrix<br/>

In [18]:
txt = "I am learning Natural Language Processing."

# Step 1: Convert text to lower case
txt = txt.lower()
print(f"Step 1: {txt}")

# Step 2: Tokenize the text
words = txt.split()
print(f"Step 2: {words}")

# Step 3: Assign a unique integer to each word
unique_words = sorted(set(words))
word_to_index = {word: idx for idx, word in enumerate(unique_words)}
print("Step 3:", word_to_index)

# Step 4: Integer encode the sequence
sequences = [word_to_index[word] for word in words]
print("Step 4:", sequences)

# Step 5: One-hot encode using to_categorical
vocab_size = len(unique_words)
one_hot_encoded = to_categorical(sequences, num_classes=vocab_size)
print("Step 5:\n", one_hot_encoded)

Step 1: i am learning natural language processing.
Step 2: ['i', 'am', 'learning', 'natural', 'language', 'processing.']
Step 3: {'am': 0, 'i': 1, 'language': 2, 'learning': 3, 'natural': 4, 'processing.': 5}
Step 4: [1, 0, 3, 4, 2, 5]
Step 5:
 [[0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]]
