In [1]:
import pandas as pd 

# Why RNN, LSTM, GRU, and Transformers?

## **Recurrent Neural Networks (RNN)**
RNNs were developed to handle sequential data by maintaining a "memory" of previous inputs in the sequence. They are good for tasks like time series prediction or text generation where past information influences future predictions.

**However, RNNs have limitations:**

**Vanishing Gradient Problem:** During backpropagation, gradients can get very small, causing learning to stop as you go further back in the sequence. This makes training RNNs on long sequences very hard.

## **Long Short-Term Memory (LSTM)**
LSTM was introduced to address the vanishing gradient problem in RNNs. It uses a special memory cell structure that allows the network to "remember" information for long periods and is much more effective for longer sequences.

**Problem:**

While LSTMs mitigate the vanishing gradient problem, they still have some issues with long-range dependencies and are computationally expensive.

## **Gated Recurrent Unit (GRU)**
GRU is a simplified version of LSTM. It combines the forget and input gates into one, which makes it faster to train and requires fewer parameters.

**Problem:**

GRUs might not handle very long sequences as well as LSTMs due to fewer gates.

## **Transformers**
Transformers solve the problem of long-range dependencies. Instead of relying on sequential processing, transformers use self-attention mechanisms that allow the model to weigh all parts of the sequence at once. This makes transformers faster and better at handling very long sequences. They are currently the state-of-the-art for most sequence modeling tasks, including NLP tasks like machine translation.

**Problem:**

Transformers can be very computationally expensive because of their attention mechanism, which needs to evaluate pairwise relations between every token in the sequence.

**Code Setup**
We’ll use the sklearn.datasets and work with the 20 Newsgroups dataset, which is often used for text classification. We'll preprocess it into sequences suitable for RNN, LSTM, GRU, and Transformer models.

We’ll split the dataset into three sets: Train, Validation (Eval), and Test.

## 1. Data Preprocessing
We'll first load the dataset, vectorize it using TfidfVectorizer, and then pad the sequences.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np 
import pandas as pd 

2025-04-30 16:30:33.078881: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746012633.090975 3817788 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746012633.094570 3817788 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-30 16:30:33.106840: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Load Dataset 
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
X_raw = data.data
y = data.target
num_classes = np.unique(y).shape[0]
y_cat = to_categorical(y, num_classes)


X_train_raw, X_temp, y_train, y_temp = train_test_split(X_raw, y_cat, test_size=0.4, random_state=42)
X_val_raw, X_test_raw, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# TF-IDF (Term Frequency-Inverse Document Frequency)

## What is TF-IDF?

**TF-IDF** is a statistical measure used in text analysis and natural language processing (NLP) to evaluate how important a word is in a document relative to a collection of documents (also called a corpus). It combines two components:

### 1. Term Frequency (TF):
This measures how frequently a term (word) appears in a document. It helps indicate the relative importance of a term within the document.

**Formula:**
TF(t)= Number of times word "t" appears / Total number of words in the document



### 2. Inverse Document Frequency (IDF):
This measures how important a term is across the entire corpus (collection of documents). Words that appear in many documents have lower IDF values, while words that are unique to fewer documents have higher IDF values. This helps reduce the importance of words that are too common (like "the," "and," etc.).

**Formula:**
IDF(t) = log((Total number of documents) / (Number of documents containing term t))

### 3. TF-IDF:
The final value is the product of the Term Frequency and the Inverse Document Frequency, helping determine the **weight of each word** in a document relative to the entire corpus.

**Formula:**
TF-IDF(t) = TF(t) * IDF(t)

## When to Use TF-IDF:

- **Text Classification:** When building models to classify text data into categories, TF-IDF helps identify important words (features) that can differentiate between classes.
- **Information Retrieval:** TF-IDF is used in search engines to rank documents based on relevance to a search query.
- **Feature Extraction:** In NLP, when you want to convert text data into a numerical format for machine learning models, TF-IDF is commonly used.
- **Reducing Noise:** Common words (e.g., "the," "is," "to") are usually given low weights, reducing their impact on models.

## How TF-IDF is Applied to the 20 Newsgroups Dataset:

### The 20 Newsgroups Dataset:
This dataset contains 20 different categories of newsgroup posts. Some of the categories are:

- **alt.atheism**
- **comp.graphics**
- **rec.autos**
- **sci.med**
- **talk.politics.misc**
- ... and others.

Each newsgroup post is a piece of text, and the goal is to classify each post into one of the 20 categories. To do that, we need to convert the text into a format that a machine learning model can understand, which is where TF-IDF comes in.

### How TF-IDF Transforms the Dataset:

#### Term Frequency (TF):
For each post in the dataset, the **TF** part of the formula measures how many times each word appears in that specific post. For example:

If the word "graphics" appears 5 times in a post and the post contains 100 words, the **TF** for "graphics" in that post would be:
TF("graphics") = 5/100 = 0.05

This means "graphics" contributes 5% of the total words in that document.

#### Inverse Document Frequency (IDF):
The **IDF** component adjusts the weight of words that are common across all the newsgroup posts. Words like "the," "and," "is" will appear in many documents and thus will have a low IDF value.

For instance, if the word "the" appears in almost every newsgroup post, the **IDF** for "the" will be small, which means it will have a low weight and not significantly influence the classification.

If a word appears in only a few categories (like "graphics" in the **comp.graphics** category), its **IDF** value will be higher, meaning it is more distinctive and useful for classification.

#### Combining TF and IDF (TF-IDF):
The final **TF-IDF** value for a word in a document is the product of its **TF** and **IDF**. Words that are frequent in a specific post but rare across the entire corpus (like "graphics" in **comp.graphics**) will have a high **TF-IDF** score, making them highly informative for classification.

### Example of How TF-IDF Works in the 20 Newsgroups Dataset:
Let’s say you have the following three documents in the dataset:

- **Document 1** (from **comp.graphics**): "Graphics hardware is important in modern computing."
- **Document 2** (from **rec.autos**): "The importance of graphics in automobile design."
- **Document 3** (from **sci.med**): "Medical graphics can be used in medical research."

For each document, TF calculates the frequency of terms:

- In **Document 1**, "graphics" may appear 1 time, and other words like "hardware" and "modern" also have their frequencies.
- In **Document 2**, "graphics" appears 1 time, and similar calculations are done for other words.

The **IDF** component checks how many documents contain the word "graphics" and assigns it a weight. If "graphics" appears in many documents, its **IDF** will be lower, as it is common. However, if "graphics" appears mainly in **comp.graphics** and not much elsewhere, the **IDF** will be higher.

Finally, the **TF-IDF** values are calculated by multiplying the term frequency (**TF**) by the inverse document frequency (**IDF**). Words like "graphics" will have a high **TF-IDF** score in **comp.graphics**, making them more important for classifying the document into that category.

## What Happens in Your Code:
```python
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(X).toarray()

TfidfVectorizer(max_features=10000): This limits the vocabulary to the top 10,000 words based on their TF-IDF scores. These words are considered the most informative in the dataset for classification purposes.

fit_transform(X): This applies the TF-IDF transformation to the data (X), converting the raw text into a matrix of TF-IDF features. Each document is now represented as a vector of numerical values, where each value corresponds to the TF-IDF score for a word in the document.

toarray(): Converts the sparse matrix (which only stores non-zero values to save memory) into a dense array, which is easier to work with but takes more memory.

**Summary:**
TF-IDF helps convert text data into numerical vectors that represent the importance of words in the context of the dataset.

It reduces the influence of common words that don’t provide much insight into the category of the document (like "the," "is," "and").

It highlights more unique and distinctive words that are valuable for text classification (like "graphics" in the comp.graphics category).

Using TF-IDF in the 20 Newsgroups dataset enables you to represent each newsgroup post in a way that a machine learning algorithm can use to predict its category based on the important features (words).

### Why is Embedding Used, Even If We Have TF-IDF?
**1. TF-IDF vs Embedding Layer:**

TF-IDF is a statistical representation of a document. It assigns each word in the document a weight (importance) based on how often it appears in the document relative to its frequency across the entire corpus. The result is a sparse vector representation, which is good for text classification tasks, but does not capture semantic relationships between words.

For example, TF-IDF would give the words "king" and "queen" very different weights, even though they are semantically similar.

Embeddings, on the other hand, provide dense, continuous vectors that capture semantic meaning. "King" and "queen" would be represented by similar vectors because they are related concepts (e.g., royalty).

**2. When We Use an Embedding Layer:**

The Embedding layer is typically used when you're dealing with raw text data, and you want to learn the best representations (embeddings) for words during the training process.

TF-IDF is a feature extraction technique, not a model itself. After converting your raw text into a TF-IDF matrix, you're feeding it directly to the model. The Embedding layer, however, is used in models where you want to learn word representations (vectors) that improve during training, which is particularly useful for deep learning models like LSTM, GRU, etc.

**Example to Illustrate:**

**Using TF-IDF:**

You convert your text into a sparse matrix where each word has a frequency or weight. This matrix can be fed into a machine learning model like logistic regression or a neural network.

This doesn't capture relationships like "king" and "queen" being similar.

**Using an Embedding Layer:**

When using a deep learning model, you might start by converting the words into indices (e.g., 0 for "king", 1 for "queen", etc.), then use the Embedding layer to transform these indices into dense vectors (e.g., [0.32, 0.15, -0.67, ...]).

These dense vectors are learned by the model during training. So, the model learns to represent words like "king" and "queen" with similar vectors because they have similar meanings.

**When to Use Which:**

Use TF-IDF when you want a quick, non-learned representation of your text for shallow machine learning models (like SVM, Naive Bayes, etc.).

Use the Embedding layer when you want your model to learn word representations and capture complex semantic relationships between words (useful in deep learning models like LSTM, GRU, Transformers, etc.).

**In Summary:**

TF-IDF gives you a static, sparse representation of words based on their frequency.

Embedding layers give you a dynamic, dense representation of words that captures their semantic meaning and is learned by the model during training.

If you're using deep learning models (like LSTM, GRU, etc.), you generally use Embeddings to get rich word representations, which are beneficial for learning the context in sequences.



## 🔸 1. TF-IDF + RNN/LSTM/GRU
### 🔹 TF-IDF Vectorization

In [4]:
X_train_raw[0] #just for check form of data 

'Hi all,\n\nI think the subject says it all - does anyone know how to take the rgb/h/vsync from a standard vga connector and record them on video tape??\n\nAny help is appreciated!\n\n\nMark J Cargill'

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train_raw).toarray()
X_val_tfidf = vectorizer.fit_transform(X_val_raw).toarray()
X_test_tfidf = vectorizer.fit_transform(X_test_raw).toarray()

In [6]:
X_train_tfidf.shape, X_val_tfidf.shape, X_test_tfidf.shape

((11307, 1000), (3769, 1000), (3770, 1000))

In [7]:
# Reshaping, pad to make it 3D
X_train_tfidf_reshaped = np.expand_dims(X_train_tfidf, axis=-1)
X_val_tfidf_reshaped = np.expand_dims(X_val_tfidf, axis=-1)
X_test_tfidf_reshaped = np.expand_dims(X_test_tfidf, axis=-1)
X_train_tfidf_reshaped.shape, X_val_tfidf_reshaped.shape, X_test_tfidf_reshaped.shape

((11307, 1000, 1), (3769, 1000, 1), (3770, 1000, 1))

## 1️⃣ TF-IDF + RNN

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, LSTM, GRU, Embedding

In [9]:
y_train.shape, y_val.shape, y_test.shape

((11307, 20), (3769, 20), (3770, 20))

In [10]:
rnn_tfidf_model = Sequential()
rnn_tfidf_model.add(SimpleRNN(128, input_shape = (1000,1)))
rnn_tfidf_model.add(Dense(20, activation = "softmax"))

rnn_tfidf_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
rnn_tfidf_model.fit(X_train_tfidf_reshaped, y_train, validation_data=(X_val_tfidf_reshaped, y_val), epochs=3, batch_size=64)

print("RNN tfidf Accuracy: ", rnn_tfidf_model.evaluate(X_test_tfidf_reshaped, y_test)[1])

Epoch 1/3


2025-04-30 16:30:37.410242: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
  super().__init__(**kwargs)


[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 134ms/step - accuracy: 0.0478 - loss: 3.0460 - val_accuracy: 0.0578 - val_loss: 3.0013
Epoch 2/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 136ms/step - accuracy: 0.0484 - loss: 3.0190 - val_accuracy: 0.0552 - val_loss: 3.0238
Epoch 3/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 133ms/step - accuracy: 0.0492 - loss: 3.0241 - val_accuracy: 0.0491 - val_loss: 3.0811
[1m118/118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 30ms/step - accuracy: 0.0525 - loss: 3.0644
RNN tfidf Accuracy:  0.05066312849521637


## 2️⃣ TF-IDF + LSTM

In [11]:
lstm_tfidf_model = Sequential()
lstm_tfidf_model.add(LSTM(128, input_shape = (1000,1)))
lstm_tfidf_model.add(Dense(20,activation="softmax"))

lstm_tfidf_model.compile(optimizer="adam", loss = "categorical_crossentropy", metrics=["accuracy"])
lstm_tfidf_model.fit(X_train_tfidf_reshaped, y_train,epochs = 3, validation_data=(X_val_tfidf_reshaped, y_val), batch_size=64)
print("LSTM tfidf Accuracy: ", lstm_tfidf_model.evaluate(X_test_tfidf_reshaped, y_test)[1])

Epoch 1/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m76s[0m 422ms/step - accuracy: 0.0491 - loss: 2.9934 - val_accuracy: 0.0581 - val_loss: 2.9900
Epoch 2/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m76s[0m 427ms/step - accuracy: 0.0502 - loss: 2.9913 - val_accuracy: 0.0520 - val_loss: 2.9873
Epoch 3/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 407ms/step - accuracy: 0.0604 - loss: 2.9881 - val_accuracy: 0.0570 - val_loss: 2.9885
[1m118/118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 105ms/step - accuracy: 0.0553 - loss: 2.9795
LSTM tfidf Accuracy:  0.05755968019366264


## 3️⃣ TF-IDF + GRU

In [12]:
gru_tfidf_model = Sequential()
gru_tfidf_model.add(GRU(128, input_shape=(1000,1)))
gru_tfidf_model.add(Dense(20, activation="softmax"))

gru_tfidf_model.compile(optimizer="adam", loss = "categorical_crossentropy", metrics=["accuracy"])
gru_tfidf_model.fit(X_train_tfidf_reshaped, y_train,epochs = 3, validation_data=(X_val_tfidf_reshaped, y_val), batch_size=64)
print("GRU tfidf Accuracy: ", gru_tfidf_model.evaluate(X_test_tfidf_reshaped, y_test)[1])

Epoch 1/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 445ms/step - accuracy: 0.0495 - loss: 2.9935 - val_accuracy: 0.0483 - val_loss: 2.9887
Epoch 2/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 424ms/step - accuracy: 0.0621 - loss: 2.9840 - val_accuracy: 0.0584 - val_loss: 2.9830
Epoch 3/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 423ms/step - accuracy: 0.0568 - loss: 2.9811 - val_accuracy: 0.0560 - val_loss: 2.9756
[1m118/118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 83ms/step - accuracy: 0.0650 - loss: 2.9692
GRU tfidf Accuracy:  0.06445623189210892


## 4️⃣ TF-IDF + Transformer (using simple Attention layer)

In [None]:
from tensorflow.keras.layers import Dense, LayerNormalization, MultiHeadAttention, GlobalMaxPooling1D, Input, Add 
from tensorflow.keras.models import Model
import tensorflow as tf 
input_layer = Input(shape = (1000,1))
x = Dense(64)(input_layer)
att = MultiHeadAttention(num_heads=2, key_dim=64)(x,x)
att = Add()([x, att])
att = LayerNormalization()(att)
ffn = Dense(64, activation='relu')(att)
ffn = Add()([att, ffn])
ffn = LayerNormalization()(ffn)
att = GlobalMaxPooling1D()(ffn)
output_layer = Dense(20, activation='softmax')(att)
transformer_tfidf_model = Model(inputs = input_layer, outputs = output_layer)
transformer_tfidf_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

#Fit and Evaluate

transformer_tfidf_model.fit(X_train_tfidf_reshaped, y_train, epochs=3, batch_size=64, validation_batch_size=(X_val_tfidf_reshaped, y_val))
print("Transformer TF-IDF Accuracy: ", transformer_tfidf_model.evaluate(X_test_tfidf_reshaped, y_test)[1])


Epoch 1/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m180s[0m 1s/step - accuracy: 0.0501 - loss: 3.0845
Epoch 2/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m179s[0m 1s/step - accuracy: 0.0595 - loss: 3.0044
Epoch 3/3
[1m177/177[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m179s[0m 1s/step - accuracy: 0.0564 - loss: 2.9994
[1m 49/118[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m13s[0m 196ms/step - accuracy: 0.0525 - loss: 3.0081

## 🔹 Common Setup for Word2Vec Models

In [None]:
# Tokenization
from tensorflow.keras.preprocessing.text import Tokenizer
from gensim.models import Word2Vec
tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>")
tokenizer.fit_on_texts(X_raw)
X_seq = tokenizer.texts_to_sequences(X_raw)
X_pad = pad_sequences(X_seq, padding='post', maxlen=500)

In [None]:
# Train Word2Vec model
sentences = [text.split() for text in X_raw]
w2v_model = Word2Vec(sentences, vector_size=1000, window=5, min_count=1)
word_index = tokenizer.word_index
embedding_matrix = np.zeros((1000,128))
for word, i in word_index.items():
    if i < 10000 and word in w2v_model.wv:
        embedding_matrix[i]: w2v_model.wv[word]
embedding_matrix

In [None]:
# Train/Val/Test Split
X_train_w2v, X_temp_w2v, y_train, y_temp = train_test_split(X_pad, y_cat, test_size=0.4, random_state=42)
X_val_w2v, X_test_w2v, y_val, y_test = train_test_split(X_temp_w2v, y_temp, test_size=0.5, random_state=42)

In [None]:
X_train_w2v

### 1️⃣ Word2Vec + RNN

In [None]:
rnn_w2v_model = Sequential()
rnn_w2v_model.add(Embedding(input_dim=1000, output_dim=128, weights=[embedding_matrix], input_length = 500, trainable = False))
rnn_w2v_model.add(SimpleRNN(128))
rnn_w2v_model.add(Dense(20, activation="softmax"))

rnn_w2v_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
rnn_w2v_model.fit(X_train_w2v, y_train, validation_data = [X_val_w2v, y_val], epochs=3, batch_size=64)
print("RNN W2v Accuracy: ", rnn_w2v_model.evaluate(X_test_w2v, y_test)[1])

### 2️⃣ Word2Vec + LSTM

In [None]:
lstm_w2v_model = Sequential()
lstm_w2v_model.add(Embedding(input_dim=1000, output_dim=128, weights=[embedding_matrix], input_length = 500, trainable = False))
lstm_w2v_model.add(LSTM(128))
lstm_w2v_model.add(Dense(20, activation="softmax"))

lstm_w2v_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["Accuracy"])
lstm_w2v_model.fit(X_train_w2v, y_train, validation_data=[X_val_w2v, y_val], batch_size=64, epochs = 3)
print("LSTM w2v Accuracy: ", lstm_w2v_model.evaluate(X_test_w2v, y_test)[1])


### 3️⃣ Word2Vec + GRU

In [None]:
gru_w2v_model = Sequential()
gru_w2v_model.add(Embedding(input_dim = 1000, output_dim = 128, weights=[embedding_matrix], input_length= 500))
gru_w2v_model.add(GRU(128))
gru_w2v_model.add(Dense(20, activation="softmax"))

gru_w2v_model.compile(optimizer="adam", loss = "categorical_crossentropy", metrics = ["Accuracy"])
gru_w2v_model.fit(X_train_w2v, y_train, validation_data=[X_val_w2v, y_val], epochs=3, batch_size=64)
print("GRU w2v Accuracy: ", gru_w2v_model.evaluate(X_test_w2v, y_test)[1])

### 4️⃣ Word2Vec + Transformer (Simplified)

In [None]:
input_layer = Input(shape=(500,))
embedding_layer = Embedding(input_dim = 1000, output_dim = 128, weights = [embedding_matrix], trainable = 500)(input_layer)
att = MultiHeadAttention(num_heads = 2, key_dim = 64)(embedding_layer, embedding_layer)
att = LayerNormalization()(att)
att = GlobalMaxPooling1D()(att)
output_layer = Dense(20, activation = "softmax")(att)

transformer_w2v_model = Model(inputs = input_layer, outputs = output_layer)
transformer_w2v_model.compile(optimizer = "adam", loss = "categorical_crossentropy", metrics = ["accuracy"])
transformer_w2v_model.fit(X_train_w2v, y_train, validation_data = [X_val_w2v, y_val], epochs = 3, batch_size = 64)
print("Transformer w2v Accuracy: ", transformer_w2v_model.evaluate(X_test_w2v, y_test)[1])

**What do the parameters mean?**

Parameter	                            Meaning

**input_dim=10000**	--> Size of the vocabulary. That is, the total number of unique words your model expects. Here, it assumes 10,000 different words. Each word will have its own vector.

Number of unique tokens (words) in whole dataset

**output_dim=128** -->	Size of the vector for each word. That is, each word will be represented by a dense vector of 128 numbers.

Number of tokens in each input sequence (padded/truncated)

**input_length=500** --> Length of each input sequence (number of words). Here, each input (sentence/document) will have 500 words (after padding/truncating).

Size of the dense vector for each word


**In simple words:**

* You have 10,000 different words.

* Each word will be converted into a 128-dimensional vector.

* Each input document must have exactly 500 words (either padded or cut).

So the output shape of the Embedding layer will be:

(batch_size, 500, 128)

→ meaning batch_size documents,

each document has 500 words,

each word is represented by a 128-length vector.

## 🔹 Common Setup for GloVe-Based Models

wget http://nlp.stanford.edu/data/glove.6B.zip

unzip glove.6B.zip

### 🔁 Load and Prepare GloVe Embeddings