In [1]:
import pandas as pd 

# Why RNN, LSTM, GRU, and Transformers?

## **Recurrent Neural Networks (RNN)**
RNNs were developed to handle sequential data by maintaining a "memory" of previous inputs in the sequence. They are good for tasks like time series prediction or text generation where past information influences future predictions.

**However, RNNs have limitations:**

**Vanishing Gradient Problem:** During backpropagation, gradients can get very small, causing learning to stop as you go further back in the sequence. This makes training RNNs on long sequences very hard.

## **Long Short-Term Memory (LSTM)**
LSTM was introduced to address the vanishing gradient problem in RNNs. It uses a special memory cell structure that allows the network to "remember" information for long periods and is much more effective for longer sequences.

**Problem:**

While LSTMs mitigate the vanishing gradient problem, they still have some issues with long-range dependencies and are computationally expensive.

## **Gated Recurrent Unit (GRU)**
GRU is a simplified version of LSTM. It combines the forget and input gates into one, which makes it faster to train and requires fewer parameters.

**Problem:**

GRUs might not handle very long sequences as well as LSTMs due to fewer gates.

## **Transformers**
Transformers solve the problem of long-range dependencies. Instead of relying on sequential processing, transformers use self-attention mechanisms that allow the model to weigh all parts of the sequence at once. This makes transformers faster and better at handling very long sequences. They are currently the state-of-the-art for most sequence modeling tasks, including NLP tasks like machine translation.

**Problem:**

Transformers can be very computationally expensive because of their attention mechanism, which needs to evaluate pairwise relations between every token in the sequence.

**Code Setup**
We’ll use the sklearn.datasets and work with the 20 Newsgroups dataset, which is often used for text classification. We'll preprocess it into sequences suitable for RNN, LSTM, GRU, and Transformer models.

We’ll split the dataset into three sets: Train, Validation (Eval), and Test.

## 1. Data Preprocessing
We'll first load the dataset, vectorize it using TfidfVectorizer, and then pad the sequences.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np 
import pandas as pd 

In [18]:
# Load Dataset 
newsgroups = fetch_20newsgroups(subset="all")

In [39]:
X_raw = newsgroups.data
y = newsgroups.target

In [40]:
# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(X_raw).toarray()

# TF-IDF (Term Frequency-Inverse Document Frequency)

## What is TF-IDF?

**TF-IDF** is a statistical measure used in text analysis and natural language processing (NLP) to evaluate how important a word is in a document relative to a collection of documents (also called a corpus). It combines two components:

### 1. Term Frequency (TF):
This measures how frequently a term (word) appears in a document. It helps indicate the relative importance of a term within the document.

**Formula:**
TF(t)= Number of times word "t" appears / Total number of words in the document



### 2. Inverse Document Frequency (IDF):
This measures how important a term is across the entire corpus (collection of documents). Words that appear in many documents have lower IDF values, while words that are unique to fewer documents have higher IDF values. This helps reduce the importance of words that are too common (like "the," "and," etc.).

**Formula:**
IDF(t) = log((Total number of documents) / (Number of documents containing term t))

### 3. TF-IDF:
The final value is the product of the Term Frequency and the Inverse Document Frequency, helping determine the weight of each word in a document relative to the entire corpus.

**Formula:**
TF-IDF(t) = TF(t) * IDF(t)

## When to Use TF-IDF:

- **Text Classification:** When building models to classify text data into categories, TF-IDF helps identify important words (features) that can differentiate between classes.
- **Information Retrieval:** TF-IDF is used in search engines to rank documents based on relevance to a search query.
- **Feature Extraction:** In NLP, when you want to convert text data into a numerical format for machine learning models, TF-IDF is commonly used.
- **Reducing Noise:** Common words (e.g., "the," "is," "to") are usually given low weights, reducing their impact on models.

## How TF-IDF is Applied to the 20 Newsgroups Dataset:

### The 20 Newsgroups Dataset:
This dataset contains 20 different categories of newsgroup posts. Some of the categories are:

- **alt.atheism**
- **comp.graphics**
- **rec.autos**
- **sci.med**
- **talk.politics.misc**
- ... and others.

Each newsgroup post is a piece of text, and the goal is to classify each post into one of the 20 categories. To do that, we need to convert the text into a format that a machine learning model can understand, which is where TF-IDF comes in.

### How TF-IDF Transforms the Dataset:

#### Term Frequency (TF):
For each post in the dataset, the **TF** part of the formula measures how many times each word appears in that specific post. For example:

If the word "graphics" appears 5 times in a post and the post contains 100 words, the **TF** for "graphics" in that post would be:
TF("graphics") = 5/100 = 0.05

This means "graphics" contributes 5% of the total words in that document.

#### Inverse Document Frequency (IDF):
The **IDF** component adjusts the weight of words that are common across all the newsgroup posts. Words like "the," "and," "is" will appear in many documents and thus will have a low IDF value.

For instance, if the word "the" appears in almost every newsgroup post, the **IDF** for "the" will be small, which means it will have a low weight and not significantly influence the classification.

If a word appears in only a few categories (like "graphics" in the **comp.graphics** category), its **IDF** value will be higher, meaning it is more distinctive and useful for classification.

#### Combining TF and IDF (TF-IDF):
The final **TF-IDF** value for a word in a document is the product of its **TF** and **IDF**. Words that are frequent in a specific post but rare across the entire corpus (like "graphics" in **comp.graphics**) will have a high **TF-IDF** score, making them highly informative for classification.

### Example of How TF-IDF Works in the 20 Newsgroups Dataset:
Let’s say you have the following three documents in the dataset:

- **Document 1** (from **comp.graphics**): "Graphics hardware is important in modern computing."
- **Document 2** (from **rec.autos**): "The importance of graphics in automobile design."
- **Document 3** (from **sci.med**): "Medical graphics can be used in medical research."

For each document, TF calculates the frequency of terms:

- In **Document 1**, "graphics" may appear 1 time, and other words like "hardware" and "modern" also have their frequencies.
- In **Document 2**, "graphics" appears 1 time, and similar calculations are done for other words.

The **IDF** component checks how many documents contain the word "graphics" and assigns it a weight. If "graphics" appears in many documents, its **IDF** will be lower, as it is common. However, if "graphics" appears mainly in **comp.graphics** and not much elsewhere, the **IDF** will be higher.

Finally, the **TF-IDF** values are calculated by multiplying the term frequency (**TF**) by the inverse document frequency (**IDF**). Words like "graphics" will have a high **TF-IDF** score in **comp.graphics**, making them more important for classifying the document into that category.

## What Happens in Your Code:
```python
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(X).toarray()


In [21]:
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [25]:
# Load dataset
newsgroups = fetch_20newsgroups(subset='all')
X = newsgroups.data
y = newsgroups.target

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(X).toarray()

In [32]:
pd.DataFrame(X[0])[0].unique()

array([0.        , 0.0551521 , 0.06228945, 0.05398663, 0.03889093,
       0.12335451, 0.02971408, 0.01860012, 0.13315578, 0.03173198,
       0.06314308, 0.07767828, 0.08378101, 0.05066146, 0.16870275,
       0.11411294, 0.08328202, 0.14112828, 0.09147512, 0.06531498,
       0.18464977, 0.1083335 , 0.04455032, 0.05699943, 0.15994526,
       0.07580327, 0.11806278, 0.02067249, 0.01571676, 0.16034569,
       0.0619455 , 0.06772143, 0.09961367, 0.02395684, 0.08101771,
       0.04431971, 0.02890425, 0.0532248 , 0.03675459, 0.04002432,
       0.09919647, 0.22413785, 0.09168329, 0.03407714, 0.0788684 ,
       0.05893586, 0.07678712, 0.05146762, 0.01576353, 0.08181758,
       0.10915665, 0.05937229, 0.09655792, 0.08303835, 0.04310689,
       0.05764678, 0.02907696, 0.05918566, 0.0252772 , 0.1425931 ,
       0.07032004, 0.01634109, 0.03397822, 0.077074  , 0.56144282,
       0.07223526, 0.08788553, 0.05473728, 0.02819857, 0.07891483,
       0.10608441, 0.0609065 , 0.05467748, 0.10321891, 0.07531