# Step 1: Install Dependencies

This cell installs the required packages such as nltk, torch, transformers, tensorflow, keras, numpy, matplotlib, and seaborn. These libraries provide a comprehensive toolkit for natural language processing, deep learning, and visualization.




In [None]:
### Step 1: Install Dependencies
!pip install nltk torch transformers tensorflow keras numpy matplotlib seaborn

# Step 2: Import Necessary Libraries

This cell imports all the essential libraries for text processing, deep learning, and data visualization. Libraries like nltk enable natural language processing, torch and tensorflow/keras are used for building neural networks, while matplotlib and seaborn allow for creating plots and visualizations.



In [None]:
### Step 2: Import Necessary Libraries
import nltk
import torch
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.util import ngrams
from collections import Counter
from transformers import pipeline

# Step 3: Load Sample Text Data

This cell sets up a sample text string and downloads the necessary NLTK data packages (like 'punkt' and 'punkt_tab') for tokenization. These tokenizers break the text into words and punctuation needed for later analysis.


In [None]:
### Step 3: Load Sample Text Data
text = "The quick brown fox jumps over the lazy dog. The fox is clever and quick."
nltk.download('punkt')
nltk.download('punkt_tab')

## Step 4: N-gram Model Explanation  

This step implements a **Bigram Model**, a type of **N-gram model** where `n=2`. The purpose of this model is to analyze word sequences and determine the frequency of consecutive word pairs in the given text.  

### How It Works  

1. **Tokenization**  
   - The input text is converted to lowercase to ensure uniformity.  
   - It is then split into individual words (tokens) using `word_tokenize` from NLTK.  

2. **Generating Bigrams**  
   - The `ngrams` function from NLTK creates a list of word pairs (bigrams).  
   - Each bigram consists of two consecutive words from the tokenized text.  

3. **Counting Bigrams**  
   - The `Counter` class from the `collections` module is used to count the frequency of each bigram.  
   - The most common bigrams are displayed using the `most_common(5)` method.  

### Theory and Applications  

An **N-gram Model** estimates the probability of a word sequence based on observed word patterns. In the case of a bigram model, the probability of a word appearing is conditioned on the previous word.  

**Applications**  
- **Language Modeling:** Predicting the next word in a sentence.  
- **Text Generation:** Constructing sentences based on learned word sequences.  
- **Speech Recognition:** Improving accuracy by considering word sequence probabilities.  
- **Machine Translation:** Enhancing translation quality by analyzing word pair patterns.  

### Advantages  

- **Simple and easy to implement** due to its straightforward structure.  
- **Computationally efficient** compared to deep learning models, making it useful for quick text analysis tasks.  

### Limitations  

- **Limited context awareness** since it only considers the immediate previous word, making it ineffective for capturing long-range dependencies.  
- **Data sparsity issue**, where unseen bigrams in the training data receive zero probability, affecting model accuracy.  

### Additional References  

- [NLTK n-grams Documentation](https://www.nltk.org/_modules/nltk/util.html)  
- [Python Counter Class](https://docs.python.org/3/library/collections.html)  
- [N-gram Models - Stanford NLP](https://web.stanford.edu/~jurafsky/slp3/3.pdf)  

In [None]:
### Step 4: N-gram Model
from nltk.tokenize import word_tokenize
def generate_ngrams(text, n=2):
    tokens = word_tokenize(text.lower())
    n_grams = list(ngrams(tokens, n))
    return [" ".join(gram) for gram in n_grams]

ngram_counts = Counter(generate_ngrams(text, 2))
print("Bigram counts:", ngram_counts.most_common(5))

## Step 5: Simple RNN Model (PyTorch)  

This step implements a **Simple Recurrent Neural Network (RNN)** using **PyTorch** for **character-level text generation**. The model learns patterns from input text and generates new sequences based on the learned character dependencies.  

### How It Works  

1. **Character-to-Index Mapping**  
   - The input text is encoded into numerical form using a **character-to-index dictionary**.  
   - Each character is assigned a unique index, allowing the model to process textual data in numerical format.  

2. **One-Hot Encoding for Input Sequences**  
   - The dataset is prepared using one-hot encoding, where each character is represented as a vector with a single "1" at the index corresponding to the character.  
   - A **custom PyTorch Dataset class** is defined to generate input-target pairs from the text.  
   - The **input sequence** consists of `seq_length` characters, and the **target** is the next character in the sequence.  

3. **Defining the RNN Model**  
   - The model consists of:  
     - An **RNN layer** (`nn.RNN`) that processes input sequences and maintains hidden states.  
     - A **fully connected layer** (`nn.Linear`) that maps the hidden state output to character predictions.  

4. **Training the Model**  
   - A **cross-entropy loss function** (`nn.CrossEntropyLoss`) is used to measure prediction accuracy.  
   - The **Adam optimizer** updates model parameters.  
   - The model is trained for **100 epochs**, adjusting weights to minimize loss.  

5. **Text Generation**  
   - A **text generation function** is implemented to predict the next character given an initial seed text.  
   - The model iteratively generates new characters by selecting the most probable next character based on learned patterns.  

### Theoretical Background  

**Recurrent Neural Networks (RNNs)** are designed for **sequential data processing**, where each output depends on previous inputs. Unlike traditional feedforward networks, RNNs maintain a **hidden state** that retains past information, making them effective for modeling sequences.  

However, **basic RNNs** face limitations such as **vanishing gradients**, where information from earlier time steps is lost as it propagates through the network. More advanced architectures like **LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units)** address this issue by introducing mechanisms for long-term memory retention.  

### Applications  

- **Text Generation:** Creating text sequences based on learned character dependencies.  
- **Speech Recognition:** Converting spoken language into written text.  
- **Machine Translation:** Translating text between languages.  
- **Time Series Forecasting:** Predicting future values in time-dependent data.  

### Advantages  

- **Simple to implement** and computationally efficient for short sequences.  
- **Effective for learning short-term dependencies in sequential data.**  

### Limitations  

- **Difficulty capturing long-term dependencies** due to vanishing gradients.  
- **Limited performance compared to LSTM and GRU models**, which are better suited for longer sequences.  

### Additional References  

- [PyTorch RNN Documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html)  
- [Understanding Recurrent Neural Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)  
- [Sequence Modeling with RNNs](https://www.deeplearningbook.org/contents/rnn.html)  

In [None]:
### Step 5: Simple RNN Model (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from collections import defaultdict

# Define the SimpleRNN model
class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out, _ = self.rnn(x)
        out = self.fc(out[:, -1, :])
        return out

# Preprocessing: Create a character-to-index mapping
char_to_idx = defaultdict(lambda: len(char_to_idx))
text_data = text
encoded_text = [char_to_idx[char] for char in text_data]
vocab_size = len(char_to_idx)

# Prepare dataset
class TextDataset(Dataset):
    def __init__(self, data, seq_length=3):
        self.data = data
        self.seq_length = seq_length

    def __len__(self):
        return len(self.data) - self.seq_length

    def __getitem__(self, idx):
        # Change here: Create one-hot encoding for input sequence
        input_seq = torch.zeros(self.seq_length, vocab_size)
        for i, char_idx in enumerate(self.data[idx:idx + self.seq_length]):
            input_seq[i, char_idx] = 1
        return (
            input_seq, # Remove the extra unsqueeze(0)
            torch.tensor(self.data[idx + self.seq_length])
        )

# Training the SimpleRNN model
hidden_size = 10
dataset = TextDataset(encoded_text)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
rnn_model = SimpleRNN(vocab_size, hidden_size, vocab_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn_model.parameters(), lr=0.01)

# Training loop
for epoch in range(100):
    for seq, target in dataloader:
        optimizer.zero_grad()
        output = rnn_model(seq)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

# Text Generation Function
def generate_text(model, start_text, length=10):
    model.eval()
    input_seq = torch.tensor([char_to_idx[char] for char in start_text]).unsqueeze(0)
    generated_text = start_text
    for _ in range(length):
        with torch.no_grad():
            # Change here: Convert input sequence to one-hot encoding
            input_seq_onehot = torch.zeros(1, input_seq.shape[1], vocab_size)
            for i, char_idx in enumerate(input_seq[0]):
                input_seq_onehot[0, i, char_idx.item()] = 1
            output = model(input_seq_onehot)
            predicted_idx = torch.argmax(output, dim=1).item()
            predicted_char = list(char_to_idx.keys())[list(char_to_idx.values()).index(predicted_idx)]
            generated_text += predicted_char
            input_seq = torch.cat([input_seq[:, 1:], torch.tensor([[predicted_idx]])], dim=1)
    return generated_text

print("Generated text:", generate_text(rnn_model, "fox", length=100))

## Step 6: LSTM Model (TensorFlow/Keras)  

This step implements an **LSTM-based neural network** using **TensorFlow/Keras** for **word-level text generation**. The model learns word sequences from input text and generates new sentences by predicting the next word based on previous words.  

### How It Works  

1. **Tokenization and Sequence Conversion**  
   - The input text is tokenized using **Keras' `Tokenizer`**, which assigns a unique index to each word.  
   - The text is converted into numerical sequences based on these word indices.  

2. **Creating Input-Output Pairs**  
   - A fixed **sequence length (`seq_length = 3`)** is defined.  
   - Input sequences of `seq_length` words are extracted, with the next word as the target (`y`).  
   - These pairs are stored in NumPy arrays (`X` and `y`) for training.  

3. **Defining the LSTM Model**  
   - **Embedding Layer (`Embedding`)**: Converts word indices into dense vector representations.  
   - **LSTM Layers (`LSTM`)**: Two stacked LSTM layers process sequential input while retaining long-term dependencies.  
     - The first LSTM layer returns sequences (`return_sequences=True`), allowing the second LSTM layer to process the entire sequence.  
   - **Dense Output Layer (`Dense`)**: Uses **softmax activation** to output a probability distribution over possible next words.  

4. **Compiling and Training the Model**  
   - The model is compiled using **sparse categorical cross-entropy loss**, which is efficient for integer-labeled classification.  
   - The **Adam optimizer** is used to update model parameters.  
   - The model is trained for **100 epochs**, adjusting weights to improve accuracy.  

5. **Text Generation**  
   - A **text generation function** takes a seed phrase and predicts new words iteratively.  
   - The function tokenizes the input, pads it to match the sequence length, and predicts the most likely next word using the trained model.  

### Theoretical Background  

**Long Short-Term Memory (LSTM)** networks are an advanced type of **Recurrent Neural Network (RNN)** specifically designed to address the **vanishing gradient problem** in traditional RNNs.  

- **Vanishing Gradient Issue**: In standard RNNs, gradients diminish over long sequences, making it difficult to learn dependencies from earlier words.  
- **LSTM Solution**: LSTMs introduce **gates (input, forget, and output gates)** that control how much past information is retained or forgotten, making them better at capturing **long-term dependencies** in sequential data.  

### Applications  

- **Text Generation:** Producing coherent text based on learned word patterns.  
- **Speech Recognition:** Converting spoken language into written text.  
- **Machine Translation:** Improving translation quality using context-aware sequences.  
- **Sentiment Analysis:** Understanding sentiment in text by analyzing word sequences.  

### Advantages  

- **Better at learning long-term dependencies** compared to simple RNNs.  
- **More stable training process**, reducing issues like exploding/vanishing gradients.  

### Limitations  

- **Higher computational cost** due to complex memory mechanisms.  
- **Longer training times** compared to basic RNNs.  

### Additional References  

- [TensorFlow LSTM Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)  
- [Understanding LSTMs – Colah’s Blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)  
- [Deep Learning for NLP – Stanford NLP](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)  

In [None]:
### Step 6: LSTM Model (TensorFlow/Keras)
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
text = "The quick brown fox jumps over the lazy dog. The fox is clever and quick."

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
vocab_size = len(tokenizer.word_index) + 1

# Convert text to sequences
sequences = tokenizer.texts_to_sequences([text])[0]
seq_length = 3  # Define input sequence length

# Create input-output pairs
X, y = [], []
for i in range(len(sequences) - seq_length):
    X.append(sequences[i:i+seq_length])
    y.append(sequences[i+seq_length])

X, y = np.array(X), np.array(y)

# Define LSTM model
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=64, input_length=seq_length),
    LSTM(64, return_sequences=True),
    LSTM(64),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Train the model
model.fit(X, y, epochs=100, verbose=1)

# Text generation function
def generate_text(seed_text, next_words=5):
    for _ in range(next_words):
        tokenized = tokenizer.texts_to_sequences([seed_text])[0]
        tokenized = pad_sequences([tokenized], maxlen=seq_length, padding='pre')
        predicted_idx = np.argmax(model.predict(tokenized), axis=-1)[0]
        for word, index in tokenizer.word_index.items():
            if index == predicted_idx:
                seed_text += " " + word
                break
    return seed_text

# Generate text
print("Generated text:", generate_text("the jumping"))

## Step 7: Transformer Model (Hugging Face GPT-2)  

This step demonstrates how to use a **pre-trained GPT-2 model** from Hugging Face’s `transformers` library for **text generation**. Unlike traditional RNN-based models, **GPT-2 leverages the Transformer architecture**, which enables **better context retention** and more **coherent text generation**.  

### How It Works  

1. **Loading the Pre-Trained Model and Tokenizer**  
   - `GPT2Tokenizer` is used to tokenize input text and convert it into numerical format.  
   - `GPT2LMHeadModel` is loaded with pre-trained GPT-2 weights to generate text.  

2. **Text Generation Function**  
   - The input text is **encoded** into token IDs using `tokenizer.encode()`.  
   - The model generates new tokens **using self-attention mechanisms** while predicting the most probable next word.  
   - The generated tokens are **decoded back to human-readable text** using `tokenizer.decode()`.  

3. **Generating New Text**  
   - The function takes a **prompt** (starting text) and generates a continuation of up to **50 tokens**.  
   - The model ensures the generated text **flows naturally from the prompt**.  

### Theoretical Background  

**Transformers** (Vaswani et al., 2017) introduced the **self-attention mechanism**, which allows models to consider **the entire context** at once, rather than processing words sequentially (as in RNNs or LSTMs).  

- **Self-Attention**: Each word in a sequence **attends to all other words**, helping the model understand **long-range dependencies**.  
- **Decoder-Only Architecture**: GPT-2 is a decoder-based Transformer, meaning it generates text **autoregressively**, predicting one token at a time.  

### Applications  

- **Conversational AI (Chatbots, Virtual Assistants)**  
- **Content Generation (Story, Blog, and Code Completion)**  
- **Summarization and Paraphrasing**  
- **Machine Translation**  

### Advantages  

- **Generates high-quality, coherent, and contextually relevant text.**  
- **No need for training from scratch**—leverages large-scale pre-trained models.  
- **Handles long-range dependencies** better than RNNs and LSTMs.  

### Limitations  

- **Computationally expensive**, requiring GPUs or TPUs for efficient inference.  
- **Can generate biased or misleading text**, as it is trained on large-scale internet data.  
- **May produce unexpected outputs** if prompt formatting or pre-processing is incorrect.  

### Additional References  

- [Hugging Face GPT-2 Documentation](https://huggingface.co/docs/transformers/model_doc/gpt2)  
- [Original Transformer Paper (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762)  
- [OpenAI’s GPT-2 Release Blog](https://openai.com/research/gpt-2)  

In [None]:
### Step 7: Transformer Model (Hugging Face GPT-2)
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

def generate_text(prompt):
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    outputs = model.generate(inputs, max_length=50, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("GPT-2 Output:", generate_text("The quick br"))