<a href="https://colab.research.google.com/github/Aditya100300/LLMs_from_scratch/blob/main/Module%201/Module_1_Foundation_LLM_Maven_v2_module_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Language Models: Trigram Model & GPT-2 Generation

In this notebook, I explore two approaches to language modeling:

1. **Trigram Language Model with NLTK**  
   - Build a simple count-based trigram model using the Reuters corpus.
   - Convert trigram counts into probabilities.
   - Query the model to understand word prediction based on two previous words.

2. **Text Generation with GPT-2**  
   - Use a pre-trained GPT-2 model from the `pytorch-transformers` library.
   - Predict the next token for a given input.
   - Extend predictions to generate multiple tokens.

Each section is broken down into detailed steps with bullet points for clarity.


In [1]:
# Cell: Install NLTK
# This command installs the nltk library quietly.
# - The "quiet" flag reduces the installation output.
!pip install nltk --quiet

## Part 1: Trigram Language Model with NLTK

In this section, we will:
- **Import** required libraries from NLTK and Python's collections.
- **Download** the Reuters corpus and tokenizer data.
- **Build** a trigram model:
  - Iterate over sentences in the corpus.
  - Create trigrams with sentence padding.
  - Count occurrences of each trigram.
- **Convert** the counts into probabilities.


In [3]:
# Import necessary libraries
import nltk
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

# Download required NLTK data
nltk.download('reuters')
nltk.download('punkt')
# Download the missing punkt_tab data package.
nltk.download('punkt_tab')


# Create a placeholder for the language model
# Using nested defaultdict to automatically handle new keys
model = defaultdict(lambda: defaultdict(lambda: 0))

# Build the trigram model
for sentence in reuters.sents():
    # Generate trigrams from each sentence
    # pad_right and pad_left add None at the beginning and end of the sentence
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        # Increment the count for this trigram
        model[(w1, w2)][w3] += 1

# Convert frequency counts to probabilities
for w1_w2 in model:
    # Calculate total count for this bigram
    total_count = float(sum(model[w1_w2].values()))

    # Convert each count to a probability
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

# At this point, model contains the probabilities of each word (w3)
# given the previous two words (w1, w2)

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### Converting Counts to Probabilities

Now we convert the frequency counts into probabilities for each trigram.
- For each bigram context `(w1, w2)`:
  - Sum the counts of all following words `w3`.
  - Divide each individual count by the total to get the probability.


### Querying the Trigram Model

Let's query our trigram model:
- We choose the bigram `("the", "price")` as the context.
- We sort the potential following words by their probabilities.
- This helps us see which words are most likely to come after "the price".


In [14]:
from collections import Counter
from nltk import trigrams

# ... your existing imports and model training ...

# Function to get probabilities given preceding words
def get_probabilities(preceding_words):
    probabilities = Counter()
    for sentence in reuters.sents():
        for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
            if (w1, w2) == preceding_words:
                probabilities[w3] += 1
    total_count = sum(probabilities.values())
    if total_count > 0:
        for word in probabilities:
            probabilities[word] /= total_count
    return probabilities

# Get probabilities for "the price"
preceding_words = ("the", "price")
probabilities = get_probabilities(preceding_words)

# Sort and print
sorted_probabilities = probabilities.most_common()  # Sort by frequency (descending order)
print("Most probable words following 'the price', in order:")
for word, prob in sorted_probabilities:
    print(f"{word}: {prob}")

Most probable words following 'the price', in order:
of: 0.3209302325581395
it: 0.05581395348837209
to: 0.05581395348837209
for: 0.05116279069767442
.: 0.023255813953488372
at: 0.023255813953488372
adjustment: 0.023255813953488372
is: 0.018604651162790697
,: 0.018604651162790697
paid: 0.013953488372093023
increases: 0.013953488372093023
per: 0.013953488372093023
the: 0.013953488372093023
will: 0.013953488372093023
cut: 0.009302325581395349
cuts: 0.009302325581395349
(: 0.009302325581395349
differentials: 0.009302325581395349
has: 0.009302325581395349
stayed: 0.009302325581395349
was: 0.009302325581395349
freeze: 0.009302325581395349
increase: 0.009302325581395349
would: 0.009302325581395349
yesterday: 0.004651162790697674
effect: 0.004651162790697674
used: 0.004651162790697674
climate: 0.004651162790697674
reductions: 0.004651162790697674
limit: 0.004651162790697674
now: 0.004651162790697674
moved: 0.004651162790697674
adjustments: 0.004651162790697674
slumped: 0.004651162790697674
mov

## Observations from the Trigram Model

- The model leverages **simple counting** to build a probability distribution.
- It demonstrates how **local context** (the previous two words) can influence the prediction of the next word.
- Although basic, this method introduces fundamental concepts of language modeling and probability estimation.


## Part 2: Text Generation with GPT-2

In this section, we'll generate text using the pre-trained GPT-2 model:
- **Install** and import the necessary libraries.
- **Load** the GPT-2 tokenizer and model.
- **Encode** an input sentence.
- **Predict** the next token.
- **Extend** the prediction to generate a longer text sequence.


In [15]:
!pip install pytorch-transformers --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.4/176.4 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m108.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m83.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Setting Up GPT-2

Let's set up the GPT-2 model:
- **Import** torch and GPT-2 classes.
- **Load** the tokenizer which converts text to tokens.
- **Encode** a given input text into tokens.
- **Load** the pre-trained GPT-2 model and set it to evaluation mode.
- **Move** the model and input data to GPU if available for faster processing.


In [16]:
# Import required libraries
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')


100%|██████████| 1042301/1042301 [00:05<00:00, 178153.67B/s]
100%|██████████| 456318/456318 [00:00<00:00, 523943.40B/s]


In [17]:
# Define the input text
text = "I am thinking"
print(f"Input text: {text}")

# Encode the input text
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens to a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()

# Check if CUDA is available and move model and tensors to GPU if possible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokens_tensor = tokens_tensor.to(device)
model.to(device)

print(f"Using device: {device}")




Input text: I am thinking


100%|██████████| 665/665 [00:00<00:00, 2336023.58B/s]
100%|██████████| 548118077/548118077 [00:41<00:00, 13070284.37B/s]


Using device: cuda


### Predicting the Next Token with GPT-2

Using the input text, we now predict the next token:
- **Disable gradients** since we're only doing inference.
- **Pass** the input tensor through the model to get predictions.
- **Extract** the most likely token from the output.
- **Decode** the predicted token back into text and append it to the input.


In [18]:
# Predict next token
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# Get the predicted next sub-word (token)
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_token = tokenizer.decode([predicted_index])

# Add the predicted token to the original text
predicted_text = text + predicted_token

# Print the results
print(f"Predicted next token: '{predicted_token}'")
print(f"Complete predicted text: '{predicted_text}'")

Predicted next token: ' of'
Complete predicted text: 'I am thinking of'


### Extending Text Generation: Predicting Multiple Tokens

We can extend the generation process to predict a sequence of tokens:
- **Loop** for a defined number of tokens.
- **Update** the input with each new token.
- **Generate** a longer, coherent piece of text.


In [20]:
# Optional: Generate multiple next tokens
num_tokens_to_generate = 100
generated_text = text

for _ in range(num_tokens_to_generate):
    # Encode all text generated so far
    indexed_tokens = tokenizer.encode(generated_text)
    tokens_tensor = torch.tensor([indexed_tokens]).to(device)

    # Predict next token
    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    # Get the predicted next token
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    predicted_token = tokenizer.decode([predicted_index])

    # Add the predicted token to the generated text
    generated_text += predicted_token

print(f"\nGenerated text with {num_tokens_to_generate} additional tokens:")
print(generated_text)


Generated text with 100 additional tokens:
I am thinking of doing a book about the history of the United States. I am thinking of doing a book about the history of the United States. I am thinking of doing a book about the history of the United States. I am thinking of doing a book about the history of the United States. I am thinking of doing a book about the history of the United States. I am thinking of doing a book about the history of the United States. I am thinking of doing a book about the history of the United


## Final Thoughts & Next Steps

- **Trigram Model:**  
  - Demonstrates basic language modeling using count-based probabilities.
  - Reinforces how local context affects word prediction.
- **GPT-2 Generation:**  
  - Leverages deep learning for coherent text generation.
  - Shows the difference between statistical models and neural language models.

### What I Plan to Explore Next:
- Experiment with different input texts to see varied GPT-2 outputs.
- Adjust generation parameters (e.g., temperature, top-k sampling) for diverse results.
- Extend the trigram model with higher n-grams or smoothing techniques.

Feel free to modify and expand upon these experiments as I continue learning about language models!
