<a href="https://colab.research.google.com/github/TRach07/gpt2-tokenization-generation-analysis/blob/main/GPT2_Tokenization_Generation_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP Practical Work: Pre-trained Language Model Manipulation**

- **Objective** : Manipulate a pre-trained language model to understand text generation and explore the tokenization process.

- **Tools**: Python, Hugging Face Transformers, PyTorch

**1. Installation and Library Import**

In [None]:
# Install required libraries
!pip install transformers torch



In [None]:
# Import necessary libraries
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

Libraries imported successfully!
PyTorch version: 2.8.0+cu126


**2. Loading a Pre-trained Language Model**

In [None]:
# Load GPT-2 model and tokenizer
model_name = "gpt2"  # Using the base GPT-2 model

print("Loading GPT-2 model and tokenizer...")
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()
print("Model and tokenizer loaded successfully!")

# Display model information
print(f"\nModel: {model_name}")
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

Loading GPT-2 model and tokenizer...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model and tokenizer loaded successfully!

Model: gpt2
Vocabulary size: 50257
Model parameters: 124,439,808


**3. Input Text Tokenization**

In [None]:
# Define input text
input_text = "The artificial intelligence system"

print("Original text:", input_text)

# Tokenize the input text
tokens = tokenizer.encode(input_text, return_tensors='pt')
token_ids = tokens[0].tolist()

print("\n=== Tokenization Results ===")
print(f"Token IDs: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")

# Display token to ID mapping
print("\nToken to ID mapping:")
for i, token_id in enumerate(token_ids):
    token = tokenizer.decode([token_id])
    print(f"  Token {i+1}: ID={token_id}, Text='{token}'")

Original text: The artificial intelligence system

=== Tokenization Results ===
Token IDs: [464, 11666, 4430, 1080]
Number of tokens: 4

Token to ID mapping:
  Token 1: ID=464, Text='The'
  Token 2: ID=11666, Text=' artificial'
  Token 3: ID=4430, Text=' intelligence'
  Token 4: ID=1080, Text=' system'


**4. Text Generation**

In [None]:
# Generate text continuation
print("=== Basic Text Generation ===")
print(f"Input: '{input_text}'")

with torch.no_grad():  # Disable gradient calculation for inference
    generated_output = model.generate(
        tokens,
        max_length=50,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        early_stopping=True
    )

# Decode and display generated text
generated_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)
print(f"Generated: '{generated_text}'")

The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


=== Basic Text Generation ===
Input: 'The artificial intelligence system'
Generated: 'The artificial intelligence system is able to predict the future, and it can also predict what the next generation of robots will look like.

The system, called the AI-Robot-AI (AI-R) system (AAR), is'


**5. Generation Parameters Exploration**

In [None]:
# Function to generate text with different parameters
def generate_with_parameters(input_text, **kwargs):
    """Generate text with specified parameters"""
    tokens = tokenizer.encode(input_text, return_tensors='pt')

    with torch.no_grad():
        generated_output = model.generate(
            tokens,
            **kwargs
        )

    return tokenizer.decode(generated_output[0], skip_special_tokens=True)

*5.1 Varying Temperature Parameter*

In [None]:
print("=== Temperature Parameter Comparison ===")
print(f"Input: '{input_text}'\n")

temperatures = [0.1, 0.5, 1.0, 1.5]
for temp in temperatures:
    generated_text = generate_with_parameters(
        input_text,
        max_length=60,
        temperature=temp,
        do_sample=True,
        num_return_sequences=1
    )
    print(f"Temperature {temp}: {generated_text}\n")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Temperature Parameter Comparison ===
Input: 'The artificial intelligence system'



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Temperature 0.1: The artificial intelligence system is able to predict the future, and it can also predict the future in real time.

The system is able to predict the future, and it can also predict the future in real time. The system is able to predict the future in real time. The system is able



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Temperature 0.5: The artificial intelligence system is capable of making decisions about where to place a place, how to use the energy it generates, and how to use it in the future.

The system is also capable of making decisions about who to send to a place. The system can make decisions about how to use



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Temperature 1.0: The artificial intelligence system is the result of the researchers working on its artificial brains. So far, so good. But the researchers think they've found the exact right way to do that, and now they'll have a way to make that work in a system that's much more complicated than its computers.

Temperature 1.5: The artificial intelligence system is a virtual organism with many thousands of neurons that work together. It may take 20-100 thousand artificial neurons at normal performance without any problems at all, although it might need much more of our human capabilities, just to put it simply, even if its memory processing would suffer,



*5.2 Varying Top-k Parameter*

In [None]:
print("=== Top-k Parameter Comparison ===")
print(f"Input: '{input_text}'\n")

top_k_values = [5, 20, 50, 100]
for top_k in top_k_values:
    generated_text = generate_with_parameters(
        input_text,
        max_length=60,
        top_k=top_k,
        do_sample=True,
        temperature=0.7,
        num_return_sequences=1
    )
    print(f"Top-k {top_k}: {generated_text}\n")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Top-k Parameter Comparison ===
Input: 'The artificial intelligence system'



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Top-k 5: The artificial intelligence system is designed to be a 'real' human, rather than just a 'computer,' but a 'computer' that can be 'computerized' and 'programmed,' which is what the artificial intelligence system does. It's not a 'computer,' because it's not. It



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Top-k 20: The artificial intelligence system in question is known as Machine Learning. This artificial intelligence system was developed by the University of Maryland's Artificial Intelligence Research Center and is based on the research in the field of artificial intelligence.

In a statement, the company said it was aware of the problem and was "working



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Top-k 50: The artificial intelligence system was able to predict the speed of its own movements and adjust it accordingly.

The researchers said that it would be possible to build a system that could track and identify when a car is in motion.

The machine would also be able to track and collect details about all

Top-k 100: The artificial intelligence system is able to determine which of the two fields of interest is best suited for the job and which to leave in the hands of the AI-led group.

"We are still working on the first iteration of the AI, which will also have a small amount of human-



*5.3 Combined Parameters*

In [None]:
print("=== Combined Parameters ===")
print(f"Input: '{input_text}'\n")

# Test different combinations
configurations = [
    {"temperature": 0.3, "top_k": 10, "max_length": 50},
    {"temperature": 0.8, "top_k": 50, "max_length": 50},
    {"temperature": 1.2, "top_k": 100, "max_length": 50},
]

for i, config in enumerate(configurations, 1):
    generated_text = generate_with_parameters(
        input_text,
        do_sample=True,
        **config
    )
    print(f"Configuration {i} (temp={config['temperature']}, top_k={config['top_k']}):")
    print(f"{generated_text}\n")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Combined Parameters ===
Input: 'The artificial intelligence system'



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Configuration 1 (temp=0.3, top_k=10):
The artificial intelligence system can predict the future and predict the future. It can also predict the future and predict the future.

The AI system can predict the future and predict the future. It can also predict the future and predict the future.




The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Configuration 2 (temp=0.8, top_k=50):
The artificial intelligence system was developed with the help of some of the most important people of the planet: the scientists who developed it, and the scientists who created it. Now, these scientists are back in the lab, working in a new way: they

Configuration 3 (temp=1.2, top_k=100):
The artificial intelligence system has been applied since 2003 to search Google," said Michael Martin, a consultant expert at CapitalBiz, which helps companies develop digital training services for defense analysts. "The idea of this problem of trust at work has so far been



**6. Analysis Questions**

*6.1 Parameter Effects Analysis*

**Temperature Impact:**

When temperature increases (0.1 → 1.5):

- Low temperature (0.1): Produces highly deterministic and repetitive text. The model plays it safe, repeating phrases like "predict the future" multiple times. Output is coherent but lacks creativity.

- Medium temperature (0.5-1.0): Strikes a balance between coherence and diversity. The text becomes more varied while maintaining logical flow.

- High temperature (1.5): Generates highly creative but sometimes less coherent content. The model produces novel concepts like "virtual organism with thousands of neurons" but may lose focus.

**Top-k Impact:**

When top-k is reduced (100 → 5):

- Low top-k (5): Severely limits vocabulary choices, resulting in constrained and repetitive generation. The model overuses certain words and patterns.

- Medium top-k (20-50): Provides good diversity while maintaining relevance to the topic.

- High top-k (100): Allows broad vocabulary selection, producing more varied and sometimes surprising content.




*6.2 Model Advantages and Limitations*

**Advantages:**

1. **Contextual Understanding**: The model correctly interprets "artificial intelligence system" and generates relevant content about prediction, machine learning, and AI research.

2. **Coherent Text Structure**: Generates grammatically correct sentences with proper paragraph structure.

3. **Knowledge Retention**: Demonstrates understanding of AI concepts like neural networks, prediction systems, and research applications.

4. **Rapid Generation**: Produces lengthy, coherent text quickly without manual intervention.

5. **Parameter Control**: Flexible adjustment of creativity vs. coherence through temperature and top-k parameters.


**Limitations:**

1. **Factual Accuracy**: May generate incorrect or fabricated information presented as fact (specific university names, system names).

2. **Repetition Tendency**: Especially noticeable with low temperature settings, repeating phrases and ideas.

3. **Context Drift**: Can gradually move away from the original topic over longer generations.

4. **Parameter Sensitivity**: Small changes in parameters can produce significantly different results, requiring careful tuning.

5. **Training Data Bias**: Reflects biases and knowledge cutoff from its training data (2019 for GPT-2).

6. **Lack of True Understanding**: While text appears coherent, the model doesn't truly comprehend the concepts it's discussing.

*6.3 Educational Value Demonstrated*

This practical work successfully illustrates:

- Tokenization process and how text is processed by AI models

- Parameter effects on generation quality and creativity

- Trade-offs between coherence and diversity in AI text generation

- Real-world limitations of current language models

- Practical skills in manipulating and configuring AI systems