## Objective of the Lab
##### Manipulate a pre-trained language model to understand how it generates text and explore the process of tokenization and text generation.

### Manipulating a Language Model with Hugging Face Transformers
##### Required materials: Python, access to the Hugging Face `transformers` library (pre-installed if possible), access to Jupyter Notebook or a Python-compatible IDE.

#### 1. Installation and Import of Libraries
- Install the required libraries: `transformers`, `torch`

In [44]:
! pip install transformers torch




In [45]:
# Import the Transformers library
import transformers
from transformers import pipeline

# Checking PyTorch (torch)
import torch

# Display the versions of the installed libraries
print("Transformers version:", transformers.__version__)
print("Torch version:", torch.__version__)


Transformers version: 4.46.3
Torch version: 2.5.1


#### 2. Loading a Pre-trained Language Model

In [46]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

#### 3. Tokenization of Input Text

In [47]:
# Load the input text
input_text = "The artificial intelligence system"

# Tokenize the text
tokens = tokenizer.tokenize(input_text)
token_ids = tokenizer.encode(input_text, return_tensors="pt")

# Display the results
print("Tokens (as text):", tokens)
print("Tokens (as numeric indices):", token_ids)


Tokens (as text): ['The', 'Ġartificial', 'Ġintelligence', 'Ġsystem']
Tokens (as numeric indices): tensor([[  464, 11666,  4430,  1080]])


#### 4. Text Generation
- Generate a continuation for this sentence.
- Code to generate the text continuation: `model.generate()`

In [48]:
# Generate a text continuation
output = model.generate(
    token_ids,
    max_length=50,  # Maximum length of the generated text
    num_return_sequences=1,  # Number of sequences to generate
    no_repeat_ngram_size=2,  # Prevent repetitions
    top_k=50,  # Limit to a subset of probable words
    top_p=0.95,  # Nucleus sampling (p > 0.95)
    temperature=1.0,  # Controls creativity (1.0 = Neutral)
    do_sample=True  # Enables sampling
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [49]:
# Decode the generated output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Display the results
print("Generated text:", generated_text)

Generated text: The artificial intelligence system was developed in partnership with IBM and Intel in the first few years of this new project.

The machines will also include three of the world's most popular smartphones – the Galaxy Nexus and the Note 9. The Android OS is


#### 5. Exploring Generation Parameters
- Modify parameters like `max_length`, `temperature` (for creativity), or `top_k` (filtering the most probable words) to observe how they affect the generated text.

#### Language Model Hyperparameters

##### 1. **`token_ids`**
- **Description**: List of token IDs representing the input text.
- **Role**: Allows the model to understand words as numerical data.
- **Default Value**: Depends on the input text and the tokenizer used.
- **Note**: Serves as the primary input for generating a response or text.

##### 2. **`max_length`**
- **Description**: Maximum length of the generated text.
- **Role**: Limits the size of the generated sequence to avoid excessively long outputs.
- **Default Value**: Variable depending on the model, often around 20-50.
- **Example**: `max_length=100` will generate text with up to 100 tokens.

##### 3. **`num_return_sequences`**
- **Description**: Number of text sequences generated by the model for a single request.
- **Role**: Useful for exploring multiple variations of the generated response.
- **Default Value**: Usually 1.
- **Example**: `num_return_sequences=3` generates 3 different responses for the same input.

##### 4. **`no_repeat_ngram_size`**
- **Description**: Minimum size of an n-gram that should not repeat.
- **Role**: Prevents repetitions in the generated text.
- **Default Value**: `None` or disabled.
- **Example**: `no_repeat_ngram_size=2` prevents repetitions of two-word phrases (bigrams).

##### 5. **`top_k`**
- **Description**: Maximum number of tokens the model can choose from at each step of generation.
- **Role**: Restricts selection to the `k` most probable options.
- **Default Value**: 50 or `None` (no restriction).
- **Example**: `top_k=10` limits consideration to the top 10 most probable options.

##### 6. **`top_p`**
- **Description**: Cumulative probability used for **Nucleus Sampling**.
- **Role**: Limits selection to tokens whose cumulative probability is less than or equal to `top_p`.
- **Default Value**: `1.0` (no limit).
- **Example**: `top_p=0.95` considers only tokens with a cumulative probability ≤ 95%.

##### 7. **`temperature`**
- **Description**: Controls the diversity of the output by adjusting token probabilities.
- **Role**: Lower temperature results in more conservative generation; higher temperature increases creativity.
- **Default Value**: `1.0`.
- **Example**:
  - `temperature=0.1` → Predictable and conservative response.
  - `temperature=1.5` → More varied and unpredictable response.

##### 8. **`do_sample`**
- **Description**: Enables or disables random sampling during generation.
- **Role**: Determines whether the model chooses tokens based on probability (`do_sample=True`) or always selects the most probable token (`do_sample=False`).
- **Default Value**: `False` (selects the most probable token).
- **Example**:
  - `do_sample=True` generates more varied responses.
  - `do_sample=False` generates deterministic responses.

#### **Practical Examples**
##### Example 1: Precise and Structured Text
```python
max_length=50, top_k=5, temperature=0.3, do_sample=False
```
- Generates short text with limited options and minimal creativity.

##### Example 2: Creative Text
```python
max_length=100, top_p=0.9, temperature=1.2, do_sample=True
```
- Generates longer, more varied text rich in lexical diversity.

##### Example 3: Preventing Repetitions
```python
no_repeat_ngram_size=3, top_k=20, temperature=0.7
```
- Prevents 3-gram repetitions while balancing creativity and coherence.
```

In [50]:
# Generate a text continuation
output = model.generate(
    token_ids,
    max_length=100,  # Maximum length of the generated text
    num_return_sequences=1,  # Number of sequences to generate
    no_repeat_ngram_size=2,  # Prevent repetitions
    top_k=10,  # Limit to a subset of probable words
    top_p=0.95,  # Nucleus sampling (p > 0.95)
    temperature=0.1,  # Controls creativity (1.0 is neutral)
    do_sample=True  # Enables sampling
)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [51]:
# Decode the generated output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Display the results
print("Generated text:", generated_text)

Generated text: The artificial intelligence system is able to predict the future, and it can also predict what will happen in the next few years.

The system can then predict how long it will take to complete a task, how much time it takes to finish a sentence, what the expected outcome will be, etc. The system also can predict when the task will end. It can even predict whether the tasks will last for a certain amount of time. This is called the "learning curve."
. . .


#### 6. Analysis Questions
- Compare the results based on the changed parameters. What happens when the temperature increases? What about when `top_k` is reduced?
- What are the advantages and limitations of this type of model for text generation?


#### **Analysis of Results Based on Parameters**

#### **1. Effect of Temperature (`temperature`)**
- Temperature controls the "creativity" or diversity in the model's choices for each generated word.
- **Low Values (`<1.0`)**: 
  - The model becomes more conservative.
  - Prioritizes the most probable words (deterministic decisions).
  - Generates more coherent but less surprising text.
  - **Example** (temperature = 0.7):
    ```
    The artificial intelligence system is designed to improve efficiency and solve complex problems in various industries.
    ```
- **High Values (`>1.0`)**:
  - The model becomes more random.
  - Increases the probability of unusual choices (less predictable decisions).
  - Generates more creative text but may risk incoherence.
  - **Example** (temperature = 1.5):
    ```
    The artificial intelligence system danced with paradoxical algorithms, weaving dreams in quantum fields of uncertainty.
    ```

---

#### **2. Effect of Reducing `top_k`**
- `top_k` limits the number of possible choices at each step of generation.
- **High Values (`>50`)**:
  - The model has a broad range of words to choose from.
  - Generates diverse but potentially less coherent text.
  - **Example** (top_k = 100):
    ```
    The artificial intelligence system can adapt, predict, simulate, and conceptualize advanced scenarios to revolutionize industries.
    ```
- **Low Values (`<10`)**:
  - Strongly restricts the model's choices.
  - Generates more repetitive and predictable text.
  - **Example** (top_k = 5):
    ```
    The artificial intelligence system is designed to improve performance and solve problems in various fields of technology.
    ```

---

### **Advantages and Limitations of This Model Type**

#### **Advantages:**
1. **Powerful Generalization**:
   - Capable of generating convincing text across different domains and styles.
2. **Customizable**:
   - Parameters like `temperature`, `top_k`, and `top_p` allow adjustment of creativity and coherence.
3. **Ease of Use**:
   - Simple interfaces with libraries like `Transformers`.
4. **Wide Applications**:
   - Useful for writing, content creation, chatbots, and more.

#### **Limitations:**
1. **Contextual Inconsistencies**:
   - Can generate grammatically correct but logically incorrect sentences.
   - Example: "The cat programmed the AI to solve equations."
2. **Dependence on Training Data**:
   - The model may reproduce biases present in the original data.
3. **Lack of Deep Understanding**:
   - The model doesn't truly "understand" concepts. It predicts the probabilities of the next words.
4. **Control Challenges**:
   - Hard to ensure relevance or factual accuracy of the generated content.
5. **Computational Limitations**:
   - Requires significant resources for large models (CPU/GPU, memory).

---

### **Conclusion:**
To generate effective text, parameters must be adjusted based on the use case:
- **For coherent and formal text**: Use low `temperature` and high `top_k`.
- **For creative or imaginative text**: Increase `temperature` and reduce `top_k`.

However, these models cannot replace humans for tasks requiring deep understanding, such as critical analysis or fact-checking.
