# 2. Loading and using an LLM

Welcom to the Second Lesson of the NNLG Tutorial!

In this session, we look at how to find, load and prompt an LLM (Large Language Model)

## 2.1 Loading an LLM

Loading an LLM is as simple as loading any other model thanks to the Hugging Face `pipeline` function.

We will load a small LLM so we can run it on the CPU: [`unsloth/HuggingFaceTB/SmolLM2-1.7B-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct).

In [1]:
# Install Transformers library
! pip install transformers --quiet

# Import Pipeline for the LLM and Pytorch to find the best available device
from transformers import pipeline
import torch

# Find the best available device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # Use GPU if available, otherwise use CPU

# Load the model
model_identifier = 'unsloth/Llama-3.2-3B-Instruct'
llm = pipeline(model=model_identifier, device=device)

config.json:   0%|          | 0.00/928 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

## 2.2 Using an LLM

In this case, the LLM is instantiated as a `TextGenerationPipeline`. This pipeline will continue writing the text after the provided input prompt. It can take several generation arguments like `max_new_tokens:int` which specifies the maximum numbers of new tokens to generate. For more information about generation arguments you can visit [this link](https://huggingface.co/docs/transformers/en/generation_strategies).

For now, lets try something simple:

In [13]:
prompt = 'What is the capital of France? ' # Expected output: Paris
generation = llm(prompt, max_new_tokens = 32)
print(generation)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[{'generated_text': 'What is the capital of France?  A. Paris B. London C. New York D. Berlin\nA. Paris'}]


As you cans see, the output of the pipeline is a list that contains a dictionary for each input ptompt. This dictionary has a key (`generated_text`) which includes the text generated by the model.

Since we are using an autoregressive model, you can see that our input prompt (`'What is the capital of France?'`) is included as part of the generation.

Lets write a bit of code to only extract the newly generated text.

In [14]:
new_text = generation[0]['generated_text'][len(prompt):]
print(new_text)

 A. Paris B. London C. New York D. Berlin
A. Paris


## 2.3 Instruction Tuning

While every execution produces diferent results, you might have noticed how the model is doing more than just answering the question form the prompt. This is because, right now, we are using its autoregressive nature in a naive way that will just continue generating text.

One way to tackle this issue is **Instruction Tunning**, where the model is fine-tuned to follow specific instructions and attempt to address them directly. While this approach is not perfect and might still lead to over generation, it is much better than the naive approach.

The model we are currently using has been trained with Instructuin Tunning. To make use of this we will need to format our prompt in a chat format. The chat consists of a list of dictionaries, each of which includes a `role` and a `content`. Typically, there is a `system` message followed by alternating `user` and `assistant` messages.

To turn this list of messages into a string to use as a prompt we will make use of the `apply_chat_template` from the `tokenizer` object included in our LLM. This method has arguments like `tokenize` which when set to `False` outputs a `string` instead of a list of tokens, and `add_generation_prompt` which when set to `True` adds a line at the end of the prompt to signal to the model that the upcoming text is the assistant's answer.


In [15]:
chat = [
    {
        'role':'system',
        'content':'''You are a helpful asistant.'''
    },
    {
        'role':'user',
        'content':'Answer the following question: What is the capital of France?'
    }
]

prompt = llm.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

print(prompt)

<|im_start|>system
You are a helpful asistant.<|im_end|>
<|im_start|>user
Answer the following question: What is the capital of France?<|im_end|>
<|im_start|>assistant



Now, if we generate the output like we did before we should get a much more concise and consistent output:

In [16]:
generation = llm(prompt, max_new_tokens = 32)
new_text = generation[0]['generated_text'][len(prompt):]
print(new_text)

The capital of France is Paris.


Let's test the model on a Trivia dataset: [`mandarjoshi/trivia_qa`](https://huggingface.co/datasets/mandarjoshi/trivia_qa). We can load it like we did in Lesson 1.

In [17]:
# Install Datasets library
! pip install datasets --quiet

# Import load_dataset and Dataset
from datasets import load_dataset, Dataset

# Instantiate the Trivia QA dataset in streaming model
dataset = load_dataset(
    'mandarjoshi/trivia_qa',
    'rc',
    split='train',
    streaming=True
)

# Preprocess the dataset
def preprocess_trivia_qa(sample):
    wiki_context = []
    for title, context in zip(sample['entity_pages']['title'], sample['entity_pages']['wiki_context']):
        wiki_context.append(tuple([title, context]))
    new_sample = {
        'wiki_context':wiki_context,
        'answer':sample['answer']['value']
    }
    return new_sample

dataset = dataset.map(
    preprocess_trivia_qa,
    remove_columns=[
        'question_id',
        'question_source',
        'entity_pages',
        'search_results'
    ]
)

# Take the first 8 elements ofthe dataset
dataset = dataset.take(8)

# Convert from IterableDataset to Dataset
dataset = Dataset.from_generator(lambda: iter(dataset), features=dataset.features)

dataset[0]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

{'question': 'Which American-born Sinclair won the Nobel Prize for Literature in 1930?',
 'answer': 'Sinclair Lewis',
 'wiki_context': []}

Now lets go over every sample on the dataset. First we will collect the question and the original  anwer; then, using the code we developed on this tutorial, we will generate one new answer with our model.

In [18]:
for sample in dataset:

    conversation = [
        {
            'role':'system',
            'content':'''You are a helpful assistant.'''
        },
        {
            'role':'user',
            'content': 'Aswer the following question: '+sample['question']
        }
    ]

    prompt = llm.tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

    generation = llm(prompt, max_new_tokens = 32)

    new_text = generation[0]['generated_text'][len(prompt):]

    print('Question:  ', sample['question'])
    print('Answer:    ', sample['answer'])
    print('Generation:', new_text)
    print()

Question:   Which American-born Sinclair won the Nobel Prize for Literature in 1930?
Answer:     Sinclair Lewis
Generation: Ernest Hemingway won the Nobel Prize for Literature in 1954, not the year mentioned in the question (1930). However

Question:   Where in England was Dame Judi Dench born?
Answer:     York
Generation: Dame Judi Dench was born on 17 March 1935 at St Mary's Hospital, Paddington, London, England. Her

Question:   In which decade did Billboard magazine first publish and American hit chart?
Answer:     30s
Generation: Billboard magazine first published its American hit chart in 1932.

Question:   From which country did Angola achieve independence in 1975?
Answer:     Portugal
Generation: Angola achieved independence from Portugal on November 11, 1975.

Key points about Angola's independence:

1. The Portuguese colonial government of

Question:   Which city does David Soul come from?
Answer:     Chicago
Generation: David Soul comes from Birmingham, England. He was born 

As you can see the answers are not always concise and sometimes they are incorrect. In the following lessons we will learn how to mitigate those issues.

## 2.4 Exercise

Apply this or any other model to the same dataset you selected and processed on the previous tutorial. Larger models might perform better, but they will require more time to process your input. You can improve generation times by using GPU instead of CPU.

### Group A

- Oyetunji ABIOYE
- Mehsen AZIZI
- Mohammad AL TAKACH

In [None]:
# Import Pipeline for the LLM and Pytorch to find the best available device
from transformers import pipeline
import torch

# Find the best available device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # Use GPU if available, otherwise use CPU

# Load the model
model_identifier = 'unsloth/Llama-3.2-1B-Instruct'
llm = pipeline(model=model_identifier, device=device)

In [20]:
from datasets import load_dataset

dataset = load_dataset(
    'ImruQays/Rasaif-Classical-Arabic-English-Parallel-texts',
    split='train',
    streaming=True
)

Resolving data files:   0%|          | 0/24 [00:00<?, ?it/s]

In [21]:
sample = dataset.__iter__().__next__()

def nested_print(key, element, level=0):
    if isinstance(element, dict):
        print(f'{"│ "*(level)}├─{key}:')
        for k, v in element.items():
            nested_print(k, v, level+1)
    else:
        print(f'{"│ "*(level)}├─{key}: {element}')

nested_print('sample', sample)

├─sample:
│ ├─ar: وبعد، فلما كان السلطان الأعظم الملك الناصر، العالم المجاهد المرابط المتاغر، المؤيد المظفر المنصور، زين الدنيا والدين، سلطان الإسلام والمسلمين، محيى العدل فى العالمين، وارث ملك ملوك العرب والعجم والترك، ظل الله فى أرضه، القائم بسنته وفرضه
│ ├─en: To proceed: Since the great Sultan, the King, the Victor, the Sage, the Just, the Struggler, the Perseverer, the Trail-blazer, the God-supported, the Conquering, the Victorious, the Ornament of the World and of Religion, the Sultan of Islam and of the Muslims, the Rejuvenator of Justice in the Worlds, the Heir of the kingdom of the Kings of the Arabs and the Persians and the Turks, Shadow of God in His land, the Upholder of God’s sunnah and of His Ordinances.


In [22]:
def preprocess_en_ar(sample):
    return {
        'source':sample['en'],
        'target':sample['ar'],
    }

dataset = dataset.map(
    preprocess_en_ar,
    remove_columns=[
            'en',
            'ar'
        ])


In [23]:
# Filter out samples that are too short or too long, we increase the lower bound to 30 because Arabic is a more verbose language
def filter_en_ar(sample):
    return (len(sample['source']) >= 30) and (len(sample['source']) <= 60) and (len(sample['target']) >= 30) and (len(sample['target']) <= 60)

dataset = dataset.filter(filter_en_ar)

In [24]:
dataset = dataset.take(16)

In [25]:
for sample in dataset:
    nested_print('sample', sample)
    print()

├─sample:
│ ├─source: Chapter One: about the maintenance of caution generally.
│ ├─target: الفصل الأول: فى أخذ الحذر فى الجملة.

├─sample:
│ ├─source: Chapter One: about the choosing of the site for camping.
│ ├─target: الفصل الأول: فى اختيار موضع المنزل.

├─sample:
│ ├─source: Chapter Two: about the method of night raiding.
│ ├─target: الفصل الثانى: فى كيفية البيات.

├─sample:
│ ├─source: Chapter Three: about the method of investment.
│ ├─target: الفصل الثالث: فى كيفية الحصار.

├─sample:
│ ├─source: “You never do anything right!” sighed Abu‘Abdullah.
│ ├─target: فقال له الشيخ: لا يجيء والله منك، من صالح أبدا.

├─sample:
│ ├─source: “So what should I do?” moaned the sheikh.
│ ├─target: قال الشيخ: فكيف أصنع جعلت فداك؟

├─sample:
│ ├─source: Thumama got extremely upset when his house burned down.
│ ├─target: وقيل: أصبح ثمامة شديد الغمّ حين احترقت داره.

├─sample:
│ ├─source: As a result, good drinking water went to waste.
│ ├─target: فكان ذلك الماء العذب الصافي يذهب باطلا.

├─sample:
│ ├

In [28]:
for sample in dataset:

    conversation = [
        {
            'role':'system',
            'content':'''You are a helpful assistant.'''
        },
        {
            'role':'user',
            'content': 'You must Translate the following English text to Arabic only: '+sample['source']
        }
    ]

    prompt = llm.tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

    generation = llm(prompt, max_new_tokens = 32)

    new_text = generation[0]['generated_text'][len(prompt):]

    print('Question:  ', sample['source'])
    print('Answer:    ', sample['target'])
    print('Generation:', new_text)
    print()

Question:   Hariri means a manufacturer or seller of harir (silk).
Answer:     والحريري: نسبة إلى الحرير وعمله أو بيعه.
Generation: هاريري (حريري) هو مصنف أو مصنف مصنع منيصرة.

Question:   Abu Dulaf heard of this, and sent him a thousand dinars.
Answer:     فبلغ خبره أبا دلف، فوجه إليه ألف دينار.
Generation: "أبى دلفا لِأَن يُحضِرَ عَلى ذَلِكَ مِنهُ مِنهُ

Question:   Here the poet has attained the acme of perfection.
Answer:     ولقد أحسن في هذا غاية الإحسان.
Generation: "المرحبة هي ما يأتي من أعلى"

Question:   We have already spoken of his father Ali.
Answer:     وقد تقدم ذكر والده في حرف العين.
Generation: The translation of the given English text in Arabic is:

نعم، نطالبنا من عمته.

Question:   ”We have spoken of al-Sharat in the life of his father Ali.
Answer:     وقد تقدم الكلام على الشراة في ترجمة أبيه علي بن عبد الله.
Generation: "نحن نكتبنا في حياته من أصوله الشارع.

Question:   Al-Bukhari was a lean-bodied man and of the middle size.
Answer:     وكان شيخا نحيف الجسم، لا بال