<a href="https://colab.research.google.com/github/NkM20/IA/blob/main/LLMs_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#What is Hugging Face?   🤗
Hugging Face is a company and open-source platform that provides tools and resources for Natural Language Processing (NLP), machine learning (ML), and artificial intelligence (AI). It has become one of the most widely used platforms for building, training, and deploying AI models, especially in the realm of transformer-based models like BERT, GPT, and T5.

#Key Features of Hugging Face
1. Transformers Library

The Transformers library is Hugging Face's flagship product.
It provides access to state-of-the-art pre-trained models for NLP and beyond, including models for:
Text classification
Question answering
Language translation
Text generation
Summarization
Conversational AI (chatbots)
Example models:
BERT, GPT-2, GPT-3, T5, RoBERTa, DistilBERT, etc.
2. Datasets Library

A repository of thousands of datasets for training and fine-tuning ML models.
Easily load and preprocess datasets for tasks like:
Text classification
Machine translation
Speech processing
Image processing
3. Hugging Face Hub

An online model repository for sharing and accessing pre-trained models.
Contains thousands of pre-trained models uploaded by researchers, organizations, and the community.
Models can be downloaded and used with a simple API.
4. Accelerate Library

Provides tools for training large models efficiently on distributed systems and GPUs/TPUs.
5. Inference API

Hugging Face offers a cloud-based API to deploy and use models in production without worrying about infrastructure.
6. Integration with Other Frameworks

Hugging Face integrates with popular ML frameworks like:
PyTorch
TensorFlow
ONNX

# Installation and Configuration

>

The libraries below are essential for developing machine learning and natural language processing (NLP) models. *Transformers*, by HuggingFace, offers a wide range of pre-trained models like BERT, GPT, and T5 for NLP tasks. *Einops* makes it easy to manipulate tensors with a clear syntax, making complex operations simpler. *Accelerate*, also from HuggingFace, helps optimize model training on different hardware accelerators like GPUs and TPUs. Finally, *BitsAndBytes* enables efficient quantization of large models, reducing memory consumption in PyTorch.

In [None]:
!pip install -q transformers einops accelerate bitsandbytes

In [None]:
#!pip install bitsandbytes-cuda110 bitsandbytes

In our next cell, we will configure the environment by importing the necessary libraries and configuring our device.

Let's import some components from the transformers library

* AutoModelForCausalLM: A class that provides a pre-trained causal (or autoregressive) language model (e.g. GPT-2, GPT-3) that are suitable for text generation tasks.
* AutoTokenizer: A class that provides a tokenizer that matches the model. The tokenizer is responsible for converting text into (numeric) tokens that the model can understand.
* pipeline: Provides a simple, unified interface for various NLP tasks, making it easier to perform tasks such as text generation, classification, and translation.
* BitsAndBytesConfig: A class for configuring quantization and other low-level optimizations to improve computational efficiency.



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

We are also defining the device variable, which specifies the computing device to use:

* This line checks if a CUDA-enabled GPU is available. If so, the code sets the device to cuda:0 (the first GPU). If it isn't, it goes back to using the CPU.

Remembering that the use of GPU can significantly accelerate the training and inference of deep learning models. Let's take advantage of Colab's free GPU (T4).

In [None]:
import torch
import getpass
import os

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
device

Although not necessary, you can also define a seed, to ensure reproducibility between different experiments and runs.

This way, we can ensure that the same random numbers are generated every time the code is run, leading to consistent results.

In [None]:
torch.random.manual_seed(42)

## Token definition

Below you must paste your generated token into the Hugging Face panel.
So, go to the Hugging Face, signup and generate a new READ token to past it here...

In [None]:
os.environ["HF_TOKEN"] = getpass.getpass()



## Loading the Model

In this step, we will download and configure a HuggingFace template. This process may take a few minutes as the template is a few GB - but overall the download to Colab should be relatively quick.


First let's start by showing the Phi 3 (microsoft/Phi-3-mini-4k-instruct), a smaller model but which proved to be very interesting and comparable to much larger ones.

https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

It was chosen because it is open source, accessible and can respond well in Portuguese (although it is still better in English). You will see that many models do not understand this language, and those that do are too heavy for us to run in our environment, that is, we need to access them via an API or web interface, such as ChatGPT. However, at this moment we want to explore open source solutions, to obtain greater freedom.





In [None]:
id_model = "microsoft/Phi-3-mini-4k-instruct"

* `device_map="cuda"`: Specifies that the model should be loaded on a CUDA-enabled GPU. Remembering that this is one of the main advantages of now using Colab as GPU significantly improves inference and model training performance by leveraging parallel processing.

* `torch_dtype="auto"`: Automatically sets the appropriate data type for the model's tensors. This ensures that the model uses the best data type for performance and memory efficiency, typically float32 or float16.

* `trust_remote_code=True`: Allows loading custom code from the template repository into HuggingFace. This is necessary for certain models that require specific configurations or implementations not included in the standard library.

* `attn_implementation="eager"`: Specifies the implementation method for the attention mechanism. The "eager" configuration is a particular implementation that can provide better performance for some models by processing the attention mechanism in a specific way.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    id_model,
    device_map = "cuda",
    torch_dtype = "auto",
    trust_remote_code = True,
    attn_implementation="eager"
)

##Tokenizer

In our configuration, we also need to load the tokenizer associated with the model. The tokenizer is crucial for preparing text data in a format that the model can understand.

* A tokenizer converts raw text into tokens, which are numeric representations that the model can process. It also converts the model's output tokens back into human-readable text.
* Tokenizers handle tasks such as splitting text into words or subwords, adding special tokens, and managing vocabulary mapping.



The tokenizer is a crucial component in the NLP pipeline, bridging the gap between raw text and model-ready tokens.

To implement, we will use the `AutoTokenizer.from_pretrained()` function, specifying the same tokenizer as the model, thus ensuring consistency between text processing during training and inference.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(id_model)

## Pipeline Creation

Now we will create a pipeline for text generation using our previously loaded model and tokenizer. The HuggingFace pipeline function simplifies the process of performing various natural language processing tasks by providing a high-level interface.

A pipeline is an abstraction that simplifies the use of pre-trained models for a variety of NLP tasks. It provides a unified API for different tasks like text generation, text classification, translation, and more.


Parameters:

* `"text-generation"`: specifies the task that the pipeline is configured to execute. In this case, we are setting up a pipeline for text generation. The pipeline will use the template to generate text based on a provided prompt.
* `model=model`: specifies the pre-trained model that the pipeline will use. Here, we are passing the model that we loaded earlier. This model is responsible for generating text based on input tokens.
* `tokenizer=tokenizer`: specifies the tokenizer that the pipeline will use. We pass the tokenizer we loaded earlier to ensure that the input text is tokenized correctly and the output tokens are decoded accurately.

In [None]:
pipe = pipeline("text-generation", model = model, tokenizer = tokenizer)

## Pipeline Creation

Now we will create a pipeline for text generation using our previously loaded model and tokenizer. The HuggingFace pipeline function simplifies the process of performing various natural language processing tasks by providing a high-level interface.

A pipeline is an abstraction that simplifies the use of pre-trained models for a variety of NLP tasks. It provides a unified API for different tasks like text generation, text classification, translation, and more.


Parameters:

* `"text-generation"`: specifies the task that the pipeline is configured to execute. In this case, we are setting up a pipeline for text generation. The pipeline will use the template to generate text based on a provided prompt.
* `model=model`: specifies the pre-trained model that the pipeline will use. Here, we are passing the model that we loaded earlier. This model is responsible for generating text based on input tokens.
* `tokenizer=tokenizer`: specifies the tokenizer that the pipeline will use. We pass the tokenizer we loaded earlier to ensure that the input text is tokenized correctly and the output tokens are decoded accurately.

In [None]:
generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.1, # 0.1 until 0.9
    "do_sample": True, # not always the same answer
}

Generating output: output = pipe(messages, **generation_args): This line passes the input message and generation arguments to the text generation pipeline. The pipeline generates a response based on the input message and specified parameters.

* `**generation_args`: This unpacks the generation_args dictionary and passes its contents as keyword arguments to the pipeline, customizing the text generation process.

In [None]:
prompt = "What is quantum computing?"
#prompt = "Calculate 7 x 6 - 42?"

output = pipe(prompt, **generation_args)

In [None]:
output

NameError: name 'output' is not defined

In [None]:
print(output[0]['generated_text'])

In [None]:
prompt = "Calculate 7 x 6 - 42?"
output = pipe(prompt, **generation_args)
print(output[0]['generated_text'])

In [None]:
prompt = "Let me know the first person in the moon?"
output = pipe(prompt, **generation_args)
print(output[0]['generated_text'])

Notice that the model continued generating after giving the answer, which is why this time it took longer.
What happens is that the model continues "talking to itself", as if simulating a conversation. This is expected behavior since we do not define what we call an end token. This will be explained in detail, but for now what you need to know is that to avoid this behavior we use templates, which are generally recommended by the authors themselves (or by the community)



And how to evaluate which model performs best for certain tasks?
For example, here we now evaluate knowledge about facts in general, to find out which model performs best in this task you can search for benchmarks/tests and leaderboards

## Templates and prompt engineering

Prompt templates help translate user input and parameters into instructions for a language model. This can be used to guide a model's response, helping it understand context and generate relevant, more coherent output.

> Solve problem with text that continues to be generated after responding

To find the appropriate template, always check the template description, for example: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

In the case of Phi 3, the authors recommend this template below.

Note: later we will see a way to pull this manual template without having to manually copy and paste it here.

These tags formed by `<|##nome##|>` are what we call special tokens and are used to delimit the beginning and end of text and tell the model how we want the message to be interpreted

The special tokens used to interact with Phi 3 are these:

* `<|system|>, <|user|> and <|assistant|> `: correspond to the role of the messages. The roles used here are: system, user and assistant

* `<|end|>`: This is equivalent to the EOS (End of String) token, used to mark the end of the text/string.  

We will use the .format to concatenate the prompt in this template, so we don't need to retype it manually

In [None]:
template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

In [None]:
template

In [None]:
output = pipe(template, **generation_args)
print(output[0]['generated_text'])

With this the problem was resolved. We can do another test (in fact, this prompt below can be used to easily test whether the model responds well in our language)

In [None]:
prompt = "Você entende português?"

template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

In [None]:
prompt = "What is AI?"  # @param {type:"string"}

template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

With this the problem was resolved. We can do another test (in fact, this prompt below can be used to easily test whether the model responds well in our language)

In [None]:

prompt = "What is AI?" # @param {type:"string"}

sys_prompt = "You are a helpful assistant."

template = """<|system|>
{}<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(sys_prompt, prompt)

print(template)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

> Example with code generation

We can modify the system prompt further

More modern models require less prompt engineering in this regard. Just saying it's an assistant is more than enough for most cases.


In [None]:
prompt = "Write a python code to have the fibonnaci sequence"

sys_prompt = "You are an expert programmer. Show that code and explain it"

template = """<|system|>
{}<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(sys_prompt, prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

In [None]:
def fibonacci(n):

    a, b = 0, 1

    sequence = []

    while len(sequence) < n:

        sequence.append(a)

        a, b = b, a + b

    return sequence


# Exemplo de uso:

n = 10  # Quantidade de números da sequência de Fibonacci a serem gerados

print(fibonacci(n))

### Improving results

**Exploring changes to the prompt**
* It may fail depending on the type of request. We will see many ways to improve the delivery of results.
* But for now, remember to check if your prompt could not be more specific. If even after improving the prompt you are having difficulty achieving the expected result (and after trying other parameters), then the model is not so appropriate for this task.
* > Bonus tip for code generation: An idea/suggestion for a prompt that could be convenient for code generation, in case you want to use LLM as your co-pilot:
> "Refactor using concepts such as SOLID, Clean Code, DRY, KISS and if possible apply one or more appropriate design patterns aiming at scalability and performance, creating an organized folder structure and separating by files" (and of course, here you can modify as you wish)

* Note: Sometimes it is better to keep the prompt simple and not too elaborate. Including too many different information or references can be confusing. Therefore, it is quite interesting to add or remove term by term if you are experimenting and looking for better results.

**Exploring other models**
* In this case, to achieve more assertive results you can look for larger and more modern models with more parameters (remember the trade-off between efficiency and quality of responses) or even models focused on the desired task, for example code generation or conversations/chat.
* For the code generation example, you could use the [deepseek-coder 6.7B](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) model (or look for others with this focus)


### Where to find prompts

Creating your own prompt can be ideal if you want to reach very specific cases.
But if you don't have much time to experiment (or don't know the best way) a good tip is to search for prompts on the internet.

There are several websites and repositories that provide prompts made by the community.

One example is the LangSmith hub: https://smith.langchain.com/hub It is part of the LangChain ecosystem. This will be very convenient later, as we will see how to pull prompts hosted there through just one function

## Message Format

An increasingly common use case for LLMs is chat. In a chat context, instead of continuing a single string of text (as is the case with a standard language model), the model continues a conversation consisting of one or more messages, each of which includes a role, such as "user" or "assistant", as well as message text.

The prompt can therefore also be structured this way below. We will look at this in more detail when we are using LangChain, as we will have additional features that will improve the usability of this mode.

`msg`: This list contains the input message that we want the model to respond to. The message format includes a dictionary with the keys `role` and `content`.

* `role`: "user" indicates that the message is from the user. Other possible roles could include "system" or "assistant" if you are simulating a multi-turn conversation. Different models may have roles with different names, here with Phi 3 these are expected.

* `content`: Here we leave the actual question we want the model to answer, in this case, our prompt.

We will explore this mode further when we use LangChain

In [None]:
prompt = "O que é IA?"

msg = [
    {"role": "system", "content": "Você é um assistente virtual prestativo. Responda as perguntas em português."},
    {"role": "user", "content": prompt}
]

output = pipe(msg, **generation_args)
print(output[0]["generated_text"])

In [None]:
prompt = "Liste o nome de 10 cidades famosas da Europa"
prompt_sys = "Você é um assistente de viagens prestativo. Responda as perguntas em português."

msg = [
    {"role": "system", "content": prompt_sys},
    {"role": "user", "content": prompt},
]

output = pipe(msg, **generation_args)
print(output[0]['generated_text'])

However, when implementing more modern models in this way, such as Phi 3 or llama 3 for example, it is recommended to follow the appropriate template as mentioned above, which is an even more guaranteed way to prevent the model from hallucinating and continuing to talk to itself after giving the desired answer.

This concern will not be necessary in some methods using LangChain, or using proprietary models such as Open AI and Gemini.

> Check GPU

To check how many resources have already been consumed from the GPU, you can use the command below. When using these smaller models, we don't need to worry, but after very continuous use and after switching between several different models, it may be interesting to keep an eye on consumption, to avoid the error "OutOfMemoryError: CUDA out of memory."

If this error occurs, simply restart the session. Note: you need to re-run the import code and where the variables that will be used were declared.

In [None]:
!nvidia-smi

### Optimizing with quantization

So far we have used a lightweight model (Phi-3 4k), but trying to run much larger models can be challenging given the limited resources, especially when using the free version of Google Colab. However, with quantization techniques and BitsAndBytesConfig from the transformers library, it is possible to load and run massive models efficiently, without significantly compromising performance.

Quantization techniques reduce memory and computation costs by representing weights and activations with lower precision data types, such as 8-bit integers (int8). This allows us to load larger models and speed up inference.


To run the model efficiently on Google Colab, we will use BitsAndBytesConfig to enable 4-bit quantization. This configuration helps reduce the memory footprint and computational load, making it feasible to use large models on limited hardware resources.

More about quantization here:
https://huggingface.co/blog/4bit-transformers-bitsandbytes

There are also other solutions for quantization (for example [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) or [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)) that may or may not optimize further.
If you don't want to bother with optimization/performance and at the same time maintain quality then consider using a paid solution.

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

Explanation of the above parameters:

`load_in_4bit` - This parameter enables 4-bit quantization. When set to True, model weights are loaded with 4-bit precision, significantly reducing memory usage.
* Impact: Lower memory usage and faster calculations with minimal impact on model accuracy.

`bnb_4bit_quant_type` - Specifies the type of 4-bit quantization to use. "nf4" stands for NormalFloat4, a quantization scheme that helps maintain model performance while reducing accuracy.
* Impact: Balances the trade-off between model size and performance.

`bnb_4bit_use_double_quant` - When set to True, this parameter enables double quantization, which further reduces quantization error and improves model stability.
* Impact: Reduces quantization error, improving model stability.

`bnb_4bit_compute_dtype` - defines the data type for calculations. Using torch.bfloat16 (Brain Floating Point) helps improve computational efficiency, maintaining most of the precision of 32-bit floating point numbers.
* Impact: Efficient calculations with minimal loss of precision.

To apply quantization, we will now load the model with the "AutoModelForCausalLM" method, as mentioned previously.

**Note:** if you run without quantization, you will see that there will be a memory overflow problem in Colab. Therefore, it is now essential to use this technique if we are going to use it on Colab's free GPU

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Typically, when preparing data for LLM, one uses tokenizer.apply_chat_template, which adds EOS (end of sequence) tokens after each response.

More on templates in chat templates
https://huggingface.co/docs/transformers/chat_templating

In [None]:
prompt = ("Quem foi a primeira pessoa no espaço?")
messages = [{"role": "user", "content": prompt}]

We recommend using the Hugging Face tokenizer.apply_chat_template() function, which automatically applies the correct chat template to the respective model. It is easier than manually writing the chat template and less error-prone. `return_tensors="en"` specifies that the returned tensors should be in PyTorch format.

The remaining lines of code: tokenize the input messages, move the tensors to the correct device, generate new tokens based on the given inputs, decode the generated tokens back into readable text, and finally return the generated text.

* `model_inputs = encodeds.to(device)` - Moves the encoded tensors to the specified device (CPU or GPU) for processing by the model.

* `encodeds` - The tensors generated in the previous line.

`to(device)` - Moves the tensors to the specified device (device), which can be a CPU or GPU.

* `generated_ids = model.generate...` -> Generates a sequence of tokens from the model_inputs.
* model.generate: Model function that generates text based on the provided inputs.
* model_inputs: The processed inputs, ready to be used by the model.
* max_new_tokens=1000: Limits generation to a maximum of 1000 new tokens.
* do_sample=True: Enables random sampling during generation, which can result in more varied outputs.
*

pad_token_id=tokenizer.eos_token_id: Sets the padding token to be the end-of-sequence token, ensuring that the generation is properly terminated.

* `decoded = tokenizer.batch_decode(generated_ids)` - decodes the generated IDs back to readable text.
* `tokenizer.batch_decode` - a function that converts a list of token IDs back to text.
* `generated_ids` - the IDs of the tokens generated in the previous step.

`res = decoded[0]` - extracts the first item from the decoded text list.
decoded[0]: Gets the first text from the decoded list, which corresponds to the generated text for the first (and possibly only) input provided.

In [None]:
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens = 1000, do_sample = True,
                               pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
res = decoded[0]
res

```
# This is formatted as code
```

****
You will see that with LangChain, we will have more options and tools, as the library offers a complete ecosystem integrated with the main and most modern language modeling solutions, both open and private.

So why might it be interesting to know this method that we show now, if LangChain is better and offers more options? It can be useful if you are testing a new and recently published model that does not yet have much compatibility.

Even with LangChain, when dealing with literally thousands of different models, there may be some incompatibility when loading them. This is usually fixed by the development team in some future release, but it is not always immediate - and other solutions you will find only by searching on forums since they are published by the community.

So knowing this method can be useful if you are testing the latest open-source models that did not load correctly with LangChain.

It may be a small inconvenience for some, but it is necessary to understand that this is the "price" to pay for being on the frontier and using the most modern Open Source models and being able to use them for free.