# Tokenizers and models

Let's begin with testing how to use tokenizers and models from HuggingFace

In [9]:
%pip install transformers
%pip install datasets
%pip install openai
%pip install scikit-learn
%pip install numpy
%pip install sentence_transformers



In [10]:

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    pipeline
)
from typing import List
from datasets import load_dataset
from openai import AzureOpenAI
from sklearn.metrics import accuracy_score
from transformers import pipeline
import os
from sklearn.neighbors import NearestNeighbors
import numpy as np
from sentence_transformers import SentenceTransformer

# Let's test text generation with different models

### Load GPT-2 model and tokenizer from Huggingface

In [None]:
# Load the gpt-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load the gpt-2 model with the text generation head
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

### Try out the loaded tokenizer

In [None]:
# Encoding can be done with encode method
input_text = "The most important thing in life is"
print("Input text was: ", input_text, "\n")

encoded_input = tokenizer.encode(input_text)
print("Encoded input:", encoded_input, "\n")

# Decoding can be done with the decode method
# When decoding the encoded input, the tokenizer should return the original text.
decoded_input = tokenizer.decode(encoded_input)
print("Decoding the tokens back to original input: ", decoded_input)

### Try out the loaded GPT-2 model

In [None]:
# Inference can be done by calling .generate method of the model
model_output = gpt2_model.generate(**tokenizer(input_text, return_tensors="pt"), max_new_tokens=10)

print("Model output is just tokens:")
print(model_output[0])

print("\nModel output needs to be decoded with the tokenizer to get meaningful words:")
print(tokenizer.decode(model_output[0]))


### TODO
The above output was somewhat reasonable with GPT-2 model. What if you increase the number of `max_new_tokens`.

Try it out!

### Try out a model trained for classification

The previous GPT-2 model was trained for Causal Language Modelling task, .i.e. to predict the text continuation. Let's try out a model trained for classification task.

ProsusAI/finbert model description:

"FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification."


In [None]:
# Load the finbert tokenizer
finbert_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

# Load the finbert model with the text generation head
finbert_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

### Try out the classification model

Notice that calling the model happens now with model callable, not with .generate method, and `max_new_tokens` input parameters does not exist.

In [None]:
input_text = "Top private equity firms put brakes on China dealmaking"
model_output = finbert_model(**finbert_tokenizer(input_text, return_tensors="pt"))
print("Model output (for positive, negative or neutral sentiment):")
print(model_output[0])


### TODO

1. Make sure you understand the model output.
2. Try out the finbert model some more and test it with some other input. Do you find some examples for which it would output faulty classification (sentiment).

### Let's test some more advanced models through Azure API's

It's easy to deploy models to cloud by using any of the LLM API providers. Let's test how to run models deployd using Azure AI services.

In [11]:
# TODO: Insert the provided API key here
from google.colab import userdata
api_key_gpt4o = userdata.get("AZURE_GPT4O_KEY")

GPT-4o mini is specifically built for chat, so the deployed model has a "chat/completions" endpoint. Notice that also the the input has pre-defined structure containing a list of messages each of which have "role" and "content" fields.

In [12]:
deployment_name="gpt-4o-mini"
api_version="2024-08-01-preview"
task = "chat/completions"
endpoint = f"https://aiservices-forge-test-westeu.openai.azure.com/"

client = AzureOpenAI(
    api_key=api_key_gpt4o,
    api_version=api_version,
    azure_endpoint = endpoint
    )
input = "The best way to learn how to build RAG applications is to "

messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me four basic ingredients for crepes. Answer only with a list of ingredients."},
]
chat_completion = client.chat.completions.create(
    model=deployment_name,
    messages=messages
)
chat_completion.choices[0].message.content


'1. Flour  \n2. Milk  \n3. Eggs  \n4. Butter  '

In [13]:
# TODO: Insert the provided API key here
from google.colab import userdata
api_key_gpt35 = userdata.get("AZURE_GPT35_KEY")

GPT-3.5 model is trained for causal langauge modelling (text continuation) and the deployed model has a "completions" endpoint for that purpose.

In [14]:
api_version="2024-02-01"
endpoint = "https://aiservices-forge-test-swe.openai.azure.com/"


client = AzureOpenAI(
    api_key=api_key_gpt35,
    api_version=api_version,
    azure_endpoint = endpoint
    )

input = "Basic ingredients for crepes are: "
response = client.completions.create(model="gpt-35-turbo-instruct", prompt=input, max_tokens=50)

print(f"Input: {input}")
print(f"Response: {response.choices[0].text}")

Input: Basic ingredients for crepes are: 
Response:  

1 cup all-purpose flour
2 large eggs
1 cup milk
1/2 cup water
1/4 teaspoon salt
2 tablespoons melted butter

Optional ingredients for sweet crepes:
1 tablespoon sugar
1 teaspoon vanilla extract

Optional ingredients


You can also deploy models for text embeddings. Let's try one out.

In [None]:
# TODO: Insert the provided API key here
api_key_embedding = os.getenv("AZURE_EMBEDDINS_KEY")

In [None]:
#TODO: deploy this

deployment_name="text-embedding-3-large"
api_version="2023-05-15"
endpoint = "https://aiservices-forge-test-swe.openai.azure.com/"

client = AzureOpenAI(
    api_key=api_key_embedding,
    api_version=api_version,
    azure_endpoint = endpoint
    )

input = "Some text to generate embeddings for."
response = client.embeddings.create(model=deployment_name, input=input)

print(f"Input: {input}")
print(f"Response: {response.data[0].embedding}")

Suggestions for things to try out later on:
1. Search Huggingface for some models that looks interesting and try them out. You can also use th Huggingface portal "Inference API" directly if you want.
2. Test different embedding models. Can mix & match different models i.e. are the embeddings somehow comparable accross different models?

### HuggingFace pipeline

HuggingFace also has convenient `pipeline` abstraction for model inference. It offers a simple API for running the models without the need to load for instance tokenizers separately.


In [None]:
pipe = pipe = pipeline("text-classification", model="ProsusAI/finbert")

input_text = "Top private equity firms put brakes on China dealmaking"
pipe(input_text)

# Embeddings and RAG

Let's next build a very simple RAG application. The application uses financial new articles as a database and is able to find similar articles to a given one and generate some additional information regarding the retrieved articles.

### Load a dataset from HuggingFace

In [15]:
fina_news = load_dataset("Aappo/fina_news_1000")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/430 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

The loaded dataset contains financial news data (news headline, journalists, data, link to the article and the article text)

In [16]:
fina_news['train'][0]

{'Headline': 'Ivory Coast Keeps Cocoa Export Tax Below 22%, Document Shows',
 'Journalists': ['Baudelaire Mieu'],
 'Date': Timestamp('2011-10-06 15:14:20'),
 'Link': 'http://www.bloomberg.com/news/2011-10-06/ivory-coast-keeps-cocoa-export-tax-below-22-document-shows.html',
 'Article': 'Export taxes on cocoa beans from Ivory Coast , the world’s biggest producer of the chocolate ingredient, won’t exceed 22 percent of the international price this season, meeting a commitment to the International Monetary Fund , according to a finance ministry document. In the 2008-9 season taxes averaged 25.3 percent of international prices, the IMF said in a document posted on its website in November last year. While the country met the commitment in the season just ended, it had a change in government earlier this year. The rate meets a demand by the International Monetary Fund and the World Bank to reform the Ivorian cocoa and coffee industries in order to comply with the terms of its Heavily Indebted 

We will use an embedding model from HuggingFace. Embedding models can be loaded by using the SentenceTransformer class.

In [17]:
embedder = SentenceTransformer("msmarco-distilbert-base-v4")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Some helper functions

Let's define some helper functions for generating a vector index and for searching the index. In this example case the vector index is a scikit-learn nearest neighbour model.

In [18]:
def index_documents_huggingface(articles:List[str]):
    embeddings = embedder.encode(articles)
    nbrs = NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(embeddings)
    return nbrs

In [19]:
def get_nearest_neighbours_huggingface(nbrs, article:str, all_articles: List[str], n_neighbors:int=2):
    embedding = embedder.encode(article)
    neighbour_indices = nbrs.kneighbors([embedding], n_neighbors=n_neighbors)
    neighbour_artices = np.array(all_articles)[neighbour_indices[1][0]]
    return neighbour_artices, neighbour_indices[0]

### Let's index the articles

This can take a short while on Colab, so we are only using the first 100 articles.

In [20]:
nbrs_huggingface = index_documents_huggingface(fina_news["train"]["Article"][:100])

### Find the similar articles of a given one

Let's take a random article from our article catalog:

In [21]:
article = fina_news["train"]["Article"][10]
display(input)

'Basic ingredients for crepes are: '

In [None]:

nearest_articles = get_nearest_neighbours_huggingface(nbrs=nbrs_huggingface, all_articles=fina_news["train"]["Article"][:1000], article=article, n_neighbors=5)
display(nearest_articles)

### Generate some additional information about the retrieved articles

Let' start with generating short summaries of the retrieved articles. There are specialized summarization models as well, but we'll use prompting and GPT-4o model in this case.

In [None]:
deployment_name="gpt-4o-mini"
api_version="2024-08-01-preview"
endpoint = f"https://aiservices-forge-test-westeu.openai.azure.com/"

client = AzureOpenAI(
    api_key=api_key_gpt4o,
    api_version=api_version,
    azure_endpoint = endpoint
    )
input = "The best way to learn how to build RAG applications is to "


for article in nearest_articles[0]:
    messages = [{"role":"system", "content": "You are a helpful assistant giving short one sentence summary of the given text."},
                {"role": "user", "content": article}]
    response = client.chat.completions.create(model=deployment_name, messages=messages, max_tokens=100)
    print("\nSummary:")
    display(response.choices[0].message.content)


### TODO

You can continue to develop this application further:

1. How could you use the GPT-4o model to classify the articles based on for instance their topic or sentiment?
2. How could you change the prompt to use GPT-4o to explain why the articles are similar to each other?
3. What if you use the above `ProsusAI/finbert` model for classification? If there are errors, how could you prevent those?
4. In what type of real life scenario could you use this type of retrieval setup?
5. Modify the code so that you use the model `text-embedding-3-large` for generating the embeddings.
6. Try deploying your own LLM model on some API provider infra and use that to 1. generate the embeddings 2. generate the additional information.