# Exercise 1: Pipelines and APIs
In this exercise, we will explore the capabilities of LLMs for natural language processing (NLP) tasks using the Hugging Face (HF) ecosystem. First, we will use the `sentence-transformers` package to extract features from text data by running language models from HF in your own environment. Second, we will use the HF `InferenceClient` API to generate text by running language models hosted on HF servers. 

By the end of this exercise, you will have learned how to:
- Extract features (embeddings) from text data using LLMs via `sentence-transformers`
- Generate text using LLMs via the HF API

## Using Notebook Environments 
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the notebook by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print preceding variables, use the `print()` function as usual.

Notebook environments support code cells and markdown (text) cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup

In [1]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    
    # Installing requisite packages
    !pip install --upgrade transformers sentence-transformers &> /dev/null

# pip install --upgrade scikit-learn

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the `import <name> as <nickname>` syntax (e.g. `import pandas as pd`). The nicknames are usually standardized. We here make use three packages:

1. `pandas`: A very popular package for reading and manipulating data in python.
2. `sentence_transformers`: A module for extracting features from text data using LLMs.
3. `huggingface_hub`: A high-level API to interact with models hosted on the HF Hub. 

In [2]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from huggingface_hub import InferenceClient
import textwrap

  from tqdm.autonotebook import tqdm, trange


Get API key

In [3]:
import os

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

## Feature Extraction with `sentence_transformers`

The following begins by extracting features (embeddings) from the text data---numerical representations of the meaning of text---using the `sentence_transformers` package. To start, it uses three sentences that the code cell places in a list of strings. This list is provided as input to the model. 

The code makes use of the `all-MiniLM-L6-v2` model, which is a small and efficient embedding model, to extract features from the sentences. The model will encode the sentences into 384-dimensional vector representations. The cell will then print the features as a pandas dataframe for easy viewing. 

Run the cell below. 

In [4]:
# Define sentences
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)

# Print the features as a pandas dataframe
pd.DataFrame(features, index=sentences)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
I feel great this morning,-0.026462,-0.044373,0.072443,0.034525,0.089534,-0.050451,0.018811,0.071296,-0.020522,-0.043637,...,-0.005689,-0.000328,-0.049055,0.016308,-0.027642,0.017276,0.065253,0.017496,-0.02281,-0.036687
I am feeling very good today,-0.043895,-0.020341,0.066563,-0.00631,0.02598,-0.04042,0.079304,-0.0097,-0.04292,-0.025988,...,-0.045309,0.049151,-0.049057,0.017821,-0.018061,-0.010441,0.04307,0.01844,-0.008274,-0.006016
I am feeling terrible,0.017495,-0.057904,0.033315,0.00171,0.051957,-0.048159,0.007659,0.119096,0.029929,-0.06896,...,0.038813,0.003015,-0.074585,-0.018391,-0.026449,0.005867,0.051495,-0.009829,0.030009,-0.064299


In [5]:
similarities = model.similarity(features, features)
print(similarities)

tensor([[1.0000, 0.7923, 0.5926],
        [0.7923, 1.0000, 0.5782],
        [0.5926, 0.5782, 1.0000]])


**TASK 1**: Have a scroll through the features printed by the cell. Can you see that the features of the first two sentences are more similar to each other (i.e., have similar numerical values) than they are to the third sentence? Why do you think this is the case?

**TASK 2**: Try to add another sentence to the `sentences` list defined above by copy-pasting one of the existing sentences but replacing one or two words with a synonym. For instance, you could change "I feel *great* this morning" to "I feel *fantastic* this morning". Then rerun the cell. What do you notice about the features of this new sentence compared to the original?

## Text Generation with `huggingface_hub`
This section demonstrates how to use the HF API. The main benefit of the API is that it allows us to run the latest, largest open models without having the specialised hardware needed to run them (since the models are run on the cloud). We will use the `meta-llama/Meta-Llama-3-8B-Instruct` to start with (we will show you how to use the larger 70B model in the exercise).  

The code begins by initializing the `InferenceClient` with an access token, **which you will need to replace with your own [access token](https://huggingface.co/settings/tokens)** (access tokens start with 'hf_...'). 

It is common for large text-generation models to take a "system-user" (or "system-user-assistant") prompting format. The format begins with `"system"` and then alternates between `"user"` and `"assistant"` roles to generate a chat-like conversation. In this case, the "system" prompt provides the general role that the model should play, and the "user" prompt provides the task-specific details. The optional `"assistant"` role can be used to add past model responses to the prompt.

Run the cell below.


In [8]:
# Initialize client
# api_key = '<your_access_token>' 
client = InferenceClient(token=key.hugging_api_key_read)

# Create prompts
system_content = "You are a helpful assistant."
user_content = """
    Summarize the following text:
    
    Once upon a time in a land far far away, there was a young prince named John. He was known for his bravery and courage. 
    One day, he decided to go on an adventure to explore the unknown lands. The prince rode his horse through the dense forests,
    crossed the vast deserts, and climbed the highest mountains. After many days of travel, he finally reached the edge of the world.
    There, he found a magical portal that led to a parallel universe. The prince stepped through the portal and found himself in a
    world filled with strange creatures and mystical beings. He knew that he had found his true calling and decided to stay in this
    new world forever.
"""

# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    max_tokens=500
)

# Accessing the text in the output object
text = output.choices[0].message.content

# Printing the output in a more readable format
print('\n'.join(textwrap.wrap(text, 100)))

Here's a summary of the text:  Prince John, known for his bravery and courage, sets out on an
adventure to explore unknown lands. He travels through dense forests, vast deserts, and high
mountains before reaching the edge of the world, where he finds a magical portal to a parallel
universe. He decides to stay in this new world, where he finds strange creatures and mystical
beings, and knows he's found his true calling.


**TASK 1**: Try copy-pasting an abstract of one our your papers into the `user_content` variable and rerun the cell. Does LLama-3 do a good job of summarizing the work?<br>
**TASK 2**: Change the `system_content` to `"You are an incredibly unhelpful assistant."` and rerun the cell. Note the power of the system prompt for directing the model's behavior.<br>
**TASK 3**: Change the `system_content` back to `"You are a helpful assistant."`. Now try changing the `max_tokens` parameter and see how it affects the length of the summaries.<br>
**TASK 4**: Play around, experiment, and have fun with the model! 

