LLM Workshop Use Case 2: Translation and coding of posts.

Author: Simon van Baal

Date: 20240405

# Setup

In [None]:
# Install custom package
!pip install openai

# Import packages
from google.colab import drive
import pandas as pd
import os
from openai import OpenAI

from transformers import pipeline

Collecting openai
  Downloading openai-1.31.0-py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.1/324.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.5 

In [None]:
# Mount google drive to enable access to data files

drive.mount('/content/drive/')
# You will be asked to give permission here.

Mounted at /content/drive/


## Working Directory

Now here is a tricky bit. The command below will set the working directory. The bit up to and including "My Drive" will be the same (even if your google drive is in a different language). If you have this at the top of your Drive hierarchy, you may simply delete "projects/" below, and it should run.

In general, it is important you check the output carefully, since even though the chunk below will tell you if it could not find the directory, it will still run!

In [7]:
# You will need to change this line below to suit your directory, unfortunately.
%cd /content/drive/My Drive/projects/workshop-llm_uni-wien-main/

/content/drive/My Drive/projects/workshop-llm_uni-wien-main


In [8]:
#Use this to see if you are in the right folder
os.listdir()

#Load in some data.
df = pd.read_csv("data/data_workshop-llm.csv")

In [9]:
# Find the right model for your translation task on the Hugging Face model hub.
# https://huggingface.co/models - simply copy the title of the model and paste.

translate_de = pipeline("translation",
                        model = "Helsinki-NLP/opus-mt-de-en",
                        max_length = 800)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



In [10]:
translate_de("Hallo, wie geht's?")

[{'translation_text': 'Hello, how are you?'}]

In [11]:
# Let us wrap a function, so it is easier to repeat the same operation.
def translate_to_english(text, language="German"):
  # The default is
  if language == "German":
    try:
      translation = translate_de(text)
    except:
      print("something went wrong with the text input.")
  else:
    print("no other translators have been set yet.")
  return translation



In [12]:
# Note that Python starts indexing from zero, not from one, so the line below
# selects the second line of text.
df_translation = pd.DataFrame()

for row in range(0,10):
  # Grab post from df.
  post = df.iloc[row]

  # Translate it.
  english_text = translate_to_english(str(post["content"]))
  print(english_text)

  # Create an output list
  output = [["channel", "user_id", "content", "timestamp", "views", "translation"]]
  output.append(list(post[["channel", "user_id", "content", "timestamp", "views"]]) + [
                                english_text[0]])

  # Convert it to a data frame so we can save to csv.
  df_out=pd.DataFrame(output[1:],
                        columns=output[0])
  df_out.to_csv("output/post"+str(row)+"_workshop.csv")


[{'translation_text': 'Livechat Monday 31 October 2022 8 pm with **Hannes Brejcha**😉 的[ Hier](https://t.me/Fairdenkenoriginal)'}]
[{'translation_text': '"The energy crisis hits private households with full force: Gas and electricity prices have now reached a record level. From 2023 on, consumers even have to adjust to a tripling of reductions. In addition, the heating season soon begins and many people are faced with the question: How can I save energy? Business Insider has talked to ten young people about how they are in crisis and how they save energy. In the talks, we have learned that despair seems to be still limited – rather, they are trying to find creative solutions." https://www.businessinsider.de/politics/germanland/I-will-me-now-one-flowerpot-oven-building-ten-young-people-give-tips-to-energysaving/'}]
[{'translation_text': '*At the end of last week, the fourth Corona vaccine was approved in the EU with the vector vaccine of Johnson & Johnson. The approval in the USA was alr

# Fancier Models

Now we proceed into the domain of using more cutting-edge models.

The first problem is that most of our machines don't have enough working memory to hold these models - many have more than 7 billion parameters. And we would need to pay for Google Colab Pro to use more RAM.

The second problem is that many top-of-the-line models are not open source, so we need to pay a (usually small) fee to use them.

Now let's try and use OpenAI their models. You first need to sign up, and then -- hopefully -- you get $5 free. If you wish to do so, head to platform.openai.com. To generate an API key, for authentication, go to the menu on the left to "API Keys" and press "create secret key".

In [None]:
# Set up connection to OpenAI; generate an API key, and
# paste it in between the quotes below.
client = OpenAI(
    api_key='',
    timeout=30
)

# =========================== Define primary coding function
def chat(system_msg,
         user_assistant,
         model,
         temperature,
         top_p,
         json):
  assert isinstance(system_msg, str), "`system_msg` should be a string"
  assert isinstance(user_assistant, list), "`user_assistant` should be a list"
  assert isinstance(json, bool), "`json` should be one of True/False"

  # Define inputs to LLM, first we have "system message" to give it guidance.
  system_msg = [{"role": "system",
                 "content": system_msg}]

  # We now feed it the context of the conversation that occurred before.
  # if role = assistant, it was the LLM speaking, if role = user, it was us.
  user_assistant_msgs = [
        {"role": "assistant",
         "content": user_assistant[i]} if i % 2 else {"role": "user",
                                                      "content": user_assistant[i]}
        for i in range(len(user_assistant))
        ]
  msgs = system_msg + user_assistant_msgs

  # If we specify no json output, we will get a regular text response.
  if not json:
    response = client.chat.completions.create(
        model=model,
        messages=msgs,
        temperature=temperature,
        seed=1,
        top_p=top_p
    )
  elif json and not any(new_model in model for new_model in ["gpt-4o",
                                                             "gpt-4",
                                                             "gpt-3.5-turbo"]):
    raise ValueError("You have selected JSON output. This model cannot handle that.")

  # Otherwise proceed and produce JSON output.
  response = client.chat.completions.create(model=model,
                                              messages=msgs,
                                              temperature=temperature,
                                              top_p=top_p,
                                              seed=1,
                                              response_format={"type": "json_object"})

  return response


In [None]:
codable_post = pd.read_csv("output/post2_workshop.csv", index_col=False)

system = ("You're an assistant helping me code posts from Telegram about war"
          " and vaccination."
          " Please tell me whether the post I enter is about war, vaccination,"
          " or neither."
          " Answer in JSON format: {'cat': category of post [war/vax/none],"
          " 'stance': whether it is pro war/vaccination, or anti [pro/anti/NA].}")

# Select the post. I am selecting the original German here, which it can handle.
prompt = [str(codable_post.iloc[0, 3])]

# Q: How would you select the translated column?

model_output = chat(system_msg = system,
                user_assistant = prompt,
                model = "gpt-3.5-turbo",
                temperature = 1,
                top_p = .9,
                json = True)



In [None]:
# We can see it adequately shows us what is happening in this post.
print(model_output.choices[0].message.content)


# The way this is set up, if we save the user prompt and the LLM message in a list,
# and make a new request afterwards, it will be automatically passed to the
# model in the correct order.