# Epilogue
## Using open source LLMs for analysing historical documents

Make sure you are using a [GPU](https://cloud.google.com/gpu) when running the code below.

Go to **`Runtime`** and select **`Change runtime type`**, then select `T4 GPU` (or any other GPU available)

In [None]:
# install the transformer libraries
!pip install -q -U "transformers==4.40.0" datasets --upgrade

In [None]:
import warnings
warnings.filterwarnings('ignore') # disable warning

In [None]:
import transformers
from datasets import Dataset
from tqdm import tqdm
import pandas as pd
import torch

In [None]:
device = 'cuda' # make sure you use a GPU

In [None]:
# load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/kasparvonbeelen/lancaster-newspaper-workshop/wc/data/subsample500mixedocr-selected_mitch.csv')
df.head(3)

In [None]:
# for the purposes of this exercise, we remove both very short and long documents from the dataset
df = df[df.word_count.between(10,250)].reset_index()
df.shape

## The Hugging Face Hub

In the example below, we will experiment with Llama-3-8B, a recent series of open-source LLMs created by Meta. To use Llama3 you need to:

- Make an account on Hugging Face https://huggingface.co/
- Go to the Llama-3-8B and sign the terms of use you should get a reply swiftly https://huggingface.co/meta-llama/Meta-Llama-3-8B
- Create a user access token with read access: https://huggingface.co/docs/hub/en/security-tokens
- Run the code cell below to log into the Hugging Face hub. Copy-paste the access token
- Reply `n` to the question 'Add token as git credential? (Y/n)'

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Load the LLM model

In [None]:
# define the model, we use the instruct variant
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

# instantiate a text generation pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=checkpoint,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

# some fluff to improve the generation
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]


## Prompting

System message: describe how you want to the LLM to work, the behaviour you want it to exhibit
User message: Content you want to process (or the LLM to act on).

```python
messages [
  {
    "role" : "system",
    "content": "<system prompt here>"
  },
  {
    "role" : "user",
    "content": "<user prompt here>"
  }
]
```

Define a message by articulating a system and user prompt.

In [None]:
messages = [
    {
        "role": "system",
        "content": """
          You are an helpful AI that will assist me with analysing and reading newspaper articles.
          Read the newspaper articles attentively and extract the required information.
          Each newspaper article will be enclosed with triple hash tags (i.e. ###).
          Don't make thigs up! If the information is not in the article then just say 'Dunno'"""
              },

    {
        "role": "user",
        "content": f"""Provide a short description of principal characters portrayed newspaper article?
                  ###{df.iloc[0].text}###"""
              }
  ]

In [None]:
def get_completion(messages: list, temperature=.1, top_p=.1) -> str:
  """get completion for given system and user prompt
    Arguments:
    messages (list): a list containin a system and user message as
      python dictionaries with keys 'role' and 'content'
    temperature (float): regulate creativity of the text generation
    top_p (float): cummulative probability included in the
      generation process
  """
  prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
      )

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=temperature,
    top_p=top_p,
      )
  return outputs[0]["generated_text"][len(prompt):]

In [None]:
print(get_completion(messages))

## Exercise

- Change the system message and ask the model to reply in medieval French.
- Change the user message and ask the model to summarize the article and condense it to one sentence.

In [None]:
# Enter code here

#### Solution

In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! Answer in medieval French!"""},
    {"role": "user", "content": f"""Provide a short description of principal characters portrayed newspaper article?
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))


In [None]:
messages = [
    {"role": "system", "content": """
    You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'Dunno'"""},
    {"role": "user", "content": f"""Summarize the article content in one sentence.
    ###{df.iloc[0].text}###"""}
]

print(get_completion(messages))

## Applying text generation to historical documents


### Example 1: Summarization

In [None]:
df_small = df.sample(5, random_state=1984).reset_index(drop=True)
df_small

In [None]:

def apply_completions(item: pd.Series,
                      system_message: str,
                      user_message: str,
                      text_column: str = 'text') -> str:
  """
  Function that appl
  Argument:
    item (pd.Series): row from a pandas Dataframe
    system_message (str): system prompt, specifies how the system
      should behave in
    user_message (str): user prompt, give instruction how to
      process each historical. the documents itself will be append
      from the 'text_column' argument
    text_column (str): name of the text column
  """
  messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
      ]
  messages[1]['content'] += f"\n\n###{item[text_column]}###"
  return  get_completion(messages)

In [None]:
tqdm.pandas() # use tqdm to view progress

system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract the required information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up! If the information is not in the article then just say 'Dunno'"""
user_message = "Summarize the article content in one sentence."

df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)

In [None]:
# get the summaries
df_small['completion']

### Example 2: Biography as microgenre

In [None]:
df_small = df.sample(10, random_state=1984).reset_index(drop=True)

In [None]:
system_message = """You are an helpful AI that will assist me with analysing and reading newspaper articles.
    Read the newspaper articles attentively and extract structured information.
    Each newspaper article will be enclosed with triple hash tags (i.e. ###).
    Don't make thigs up!"""


user_message = """Who are the characters portrayed in the article?
    Extract biographical from a newspaper article.
    For each identified person return a nested Python dictionary with the key equal to the name of the individual.
    The values conist of dictionaries that record specific attributes such as age, gender, nationality, profession ,place of birth etc.
    The format has to be a Python dictionary, do not add extra text!"""

In [None]:
df_small['completion'] =  df_small.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


In [None]:
df_small['completion'][5]

In [None]:
eval(df_small['completion'][5].split('format:\n\n')[-1].strip())

In [None]:
eval(df_small['completion'][4].split('format:\n\n')[-1].strip())

In [None]:
eval(df_small['completion'][2].split('format:\n\n')[-1].strip())

### Example 3: OCR correction

In [None]:
df_small_bad_ocr = df.sort_values('ocrquality')[:5]

In [None]:
user_message = """Transcribe the text and correct typos and errors in the text caused by bad optical character recognition (OCR).
Do not add any information that is not in the original text!"""

df_small_bad_ocr['completion'] = df_small_bad_ocr.progress_apply(apply_completions,system_message=system_message, user_message=user_message, axis=1)


In [None]:
print(df_small_bad_ocr.iloc[0]['text'])

In [None]:
print(df_small_bad_ocr.iloc[0]['completion'])

In [None]:
print(df_small_bad_ocr.iloc[4]['text'])

In [None]:
print(df_small_bad_ocr.iloc[4]['completion'])

In [None]:
df_small_bad_ocr.to_csv('newspaper_ocr_corrected.csv')

## Combining document filtering and targeted prompting

Below, we combine many the things we covered in the previous notebook. Instead of running an LLM on all the documents, we use regular expressions to select a relevant subset of newspaper articles and use the LLMs to extract structured information.

In [None]:
import re
pattern = re.compile(r'\baccident[s]{0,1}\b',re.I) # compile a regex
df_kw_sample = df[df.apply(lambda x: bool(pattern.findall(x.text)), axis=1)] # get only rows that match the regex

# define the user message we retain the system message from previous examples
user_message = """Does the newspaper describe a historical accident? If not return an empty Python list'.
If it does describe an accident extract, information on the people involved in the accident.
Return a list of Python dictionaries. For each dictionary the key is equal to the name of the person.
The values list charactertistics of this person such a gender, age and occupation.
Only return the Python list and no additional text!
"""

# apply messages
df_kw_sample['completion'] = df_kw_sample.progress_apply(apply_completions, user_message=user_message, system_message=system_message, axis=1)
# save outputs
df_kw_sample.to_csv('accidents.csv')

In [None]:
df_kw_sample['completion']

In [None]:
eval(df_kw_sample.iloc[0]['completion'])

## Exercise

Experiment with your own system and user message! Have fun :-)

In [None]:
# enter code here

# Fin.