# Adapting LLaMA-2 for NER Tasks Using Few-shot In-Context Learning

This notebook presents our approach to evaluating the LLaMA-2 model's effectiveness in Named Entity Recognition (NER) tasks using few-shot in-context learning. It outlines the process of loading the necessary libraries, datasets and initializing the LLaMA-2 model. The notebook then demonstrates the generation of specialized prompts, tailored to elicit NER predictions from the model. Each prompt is carefully constructed and adjusted to align with the specific aspects of the evaluation. Following the prompt generation, the notebook presents the procedure of submitting these prompts to the LLaMA-2 model and retrieving its NER tag predictions. The last section evaluates the model's NER performance, using metrics such as precision, recall, and F1 score across the different languages.

Some of our evaluations were conducted on this Google Colab notebook using the setup described on the paper.

In [1]:
# Installing libraries
! pip install transformers pandas huggingface_hub seqeval accelerate > /dev/null 2>&1

In [2]:
# Logging into Hugging Face
# Login using a Hugging Face authorization token
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
# Importing Libraries
from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import seqeval.metrics
import json
from IPython.display import clear_output

In [4]:
# Mounting Google Drive to access files stored there
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# These choices are the
FEW_SHOT_SIZE = 10 # Number of Examples included in the prompt
SAMPLE_SIZE = 300 # Number of samples it is evaluated on

In [6]:
# Path to files on Google Drive
folder_path = '/content/drive/MyDrive/CPSC_488_Data/'

## Loading Data

In [5]:
def load_ner_data(file_path):
  """
  Loads NER data from a specified file and formats it into a DataFrame.

  Parameters:
  file_path (str): The path to the file containing the NER data.

  Returns:
  pandas.DataFrame: A DataFrame with three columns: 'sentence_id', 'words', and 'tags'.
                    Each row represents a token, with its associated sentence ID and NER tag.
  """

  # Create an empty DataFrame to hold tokens and tags
  data = pd.DataFrame(columns=["sentence_id", "words", "tags"])

  current_words = []
  current_tags = []
  sentence_id = 0
  sentences_data = []  # List to store each sentence's data

  with open(file_path, 'r', encoding='utf-8') as file:
      for line in file:
          line = line.strip()

          # Check if the line is the start of a new sentence
          if not line or line.startswith('# id'):
              if current_words:  # Save the previous sentence
                  sentences_data.append({"sentence_id": sentence_id, "words": current_words, "tags": current_tags})
                  sentence_id += 1
              current_words = []
              current_tags = []
          else:
              # Split the line into columns
              parts = line.split()
              current_words.append(parts[0])
              current_tags.append(parts[-1])  # The last column is the tag

  # Using pandas.concat to add the list of sentence data to the DataFrame
  data = pd.concat([data, pd.DataFrame(sentences_data)], ignore_index=True)

  return data

In [7]:
# Loading Files
# All of these files have already been uploaded to the Google Drive directory at this time

# Bangla Data
bn_test_file_path = folder_path + 'bn_test.conll'
bn_test_ner_data = load_ner_data(bn_test_file_path)

# Farsi Data
fa_test_file_path = folder_path + 'fa_test.conll'
fa_test_ner_data = load_ner_data(fa_test_file_path)

# Hindi Data
hi_test_file_path = folder_path + 'hi_test.conll'
hi_test_ner_data = load_ner_data(hi_test_file_path)

# Portuguese Data
pt_test_file_path = folder_path + 'pt_test.conll'
pt_test_ner_data = load_ner_data(pt_test_file_path)

# Italian Data
it_test_file_path = folder_path + 'it_test.conll'
it_test_ner_data = load_ner_data(it_test_file_path)

# Ukrainian Data
uk_test_file_path = folder_path + 'uk_test.conll'
uk_test_ner_data = load_ner_data(uk_test_file_path)

# English Data
en_test_file_path = folder_path + 'en_test.conll'
en_test_ner_data = load_ner_data(en_test_file_path)

## Peeking at the Dataset

In [None]:
bn_test_ner_data.head()

Unnamed: 0,sentence_id,words,tags
0,0,"[প্রোপেলারটি, একটি, ডি, হ্যাভিল্যান্ড, এয়ারক্...","[O, O, B-AerospaceManufacturer, I-AerospaceMan..."
1,1,"[এটি, ১৯৫৫, সালে, ১৯৫৭, সালে, নর্ড, এভিয়েশন, ...","[O, O, O, O, O, B-AerospaceManufacturer, I-Aer..."
2,2,"[পাঁচটি, f108, চালিত, উদাহরণগুলি, বোয়িং, কোম্...","[O, B-OtherPROD, O, O, B-AerospaceManufacturer..."
3,3,"[গ্রিনহিল, ১৯৯৫, সালে, l'auto-neige, bombardie...","[O, O, O, B-AerospaceManufacturer, I-Aerospace..."
4,4,"[যদিও, ডকটি, অ্যাসোসিয়েটেড, ব্রিটিশ, পোর্টস, ...","[O, O, B-ORG, I-ORG, I-ORG, O, O, O, O, O, O, ..."


In [None]:
fa_test_ner_data.head()

Unnamed: 0,sentence_id,words,tags
0,0,"[قراردادی, که, دولت, بریتانیا, سعی, در, بستن, ...","[O, O, O, B-HumanSettlement, O, O, O, O, O, O,..."
1,1,"[اِپُک, تایمز, ،, یک, شرکت, رسانه, ای, خبری, د...","[O, O, O, O, O, O, O, O, O, B-HumanSettlement,..."
2,2,"[این, فرودگاه, در, کشور, ایالات, متحده, آمریکا...","[O, O, O, O, B-HumanSettlement, I-HumanSettlem..."
3,3,"[فینیکس, جمعیت, ۱, ٬, ۴۴۵, ٬, ۶۳۲]","[B-HumanSettlement, O, O, O, O, O, O]"
4,4,"[هنزیک, (, honzik, ), ،, حشره, ای, است, که, در...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-H..."


In [None]:
hi_test_ner_data.head()

Unnamed: 0,sentence_id,words,tags
0,0,"[उनकी, विशेषताओं, आंदोलनों, और, खेल, शैली, के,...","[O, O, O, O, O, O, O, O, O, O, B-SportsManager..."
1,1,"[एथन, काट्ज़, (, जन्म, १९८३, ), शिकागो, व्हाइट...","[B-SportsManager, I-SportsManager, O, O, O, O,..."
2,2,"[वह, प्रसिद्ध, रैंडविक, रेसकोर्स, ट्रेनर, इसहा...","[O, O, B-Facility, I-Facility, O, B-SportsMana..."
3,3,"[चेल्सी, ने, उथल, -पुथल, में, मैच, में, प्रवेश...","[O, O, O, O, O, O, O, O, O, O, O, B-SportsMana..."
4,4,"[शूप, २००४, में, मुख्य, कोच, जॉन, ग्रुडेन, के,...","[O, O, O, O, O, B-SportsManager, I-SportsManag..."


## Functions for Prompting

In [None]:
def create_ner_prompt(language, examples, annotations):
  """
  Generates an NER task prompt in BIO tagging format for a given language.

  Parameters:
  language (str): The language for which the prompt is being created.
  examples (list): A list of example sentences in the specified language.
  annotations (list): A list of corresponding BIO tag sequences for the example sentences.

  Returns:
  str: A formatted prompt string for an NER task in the specified language.
  """

  # BIO Tags included
  entity_types = (
      "Location (LOC): B-Facility, I-Facility, B-OtherLOC, I-OtherLOC, B-HumanSettlement, I-HumanSettlement, B-Station, I-Station\n"
      "Creative Work (CW): B-VisualWork, I-VisualWork, B-MusicalWork, I-MusicalWork, B-WrittenWork, I-WrittenWork, B-ArtWork, I-ArtWork, B-Software, I-Software\n"
      "Group (GRP): B-MusicalGRP, I-MusicalGRP, B-PublicCORP, I-PublicCORP, B-PrivateCORP, I-PrivateCORP, B-AerospaceManufacturer, I-AerospaceManufacturer, B-SportsGRP, I-SportsGRP, B-CarManufacturer, I-CarManufacturer, B-ORG, I-ORG\n"
      "Person (PER): B-Scientist, I-Scientist, B-Artist, I-Artist, B-Athlete, I-Athlete, B-Politician, I-Politician, B-Cleric, I-Cleric, B-SportsManager, I-SportsManager, B-OtherPER, I-OtherPER\n"
      "Product (PROD): B-Clothing, I-Clothing, B-Vehicle, I-Vehicle, B-Food, I-Food, B-Drink, I-Drink, B-OtherPROD, I-OtherPROD\n"
      "Medical (MED): B-Medication/Vaccine, I-Medication/Vaccine, B-MedicalProcedure, I-MedicalProcedure, B-AnatomicalStructure, I-AnatomicalStructure, B-Symptom, I-Symptom, B-Disease, I-Disease\n"
      "O (Outside of any entity)\n"
  )

  prompt = f"For the following sequences of words in the {language} sentences, generate the appropriate sequence of BIO tags, each tag corresponding with each word in a sentence. Indicate the end of the generated sequence with a ##### symbol. ##### means that the sequence of BIO Tags for the corresponding sentence has ended. Each entity type is marked as 'B-' (beginning), 'I-' (inside), or 'O' (outside). Types include Location (LOC), Creative Work (CW), Group (GRP), Person (PER), Product (PROD), and Medical (MED). Here are all possible BIO Tags:\n{entity_types}\n Here are some examples:\n"

  for i, (sentence, annotation) in enumerate(zip(examples, annotations), 1):
      prompt += f"Sentence: {sentence}\n   Sequence of BIO Tags: {annotation} #####\n"

  prompt += f"\nNow, using the same format as the examples, generate a sequence of BIO tags for the following sentence with each tag corresponding with each word in the new {language} sentence:\n"

  return prompt

In [None]:
def generate_prediction(sentence, model, tokenizer, prompt_template, decoded_response_filepath):
  """
  Generates and writes a sequence of predicted BIO tags for a given sentence using a language model.

  Parameters:
  sentence (str): The sentence for which BIO tags are to be predicted.
  model (PreTrainedModel): A pre-trained language model from HuggingFace's transformers library.
  tokenizer (PreTrainedTokenizer): A tokenizer corresponding to the pre-trained model.
  prompt_template (str): A template string used to construct the complete prompt for the model.
  decoded_response_filepath (str): The file path where the model's decoded response will be written.

  Returns:
  list: A list of predicted BIO tags for the given sentence.
  """

  prompt = prompt_template + f"\nSentence: {sentence}\nSequence of BIO Tags:"
  inputs = tokenizer.encode(prompt, return_tensors='pt')

  # Move input_ids to the same device as the model
  inputs = inputs.to(model.device)

  outputs = model.generate(inputs, max_length=2500, num_return_sequences=1)
  decoded_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

  # Open the file for writing predictions
  with open(decoded_response_filepath, 'a', encoding='utf-8') as decoded_response_file:
      decoded_response_file.write(f"START OF DECODED RESPONSE \n\n")
      decoded_response_file.write(decoded_response)
      decoded_response_file.write(f"\nEND OF DECODED RESPONSE \n\n\n")

  # Identify the start of the predicted tags
  start_index = decoded_response.find(f"Sentence: {sentence}\nSequence of BIO Tags:") + len(f"Sentence: {sentence}\nSequence of BIO Tags:")
  if start_index == -1:
      return []

  # Locate the end of the predicted tags
  end_index = decoded_response.find("#####", start_index)
  predicted_tags_str = decoded_response[start_index:end_index].strip() if end_index != -1 else decoded_response[start_index:].strip()
  predicted_tags = predicted_tags_str.split() if predicted_tags_str else []

  return predicted_tags

In [None]:
def clean_and_align_predicted_tags(predicted_tags, sentence_length):
  """
  Cleans and aligns predicted BIO tags to match the length of the given sentence.

  Parameters:
  predicted_tags (list): A list of predicted BIO tags.
  sentence_length (int): The length of the sentence for which tags were predicted.

  Returns:
  list: A list of cleaned and aligned BIO tags corresponding to the sentence length.
  """

  # Replace any non-tag elements with 'O' and truncate or pad to match sentence length
  cleaned_tags = [
      tag if tag.startswith('B-') or tag.startswith('I-') or tag == 'O' else 'O'
      for tag in predicted_tags
  ]

  return cleaned_tags[:sentence_length] + ['O'] * (sentence_length - len(cleaned_tags))

In [None]:
def evaluate_for_language(model, tokenizer, language, dataset, prediction_filepath, score_filepath, decoded_response_filepath):
  """
  Evaluates a language model's NER performance for a specific language using a given dataset.

  Parameters:
  model (PreTrainedModel): The pre-trained language model used for NER prediction.
  tokenizer (PreTrainedTokenizer): The tokenizer corresponding to the model.
  language (str): The language for which the evaluation is performed.
  dataset (DataFrame): The dataset containing sentences and their corresponding NER tags.
  prediction_filepath (str): File path to write the model's NER predictions.
  score_filepath (str): File path to write the evaluation scores.
  decoded_response_filepath (str): File path to write the model's decoded responses.

  Outputs:
  The function outputs precision, recall, and F1 scores to the console and writes the
  detailed predictions and scores to the specified files.
  """

  # Prepare the initial part of the prompt with examples
  example_sentences = [" ".join(words) for words in dataset.head(FEW_SHOT_SIZE)['words']]
  example_annotations = [" ".join(tags) for tags in dataset.head(FEW_SHOT_SIZE)['tags']]
  prompt = create_ner_prompt(language, example_sentences, example_annotations)

  # List to store cleaned and aligned predicted tags
  cleaned_predicted_tags = []

  count = 1

  # Open the file for writing predictions
  with open(prediction_filepath, 'w', encoding='utf-8') as prediction_file:

      # Iterate over the test data
      for index, row in dataset.iterrows():
          sentence = " ".join(row['words'])
          generated_prediction = generate_prediction(sentence, model, tokenizer, prompt, decoded_response_filepath)
          aligned_tags = clean_and_align_predicted_tags(generated_prediction, len(row['words']))

          # Save aligned tags and reference tags for each sentence
          prediction_file.write(f"Sentence: {sentence}\n")
          prediction_file.write(f"Predicted Tags: {' '.join(aligned_tags)}\n")
          prediction_file.write(f"Reference Tags: {' '.join(row['tags'])}\n\n")

          clear_output(wait=True)
          print("Count: ", count)

          count += 1

          cleaned_predicted_tags.append(aligned_tags)

  # Actual tags from the test data
  actual_tags = [tags for tags in dataset['tags']]

  # Calculate evaluation metrics
  precision = seqeval.metrics.precision_score(actual_tags, cleaned_predicted_tags)
  recall = seqeval.metrics.recall_score(actual_tags, cleaned_predicted_tags)
  f1_score = seqeval.metrics.f1_score(actual_tags, cleaned_predicted_tags)

  print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1_score}")

  # Save the scores
  with open(score_filepath, 'w', encoding='utf-8') as score_file:
      scores = {
          'Precision': precision,
          'Recall': recall,
          'F1-Score': f1_score
      }
      score_file.write(json.dumps(scores, indent=4))

## Sampling Examples

In [None]:
# Bangla Sample

# Drop the first FEW_SHOT_SIZE rows
bn_test_ner_data_remaining = bn_test_ner_data.iloc[FEW_SHOT_SIZE:]
# Shuffle and select SAMPLE_SIZE samples from the remaining data
bn_test_ner_data_sample = bn_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

# Farsi Sample
fa_test_ner_data_remaining = fa_test_ner_data.iloc[FEW_SHOT_SIZE:]
fa_test_ner_data_sample = fa_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

# Hindi Sample
hi_test_ner_data_remaining = hi_test_ner_data.iloc[FEW_SHOT_SIZE:]
hi_test_ner_data_sample = hi_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

# Portuguese Sample
pt_test_ner_data_remaining = pt_test_ner_data.iloc[FEW_SHOT_SIZE:]
pt_test_ner_data_sample = pt_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

# Italian Sample
it_test_ner_data_remaining = it_test_ner_data.iloc[FEW_SHOT_SIZE:]
it_test_ner_data_sample = it_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

# Ukrainian Sample
uk_test_ner_data_remaining = uk_test_ner_data.iloc[FEW_SHOT_SIZE:]
uk_test_ner_data_sample = uk_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

# English Sample
en_test_ner_data_remaining = en_test_ner_data.iloc[FEW_SHOT_SIZE:]
en_test_ner_data_sample = en_test_ner_data_remaining.sample(n=SAMPLE_SIZE, random_state=16)

## Loading Model

In [None]:
# Load the LLaMA model
model_name = "meta-llama/Llama-2-7b-chat-hf"

# Loading tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True, device_map = 'auto')

## Evaluating the Model on the 7 Different Languages

In [None]:
evaluate_for_language(model, tokenizer, "Bangla", bn_test_ner_data_sample, folder_path + "bn3_predicted_vs_reference_tags.txt", folder_path + "bn3_evaluation_scores.json", folder_path + "bn3_decoded_responses.txt")

In [None]:
evaluate_for_language(model, tokenizer, "Farsi", fa_test_ner_data_sample, folder_path + "fa3_predicted_vs_reference_tags.txt", folder_path + "fa3_evaluation_scores.json", folder_path + "fa3_decoded_responses.txt")

In [None]:
evaluate_for_language(model, tokenizer, "Hindi", hi_test_ner_data_sample, folder_path + "hi3_predicted_vs_reference_tags.txt", folder_path + "hi3_evaluation_scores.json", folder_path + "hi3_decoded_responses.txt")

In [None]:
evaluate_for_language(model, tokenizer, "Portuguese", pt_test_ner_data_sample, folder_path + "pt3_predicted_vs_reference_tags.txt", folder_path + "pt3_evaluation_scores.json", folder_path + "pt3_decoded_responses.txt")

In [None]:
evaluate_for_language(model, tokenizer, "Italian", it_test_ner_data_sample, folder_path + "it3_predicted_vs_reference_tags.txt", folder_path + "it3_evaluation_scores.json", folder_path + "it3_decoded_responses.txt")

In [None]:
evaluate_for_language(model, tokenizer, "Ukrainian", uk_test_ner_data_sample, folder_path + "uk3_predicted_vs_reference_tags.txt", folder_path + "uk3_evaluation_scores.json", folder_path + "uk3_decoded_responses.txt")

In [None]:
evaluate_for_language(model, tokenizer, "English", en_test_ner_data_sample, folder_path + "en3_predicted_vs_reference_tags.txt", folder_path + "en3_evaluation_scores.json", folder_path + "en3_decoded_responses.txt")