<a href="https://colab.research.google.com/github/RyanChen12035/capstone/blob/main/Llama2_synthesizing_wNER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Download Llama 2-13b-4bits-quantized-chat version, and try out different ways of prompting to ask the model generate synthetic data based on BIOS tagging.

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4

In [2]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

In [3]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

In [None]:
# download the model
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

#move to GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Maximum characters
    n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
    )

### Example of Llama2 prompt template

according to https://www.youtube.com/watch?v=Pb_RGAl75VE

The impact of chat templates on the performance of the model is unclear. In most cases, we fine-tune base models that have not been trained with a particular template, which is also why there's no clear standard. However, they are important as they can cause many issues and limit the compatibility of your models.

    <s>[INST] <<SYS>>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <</SYS>>

    There's a llama in my garden 😱 What should I do? [/INST]

In [112]:
# It's a chat version so we need to follow the template of Llama 2
# Run text generation pipeline with our model
system_prompt = "You are an intelligent Reversed Named Entity Recognition (NER) system. Please synthesize data based on the definition of NER tagging provided and replace the tagging. The output should be in the given format as examples but don't use the tagging in the example. Please be creative but still keep the output reasonable."
sentence_NER = " {ORG} rejects {PERSON} call to boycott {LOC} lamb"
prompt = """Definition of NER Tagging:\n
            1. {PERSON}: Short name or full name of a person from any geographic regions.\n
            2. {DATE}: Any format of dates. Dates can also be in natural language.\n
            3. {LOC}: Name of any geographic location, like cities, countries, continents, districts etc.\n
            4. {ORG}: Name of the companies like Google, samsung, Apple etc.\n
            5. {NUMBERS}: Numerical entites which are numerically present or mentioned in words like 7000, half of dozen etc.\n
            Examples:\n
            1. Sentence: {LOC} and {LOC} are friends. G20 summit going to held in {LOC} in {DATE}. Indian Prime Minister {PERSON} will be hosting it and {ORG} will be giving charity of {NUMBERS}.\n
            Output: Sentence: USA and India are friends. G20 summit going to held in India in September 2023. Indian Prime Minister Narendra Modi will be hosting it and TATA will be giving charity of $150 Million.\n

        """
prompt_template = f"<s>[INST] <<SYS>>:\n{system_prompt}\n<</SYS>>\n{prompt}\n Please synthesize the data and replace the NER tagging:{sentence_NER}[/INST]\n"

In [63]:
# NER training material from https://huggingface.co/datasets/conll2003?row=0
# example:EU rejects German cal to boycott British lamb
# BIOS tagging: {ORG} rejects {LOC} cal to boycott {LOC} lamb

In [113]:
response=lcpp_llm(prompt=prompt_template, max_tokens=512, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)
# higher temperature, more creative response.

Llama.generate: prefix-match hit


In [114]:
print(response["choices"][0]["text"][len(prompt_template):])

#temp = 1   The tech giant Apple has rejected a call by its CEO Tim Cook to boycott the upcoming G20 summit in India. According to sources close to the company, Apple is concerned that such a move could harm their business and relationships with Indian leaders. The company is instead choosing to focus on donating $150 million to support education initiatives in underserved communities across India.
#temp = 0.9 Apple rejects Tim Cook's call to boycott California lamb.
#temp = 0.5 pple rejects Tim Cook's call to boycott Cupertino lamb.
#temp = 0.3 Apple rejects Tim Cook's call to boycott California lamb.
#temp = 0.1 Apple rejects Tim Cook's call to boycott California lamb.

  Sure, here's the synthesized output based on the given definition of NER tagging:

Apple rejects Tim Cook's call to boycott California lamb.


### 2. Given a sentence with private data, annotate and replace it with NER tagging

In [96]:
# It's a chat version so we need to follow the template of Llama 2
# Run text generation pipeline with our model
system_prompt = "You are an intelligent Named Entity Recognition (NER) system. Please annotate and replace the input sentence accordingly with the definition of the NER tagging provided. The output should be in given format with examples."
sentence = "EU rejects German cal to boycott British lamb"
prompt = """Definition of NER tagging:\n
            1. {PERSON}: Short name or full name of a person from any geographic regions.\n
            2. {DATE}: Any format of dates. Dates can also be in natural language.\n
            3. {LOC}: Name of any geographic location, like cities, countries, continents, districts etc.\n
            4. {ORG}: Name of the companies like Google, samsung, Apple etc.\n
            5. {NUMBERS}: Numerical entites which are numerically present or mentioned in words like 7000, half of dozen etc.\n
            Examples:\n
            1. Sentence: Sentence: USA and India are friends. G20 summit going to held in India in September 2023. Indian Prime Minister Narendra Modi will be hosting it and TATA will be giving charity of $150 Million.\n
            Output: {LOC} and {LOC} are friends. G20 summit going to held in {LOC} in {DATE}. Indian Prime Minister {PERSON} will be hosting it and {ORG} will be giving charity of {NUMBERS}.\n

        """
prompt_template = f"<s>[INST] <<SYS>>:\n{system_prompt}\n<</SYS>>\n{prompt}\n Please annotate the input with NER tagging and replace it based on the definition of the NER tagging:{sentence}[/INST]\n"

In [97]:
response=lcpp_llm(prompt=prompt_template, max_tokens=512, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)
# higher temperature, more creative response.

Llama.generate: prefix-match hit


In [98]:
print(response["choices"][0]["text"][len(prompt_template):])

 Sure! Here's the input sentence with NER tags added, based on the definition you provided:

Input sentence: EU rejects German call to boycott British lamb.

NER-tagged output:

{ORG} rejects {PERSON}'s call to boycott {LOC} lamb.

Here's a breakdown of each tag:

* {ORG}: European Union (EU)
* {PERSON}: German (German government or representative)
* {LOC}: Britain (United Kingdom)
* {NUMBERS}: 150 million (the amount of charity
