# Generate Synthetic Data and Q&A with Citations

## Adapted from the Ollama notebook

## Requirements

### Install llama-cpp-python and outlines

In [1]:
# RUN IT ONLY ONCE TO INSTALL THE REQUIREMENTS
# %pip install llama-cpp-python outlines

For detailed installation instructions, see [llama-cpp-python installation](https://llama-cpp-python.readthedocs.io/en/stable/) and [outlines installation](https://outlines-dev.github.io/outlines/installation/)

### Pull the model from HuggingFace

Download a GGUF model from HuggingFace [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/tree/main), for example, the `Q4_K_M` one (it requires 4.92 GB):

In [2]:
# RUN IT ONLY ONCE TO DOWNLOAD THE GGUF MODEL, IN THIS CASE THE Q4_K_M
# !wget https://hf.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf

## Usage

### Generate Synthetic Data

#### Define Pydantic class

In [3]:
from pydantic import BaseModel, Field

class UserDetail(BaseModel):
    id: int = Field(..., description="Unique identifier") # so the model keeps track of the number of fake users
    first_name: str
    last_name: str
    age: int

In [4]:
from typing import List

class Users(BaseModel):
    users: List[UserDetail]

#### Load the model

In [5]:
import llama_cpp
from llama_cpp import Llama
from outlines import generate, models

llm = Llama(
    "/home/asilva/models/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf", # replace with your /path/to/the/model
    n_gpu_layers=-1,
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        "NousResearch/Hermes-2-Pro-Llama-3-8B"
    ),
    use_mlock=True,
    flash_attn=True,
    verbose=False
)
model = models.LlamaCpp(llm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) # ignore runtime warnings

In [7]:
generator = generate.json(model, Users)
response = generator("Create 5 fake users", max_tokens=1024, temperature=0, seed=42)

In [8]:
response.users

[UserDetail(id=1, first_name='John', last_name='Doe', age=25),
 UserDetail(id=2, first_name='Jane', last_name='Doe', age=30),
 UserDetail(id=3, first_name='Bob', last_name='Smith', age=40),
 UserDetail(id=4, first_name='Alice', last_name='Smith', age=35),
 UserDetail(id=5, first_name='John', last_name='Smith', age=20)]

In [9]:
for user in response.users:
    print(user.first_name)
    print(user.last_name)
    print(user.age)
    print()

John
Doe
25

Jane
Doe
30

Bob
Smith
40

Alice
Smith
35

John
Smith
20



### QA with Citations

#### Define Pydantic class

In [10]:
from typing import List

from pydantic import BaseModel


class QuestionAnswer(BaseModel):
    question: str
    answer: str
    citations: List[str]

schema = QuestionAnswer.model_json_schema()
schema

{'properties': {'question': {'title': 'Question', 'type': 'string'},
  'answer': {'title': 'Answer', 'type': 'string'},
  'citations': {'items': {'type': 'string'},
   'title': 'Citations',
   'type': 'array'}},
 'required': ['question', 'answer', 'citations'],
 'title': 'QuestionAnswer',
 'type': 'object'}

#### Create function to generate final prompt

In [11]:
def my_final_prompt(question, context):
    return (
        "<|im_start|>system\n"
        "You are a world class AI model who answers questions in JSON with correct and exact citations "
        "extracted from the `Context`. "
        f"Here's the json schema you must adhere to:\n<schema>\n{schema}\n</schema><|im_end|>\n"
        "<|im_start|>user\n"
        + "`Context`: "
        + context
        + "\n`Question`: "
        + question + "<|im_end|>"
        + "\n<|im_start|>assistant\n"
        "<schema>"
    )

In [12]:
question = "What did the author do during college?"
context = """
My name is Jason Liu, and I grew up in Toronto Canada but I was born in China.
I went to an arts high school but in university I studied Computational Mathematics and physics.
As part of coop I worked at many companies including Stitchfix, Facebook.
I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.
"""

In [13]:
print(my_final_prompt(question, context))

<|im_start|>system
You are a world class AI model who answers questions in JSON with correct and exact citations extracted from the `Context`. Here's the json schema you must adhere to:
<schema>
{'properties': {'question': {'title': 'Question', 'type': 'string'}, 'answer': {'title': 'Answer', 'type': 'string'}, 'citations': {'items': {'type': 'string'}, 'title': 'Citations', 'type': 'array'}}, 'required': ['question', 'answer', 'citations'], 'title': 'QuestionAnswer', 'type': 'object'}
</schema><|im_end|>
<|im_start|>user
`Context`: 
My name is Jason Liu, and I grew up in Toronto Canada but I was born in China.
I went to an arts high school but in university I studied Computational Mathematics and physics.
As part of coop I worked at many companies including Stitchfix, Facebook.
I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.

`Question`: What did the author do during college?<|im_end|>
<|im_start|>assistant
<schema>


In [14]:
from outlines import generate, models

model = models.LlamaCpp(llm)
generator = generate.json(model, QuestionAnswer)

In [15]:
answer = generator(my_final_prompt(context, question), max_tokens=1024, temperature=0, seed=42)

In [16]:
answer

QuestionAnswer(question='What did Jason Liu do during college?', answer='During college, Jason Liu studied Computational Mathematics and Physics. He also worked at companies such as Stitchfix and Facebook, and started the Data Science club at the University of Waterloo, serving as its president for two years.', citations=['I went to an arts high school but in university I studied Computational Mathematics and physics.', 'As part of coop I worked at many companies including Stitchfix, Facebook.', 'I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.'])

In [17]:
question1 = "Where was John born?"
context1 = """
John Doe is a software engineer who was born in New York, USA. 
He studied Computer Science at the Massachusetts Institute of Technology. 
During his studies, he interned at Google and Microsoft. 
He also founded the Artificial Intelligence club at his university and served as its president for three years.
"""


question2 = "What did Emily study in university?"
context2 = """
Emily Smith is a data scientist from London, England. 
She attended the University of Cambridge where she studied Statistics and Machine Learning. 
She interned at IBM and Amazon during her summer breaks. 
Emily was also the head of the Women in Tech society at her university.
"""

question3 = "Which companies did Robert intern at?"
context3 = """
Robert Johnson, originally from Sydney, Australia, is a renowned cybersecurity expert. 
He studied Information Systems at the University of Melbourne. 
Robert interned at several cybersecurity firms including NortonLifeLock and McAfee. 
He was also the leader of the Cybersecurity club at his university.
"""


question4 = "What club did Alice start at her university?"
context4 = """
Alice Williams, a native of Dublin, Ireland, is a successful web developer. 
She studied Software Engineering at Trinity College Dublin. 
Alice interned at several tech companies including Shopify and Squarespace. 
She started the Web Development club at her university and was its president for two years.
"""


question5 = "What did Michael study in high school?"
context5 = """
Michael Brown is a game developer from Tokyo, Japan. 
He attended a specialized high school where he studied Game Design. 
He later attended the University of Tokyo where he studied Computer Science. 
Michael interned at Sony and Nintendo during his university years. 
He also started the Game Developers club at his university.
"""

In [18]:
for question, context in [
    (question1, context1),
    (question2, context2),
    (question3, context3),
    (question4, context4),
    (question5, context5),
]:
    final_prompt = my_final_prompt(question, context)
    generator = generate.json(model, QuestionAnswer)
    response = generator(final_prompt, max_tokens=1024, temperature=0, seed=42)
    display(question)
    display(response.answer)
    display(response.citations)
    print("\n\n")

'Where was John born?'

'John Doe was born in New York, USA.'

['John Doe is a software engineer who was born in New York, USA.']






'What did Emily study in university?'

'Emily studied Statistics and Machine Learning in university.'

['She attended the University of Cambridge where she studied Statistics and Machine Learning.']






'Which companies did Robert intern at?'

'Robert interned at NortonLifeLock and McAfee.'

['Robert Johnson, originally from Sydney, Australia, is a renowned cybersecurity expert. He interned at several cybersecurity firms including NortonLifeLock and McAfee.']






'What club did Alice start at her university?'

'Alice started the Web Development club at her university.'

['Alice Williams, a native of Dublin, Ireland, is a successful web developer. She started the Web Development club at her university and was its president for two years.']






'What did Michael study in high school?'

'Michael studied Game Design in high school.'

['Michael Brown is a game developer from Tokyo, Japan. He attended a specialized high school where he studied Game Design.']




