## **Outline**

- Generating Data and Synthetic Datasets
- Case Study: Sentiment Analysis
- Workshop: Tackling Generated Datasets Diversity


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

In [1]:
!python -m venv openai-env

In [2]:
!source openai-env/bin/activate

In [3]:
!pip install --upgrade openai

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [4]:
from openai import OpenAI
import os

client = OpenAI()
# defaults to getting the key using os.environ.get("OPENAI_API_KEY")
# if you saved the key under a different environment variable name, you can do something like:
client = OpenAI(
  api_key=os.environ.get("OPENAI_API_KEY"),
)

In [5]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In the realm of code, a concept quite grand,\nIs recursion, a magical hand,\nLike a mirror reflecting its own image,\nA function that calls itself with courage.\n\nA puzzle solved by breaking it down,\nInto smaller tasks, it wears the crown,\nWith elegance and power it repeats,\nSolving problems in defining feats.\n\nThrough layers of depth, it delves,\nInto the unknown, where mystery dwells,\nEach recursive call, a journey anew,\nUnraveling complexity, it's true.\n\nA loop within a loop, a cycle unbroken,\nInfinite possibilities, words unspoken,\nA dance of logic, a pattern that gleams,\nRecursion in programming, a poet's dreams.", role='assistant', function_call=None, tool_calls=None)


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Function Calling with LLMs**
  - Enables LLMs like GPT-4 and GPT-3.5 to reliably connect with external tools and APIs.
  - Detects the need for a function call within a chat and outputs JSON with arguments to execute the function.

- **Tool Integration:**
  - Functions act as tools within AI applications, allowing for multiple tools to be defined and called in a single request.

- **Importance for AI Applications:**
  - Essential for developing LLM-powered chatbots or agents that require context retrieval or need to interact with external tools.
  - Transforms natural language instructions into actionable API calls, enhancing the utility and interactivity of chatbots.

- **Enhancing Chatbot Capabilities:**
  - Facilitates seamless integration of LLMs with a wide range of external services and data sources.
  - Enables chatbots to perform complex tasks, such as data retrieval, content creation, and more, by calling specific functions.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- **Applications Enabled by Functional Calling**

- **Conversational Agents with External Tool Usage:**
  - Allows conversational agents to efficiently utilize external tools to answer user questions.
  - Example: Querying weather information translates to function calls like `get_current_weather(location: string, unit: 'celsius' | 'fahrenheit')`.

- **Data Extraction and Tagging Solutions:**
  - Empowers LLM-powered solutions to extract and tag data from various sources.
  - Example: Extracting people names from a Wikipedia article.

- **Natural Language to API Conversion:**
  - Facilitates the creation of applications that translate natural language into API calls or database queries.
  - Enhances the usability and accessibility of data and services.

- **Conversational Knowledge Retrieval Engines:**
  - Enables conversational engines to interact with knowledge bases, facilitating knowledge retrieval through natural language queries.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Function Calling with GPT-4**


- **Integration of LLM with External Tool for Weather Query**
  - **Challenge:**
    - LLM alone cannot respond to dynamic queries like checking the weather due to dataset limitations.
  - **Solution:**
    - Utilize function calling capabilities of the LLM to invoke an external tool for weather information retrieval.

  - **Implementation Example:**
    - User query: "What is the weather like in a given location?"
    - LLM processes the query and recognizes the need for external information.
    - Function calling mechanism selects appropriate function and arguments (e.g., `get_current_weather(location: string, unit: 'celsius' | 'fahrenheit')`).
    - OpenAI APIs facilitate the interaction between the LLM and the weather API.
    - Final response generated based on the retrieved weather data.

```
What is the weather like in London?

```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- To handle this request using function calling,
  - Define a weather function or set of functions that you will be passing as part of the OpenAI API request


In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string", 
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Function Details:
  - Name: get_current_weather
  - Description: Retrieves the current weather in a specified location.
  - Parameters:
    - location: Specifies the city and state (e.g., San Francisco, CA).
    - unit: Specifies the temperature unit (celsius or fahrenheit).


Define a completion function as follows:

In [None]:
def get_completion(client, messages, model="gpt-3.5-turbo-1106", temperature=0, max_tokens=300, tools=None, tool_choice=None):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
        tools=tools,
        tool_choice=tool_choice
    )
    return response.choices[0].message

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

Compose the user question:



In [None]:
messages = [
    {
        "role": "user",
        "content": "What is the weather like in London?"
    }
]

In [None]:
response = get_completion(client, messages, tools=tools)
print(response)


ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_q17T7UG98lVTh9XMuZcb0yrv', function=Function(arguments='{"location":"London","unit":"celsius"}', name='get_current_weather'), type='function')])


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

Capture the arguments:

In [None]:
import json

args = json.loads(response.tool_calls[0].function.arguments)


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

Dummy function


In [None]:
# Defines a dummy function to get the current weather
def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    weather = {
        "location": location,
        "temperature": "50",
        "unit": unit,
    }

    return json.dumps(weather)


In [None]:
get_current_weather(**args)


'{"location": "London", "temperature": "50", "unit": "celsius"}'

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **An LLM-powered conversational agent**


In [None]:
messages = [
    {
        "role": "user",
        "content": "Hello! How are you?",
    }
]

In [None]:
get_completion(messages, tools=tools)


TypeError: get_completion() missing 1 required positional argument: 'messages'

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Specify desired behavior for function calling to tailor the response generation process according to specific requirements.
- By default, the model autonomously determines whether to call a function and which function to call.
-  Utilize the tool_choice parameter to control the behavior of the system.
- Default setting: tool_choice: "auto".

- In "auto" mode, the model automatically decides whether and which function to call based on the context of the user query.


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

In [None]:
get_completion(client, messages, tools=tools, tool_choice="auto")


ChatCompletionMessage(content="Hello! I'm here and ready to assist you. How can I help you today?", role='assistant', function_call=None, tool_calls=None)

In [None]:
get_completion(client, messages, tools=tools, tool_choice="none")


ChatCompletionMessage(content="Hello! I'm here and ready to assist you. How can I help you today?", role='assistant', function_call=None, tool_calls=None)

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

Forces the model to not use any of the functions provided.

In [None]:
messages = [
    {
        "role": "user",
        "content": "What's the weather like in London?",
    }
]
get_completion(client, messages, tools=tools, tool_choice="none")

ChatCompletionMessage(content='I will check the current weather in London for you.', role='assistant', function_call=None, tool_calls=None)

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

Force the model to choose a function if that's the behavior you want in your application.

In [None]:
messages = [
    {
        "role": "user",
        "content": "What's the weather like in London?",
    }
]
get_completion(client, messages, tools=tools, tool_choice={"type": "function", "function": {"name": "get_current_weather"}})

ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_dh0nudIqqUzA3YLm51IXh2EJ', function=Function(arguments='{"location":"London","unit":"celsius"}', name='get_current_weather'), type='function')])

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Exercise**

Develop an agent that passes back the result obtained after calling your APIs with the inputs generated from function calling.

In [None]:
messages = []
messages.append({"role": "user", "content": "What's the weather like in Boston!"})
assistant_message = get_completion(client, messages, tools=tools, tool_choice="auto")
assistant_message = json.loads(assistant_message.model_dump_json())
assistant_message["content"] = str(assistant_message["tool_calls"][0]["function"])

#a temporary patch but this should be handled differently
# remove "function_call" from assistant message
del assistant_message["function_call"]

In [None]:
messages.append(assistant_message)


In [None]:
# get the weather information to pass back to the model
weather = get_current_weather(messages[1]["tool_calls"][0]["function"]["arguments"])

messages.append({"role": "tool",
                 "tool_call_id": assistant_message["tool_calls"][0]["id"],
                 "name": assistant_message["tool_calls"][0]["function"]["name"],
                 "content": weather})

In [None]:
final_response = get_completion(client, messages, tools=tools)


In [None]:
final_response


ChatCompletionMessage(content='The current weather in Boston, MA is 50°F.', role='assistant', function_call=None, tool_calls=None)

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Generating Data**


- LLMs possess powerful text generation capabilities for creating coherent text.
- Implementing effective prompt strategies enhances the quality, consistency, and factual accuracy of the model's outputs.
- LLMs are invaluable for generating datasets, facilitating a wide range of experiments and evaluations.
- They can be utilized for quick sample generation in applications, such as training a sentiment classifier.

```
Produce 10 exemplars for sentiment analysis. Examples are categorized as either positive or negative. Produce 2 negative examples and 8 positive examples. Use this format for the examples:
Q: <sentence>
A: <sentiment>
```


**Output:**

```
Q: I just got the best news ever!
A: Positive

Q: We just got a raise at work!
A: Positive

Q: I'm so proud of what I accomplished today.
A: Positive

Q: I'm having the best day ever!
A: Positive

Q: I'm really looking forward to the weekend.
A: Positive

Q: I just got the best present ever!
A: Positive

Q: I'm so happy right now.
A: Positive

Q: I'm so blessed to have such an amazing family.
A: Positive

Q: The weather outside is so gloomy.
A: Negative

Q: I just got some terrible news.
A: Negative
```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Machine Learning projects often face challenges due to insufficient labeled data, leading to prolonged data collection and labeling phases.
- The emergence of LLMs has transformed this paradigm, enabling the testing and development of ideas or AI-powered features with minimal delay.
- LLMs leverage their generalization capabilities to provide immediate insights and preliminary results.
- Successful initial testing with LLMs can justify and lead into the traditional, more time-consuming development process.

<img src="./images/synthetic_rag_1.webp" width="800" align="center"/>

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Retrieval Augmented Generation (RAG) is a method that combines
  -  information retrieval with LLM text generation for knowledge-intensive tasks.
- The Retrieval model is key,
  - Selecting relevant documents for the LLM to process, with its performance directly impacting the quality of the output.
- RAG's effectiveness can vary across languages and specific domains, 
  - Sometimes struggling with tasks like creating a chatbot for Czech legal advice or a tax assistant for the Indian market.
- A solution to enhance RAG's performance 
  - Using LLMs to synthesize training data for new models, a method that can improve accuracy in underrepresented languages or specialized areas.
- This approach, while computationally demanding, aims at distilling LLMs into more efficient models, potentially lowering inference costs and boosting overall system performance.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Domain-Specific Dataset Generation**


- Data labeling is crucial for machine learning projects, following the principle of "garbage in, garbage out."
- Labeling entails generating annotated datasets for training and evaluation purposes.
- This process is often time-consuming and costly, particularly for projects with extensive datasets.
- Leveraging advancements in Large Language Models (LLMs) could potentially mitigate the cost and effort associated with data labeling tasks.

- Utilizing pre-labeled data is more appealing than starting from scratch, as editing is typically less arduous than composing original work.
- Leveraging GPT-4 as a prediction engine for pre-labeling data capitalizes on its contextual understanding and ability to generate human-like text.
- By incorporating GPT-4 into the labeling process, manual effort can be significantly reduced, leading to cost savings and a less monotonous process.
- Prompt engineering, familiar to users of GPT models, involves setting the context for the model to generate specific outputs. Crafting prompts tailored to our desired output format facilitates GPT-4's generation of text resembling model predictions.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Example task: Sentiment analysis - classifying text as positive, negative, or neutral.**

- Prompt: "Given the input text, classify its sentiment as positive, negative, or neutral."
- Follow-up instructions: "Ensure the generated output corresponds accurately to the sentiment expressed in the input text."
- Desired outcome: Text generated should indicate whether the sentiment is positive, negative, or neutral based on the input.

In [23]:
import openai
import re

def get_sentiment(input_text):
    prompt = f"Respond in the json format: {{'response': sentiment_classification}}\nText: {input_text}\nSentiment (positive, neutral, negative):"
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt,}
        ],
        max_tokens=40,
        n=1,
        stop=None,
        temperature=0.5,
    )
    response_text =  response.choices[0].message.content.strip()
    sentiment = re.search("negative|neutral|positive", response_text).group(0)
    # Add input_text back in for the result
    return {"text": input_text, "response": sentiment}
    # return response.choices[0].message.content

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

In [24]:
# Test single example

sample_text = "I had a terrible time at the party last night!"
sentiment = get_sentiment(sample_text)


{'text': 'I had a terrible time at the party last night!',
 'response': 'negative'}

In [25]:
%%writefile input_texts.txt
I love going to the park on a sunny day.
The customer service was terrible; they were rude and unhelpful.
I am neither happy nor sad about the new policy changes.
The cake was delicious and the presentation was fantastic.
I had a really bad experience with the product; it broke after two days.

Writing input_texts.txt


In [26]:
import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
    examples = []
    for line in input_file:
        text = line.strip()
        if text:
            examples.append((get_sentiment(text)))
    output_file.write(json.dumps(examples))

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Should I now use it directly for training my model?**

- Verification is required!
- one approach is to go for some user-friendly tools like label studio to verify some of them easily

In [27]:
def convert_ls_format(input_dict):
    """
    Convert sentiment analysis output from a simple format to Label Studio's prediction format.
    
    Args:
        input_dict (dict): A dictionary containing text and response keys. Example:
            {
                "text": "I love going to the park on a sunny day.",
                "response": "positive"
            }
            
    Returns:
        dict: A dictionary in Label Studio's prediction format. 
    """
    
    score_value = 1.00  # We don't know the model confidence
    output_dict = {
        "data": {
            "text": input_dict["text"]
        },
        "predictions": [
            {
                "result": [
                    {
                        "value": {
                            "choices": [
                                input_dict["response"].capitalize()
                            ]
                        },
                        "from_name": "sentiment",
                        "to_name": "text",
                        "type": "choices"
                    }
                ],
                "score": score_value,
                "model_version": "gpt-3.5-turbo"
            }
        ]
    }
    return output_dict

In [28]:
print(convert_ls_format(sentiment))


{'data': {'text': 'I had a terrible time at the party last night!'}, 'predictions': [{'result': [{'value': {'choices': ['Negative']}, 'from_name': 'sentiment', 'to_name': 'text', 'type': 'choices'}], 'score': 1.0, 'model_version': 'gpt-3.5-turbo'}]}


In [29]:
import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
    examples = []
    for line in input_file:
        text = line.strip()
        if text:
            examples.append(convert_ls_format(get_sentiment(text)))
    output_file.write(json.dumps(examples))

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Set up a Label Studio project: Begin by creating a new project in Label Studio.
- Import pre-labeled data: Upload the pre-labeled data into the Label Studio project.
- Review annotations: Examine the annotations within Label Studio to ensure accuracy and consistency.
- Iterate if necessary: Make any necessary adjustments or corrections to the annotations based on the review.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Step 1: Install and Launch Label Studio**


In [30]:
!pip install label-studio


Collecting label-studio
  Downloading label_studio-1.11.0-py3-none-any.whl.metadata (14 kB)
Collecting Django<3.3.0,>=3.2.23 (from label-studio)
  Downloading Django-3.2.24-py3-none-any.whl.metadata (4.0 kB)
Collecting appdirs>=1.4.3 (from label-studio)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting attr==0.3.1 (from label-studio)
  Downloading attr-0.3.1.tar.gz (1.7 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting azure-storage-blob>=12.6.0 (from label-studio)
  Downloading azure_storage_blob-12.19.0-py3-none-any.whl.metadata (26 kB)
Collecting bleach<5.1.0,>=5.0.0 (from label-studio)
  Downloading bleach-5.0.1-py3-none-any.whl.metadata (27 kB)
Collecting boto<3.0.0,>=2.49.0 (from label-studio)
  Downloading boto-2.49.0-py2.py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━

In [32]:
!label-studio


=> Database and media directory: /Users/mzihayat/Library/Application Support/label-studio
=> Static URL is set to: /static/
=> Database and media directory: /Users/mzihayat/Library/Application Support/label-studio
=> Static URL is set to: /static/
Read environment variables from: /Users/mzihayat/Library/Application Support/label-studio/.env
get 'SECRET_KEY' casted as '<class 'str'>' with default ''
[Tracing] Create new propagation context: {'trace_id': '88bc9d1b2d4641eea2eeb987976c674d', 'span_id': '91b9ebd3dc59ee7d', 'parent_span_id': None, 'dynamic_sampling_context': None}
Starting new HTTPS connection (1): pypi.org:443
https://pypi.org:443 "GET /pypi/label-studio/json HTTP/1.1" 200 31429
Performing system checks...

System check identified no issues (1 silenced).
March 01, 2024 - 16:57:08
Django version 3.2.24, using settings 'label_studio.core.settings.label_studio'
Starting development server at http://0.0.0.0:8080/
Quit the server with CONTROL-C.
[2024-03-01 16:57:11,321] [django

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Step 2: Create a New Project**

- Create Project: Click on "Create Project" and input a project name, such as "Review Bootstrapped Labels."
- Define Labeling Configuration: Choose the "Sentiment Analysis Text Classification" template for Sentiment Analysis tasks.
- Configuration Flexibility: These templates are customizable, allowing for easy adjustment of properties as needed.
- Default Labeling Configuration: Below is the default labeling configuration for reference.
  
<img src="./images/lbl_studio_1.png" width="800" align="center"/>

<img src="./images/lbl_studio_2.png" width="800" align="center"/>



<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Step 3: Import Pre-labeled Data**

<img src="./images/lbl_studio_3.png" width="800" align="center"/>

<img src="./images/lbl_studio_4.png" width="800" align="center"/>

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Step 4: Review and Update Labels**


<img src="./images/lbl_studio_5.png" width="800" align="center"/>

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Beyond Sentiment Analysis: Label Any NLP Task**

- **Summarization**
```
Respond in the json format with a summary for the following text: {‘summary’: summary}

Text: I love going to the park on a sunny day. The customer service was terrible; they were rude and unhelpful. I am neither happy nor sad about the new policy changes. The cake was delicious and the presentation was fantastic. I had a really bad experience with the product; it broke after two days.

Summary:

```



<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**- Named Entity Recognition (NER)**
  
  ```
  You are an expert NER labeler. Only return JSON. Do not return explanations. Return the CoNLL format for the following sentence in JSON format.
  
  ```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Human Evaluation: There is no replacement for human assessment in certain contexts.
- GPT Models as Starting Point: Utilizing GPT models to provide initial insights can offer significant advantages.
- Balancing Human and AI Contributions: Combining human evaluation with AI-generated insights can enhance efficiency and accuracy in various tasks.
- Leveraging GPT Models: Incorporating GPT models provides a valuable starting point, which can streamline processes and improve productivity.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Prioritize responsible manual annotation, preparing around **20 examples** and randomly **selecting 2-8 for prompts** to enhance data diversity.
- Ensure examples are representative, correctly formatted, and detail specifics like query length and tone.
- Precision in examples and instructions improves synthetic data quality for Retriever training.
- Low-quality few-shot examples can detrimentally affect the trained model's quality.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Utilizing an affordable model like ChatGPT is often adequate for non-English languages and unusual domains.
- A typical prompt with instructions and 4-5 examples might use about 700 tokens, plus 25 tokens for generation, under Retriever's 128-token constraint per passage.
- The cost of generating a synthetic dataset for 50,000 documents, considering GPT-3.5 Turbo API pricing, is approximately $55.
  - Calculation: 50,000 * (700 * 0.001 * $0.0015 + 25 * 0.001 * $0.002)
- Generating multiple (2-4) query examples per document is feasible and can enhance local model fine-tuning.
- Further training, despite additional costs, often yields significant benefits, particularly for specialized domains such as Czech law.

- Dai et al. (2022) found that around 50,000 manually labeled data are needed for a model to match synthetic data quality.
- Gathering 10,000 manual examples before launch takes time and costs over a thousand dollars.
- Synthetic data and local Retriever Model training are more cost-effective.
- With the new technique, double-digit metric growth can be achieved in a few days.
  
<img src="./images/synthetic_rag_3.webp" width="800" align="center"/>

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Synthetic Data for RAG Setup**

- The Retrieval model is pivotal in the RAG (Retriever-And-Generator) framework, as it identifies relevant documents and forwards them to the Large Language Model (LLM) for subsequent processing.
- The effectiveness of the Retrieval model directly impacts the quality of the product or feature outcome.
- Ideally, the Retrieval model should exhibit satisfactory performance without requiring extensive fine-tuning.
- The performance of the Retrieval model may diminish when operating in diverse languages or specialized domains.
- Strategies such as domain adaptation or multilingual training may be employed to enhance the Retrieval model's performance in various contexts.


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Scenario:**
- Consider the task of developing a chatbot to answer questions regarding Czech laws and legal practices, or designing a tax assistant customized for the Indian market.
- In such specialized domains and languages, the Retrieval model may struggle to identify the most pertinent documents, leading to suboptimal performance and impacting the overall quality of the system.
- The Retrieval model's limitations may result in missing crucial information and hindering the chatbot or tax assistant's ability to provide accurate and comprehensive responses.
- Addressing these challenges requires tailored approaches such as domain-specific prompting, language adaptation, or incorporating specialized knowledge bases to enhance the Retrieval model's effectiveness in retrieving relevant information.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

  ## **Solution:**
-  Leveraging existing Large Language Models (LLMs) to synthesize data for training new generations of LLMs, Retrievers, and other models.
   - This approach involves distilling LLMs into standard-sized encoders through prompt-based query generation.
   - The distillation process requires significant computational resources.
- Despite its computational demands, this method significantly reduces inference costs and has the potential to greatly enhance model performance, especially in scenarios involving low-resource languages or specialized domains.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## How to use?

- Users typically need to supply a brief description and manually label a few examples.
- Different retrieval tasks exhibit diverse search intents, resulting in varying interpretations of "relevance" for a given (Query, Document) pair.
- Relevance can differ significantly based on the specific search intent of the task.
  - For example, in an argument retrieval task, relevance might entail identifying supporting arguments, whereas in other scenarios, counter-arguments might be sought.

**Prompt:**
Task: Identify a counter-argument for the given argument.

Argument #1: {insert passage X1 here}

A concise counter-argument query related to the argument #1: {insert manually prepared query Y1 here}

Argument #2: {insert passage X2 here}
A concise counter-argument query related to the argument #2: {insert manually prepared query Y2 here}

**Examples:**

Argument N: Even if a fine is made proportional to income, you will not get the equality of impact you desire. This is because the impact is not proportional simply to income, but must take into account a number of other factors. For example, someone supporting a family will face a greater impact than someone who is not, because they have a smaller disposable income. Further, a fine based on income ignores overall wealth (i.e. how much money someone actually has: someone might have a lot of assets but not have a high income). The proposition does not cater for these inequalities, which may well have a much greater skewing effect, and therefore the argument is being applied inconsistently.

A concise counter-argument query related to the argument #N:

**output:**

```
punishment house would make fines relative income

```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

<img src="./images/synthetic_rag_2.webp" width="800" align="center"/>

- Suppose Information Retrieval isn't your primary task, such as when working on a legal document classification problem without the option to utilize external APIs.
  - In such cases, training a local model becomes necessary to maintain data privacy and comply with restrictions on data sharing.
  - However, gathering sufficient data locally can present a significant hurdle, potentially leading to delays in product development.
  - Limited access to data may hinder model training and subsequent performance, necessitating innovative strategies to overcome this obstacle.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- Consider the task of generating children's stories, as explored in the research by Eldan et al. (2023).
- Each story typically comprises 2-3 paragraphs, following a simple plot and theme, with the dataset covering vocabulary and factual knowledge suitable for children.
- Language serves not only as a system of rules and symbols but also as a medium for conveying and interpreting meaning.
- A key challenge in using large language models for generating training data is ensuring **dataset diversity**. 
  - Even with high generation temperatures, models may produce repetitive datasets lacking the necessary variety, particularly concerning children's language.
- Additionally, natural language generation faces challenges related to **coherence** and **relevance**, which are crucial for producing meaningful and engaging children's stories.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- To address the diversity challenge, the authors curated a vocabulary comprising approximately 1500 basic words, reminiscent of a typical child's vocabulary, categorized into nouns, verbs, and adjectives.
- During each generation process, one verb, one noun, and one adjective were randomly chosen from the curated vocabulary.
- The model then proceeded to generate a story incorporating these randomly selected words, thereby ensuring diversity in the generated datasets.
- By integrating random words from the curated vocabulary, the model diversified the stories it produced, enhancing the overall diversity of the dataset.

```
Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”{random.choice(verbs_list)}”, the noun ”{random.choice(nouns_list)}” and the adjective ”{random.choice(adjectives_list)}”. The story should have the following features: {random.choice(features_list)}, {random.choice(features_list)}. Remember to only use simple words!
```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- The technique significantly expanded the dataset's diversity by ensuring that stories encompassed a child's entire vocabulary and various combinations of concepts.
- Additionally, the authors incorporated potential story features such as dialogue, plot twists, bad endings, or moral lessons.
- A random subset of these features was selected for each story generation process.
- The model was prompted to include the chosen story features, thereby enriching the narratives with diverse elements and enhancing the overall quality and variety of the generated stories.

```
Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would likely understand. The story should use the verb ”decorate”, the noun ”thunder” and the adjective ”ancient”. The story should have the following features: the story should contain at least one dialogue, the story has a bad ending. Remember to only use simple words!
```

**Output:**

```
Once upon a time, in an ancient house, there lived a girl named Lily. She loved to decorate her room with pretty things. One day, she found a big box in the attic. She opened it and saw many shiny decorations. Lily was very happy and decided to use them in her room. As Lily was decorating her room, the sky outside became dark. There was a loud thunder sound, and Lily got scared. She ran to her mom and said, ”Mommy, the thunder is so loud!” Her mom hugged her and said, ”Don’t worry, it will pass soon.” But the thunder did not stop. It got louder and louder, and the ancient house started to shake. Suddenly, the roof fell down on the pretty decorations. Lily was sad because her room was not pretty anymore. The end.
```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## Iterative (Hierarchical) Synthetic Data Generation

- The format can be further enriched by incorporating additional entities, some of which are generated beforehand by the LLM itself.
  - For instance, one could initiate the process by requesting the generation of a story summary and specifying a particular sentence that must be included in the story.
  - The LLM generates this intermediate data, which is then utilized in subsequent requests.
  - In the final request, the previously generated intermediate data is incorporated, allowing for the creation of more complex and intricate story narratives.


```
Summary: {a short summary generated by LLM, using the approach above}
Features: {copy the features from the initial prompt}
Sentence: {a sentence generated by LLM, which should be present in the story}
Words: {copy the words from the initial prompt}
Story:
```

## **Example:**
```
Summary: Lily and Timmy build a sandcastle together and learn to compromise, but it gets knocked over by a gust of wind. They find beauty in the broken sandcastle and play happily with a butterfly.
Features: Dialogue, Foreshadowing, Twist
Sentence: One day, she went to the park and saw a beautiful butterfly.
Words: disagree, network, beautiful
Story:
```

**Output:**

```
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, she went to the park and saw a beautiful butterfly. She tried to catch it, but it flew away. Lily’s friend, Timmy, came to play with her. They decided to build a sandcastle, but they disagreed on how to build it. Lily wanted to make it tall, while Timmy wanted to make it wide. They argued for a while, but finally agreed to make it both tall and wide. As they were building the sandcastle, they noticed a group of ants working together. They realized that the ants were building a network of tunnels. Suddenly, a gust of wind came and knocked over their sandcastle. They were sad, but then they saw the butterfly again. It landed on the remains of their sandcastle and they realized that it was still beautiful, even in its broken state. They smiled and played together happily.
```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Generating Code**


- Language Models like ChatGPT demonstrate high effectiveness in generating code.
-  Similar to other chat models from OpenAI, a System Message is employed to define response behavior and format for the prompt examples.
- Below is the System Message utilized for the prompt examples:

```
You are a helpful code assistant that can teach a junior developer how to code. Your language of choice is Python. Don't explain the code, just generate the code block itself.
```

- The code generation capabilities of LLMs can also be employed to generate code solely from comments.
- Instructions are provided within a comment block to guide the code generation process.
- By leveraging comments as input, LLMs can interpret the instructions and generate corresponding code snippets.
- This approach offers versatility in generating code from various types of input, such as comments or descriptions of desired functionality.

```
"""
1. Create a list of movies
2. Create a list of ratings for these movies
3. Combine them to make a json object of 10 movies with their ratings.
"""
```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

**Output:**

In [33]:
movies = ["The Shawshank Redemption", "The Godfather", "The Dark Knight", "Schindler's List", "Pulp Fiction", "The Lord of the Rings: The Return of the King", "Forrest Gump", "Star Wars: Episode V - The Empire Strikes Back", "Inception", "The Silence of the Lambs"]
 
ratings = [9.3, 9.2, 9.0, 8.9, 8.9, 8.9, 8.8, 8.7, 8.7, 8.6]
 
movie_ratings = {}
 
for i in range(len(movies)):
    movie_ratings[movies[i]] = ratings[i]
 
json_object = json.dumps(movie_ratings, indent=4)
 
print(json_object)

{
    "The Shawshank Redemption": 9.3,
    "The Godfather": 9.2,
    "The Dark Knight": 9.0,
    "Schindler's List": 8.9,
    "Pulp Fiction": 8.9,
    "The Lord of the Rings: The Return of the King": 8.9,
    "Forrest Gump": 8.8,
    "Star Wars: Episode V - The Empire Strikes Back": 8.7,
    "Inception": 8.7,
    "The Silence of the Lambs": 8.6
}


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Complete Functions or Next Line**


```
# function to multiply two numbers and add 75 to it
 
def multiply(
    
```

**Output:**

```
a, b):
    result = a * b
    result += 75
    return result

```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Explain Code**



- When learning a programming language, prompting the model to explain specific code segments can be highly beneficial.
- Queries generated previously can be repurposed to request explanations for code snippets.
- It's essential to consider the System Message used, as the model's behavior may vary based on its instructions.
  - For instance, if the System Message restricts the model to acting as a Python code assistant, it might refuse certain requests that deviate from this instruction.
- The model may still provide explanations for code snippets, demonstrating its adaptability and utility in aiding programming education.

```
SELECT students.StudentId, students.StudentName
FROM students
INNER JOIN departments
ON students.DepartmentId = departments.DepartmentId
WHERE departments.DepartmentName = 'Computer Science';

Explain the above SQL statement.

```

**Output:**

```
Sorry, as a code assistant in Python, I cannot generate SQL code. However, I can explain the purpose of the SQL statement provided. 

This SQL statement selects the `StudentId` and `StudentName` columns from the `students` table, where the `DepartmentName` column in the `departments` table is equal to 'Computer Science'. To achieve this, the statement performs an inner join on the `DepartmentId` column between the `students` and `departments` tables. 

In simpler terms, it is selecting the names and IDs of students who are enrolled in the Computer Science department.
```

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Graduate Job Classification Case Study**

- Clavié et al., 2023 present a case study on prompt engineering applied to a text classification task within a production system.
- **Task Description:** 
  - Classifying whether a job is suitable as an entry-level position for recent graduates or not.
- The study evaluates various prompt engineering techniques using GPT-3.5 (gpt-3.5-turbo), comparing its performance against other models, including DeBERTa-V3.
  - Few-shot CoT prompting underperforms compared to Zero-shot prompting when expert knowledge is not required.
  - The prompt significantly influences the model's reasoning, with post-prompt engineering models achieving substantially higher F1 scores.
  - Attempts to enforce template adherence reduce performance across all cases.
  - Small modifications have a disproportionate impact on performance.
  - Clear instructions and repetition of key points are crucial for enhancing performance.
  - Even simple modifications, such as assigning a human name to the model, lead to measurable improvements in F1 score.
- The observed template adherence issue diminishes in early testing with GPT-4, which is subsequent to the period covered in the paper.


<img src="./images/study_1.png" width="800" align="center"/>


<img src="./images/study_2.png" width="800" align="center"/>


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Prompt Function**

- Drawing a parallel between GPT's dialogue interface and a programming language's shell
  - The prompt can be likened to forming a function.
- The encapsulated prompt acts as a **function** with a unique name. 
  - When invoked with input text, it produces results based on predefined internal rules.
- Creating a reusable prompt with a name simplifies engagement with GPT, akin to having a handy tool to delegate specific tasks.
  - Inputting text triggers the function, generating the desired output.
- Encapsulating prompts into functions enables the creation of a series of functions to establish a workflow.
  - Each function represents a distinct step or task, and when sequenced appropriately, they automate complex processes or enhance problem-solving efficiency.


- **Before** we can use a function, we need to **let GPT know about it**. 


```
Hello, ChatGPT! I hope you are doing well. I am reaching out to you for assistance with a specific function. I understand that you have the capability to process information and perform various tasks based on the instructions provided. In order to help you understand my request more easily, I will be using a template to describe the function, input, and instructions on what to do with the input. Please find the details below:

function_name: [Function Name]
input: [Input]
rule: [Instructions on how to process the input]

I kindly request you to provide the output for this function, based on the details I have provided. Your assistance is greatly appreciated. Thank you!
I will replace the text inside the brackets with the relevant information for the function I want you to perform. This detailed introduction should help you understand my request more efficiently and provide the desired output. The format is function_name(input) If you understand, just answer one word with ok.

```


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Examples: English Study Assistant**

```
function_name: [trans_word]
input: ["text"]
rule: [I want you to act as an English translator, spelling corrector and improver. I will provide you with input forms including "text" in any language and you will detect the language, translate it and answer in the corrected of my text, in English.]
```

```
function_name: [fix_english]
input: ["text"]
rule: [Please serve as an English master, spelling corrector, and language enhancer. I will provide you with input forms including "text", I want you to improve the text's vocabulary and sentences with more natural and elegent. Keep the meaning same.]
```

```
trans_word('婆罗摩火山处于享有“千岛之国”美称的印度尼西亚. 多岛之国印尼有4500座之多的火山, 世界著名的十大活火山有三座在这里.')
fix_english('Finally, you can run the function independently or chain them together.')
fix_english(expand_word(trans_word('婆罗摩火山处于享有“千岛之国”美称的印度尼西亚. 多岛之国印尼有4500座之多的火山, 世界著名的十大活火山有三座在这里.')))
```

> If you don't want ChatGPT to output excessive information, you can simply add a sentence after defining the function's rules.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Multiple params function**

```
function_name: [pg]
input: ["length", "capitalized", "lowercase", "numbers", "special"]
rule: [I want you to act as a password generator for individuals in need of a secure password. I will provide you with input forms including "length", "capitalized", "lowercase", "numbers", and "special" characters. Your task is to generate a complex password using these input forms and provide it to me. Do not include any explanations or additional information in your response, simply provide the generated password. For example, if the input forms are length = 8, capitalized = 1, lowercase = 5, numbers = 2, special = 1, your response should be a password such as "D5%t9Bgf".]

```