## **Outline**

- Generating Data and Synthetic Datasets
- Case Study: Graduate Job Classification
- Workshop: Tackling Generated Datasets Diversity


<img src="./images/border.jpg" height="10" width="1500" align="center"/>

In [3]:
!python -m venv openai-env

In [4]:
!source openai-env/bin/activate

In [5]:
!pip install --upgrade openai



In [6]:
from openai import OpenAI
import os

client = OpenAI()
# defaults to getting the key using os.environ.get("OPENAI_API_KEY")
# if you saved the key under a different environment variable name, you can do something like:
client = OpenAI(
  api_key=os.environ.get("OPENAI_API_KEY"),
)

In [7]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In the realm of coding, there's a method quite grand,\nA concept mysterious, yet elegantly planned.\nRecursion it's called, a looping of delight,\nA function that calls itself, shining bright.\n\nLike a mirror reflecting endlessly,\nRecursion spirals through functions, fluently.\nIn a dance of repetition, it finds its grace,\nSolving problems with an elegant embrace.\n\nThe base case, a safety net in its design,\nControls the looping, so sublime.\nAs calls unravel, deeply nested in the code,\nRecursion weaves a tale, elegantly told.\n\nFrom fractals to trees, it finds its place,\nA concept so profound, yet full of grace.\nSo embrace the mystery, let recursion guide your way,\nIn the enchanted realm of coding, where its magic holds sway.", role='assistant', function_call=None, tool_calls=None)


## **Function Calling with LLMs**
  - Enables LLMs like GPT-4 and GPT-3.5 to reliably connect with external tools and APIs.
  - Detects the need for a function call within a chat and outputs JSON with arguments to execute the function.

- **Tool Integration:**
  - Functions act as tools within AI applications, allowing for multiple tools to be defined and called in a single request.

- **Importance for AI Applications:**
  - Essential for developing LLM-powered chatbots or agents that require context retrieval or need to interact with external tools.
  - Transforms natural language instructions into actionable API calls, enhancing the utility and interactivity of chatbots.

- **Enhancing Chatbot Capabilities:**
  - Facilitates seamless integration of LLMs with a wide range of external services and data sources.
  - Enables chatbots to perform complex tasks, such as data retrieval, content creation, and more, by calling specific functions.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

- **Applications Enabled by Functional Calling**

- **Conversational Agents with External Tool Usage:**
  - Allows conversational agents to efficiently utilize external tools to answer user questions.
  - Example: Querying weather information translates to function calls like `get_current_weather(location: string, unit: 'celsius' | 'fahrenheit')`.

- **Data Extraction and Tagging Solutions:**
  - Empowers LLM-powered solutions to extract and tag data from various sources.
  - Example: Extracting people names from a Wikipedia article.

- **Natural Language to API Conversion:**
  - Facilitates the creation of applications that translate natural language into API calls or database queries.
  - Enhances the usability and accessibility of data and services.

- **Conversational Knowledge Retrieval Engines:**
  - Enables conversational engines to interact with knowledge bases, facilitating knowledge retrieval through natural language queries.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Function Calling with GPT-4**


- **Integration of LLM with External Tool for Weather Query**
  - **Challenge:**
    - LLM alone cannot respond to dynamic queries like checking the weather due to dataset limitations.
  - **Solution:**
    - Utilize function calling capabilities of the LLM to invoke an external tool for weather information retrieval.

  - **Implementation Example:**
    - User query: "What is the weather like in a given location?"
    - LLM processes the query and recognizes the need for external information.
    - Function calling mechanism selects appropriate function and arguments (e.g., `get_current_weather(location: string, unit: 'celsius' | 'fahrenheit')`).
    - OpenAI APIs facilitate the interaction between the LLM and the weather API.
    - Final response generated based on the retrieved weather data.

```
What is the weather like in London?

```

- To handle this request using function calling,
  - Define a weather function or set of functions that you will be passing as part of the OpenAI API request


In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string", 
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

- Function Details:
  - Name: get_current_weather
  - Description: Retrieves the current weather in a specified location.
  - Parameters:
    - location: Specifies the city and state (e.g., San Francisco, CA).
    - unit: Specifies the temperature unit (celsius or fahrenheit).


Define a completion function as follows:

In [None]:
def get_completion(client, messages, model="gpt-3.5-turbo-1106", temperature=0, max_tokens=300, tools=None, tool_choice=None):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
        tools=tools,
        tool_choice=tool_choice
    )
    return response.choices[0].message

Compose the user question:



In [None]:
messages = [
    {
        "role": "user",
        "content": "What is the weather like in London?"
    }
]

In [None]:
response = get_completion(client, messages, tools=tools)
print(response)


ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_q17T7UG98lVTh9XMuZcb0yrv', function=Function(arguments='{"location":"London","unit":"celsius"}', name='get_current_weather'), type='function')])


Capture the arguments:

In [None]:
import json

args = json.loads(response.tool_calls[0].function.arguments)


Dummy function


In [None]:
# Defines a dummy function to get the current weather
def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    weather = {
        "location": location,
        "temperature": "50",
        "unit": unit,
    }

    return json.dumps(weather)


In [None]:
get_current_weather(**args)


'{"location": "London", "temperature": "50", "unit": "celsius"}'

## **An LLM-powered conversational agent**


In [None]:
messages = [
    {
        "role": "user",
        "content": "Hello! How are you?",
    }
]

In [None]:
get_completion(messages, tools=tools)


TypeError: get_completion() missing 1 required positional argument: 'messages'

- Specify desired behavior for function calling to tailor the response generation process according to specific requirements.
- By default, the model autonomously determines whether to call a function and which function to call.
-  Utilize the tool_choice parameter to control the behavior of the system.
- Default setting: tool_choice: "auto".

- In "auto" mode, the model automatically decides whether and which function to call based on the context of the user query.


In [None]:
get_completion(client, messages, tools=tools, tool_choice="auto")


ChatCompletionMessage(content="Hello! I'm here and ready to assist you. How can I help you today?", role='assistant', function_call=None, tool_calls=None)

In [None]:
get_completion(client, messages, tools=tools, tool_choice="none")


ChatCompletionMessage(content="Hello! I'm here and ready to assist you. How can I help you today?", role='assistant', function_call=None, tool_calls=None)

Forces the model to not use any of the functions provided.

In [None]:
messages = [
    {
        "role": "user",
        "content": "What's the weather like in London?",
    }
]
get_completion(client, messages, tools=tools, tool_choice="none")

ChatCompletionMessage(content='I will check the current weather in London for you.', role='assistant', function_call=None, tool_calls=None)

Force the model to choose a function if that's the behavior you want in your application.

In [None]:
messages = [
    {
        "role": "user",
        "content": "What's the weather like in London?",
    }
]
get_completion(client, messages, tools=tools, tool_choice={"type": "function", "function": {"name": "get_current_weather"}})

ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_dh0nudIqqUzA3YLm51IXh2EJ', function=Function(arguments='{"location":"London","unit":"celsius"}', name='get_current_weather'), type='function')])

**Exercise**

Develop an agent that passes back the result obtained after calling your APIs with the inputs generated from function calling.

In [None]:
messages = []
messages.append({"role": "user", "content": "What's the weather like in Boston!"})
assistant_message = get_completion(client, messages, tools=tools, tool_choice="auto")
assistant_message = json.loads(assistant_message.model_dump_json())
assistant_message["content"] = str(assistant_message["tool_calls"][0]["function"])

#a temporary patch but this should be handled differently
# remove "function_call" from assistant message
del assistant_message["function_call"]

In [None]:
messages.append(assistant_message)


In [None]:
# get the weather information to pass back to the model
weather = get_current_weather(messages[1]["tool_calls"][0]["function"]["arguments"])

messages.append({"role": "tool",
                 "tool_call_id": assistant_message["tool_calls"][0]["id"],
                 "name": assistant_message["tool_calls"][0]["function"]["name"],
                 "content": weather})

In [None]:
final_response = get_completion(client, messages, tools=tools)


In [None]:
final_response


ChatCompletionMessage(content='The current weather in Boston, MA is 50°F.', role='assistant', function_call=None, tool_calls=None)

## **Generating Data**


- LLMs possess powerful text generation capabilities for creating coherent text.
- Implementing effective prompt strategies enhances the quality, consistency, and factual accuracy of the model's outputs.
- LLMs are invaluable for generating datasets, facilitating a wide range of experiments and evaluations.
- They can be utilized for quick sample generation in applications, such as training a sentiment classifier.

```
Produce 10 exemplars for sentiment analysis. Examples are categorized as either positive or negative. Produce 2 negative examples and 8 positive examples. Use this format for the examples:
Q: <sentence>
A: <sentiment>
```


**Output:**

```
Q: I just got the best news ever!
A: Positive

Q: We just got a raise at work!
A: Positive

Q: I'm so proud of what I accomplished today.
A: Positive

Q: I'm having the best day ever!
A: Positive

Q: I'm really looking forward to the weekend.
A: Positive

Q: I just got the best present ever!
A: Positive

Q: I'm so happy right now.
A: Positive

Q: I'm so blessed to have such an amazing family.
A: Positive

Q: The weather outside is so gloomy.
A: Negative

Q: I just got some terrible news.
A: Negative
```

- Machine Learning projects often face challenges due to insufficient labeled data, leading to prolonged data collection and labeling phases.
- The emergence of LLMs has transformed this paradigm, enabling the testing and development of ideas or AI-powered features with minimal delay.
- LLMs leverage their generalization capabilities to provide immediate insights and preliminary results.
- Successful initial testing with LLMs can justify and lead into the traditional, more time-consuming development process.

<img src="./images/synthetic_rag_1.webp" width="800" align="center"/>

- Retrieval Augmented Generation (RAG) is a method that combines
  -  information retrieval with LLM text generation for knowledge-intensive tasks.
- The Retrieval model is key,
  - Selecting relevant documents for the LLM to process, with its performance directly impacting the quality of the output.
- RAG's effectiveness can vary across languages and specific domains, 
  - Sometimes struggling with tasks like creating a chatbot for Czech legal advice or a tax assistant for the Indian market.
- A solution to enhance RAG's performance 
  - Using LLMs to synthesize training data for new models, a method that can improve accuracy in underrepresented languages or specialized areas.
- This approach, while computationally demanding, aims at distilling LLMs into more efficient models, potentially lowering inference costs and boosting overall system performance.

## **Domain-Specific Dataset Generation**


- Utilizing LLMs for retrieval tasks necessitates providing a concise description and a handful of manually labeled examples.
- Different retrieval tasks have distinct search intents, impacting the definition of "relevance" for any given (Query, Document) pair.
- The perceived relevance of a document can vary significantly based on the specific search intent behind the retrieval task.
- For example, an argument retrieval task may look for documents containing supporting arguments, whereas other tasks may seek counter-arguments, as illustrated by the ArguAna dataset.

```
Task: Identify a counter-argument for the given argument.

Argument #1: {insert passage X1 here}

A concise counter-argument query related to the argument #1: {insert manually prepared query Y1 here}

Argument #2: {insert passage X2 here}
A concise counter-argument query related to the argument #2: {insert manually prepared query Y2 here}

<- paste your examples here ->

Argument N: Even if a fine is made proportional to income, you will not get the equality of impact you desire. This is because the impact is not proportional simply to income, but must take into account a number of other factors. For example, someone supporting a family will face a greater impact than someone who is not, because they have a smaller disposable income. Further, a fine based on income ignores overall wealth (i.e. how much money someone actually has: someone might have a lot of assets but not have a high income). The proposition does not cater for these inequalities, which may well have a much greater skewing effect, and therefore the argument is being applied inconsistently.

A concise counter-argument query related to the argument #N:
```

```
punishment house would make fines relative income
```

<img src="./images/synthetic_rag_2.webp" width="800" align="center"/>

- Prioritize responsible manual annotation, preparing around **20 examples** and randomly **selecting 2-8 for prompts** to enhance data diversity.
- Ensure examples are representative, correctly formatted, and detail specifics like query length and tone.
- Precision in examples and instructions improves synthetic data quality for Retriever training.
- Low-quality few-shot examples can detrimentally affect the trained model's quality.

- Utilizing an affordable model like ChatGPT is often adequate for non-English languages and unusual domains.
- A typical prompt with instructions and 4-5 examples might use about 700 tokens, plus 25 tokens for generation, under Retriever's 128-token constraint per passage.
- The cost of generating a synthetic dataset for 50,000 documents, considering GPT-3.5 Turbo API pricing, is approximately $55.
  - Calculation: 50,000 * (700 * 0.001 * $0.0015 + 25 * 0.001 * $0.002)
- Generating multiple (2-4) query examples per document is feasible and can enhance local model fine-tuning.
- Further training, despite additional costs, often yields significant benefits, particularly for specialized domains such as Czech law.

In [15]:
import openai

# Function to generate synthetic data
def generate_synthetic_data(client,  genre, num_samples):
    prompts = {
        'Science Fiction': 'Write a science fiction story about a robot uprising.',
        'Romance': 'Write a romance story set in a small coastal town.'
    }

    synthetic_data = []
    for _ in range(num_samples):
        response = client.chat.completions.create(
            engine="gpt-3.5-turbo-1106",
            prompt=prompts[genre],
            max_tokens=1024
        )
        synthetic_data.append((response['choices'][0]['text'], genre))

    return synthetic_data

# Generate data for each genre
num_samples_per_genre = 50  # You can choose how many samples you want to generate
sci_fi_data = generate_synthetic_data(client, 'Science Fiction', num_samples_per_genre)
romance_data = generate_synthetic_data(client, 'Romance', num_samples_per_genre)

# Combine the data
all_data = sci_fi_data + romance_data


TypeError: Missing required arguments; Expected either ('messages' and 'model') or ('messages', 'model' and 'stream') arguments to be given