# Data Generation

We are going to use this repository to generate training data as mentioned in the document using OpenAI API and prompt engineering

In [13]:
# Importing libraries and keys
import openai
import yaml
from yaml.loader import SafeLoader
with open('env.yml') as f:
    data = yaml.load(f, Loader=SafeLoader)
openai.organization = data["OPEN_API_ORG"]
openai.api_key = data["OPENAI_API_KEY"]

In [14]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

In [16]:
example_row = """
“01:00:00”,“01:00:20”,“Alice”,“Hello everyone!”
"01:00:00","01:01:20","Bob","Today we are going to discuss about overall product metrics"
"12:01:34","12:01:50","Tay","Awesome, thanks for informing about that!"
"""
prompt = f"""
Generate an audio transcript of a multi turn conversation. \
The generated text should have similar structure to the example delimited with triple backticks

Example Transcription: ```{example_row}```
The conversation would have only 5 turns.
"""
repsonse = get_completion(prompt)
print(repsonse)

```
"00:00:00","00:00:05","John","Hey guys, how's everyone doing today?"
"00:00:05","00:00:10","Sarah","I'm doing pretty well, thanks for asking."
"00:00:10","00:00:15","Mark","Same here, just trying to stay productive."
"00:00:15","00:00:20","Emily","I'm feeling a bit overwhelmed, to be honest."
"00:00:20","00:00:25","John","Oh no, what's going on Emily?"


So We can generate a good multi-turn conversation here and in the desired format, but the transcripts do not contain any good information or action items, that canc be extracted.<br>
So we will now try to modify the prompt to describe the environment and if we want something to be discussed in the conversation

In [19]:
example_row = """
“01:00:00”,“01:00:20”,“Alice”,“Hello everyone!”
"01:00:00","01:01:20","Bob","Today we are going to discuss about overall product metrics"
"12:01:34","12:01:50","Tay","Awesome, thanks for informing about that!"
"""
conv_env = "weekly standup meeting of an app development team"
conv_topics = "bugs and next steps."
prompt = f"""
Generate an audio transcript of a multi turn conversation of a {conv_env}  \
discussing about  {conv_topics} \
The generated text should have similar structure to the example delimited with triple backticks

Example Transcription: ```{example_row}```
The conversation would have only 5 turns.
"""
repsonse = get_completion(prompt)
print(repsonse)

```
"09:00:00","09:00:10","John","Good morning everyone, let's start the weekly standup meeting."
"09:00:10","09:00:20","Alice","Hi John, I have an update on the new feature we are working on. We have completed the design and are now moving to the development phase."
"09:00:20","09:00:30","John","Great to hear that Alice. Bob, do you have any updates on the product metrics?"
"09:00:30","09:00:40","Bob","Yes John, we have seen a 10% increase in user engagement since the last update. However, we still need to work on reducing the app's loading time."
"09:00:40","09:00:50","John","Thanks for the update Bob. Tay, have you found any new bugs in the app?"
"09:00:50","09:01:00","Tay","Yes John, we have identified a few bugs in the payment gateway. We are working on fixing them as soon as possible."
"09:01:00","09:01:10","John","Thanks for letting us know Tay. So, what are the next steps for the team?"
"09:01:10","09:01:20","Alice","We will continue working on the new feature and aim to comple

So above we can see that we can generate audio transcript pretty effectively for a given setting and also what type of things it should include.<br><br>
But still there is a lot of noise and right now we just want to train a model to find action items and link them to the person mentioned in the same text. We should  modify our prompt accordingly nad also we can create labels as well

In [23]:
transcript = """
start_time,end_time,speaker,text,labels
“10:00:00”, “10:01:50”, “Bob“, “... Alice, can you take the UX bug? ...”,{"text": "UX bug", "assignee": "Alice"}""
“12:25:00”, “12:25:30”, “Alice”, ”... We need to plan for offsite next month ...”,"{"text": "plan for offsite next month", "assignee": "UNKNOWN"}"
"12:01:34","12:01:50","Tay","Awesome, thanks for informing about that!","{text:"","assignee":"N/A"}"
"""
prompt = f"""
Create a csv containing 10 rows of audio transcript which are of tech company standups \
discussing about  {conv_topics}\
The csv should have a similar struture to the example csv \
delimited with triple backticks.

Make sure each entry has a valid name.

Example Transcription: ```{transcript}```
"""
response = get_completion(prompt)
print(response)


start_time,end_time,speaker,text,labels
"09:00:00","09:01:30","John","Good morning everyone, let's start with the updates. We have successfully launched the new feature last week and we are seeing a positive response from the users.","{"text": "new feature launch", "assignee": "UNKNOWN"}"
"09:02:00","09:03:30","Sarah","That's great news John. We have also noticed a decrease in the app's loading time after the recent optimization.","{"text": "app optimization", "assignee": "UNKNOWN"}"
"09:04:00","09:05:30","Tom","I have been working on fixing the bugs reported by the users. We have resolved most of them and the remaining ones will be fixed by the end of this week.","{"text": "bug fixes", "assignee": "Tom"}"
"09:06:00","09:07:30","Alice","I have been analyzing the product metrics and we have seen a significant increase in the user engagement. However, we need to work on improving the retention rate.","{"text": "improve retention rate", "assignee": "Alice"}"
"09:08:00","09:09:30","Bob","T

In [22]:
print(prompt)


Create a csv containing 10 rows of audio transcript which are of tech company standups where the call has an agenda and people are getting tasks assigned, and also From meeting transcripts, identify action items (tasks to be done) that were identified during the course of the meeting, and who they were assigned to.The csv should have a similar struture to the example csv delimited with triple backticks.

Make sure each entry has a valid name.

Example Transcription: ```
start_time,end_time,speaker,text,labels
“10:00:00”, “10:01:50”, “Bob“, “... Alice, can you take the UX bug? ...”,{"text": "UX bug", "assignee": "Alice"}""
“12:25:00”, “12:25:30”, “Alice”, ”... We need to plan for offsite next month ...”,"{"text": "plan for offsite next month", "assignee": "UNKNOWN"}"
```

