Table of contents:

1. [Set API Key](#Key)
2. [Imports](#Import)
3. [Load Data](#Load)
4. [Function Definitions](#Functions)
5. [Dataset Expansion with GPT-3](#GPT3)
6. [Dataset Expansion with Pegasus](#Pegasus)
7. [Preparing the final dataframe](#Dataframe)

<a name = "Key"></a>
## 1. Set API Key

In [2]:
api_key = None

<a name = "Import"></a>
## 2. Imports

In [None]:
import pandas as pd
import pickle
import torch
import openai
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from time import sleep

<a name = "Load"></a>
## 3. Load Data
The starting dataset 'intents.csv' was created manually, covering a range of basic questions. This consists of the 70 examples, ready to be expanded into over 4000.

In [65]:
with open('intents.csv', 'r') as f:
    data = pd.read_csv(f)

with open('intents-key.csv', 'r') as f:
    data_key = pd.read_csv(f)

data.head()

Unnamed: 0,id,area,intent,prompt,completion
0,14,ExperienceProjects,Experience,Has Charlie worked with machine learning before?,Charlie has completed various interesting mach...
1,15,ExperienceProjects,Experience,What projects has Charlie completed?,Charlie has a diverse portfolio of machine lea...
2,16,ExperienceProjects,ProgrammingLanguages,What programming language is Charlie proficien...,Charlie is proficient in Python.
3,17,ExperienceProjects,ProgrammingLanguages,Which programming languages can Charlie use?,"Charlie can use Python, C# and C++"
4,18,ExperienceProjects,ProgrammingLanguages,Which programming languages does Charlie know?,"Charlie knows Python, C# and C++"


<a name = "Functions"></a>
## 4. Function Definitions

In [424]:
def get_data_dict(data, intent):
    data_dict = data
    data_dict = data_dict[data_dict['intent'] == intent]
    data_dict = data_dict.dropna().drop(['id'], axis=1).drop(['intent'], axis=1).drop(['area'], axis=1)
    data_dict = data_dict.to_dict('records')
    return data_dict

def extract_response(response):
    response = response["choices"][0]['text'].split('\n')
    response = response[1:]
    if response[0][0] is '1' or 'a' :
        response = [i[3:] for i in response]
    return response

<a name = "GPT3"></a>
## 5. Dataset expansion with GPT-3

### Stage 1: Question expansion

In [1]:
openai.api_key = api_key

intents = set(data['intent'])
intent_list = list(intents)

questions=[]
for intent in intent_list:
    data_dict = get_data_dict(data, intent)
    for datum in data_dict:
        questions.append( (intent, datum['prompt']) )
        prompt=f"Rephrase 5 times: '{datum['prompt']}'\n"

        response = openai.Completion.create(
            model='text-davinci-002',
            prompt=prompt,
            temperature=0.9,
            max_tokens=256,
            top_p=1,
            best_of=1,
            frequency_penalty=1,
            presence_penalty=0.2,
            stop=["\n\n"]
        )
        
        questions.append( (intent, datum['prompt']) )
        questions_list = extract_response(response)
        [questions.append( (intent, question) ) for question in questions_list]
        questions = list(set(questions))
        
df = pd.DataFrame(questions, columns=['intent', 'prompt'])
df.to_csv("questions_post_gpt3.csv")

NameError: name 'api_key' is not defined

### Stage 2: Question answering
##### questions_post_gpt3.csv is manually cleaned, then turned into questions_post_gpt3_cleaned.csv

In [604]:
input_df = pd.read_csv("questions_post_gpt3_cleaned.csv")

area2context = data_key.set_index('area').T.to_dict('list')
intent2area = data[['area', 'intent']].drop_duplicates().set_index('intent').T.to_dict('list')

questions_areas = [intent2area[intent][0] for intent in input_df['intent']]
questions_contexts = [area2context[area][0] for area in questions_areas]

input_df['area'] = questions_areas
input_df['context'] = questions_contexts

intents_list = list(intents)

answers=[]
for intent in intents_list:
    inputs = get_data_dict(input_df, intent)
    for datum in inputs:
        answers_prompt = f"Context: {datum['context']}\nQuestion: {datum['prompt']}\n"
        answers_response = openai.Completion.create(
            model='text-davinci-002',
            prompt=answers_prompt,
            temperature=0.9,
            max_tokens=100,
            top_p=1,
            best_of=1,
            frequency_penalty=1.5,
            presence_penalty=1.5,
            stop=["\n\n"]
        )
        print( (intent, datum['prompt'], answers_response["choices"][0]["text"][1:]) )
        answers.append( (intent, datum['prompt'], answers_response["choices"][0]["text"][1:]) )
        sleep(1.1)
        
answers_df = pd.DataFrame(answers, columns=['intent', 'prompt', 'completion'])
answers_df.to_csv("intermediate_dataset.csv")

('MusicGenre', "What kind of music is Charlie's favourite?", "Charlie's favourite music genre is rock, but his favourite artist is Bliss, who makes psytrance music.")
('MusicGenre', 'What style of music does Charlie favor?', "Charlie's favorite style of music is rock, but he also enjoys psytrance music.")
('MusicGenre', 'Which kind of music does Charlie like best?', 'Charlie likes rock music best.')
('MusicGenre', 'What type of music does Charlie prefer?', 'Charlie prefers rock music, but he also enjoys psytrance music by Bliss.')
('MusicGenre', 'What type of music does Charlie like the most?', 'Charlie likes rock music the most.')
('MusicGenre', "What is Charlie's favourite music genre?", "Charlie's favorite music genre is rock.")
('MusicGenre', "What is Charlie's favourite type of music?", "Charlie's favourite type of music is rock.")
('MusicGenre', "Which genre of music is Charlie's favorite?", "Charlie's favorite genre of music is rock.")
('MusicGenre', 'Which music genre does Char

('Author', 'Which author does Charlie prefer?', 'Sebastian De Castell')
('Author', "If you had to guess, who would be Charlie's top pick for an author?", 'Probably Sebastian De Castell, since he enjoys reading his books so much.')
('Author', 'Who does Charlie like best when it comes to authors?', "Charlie's favorite author is Sebastian De Castell.")
('Author', "Who is Charlie's favourite author?", 'Sebastian De Castell')
('Author', "Which author is Charlie's favourite?", 'Sebastian De Castell')
('Weight', 'How much does Charlie weigh in pounds?', 'Charlie weighs 183.7 pounds.')
('Weight', "What is Charlie's weight in pounds?", "Charlie's weight in pounds is approximately 182.98.")
('Weight', "What is Charlie's weight?", "Charlie's weight is 83 kg.")
('Weight', 'How much does Charlie weigh?', 'Charlie weighs 83 kg.')
('Weight', 'How heavy is Charlie?', '83 kg')
('Weight', 'How many kilos does Charlie weigh?', '83 kg.')
('Weight', 'How much does charlie weigh?', '83 kg')
('GuitarMusic', 

('Responsibilities', 'What does Charlie do for work?', 'Charlie works as a software engineer at Bilfinger. He is responsible for designing software, commissioning gas plants, and establishing requirements and specifications. In addition to his work experience, Charlie is also very knowledgeable in the Python programming language and its various frameworks.')
('Responsibilities', "What are Charlie's tasks and responsibilities?", "Charlie's tasks and responsibilities include establishing requirements and specifications, calculating bid proposals, developing software and commissioning gas plants.")
('Responsibilities', "What is Charlie's job?", "Charlie's job is to design software and commission gas plants into operation.")
('Responsibilities', "What was Charlie's old job?", "Charlie's old job was working as a software engineer for Bilfinger.")
('Responsibilities', "Charlie's old job - what did he do there?", "Charlie's old job was working as a software engineer at Bilfinger, where he gai

('Studies', 'To what did Charlie devote his studies? ', 'Charlie has devoted his studies to mechanical engineering and control systems.')
('Studies', 'Where did Charlie go to University?', 'Charlie attended the University of Exeter for his undergraduate degree and then went on to study at Imperial College London for his masters.')
('Studies', "What was Charlie's major in college?", "Charlie's major in college was Mechanical Engineering.")
('Studies', "Of what was Charlie's studying composed?", "Charlie's studying was composed of Mechanical Engineering at the University of Exeter and then he did his Masters in Control Systems at Imperial College London.")
('Studies', 'What did Charlie focus on in his studies? ', 'Charlie focused on mechanical engineering and control systems in his studies.')
('Studies', 'Where did Charlie go to school?', 'Charlie went to Pent Valley Technology College when he was a boy. He later studied Mechanical Engineering at the University of Exeter and did his Mast

('Parkour', 'How much longer does Charlie have to train parkour?', ' Charlie does not have to train parkour anymore. He switched to gymnastics instead.')
('Parkour', 'When will Charlie finally be done training parkour?', 'Charlie quit training parkour after he switched to gymnastics.')
('Parkour', 'How long has Charlie trained parkour?', 'Charlie trained parkour for a brief period of time before switching to gymnastics.')
('MLFrameworks', 'What machine learning frameworks does Charlie understand?', 'Charlie understands the machine learning frameworks Pytorch, Tensorflow with Keras, and scikit-learn.')
('MLFrameworks', 'How many machine learning frameworks does Charlie know?', 'Charlie knows three machine learning frameworks: Pytorch, Tensorflow with Keras, and scikit-learn.')
('MLFrameworks', 'Can you list the machine learning frameworks that Charlie knows?', 'Charlie knows the machine learning frameworks: Pytorch, Tensorflow with Keras, scikit-learn.')
('MLFrameworks', 'What machine l

('Experience', "Tell me about Charlie's experience with this.", 'Charlie has 3 years of experience with Python, including its various frameworks. He has also undergone training and finished independent projects using a variety of Frameworks for machine learning, data science, and web development. Charlie is very knowledgeable with the Python programming language, having used it for 3 years, along with its various frameworks. Additionally Charlie has some experience with programming languages C# and C++ from his studies.')
('Experience', 'Has Charlie worked with machine learning before?', 'Has Charlie worked with machine learning before?')
('Experience', 'Is Charlie knowledgeable about machine learning?', 'Yes, Charlie is knowledgeable about machine learning.')
('Experience', "What is Charlie's expertise in machine learning?", "Charlie's expertise in machine learning includes experience with the Pytorch, Tensorflow and Keras frameworks, as well as extensive knowledge of essential data s

('Experience', 'What sorts of things has Charlie experienced?', 'Charlie has experienced working with customers and colleagues to design software, commissioning gas plants, establishing requirements and specifications, calculating bid proposals, developing software, and training. He is also knowledgeable with the Python programming language and its various frameworks. Additionally, he has undergone projects using a variety of Frameworks for machine learning, data science, and web development.')
('Experience', 'What has Charlie gone through?', 'Charlie has gained experience working with customers and colleagues to design software and commission gas plants into operation. He has also undergone training and finished independent projects using a variety of Frameworks for machine learning, data science, and web development.')
('Experience', 'How much professional experience does Charlie have?', 'Charlie has 3 years of professional experience as a software engineer.')
('Experience', 'What pr

('Hobbies', "What are Charlie's hobbies?", "Charlie's hobbies are playing the guitar, training calisthenics, and programming.")
('Hobbies', 'What are some things Charlie likes to do?', 'Some things Charlie likes to do are train calisthenics, play his guitar with friends, and program using AI.')
('Hobbies', "What are Charlie's interests?", "Charlie's interests include calisthenics, gymnastics, music, guitar playing, programming and artificial intelligence. He also enjoys travelling and experiencing new cultures.")
('Hobbies', 'What does Charlie like to do for fun?', 'Charlie likes to do a lot of different things for fun. He loves music and playing his guitar, and he also enjoys programming and using AI. Charlie loves to be outside, so he would love to travel the world and experience new cultures.')
('Hobbies', "What are some of Charlie's favorite activities?", "Some of Charlie's favorite activities include training calisthenics, playing his guitar with friends, and programming. He also 

('RemoteWork', 'Does Charlie have any reservations about working remotely?', 'No, Charlie does not have any reservations about working remotely. He loves to be outside and believes that this would be the ideal situation for him since he has friends all over the world and enjoys experiencing new cultures.')
('RemoteWork', 'How does Charlie feel about working remotely? ', 'Charlie feels great about working remotely! He loves to be outside, and this way he can work from anywhere in the world.')
('RemoteWork', 'Would Charlie be open to the idea of working remotely?', 'Yes, Charlie would be open to the idea of working remotely.')
('Grades', 'What grades did Charlie get in University?', 'Charlie got a 2:1 in his Mechanical Engineering degree from the University of Exeter and a distinction in his Masters in Control Systems from Imperial College London.')
('Grades', 'What types of grades did Charlie get while attending University?', "Looking at Charlie's CV, it appears that he received good gr

('ProgrammingLanguages', 'What programming languages can Charlie use?', 'Charlie can use the programming languages Python, C#, and C++.')
('ProgrammingLanguages', 'What are the programming languages that Charlie is good at?', 'Judging by the information provided, it seems that Charlie is good at Python, C#, and C++.')
('ProgrammingLanguages', 'What programming languages can Charlie use well?', 'Charlie knows how to use the programming languages Python, C# and C++.')
('ProgrammingLanguages', 'What programming languages can Charlie use?', 'Charlie can use Python, C#, and C++.')
('ProgrammingLanguages', 'What extent is Charlie proficient in different programming languages?', 'Charlie is proficient in the programming languages Python, C# and C++.')
('ProgrammingLanguages', 'What is the programming language that Charlie knows well?', 'The programming language that Charlie knows well is Python.')
('ProgrammingLanguages', 'Which programming languages is Charlie familiar with?', 'Charlie is fa

RateLimitError: Rate limit reached for default-text-davinci-002 in organization org-ETAduTqLVAj5ZDS1d9qjZpPr on requests per min. Limit: 60.000000 / min. Current: 66.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://beta.openai.com/account/billing to add a payment method.

<a name = "Pegasus"></a>
## 6. Dataset expansion with Pegasus

In [11]:
df = pd.read_csv("intermediate_dataset.csv")
questions = df['prompt']
answers = df['completion']

model_name = 'tuner007/pegasus_paraphrase'
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

def get_response(src_text, num_return_sequences, num_beams):
    batch = tokenizer(src_text, truncation=True, padding="longest", max_length=60, return_tensors="pt").to(device)
    translated = model.generate(**batch,max_length=60,num_beams=num_beams, num_return_sequences=num_return_sequences, temperature=1.5)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text

num_return_sequences=10
num_beams=10

generated_questions=[]
for text in questions:
    r = get_response(text,num_return_sequences,num_beams)
    generated_questions.append((text, r))
    
generated_answers=[]
for text in answers:
    r = get_response(text,num_return_sequences,num_beams)
    generated_answers.append((text, r))

del r, tokenizer, model
torch.cuda.empty_cache()

with open("generated_questions.pickle", "wb") as f:
    pickle.dump(generated_questions, f)
with open("generated_answers.pickle", "wb") as f:
    pickle.dump(generated_answers, f)

<a name = "Dataframe"></a>
## 7. Preparing the final dataframe

In [67]:
with open("generated_questions.pickle", "rb") as f:
    imported_questions = pickle.load(f)
with open("generated_answers.pickle", "rb") as f:
    imported_answers = pickle.load(f)

intents_final = []
questions_final = []
answers_final = []
for i, _ in enumerate(imported_questions):
    intents_final.extend(11*[df['intent'][i]])
    
    
    questions_final.append(imported_questions[i][0])
    questions_final.extend(imported_questions[i][1])
    
    answers_final.append(imported_answers[i][0])
    answers_final.extend(imported_answers[i][1])
    
df_final = pd.DataFrame({'intent':intents_final, 'prompt':questions_final, 'completion':answers_final})
df_final.to_csv('dataset.csv')
df_final

Unnamed: 0,intent,prompt,completion
0,MusicGenre,What kind of music is Charlie's favourite?,"Charlie's favourite music genre is rock, but h..."
1,MusicGenre,What kind of music does Charlie like?,"Charlie's favourite artist is Bliss, who makes..."
2,MusicGenre,What kind of music is Charlie fond of?,"Rock is Charlie's favourite genre, but his fav..."
3,MusicGenre,What kind of music do you like?,"Charlie's favourite genre of music is rock, bu..."
4,MusicGenre,What kind of music is Charlie into?,"Rock is Charlie's favourite music genre, but h..."
...,...,...,...
4890,BookGenre,What is Charlie's favourite genre of books?,Charlie likes fantasy books.
4891,BookGenre,What genre of books do Charlie like?,Fantasy is the genre that Charlie likes the most.
4892,BookGenre,What are Charlie's favorite books?,Fantasy is one of Charlie's favorite genres.
4893,BookGenre,What is Charlie's favorite genre?,Fantasy is Charlie's favourite type of books.
