# Creating Aira's Dataset

Return to the [index](https://github.com/Nkluge-correa/Aira-EXPERT).

**The goal of this notebook/tutorial is to create a dataset that allows the construction of:**

1. **`Closed-domain` chatbots.**
2. **`Open-domain` chatbots.**

**`Closed-domain` chatbots are designed to respond to specific questions and interactions within a limited set of predefined topics. They are programmed to recognize `keywords` and `patterns` and respond to them in a predefined way. In other words, they are trained in a specific `knowledge domain` or set of tasks and can provide accurate and consistent answers within these limits. We will treat this problem as a `text classification` problem, where given input $x$, what is label $y$, where $y$ is the more probable response.**

**On the other hand, `open-domain` chatbots are designed to interact with users on a wide variety of topics and issues, without predefined limitations. They use machine learning algorithms and natural language processing to understand the user's intentions and respond in a more free and adaptive way, without the need to follow a rigid set of rules.**

**In summary, while closed-domain chatbots are more effective at providing accurate and consistent answers within a limited set of topics, open-domain chatbots are more flexible and adaptive, able to interact with users on a wide variety of topics and questions, but we lose control and predictability.**

**Le us start. 🤗**

## Defining a seed dataset

**To create our chatbots, we will need a `dataset`. `Aira` is a chatbot specifically designed to master a specific domain (AI Ethics and AI Safety). So we created a `seed dataset`, which contains 142 questions + 142 answers.**

> **We can define a `seed dataset` as a starting point to create a bigger dataset via `data augmentation`. Data augmentation is a technique used in machine learning and deep learning to increase the size of a training dataset by artificially creating additional examples. This is typically done by applying various transformations to existing data samples. In our case, this will involve "_rewriting a paraphrasing_" over our seed dataset.**

**Let us first load and look at an example of our `seed dataset`.**

In [3]:
language = "pt" # "en" or "pt"

with open(f'data/original_data/originalQA_{language}.txt', encoding='utf-8') as fp:
    questions = [line.strip() for line in fp]
    fp.close()

with open(f'data/original_data/answers_{language}.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()

print(f"Found {len(questions)} questions in the original file (language: {language}).")
print(f"Found {len(answers)} answers in the original file (language: {language}).")

Found 142 questions in the original file (language: pt).
Found 142 answers in the original file (language: pt).


## Augmenting the Seed Dataset

**We have created two seed datasets, Portuguese and English. The seed datasets are composed of 142 questions and answers. The questions are in the file originalQA.txt and the answers are in the file answers.txt. To create our augmented dataset we will use an LLM (large language model) to help us create our artificial samples. There are other tools to achieve this, like the [TextAttack augmenter API](https://textattack.readthedocs.io/en/latest/3recipes/augmenter_recipes.html), or commercial solutions like [Quilbot](https://quillbot.com/), but we will be using OpenAI's GPT models.**

**OpenAI offers many models for working with text, with [differencial pricing](https://openai.com/pricing). Models like `text-davinci-003`, `text-davinci-002`, `text-curie-001`, `text-babbage-001`, `text-ada-001` can follow instructions like "paraphrase this sentence." You can test all of them if you wish. We found that the GPT-3.5 (`gpt-3.5-turbo`) creates better samples in the end (more aligned with the initial question, but sufficiently different and diverse.).**

**The rephrasing task is solicited via a prompt. Prompts were created for both the English and Portuguese seed datasets and work pretty well:**

> **Portuguese: _Você é um sistema que parafraseia e reescreve textos. Você seguirá estas instruções: Crie uma lista com 25 variações de cada pergunta que você receber. Devolva todas as perguntas geradas como uma lista em Markdown (use "- " no começo de cada item)._**

> **English: _You are a system that paraphrases and rewrites text. You will follow these paraphrasing instructions: Create a list with 25 variations of every question you receive. Give all of the generated questions back as a markdown list (use "- " to make the list)._**

**With this prompt, we create a loop of requests to the OpenAI API. Inside the loop, the code is using the `openai.ChatCompletion.create()` method to generate 25 variations of the current question. The method takes a `model` parameter to specify which model to use, and a `messages` parameter, which is a list of dictionaries containing the message history between the user and the system.** 

**After we received our chat completions, the generated variations are then split by new line characters (`\n`), and for each generated variation, a new string is created that includes the variation and the index of the current question. This new string is then appended to a file called `generatedQA_pt.txt`.**

In [7]:
import openai
import tqdm

openai.api_key="your_api_key_here"

model = 'gpt-3.5-turbo' # 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-003'

for i, question in enumerate(tqdm.tqdm(questions)):

    response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[
        {"role": "system", "content": f"""Você é um sistema que parafraseia e reescreve textos. Você seguirá estas instruções: Crie uma lista com 25 variações de cada pergunta que você receber. Devolva todas as perguntas geradas como uma lista  em Markdown (use "- " no começo de cada item)."""},
        {"role": "user", "content": f"{question}"},
        ]
    )
    
    for string in response['choices'][0]['message']['content'].split('\n'):
        text = f'{string} {i+1}'
        with open(f'data/generated_data/generatedQA_{language}.txt', 'a') as fp:
            fp.write("%s\n" % text)
            
    fp.close()

        

100%|██████████| 142/142 [42:24<00:00, 17.92s/it]


**And like this, we just made our dataset larger. You could augment this even further, and keep calling the model to create a larger dataset. All of our generated samples have their label as the final word in the sentence.**

In [5]:
with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    generated_text = [line.strip() for line in fp]
    fp.close()

generated_text[:20]

['Qual é a sua designação? 1',
 'Como você se autodenomina? 1',
 'Qual é a alcunha que você utiliza? 1',
 'Como devo me referir a você? 1',
 'Qual é o seu título? 1',
 'Qual é o nome que você utiliza? 1',
 'Como devo chamá-lo? 1',
 'Qual é o seu apelido? 1',
 'Qual é a sua denominação oficial? 1',
 'De que maneira você se chama? 1',
 'Como você é chamado? 1',
 'Qual é o seu nome pessoal? 1',
 'Qual é a sua identificação? 1',
 'Como você é identificado? 1',
 'Poderia me dizer o seu nome? 1',
 'Qual é a sua designação pessoal? 1',
 'Como devo me dirigir a você? 1',
 'Qual nome você utiliza? 1',
 'Qual é a sua nomenclatura? 1',
 'Como posso te chamar? 1']

## Cleaning the Generated Dataset

**This code defines a function called `standerdize_text` (the auxiliary functions are found in the `utilities.py` file) which takes a string of text and performs some cleaning operations. The function removes all punctuation marks from the text, converts it to lowercase, and removes any accents or diacritics.**

**This can be very helpful, especially for small language models. It is important to standardize text data before training a language model because it can help reduce noise and inconsistencies in the data.** 

**The standardized text is all saved in a `lower_generatedQA.txt` file.**

In [None]:
from utilities import standerdize_text

with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    questions = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()
with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    labels = [line.strip().split(' ')[-1] for line in fp]
    fp.close()

standerdized_questions = [standerdize_text(x) for x in questions]

with open(f'data/generated_data/lower_generatedQA_{language}.txt', 'a') as fp:
    for i, line in enumerate(standerdized_questions):
        text = f'{line} {labels[i]}'
        fp.write("%s\n" % text)
    fp.close()

## Knowledge Bases for Expert Systems

**`Expert systems` were one of the earliest approaches used to perform language modeling. Expert systems are computer programs that mimic the decision-making abilities of human experts in a specific domain by using a set of `rules` and `heuristics`. In the context of language modeling, `expert systems` were designed to incorporate linguistic knowledge and rules to generate coherent and grammatically correct modeling of human languages.**

**While `expert systems` were an important early approach to language modeling, they had several limitations. For example, they were often limited by the amount and specificity of the rules and heuristics they could incorporate. As a result, more sophisticated approaches to language modeling, such as `statistical` and `neural network-based models`, have replaced expert systems in modern `natural language processing` applications.**

**We will create our `expert system` to serve as a `baseline` against our other ML models. This system will use a simple heuristic that combines search and a "_scoring function_" that assigns more value to keys it already knows and finds, than keys it has to create (breaking the sentence into smaller keys) to be able to find.**

**The code below takes in our `generated dataset` and generates `n-grams` (sequences of adjacent words) of length 6 for each question (you are free to find other better window sizes). Then, it creates a list of keys by joining the `n-grams` with spaces, followed by the corresponding label. These keys are then added to an existing list of keys stored in a file called `handcrafted_grams.txt`. These `handcrafted_grams` were, as the name suggests, made by hand. We created them to make our dictionary more fine-graned (with keys lower than an `n-gram` of window size 6).**

**Next, the code extracts the keys and labels from the combined list of handcrafted and generated keys and creates a dictionary that maps keys to labels. This dictionary is then sorted by label, and saved in a JSON file called `keys.json`.**

In [112]:
import json
from utilities import make_keys

with open(f'data/generated_data/lower_generatedQA_{language}.txt', encoding='utf-8') as fp:
    questions = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()
with open(f'data/generated_data/lower_generatedQA_{language}.txt', encoding='utf-8') as fp:
    labels = [int(line.strip().split(' ')[-1]) for line in fp]
    fp.close()

for i, question in enumerate(questions):
    keys = make_keys(question, 6)
    for key in keys:
        key = f'{key} {labels[i]}'
        with open(f'data/generated_data/keys_{language}.txt', 'a') as fp:
            fp.write("%s\n" % key)
            fp.close()

with open(f'data/generated_data/keys_{language}.txt', encoding='utf-8') as fp:
    X = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()

with open(f'data/generated_data/handcrafted_grams_{language}.txt', encoding='utf-8') as fp:
    x = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()

with open(f'data/generated_data/keys_{language}.txt', encoding='utf-8') as fp:
    Y = [line.strip().split(' ')[-1] for line in fp]
    fp.close()

with open(f'data/generated_data/handcrafted_grams_{language}.txt', encoding='utf-8') as fp:
    y = [line.strip().split(' ')[-1] for line in fp]
    fp.close()

X.extend(x)
Y.extend(y)

vocabulary = dict(zip(X, Y))

sorted_vocabulary = {k: v for k, v in sorted(vocabulary.items(), key=lambda item: int(item[1]))}

with open(f"data/generated_data/keys_{language}.json", "w") as fp:
    json.dump(sorted_vocabulary, fp, indent=2)
    fp.close()

## Creating a Conditional Completion Dataset 

**While we modeled our `close-domain` chatbots as `text classification` models, our `open-domain` chatbots will be solving a problem of `conditional text generation`. Conditional text generation refers to the task of generating natural language text that is conditioned on some given input, such as a prompt, a question, or a context. The goal is to generate text that is coherent, grammatically correct, and consistent with the given input.**

**However, `conditional text generation` depends on a good `foundational model` that can generate coherent and grammatically correct text. Given that training large language models (and fine-tuning them) requires vast amounts of computational resources, we will be working with OpenAI's fine-tuning API, which takes care of this training for us. We only need to provide the dataset.**

**Building a dataset for conditional text generation typically involves collecting a large set of text samples that are paired with corresponding input conditions or prompts. For example, if the task is to generate movie reviews based on a given movie title, the dataset would consist of pairs of movie titles and reviews. There already are open-source datasets for this kind of tuning (like the [OpenAssitant dataset](https://huggingface.co/OpenAssistant)), but here we will build our own using (again) an LLM. If you are interested in the idea of how LLM can be used to generate data that they selfs can use to improve, check [this publication](https://arxiv.org/abs/2210.11610).**

**The code reads two text files: `generatedQA.txt` and `answers.txt`. The `generatedQA.txt` file contains a list of questions and corresponding label numbers, and the `answers.txt` file contains a list of answers. It generates paraphrases using OpenAI's `gpt-3.5-turbo`, and saves the paraphrased data as a JSON file. Here is the prompt we used to generate our samples:**

> **Portuguese: _Você é um sistema que parafraseia e reescreve textos. Você seguirá as seguintes instruções: Crie uma paráfrase de cada parágrafo que você receber. Retorne apenas a paráfrase gerada e nada mais. Mantenha todas as paráfrases curtas e similares ao texto original. Se a paráfrase tiver mais de um parágrafo, devolva apenas o primeiro parágrafo._**

> **English: _You are a system that paraphrases and rewrites text. You will follow these paraphrasing instructions: Create a paraphrase for every text sample you receive. Return only the generated paraphrase and nothing more. Keep all the paraphrases short and close to the original example. If the paraphrase has more than one paragraph, return only the first paragraph._**

**Since we are going to fine-tune the models available in [OpenAI's API](https://platform.openai.com/docs/guides/fine-tuning), we have to create our dataset in a way that is specified by it. For both classification and conditional generation tasks, we need to append a special token at the end to signify to the model "the sequence ends here" (e.g., `"\n\n###\n\n"`). The completions need to start with a space, either for classification (`" class"`) or completion (`" Response is..."`). In the case of completions, we also need to specify an "end token" and the end of every completion (e.g., `"[END]"`).**

In [7]:
with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    questions = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()
with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    labels = [int(line.strip().split(' ')[-1]) for line in fp]
    fp.close()
with open(f'data/original_data/answers_{language}.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()

print(f"Found {len(questions)} questions in the original file (language: {language}).")
print(f"Found {len(labels)} labels in the original file (language: {language}).")
print(f'Found {len(answers)} unique answers in the original file (language: {language}).')

dataset = list()

for i, question in enumerate(tqdm.tqdm(questions)):

    index = labels[i+1]
    answer = answers[index-1]

    response = openai.ChatCompletion.create(
    model=model,
    messages=[
        {"role": "system", "content": f"""You are a system that paraphrases and rewrites text. You will follow these paraphrasing instructions: Create a paraphrase for every text sample you receive. Return only the generated paraphrase and nothing more. Keep all the paraphrases short and close to the original example. If the paraphrase has more than one paragraph, return only the first paragraph."""},
        {"role": "user", "content": f"{answer}"},
        ]
    )
    
    dataset.append(
        {"prompt": question + '\n\n###\n\n', 
         "completion": " " + response['choices'][0]['message']['content'] +"[END]",
         }
         )

with open("data/fine_tuning_data/fine_tuning_completion.json", "w") as fp:
    json.dump(dataset, fp, indent=2)
    fp.close()

Found 4147 questions in the original file (language: pt).
Found 4147 labels in the original file (language: pt).
Found 142 unique answers in the original file (language: pt).


**Done! Now we have everything we need to build our chatbots. All data generated can be found in the `data` folder. We created two training datasets: one for `classification` and one for `completion` tasks. 🤗**

---

Return to the [index](https://github.com/Nkluge-correa/Aira-EXPERT).