# Making Closed & Open Domain Chatbots Via `Text Classification` and `Conditional Text Generation`

Return to the [index](https://github.com/Nkluge-correa/Aira-EXPERT).

**In this notebook, we train several versions of our chatbot (`Aira`):**

- **A `rule-based system` based on heuristics and search.**
- **A `bidirectional-LSTM`.**
- **An `ensemble model` on `Bi-LSTMs`.**
- **A `decoder-only transformer`.**
- **A fine-tuned BERT.**
- **Fine-tuned versions of `Ada`, `Babbage`, `Curie`, and `Davinci`.**

**To begin, let us first create our `rule-based system` to use as a baseline. We will begin with the `close-domain` chatbots, and leave the `open-domain` chatbots for the end.**

## Rule-based System (`heuristic dictionary search`)

**Our `rule_based_prediction` function is documented in the `utilities.py` file. Now let us test our non-ML version of Aira, and see how well it classifies our generated dataset. Since the keys were generated from this same dataset, naturally we can get a good enough score with a big enough dictionary.**


In [1]:
from utilities import rule_based_prediction
import json

language = 'pt'

with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    questions = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()

with open(f'data/generated_data/generatedQA_{language}.txt', encoding='utf-8') as fp:
    labels = [int(line.strip().split(' ')[-1]) for line in fp]
    fp.close()
  
with open(f'data/generated_data/keys_{language}.json') as fp:
    keys = json.load(fp)    
    fp.close()

preds = list()

for i, question in enumerate(questions):
    prediction = rule_based_prediction(question, keys)

    if prediction+1 == labels[i]: # +1 because the keys are 1-indexed
        preds.append(1)
        
    else:
        preds.append(0)

print(f"Accuracy of Rule-Based Model: {(sum(preds)/len(preds)) * 100:.2f}%")

Accuracy of Rule-Based Model: 96.77%


**This is the baseline we want to beat:**

| Models      	| Accuracy (PT) 	| Accuracy (EN) 	|
|-------------	|---------------	|---------------	|
| Ruled-based 	| 96.77%        	| 95.36%        	|

**Now, to our ML models.**

## Data & Preprocessing

**The cell below is preparing data for the language models we will train. Specifically, it is creating a training and testing dataset for a text classification task, and a vocabulary to be shared by all trained models (not the fine tune ones). To find the best set of parameters to create our `TextVectorization` layer, we performed some testing with different `TextVectorization` layers (with different tokenization schema and sizes):**


| Parameters                      	| Accuracy (PT) 	|
|---------------------------------	|---------------	|
| vocab_size = 20_000, ngrams = 6 	| 86.02%        	|
| vocab_size = 20_000, ngrams = 3 	| 89.64%        	|
| vocab_size = 12_700, ngrams = 2 	| 90.46%        	|
| vocab_size = 12_300, ngrams = 2 	| **92.29%**    	|

**We perform this for both the English and Portuguese datasets and arrived at a specific configuration of `vocabulary_size` and tokenization format. We also keep the `dimensionality` of our `embeddings` (512) and `maximum sequence length` (50) equal among models. We will train and test all models in the same `training` and `testing` set, defined in the cell below.**


In [9]:
import tensorflow as tf

vocabs = {"pt": 12_300, "en": 8_150}

with open(f'data/generated_data/lower_generatedQA_{language}.txt', encoding='utf-8') as fp:
    X = [[' '.join(line.strip().split(' ')[:-1])] for line in fp]
    fp.close()

with open(f'data/generated_data/lower_generatedQA_{language}.txt', encoding='utf-8') as fp:
    Y = [int(line.strip().split(' ')[-1]) for line in fp]
    fp.close()

vocab_size = vocabs[language]
embed_size = 512
sequence_length = 50

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    ngrams=2,
    output_sequence_length=sequence_length)

vectorize_layer.adapt(X)
vocabulary = vectorize_layer.get_vocabulary()

with open(f'aira/vocabulary_{language}.txt', 'w') as fp:
    for word in vocabulary:
        fp.write("%s\n" % word)
    fp.close()

encoded_X = vectorize_layer(X)

one_hot_encoded_Y = tf.keras.utils.to_categorical(Y)[:,1:]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(encoded_X.numpy(), 
                                                    one_hot_encoded_Y, 
                                                    test_size=0.1, 
                                                    random_state=42)

## `Bi-LSTM`

**This cell below is creating and training a `Bidirectional Long Short-Term Memory` with two `Bidirectional` layers (128 neurons each).**

> **A `BiLSTM` is a type of recurrent neural network (`RNN`) that is commonly used for processing sequential data, such as text or speech. It extends the capabilities of a regular `LSTM` by processing the input sequence in both forward and backward directions, allowing the network to capture dependencies that exist in both directions.**

**To better train our models, we are also defining a list of callbacks to be used during training, including `ModelCheckpoint` (to save the best model according to the validation set), and `ReduceLROnPlateau` to reduce the learning rate when the model stops improving.**


In [None]:
import tensorflow as tf

inputs = tf.keras.Input(shape=(x_train.shape[1],), dtype="int32")

embedded_inputs = tf.keras.layers.Embedding(
    input_dim=vocab_size, 
    output_dim=embed_size, 
    mask_zero=True)(inputs)

x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(128, return_sequences=True))(embedded_inputs)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(x)

outputs = tf.keras.layers.Dense(142, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_BiLSTM_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit(x_train,
          y_train,
          validation_split = 0.2,
          epochs=10,
          batch_size=32,
          verbose=1,
          callbacks=callbacks)

model = tf.keras.models.load_model(f'aira/Aira_BiLSTM_{language}.keras')
test_loss_score, test_acc_score = model.evaluate(x_test, y_test, verbose=0)

print(f'Final Loss: {round(test_loss_score, 1)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

**You can also load and test the `BiLSTM` model using the cell below.**

In [24]:
from IPython.display import Markdown
import tensorflow as tf

language = 'pt'
vocabs = {"pt": 12_300, "en": 8_150}
vocab_size = vocabs[language]
sequence_length = 50

model = tf.keras.models.load_model(f'aira/Aira_BiLSTM_{language}.keras')

with open(f'aira/vocabulary_{language}.txt', encoding='utf-8') as fp:
    vocabulary = [line[:-1] for line in fp]
    fp.close()

with open(f'data/original_data/answers_{language}.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()

text_vectorization = tf.keras.layers.TextVectorization(max_tokens=vocab_size,
                                                       output_mode="int",
                                                       ngrams=2,
                                                       vocabulary=vocabulary,
                                                       output_sequence_length=sequence_length)

encoded_sentence = text_vectorization("o que e aprendizagem de maquina?")
encoded_sentence = tf.keras.backend.expand_dims(encoded_sentence, axis=0)
preds = model.predict(encoded_sentence, verbose=0)[0]
index = tf.math.argmax(preds).numpy()

display(Markdown(answers[index]))

Aprendizagem de Máquina é um campo de pesquisa dedicado à compreensão e construção de métodos computacionais que "aprendem", ou seja, métodos que utilizam informação/dados para melhorar o desempenho em algumas tarefas. Geralmente, ML é utilizado em problemas onde uma descrição precisa da solução seria muito desafiadora (por exemplo, visão computacional).

**The `Bi-LSTM` trained in Portuguese does not outperform the rule-based model, but the model trained in English does:**

| Models        | Accuracy (PT)     | Accuracy (EN)     |
|-------------  |---------------    |---------------    |
| Ruled-based   | **96.77%**        | 95.36%            |
| Bi-LSTM       | 92.29%            | **96.98%**        |

**Let us try to ensemble more LSTMs and see if we get some improvement.**

## `Ensemble Bi-LSTM`

**This model is an ensemble of two `Bi-LSTM` layers, each taking different inputs.**

> **Ensemble model is a machine learning technique that combines several individual models to improve the performance of the overall system.**

**For simplicity, we are keeping both inputs equal, but in reality, we could, for example, shift inputs 1 and 2 by a fixed amount before presenting them to the model. The two inputs are processed in parallel, and their outputs are concatenated and fed into a final dense layer with a softmax activation function. The below model haves two different `embedding` layers, and two different stacks of `Bi-LSTM` layers, and the concatenation of this forward pass is processed by the same `dense` layer.**


In [None]:
import tensorflow as tf

inputs_1= tf.keras.Input(shape=(x_train.shape[1],), dtype="int32",  name='input_1')

embedded_inputs_1 = tf.keras.layers.Embedding(
    input_dim=vocab_size, 
    output_dim=embed_size, 
    mask_zero=True,
    name="embedded_inputs_1")(inputs_1)

x_1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(128, return_sequences=True))(embedded_inputs_1)
x_1 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(x_1)

inputs_2= tf.keras.Input(shape=(x_train.shape[1],), dtype="int32",  name='input_2')

embedded_inputs_2 = tf.keras.layers.Embedding(
    input_dim=vocab_size, 
    output_dim=embed_size, 
    mask_zero=True,
    name="embedded_inputs_2")(inputs_2)

x_2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(128, return_sequences=True))(embedded_inputs_2)
x_2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(x_2)

concatenated = tf.keras.layers.concatenate([x_1, x_2], axis=-1)
outputs = tf.keras.layers.Dense(142, activation="softmax")(concatenated)

model = tf.keras.Model([inputs_1, inputs_2], outputs)

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()


callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_BiLSTM_ENSEMB_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit([x_train, x_train],
          y_train,
          validation_split = 0.2,
          epochs=10,
          batch_size=32,
          verbose=1,
          callbacks=callbacks)

model = tf.keras.models.load_model(f'aira/Aira_BiLSTM_ENSEMB_{language}.keras')
test_loss_score, test_acc_score = model.evaluate([x_test,x_test], y_test, verbose=0)

print(f'Final Loss: {round(test_loss_score, 1)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

**You can also load and test the `BiLSTM-ensemble` model using the cell below.**

In [25]:
from IPython.display import Markdown
import tensorflow as tf

language = 'pt'
vocabs = {"pt": 12_300, "en": 8_150}
vocab_size = vocabs[language]
sequence_length = 50

model = tf.keras.models.load_model(f'aira/Aira_BiLSTM_ENSEMB_{language}.keras')

with open(f'aira/vocabulary_{language}.txt', encoding='utf-8') as fp:
    vocabulary = [line[:-1] for line in fp]
    fp.close()

with open(f'data/original_data/answers_{language}.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()

text_vectorization = tf.keras.layers.TextVectorization(max_tokens=vocab_size,
                                                       output_mode="int",
                                                       ngrams=2,
                                                       vocabulary=vocabulary,
                                                       output_sequence_length=sequence_length)

encoded_sentence = text_vectorization("o que e uma funcao de ativacao?")
encoded_sentence = tf.keras.backend.expand_dims(encoded_sentence, axis=0)
preds = model.predict([encoded_sentence, encoded_sentence], verbose=0)[0]
index = tf.math.argmax(preds).numpy()

display(Markdown(answers[index]))

Aprendizagem de Máquina é um campo de pesquisa dedicado à compreensão e construção de métodos computacionais que "aprendem", ou seja, métodos que utilizam informação/dados para melhorar o desempenho em algumas tarefas. Geralmente, ML é utilizado em problemas onde uma descrição precisa da solução seria muito desafiadora (por exemplo, visão computacional).

**The `Ensembled-Bi-LSTM` trained in Portuguese improves marginally when compared with the `Bi-LSTM` but does not outperform the rule-based model. The `Ensembled-Bi-LSTM` trained in English underperforms the `Bi-LSTM`, but outperforms the baseline:**

| Models        | Accuracy (PT)     | Accuracy (EN)     |
|-------------  |---------------    |---------------    |
| Ruled-based   | **96.77%**        | 95.36%            |
| Bi-LSTM       | 92.29%            | **96.98%**        |
| Ensembled-Bi-LSTM | 93.73%            | 95.43%        |

**Now, to the transformer.**

## `Dedocer-Transformer`

**The cell below is creates and traines an `encoder-only Transformer` model.**

> **The `Transformer` is a neural network architecture [developed by Google](https://arxiv.org/abs/1706.03762) that **relies on attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings without relying on convolutions or recurrent neural networks**. We can think of a transformer as a stack of `attention` (and `self-attention`) layers connected together by `residual connections` with `feed forward` and `normalization` layers.**

> **`Eecoder-only transformers` are a type of transformer architecture that only consists of the encoder module, while omitting the decoder module. Like a [BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers).**

**Our custom transformer blocks (`PositionalEmbedding` and `TransformerEncoder`) are documented in the `tblocks` file. In our current version, this model uses 6 attention heads, only one decoder block, and a `latent dimensionality` (size of the feed-forward neural net after the attention head) of 512.**

In [None]:
from tblocks import PositionalEmbedding, TransformerEncoder
import tensorflow as tf

num_heads = 6
latent_dim = 512

inputs = tf.keras.Input(shape=(x_train.shape[1],), dtype="int64")

x = PositionalEmbedding(sequence_length, vocab_size, embed_size)(inputs)
x = TransformerEncoder(embed_size, latent_dim, num_heads)(x)
x = tf.keras.layers.GlobalMaxPooling1D()(x) 
x = tf.keras.layers.Dropout(0.5)(x)

outputs = tf.keras.layers.Dense(142, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs)
model.compile(optimizer="adam", 
    loss="categorical_crossentropy", 
    metrics=["categorical_accuracy"])
model.summary()

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_transformer_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit(x_train,
          y_train,
          validation_split = 0.2,
          epochs=10,
          batch_size=32,
          verbose=1,
          callbacks=callbacks)

model = tf.keras.models.load_model(f"aira/Aira_transformer_{language}.keras", 
                                custom_objects={"TransformerEncoder": TransformerEncoder, 
                                                 "PositionalEmbedding": PositionalEmbedding})

test_loss_score, test_acc_score = model.evaluate(x_test, y_test, verbose=0)

print(f'Final Loss: {round(test_loss_score, 1)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')


**You can also load and test the `transformer` model using the cell below.**

In [29]:
from tblocks import PositionalEmbedding, TransformerEncoder
from IPython.display import Markdown
import tensorflow as tf

language = 'pt'
vocabs = {"pt": 12_300, "en": 8_150}
vocab_size = vocabs[language]
sequence_length = 50

model = tf.keras.models.load_model(f"aira/Aira_transformer_{language}.keras",
                                   custom_objects={"TransformerEncoder": TransformerEncoder,
                                                   "PositionalEmbedding": PositionalEmbedding})


with open(f'aira/vocabulary_{language}.txt', encoding='utf-8') as fp:
    vocabulary = [line[:-1] for line in fp]
    fp.close()

with open(f'data/original_data/answers_{language}.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()

text_vectorization = tf.keras.layers.TextVectorization(max_tokens=vocab_size,
                                                       output_mode="int",
                                                       ngrams=2,
                                                       vocabulary=vocabulary,
                                                       output_sequence_length=sequence_length)

encoded_sentence = text_vectorization("o que e um sistema GOFAI?")
encoded_sentence = tf.keras.backend.expand_dims(encoded_sentence, axis=0)
preds = model.predict(encoded_sentence, verbose=0)[0]
index = tf.math.argmax(preds).numpy()

display(Markdown(answers[index]))

GOFAI ("good-old-fashioned-ai"), ou inteligência artificial simbólica, é o termo utilizado para se referir a métodos de desenvolvimento de sistemas de IA baseados em representações simbólicas (interpretáveis) de alto nível, lógica e busca. Deep Blue é um grande exemplo de um sistema especialista/GOFAI. Deep Blue venceu Garry Kasparov (um grão-mestre de xadrez russo) em uma partida de seis jogos em 1996.

**Now, we are able to beat our baseline using the `Transformer` on both languages:**

| Models        | Accuracy (PT)     | Accuracy (EN) |
|-------------  |---------------    |---------------|
| Ruled-based   | 96.77%            | 95.36%        |
| Bi-LSTM       | 92.29%            | 96.98%        |
| Ensembled-Bi-LSTM | 93.73%        | 95.43%        |
| Transformer   | **97.11%**        | **98.35%**    |

**And now, to a pre-trained transformer. From now on, we will not be using our vocabulary/TextVectorization anymore, but fine tuning a BERT model that already comes pre-trained with its own tokenizer.**

## `Fined-tuned BERT`

**We will need to have access to pre-trained BERT models in both of our languages of choice (English and Portuguese), something we can attain via the `transformers` library and the ever-growing space of models and datasets available in [Hugging Face](https://huggingface.co/).**

> **Bidirectional Encoder Representations from Transformers ([BERT](https://arxiv.org/abs/1810.04805)) is a type of pre-trained natural language processing model developed by Google in 2018. It is based on the `transformer` architecture and is designed to capture contextual relations between words in a text by considering both the left and right contexts. `BERT` is trained on a `self-supervised` learning approach called `masked language modeling`. In this approach, certain words in a sentence are randomly replaced with a `[MASK]` token, and the model is trained to predict the original word based on its context.**

**We also need to prepare our dataset differently, since BERT is already a pre-trained model, we will not be passing vectorized tokens but the raw text.**

In [2]:
import torch 
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast, TFBertForSequenceClassification

with open(f'data/generated_data/lower_generatedQA_{language}.txt', encoding='utf-8') as fp:
    X = [' '.join(line.strip().split(' ')[:-1]) for line in fp]
    fp.close()

with open(f'data/generated_data/lower_generatedQA_{language}.txt', encoding='utf-8') as fp:
    Y = [int(line.strip().split(' ')[-1]) for line in fp]
    fp.close()

x_train, x_test, y_train, y_test = train_test_split(X, 
                                                Y, 
                                                test_size=0.1, 
                                                random_state=42)

y_train = tf.keras.utils.to_categorical(y_train)[:,1:]

**We will be fine tunning a [`bert-base-cased`](https://huggingface.co/bert-base-cased) for the English dataset, and a [`bert-base-portuguese-cased`](https://huggingface.co/neuralmind/bert-base-portuguese-cased) for the Portuguese one.**

**Since our model is already pre-trained, and we do not want to destroy the valuable representations it has, we will set a very low learning rate (paired with an `ExponentialDecay` scheduler) to minimize the size of the updates on our model.**

In [None]:
from transformers import BertTokenizerFast, TFBertForSequenceClassification
from transformers import TextClassificationPipeline

model_type = 'neuralmind/bert-base-portuguese-cased' if language == 'pt' else 'bert-base-cased'

model = TFBertForSequenceClassification.from_pretrained(model_type, num_labels=y_train.shape[1])
tokenizer = BertTokenizerFast.from_pretrained(model_type)

train_encodings = tokenizer(x_train, truncation=True, padding=True)
train_ds = tf.data.Dataset.from_tensor_slices((dict(train_encodings),y_train))
train_ds = train_ds.batch(32)

learning_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=5e-5,
    decay_steps=10000,
    decay_rate=0.9)

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_schedule)
model.compile(optimizer=optimizer,
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=tf.metrics.CategoricalAccuracy()
              )

model.summary()

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_transformer_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit(train_ds, 
          epochs=10, 
          verbose=1,
          batch_size=32,
          callbacks=callbacks)

tokenizer.save_pretrained(f"aira/aira_BERT_{language}")
model.save_pretrained(f"aira/aira_BERT_{language}")

tokenizer = BertTokenizerFast.from_pretrained(f"aira/aira_BERT_{language}")
model = TFBertForSequenceClassification.from_pretrained(f"aira/aira_BERT_{language}", 
                                                        id2label=dict([(i, i+1) for i in range(y_train.shape[1])])) 

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)
predictions = pipe(x_test)

acc = list()

for i, pred in enumerate(predictions):
    if y_test[i] == pred['label']:
        acc.append(1)
    else:
        acc.append(0)

print(f"Accuracy of BERT Model: {(sum(acc)/len(acc)) * 100:.2f}%")

**We were again able to outperform the baseline and the `Transformer` on both languages by fine-tuning a pre-trained BERT:**

| Models        | Accuracy (PT)     | Accuracy (EN) |
|-------------  |---------------    |---------------|
| Ruled-based   | 96.77%            | 95.36%        |
| Bi-LSTM       | 92.29%            | 96.98%        |
| Ensembled-Bi-LSTM | 93.73%        | 95.43%        |
| Transformer   | 97.11%            | 98.35%        |
| BERT          | **98.55%**        | **99.45%**    |

**Our last classification model (close-domain chatbot) will be created using `Ada`, the smallest GPT-3 model that OpenAI allows fine-tuning.**

## Fine-tuning `Ada` for `text classification`

**`Ada` is the samllest and fastest GPT-3 model:**

| Models        | Size (Parameters) |
|-------------  |---------------    |
| Ada           | 350M              | 
| Babbage       | 3B                | 
| Curie         | 13B               | 
| Davinci       | 175B              | 

**We will only be training `Ada` for text classification. After, we will train all four models for conditional text generation, and they will be our Open Domain Aira.**

**OpenAI has specific [guides on how to prepare your dataset](https://platform.openai.com/docs/guides/fine-tuning/prepare-training-data) for fine-tuning a model. The `openai` library comes with functionalities to help check if your dataset is in the correct format. But for them to work, we will convert our poorly structured JSON file into a CSV.**

In [None]:
import json
import pandas as pd

with open(f'data/fine_tuning_data/fine_tuning_classification_train.json') as fp:
    dataset = json.load(fp)    
    fp.close()

prompt = list()
completion = list()

for i in range(len(dataset)):
    prompt.append(dataset[i]['prompt'])
    completion.append(dataset[i]['completion'])

df = pd.DataFrame({'prompt': prompt, 'completion': completion})
df.to_csv(f'data/fine_tuning_data/fine_tuning_classification_train.csv', index=False)

display(df.head())


**To turn your new CSV file into a JSONL (the format you send to the API), use the following command on your CLI.**

> **Note: all of the datasets and results are already available in the `fine_tune_data` folder.**

```bash

openai tools fine_tunes.prepare_data -f your_file_name.csv

```
**This will create the `your_file_name.jsonl` file you will send to the API.**

> **Note: if you are a Windows user, you will need to add your OpenAI API key as an environmental variable to send fine-tune requests to the API.**

**Since we have few samples, we will not be able to give a validation set to evaluate the fine-tuning (OpenAI recommends around 100 samples per class, and we do not have that.). You can [check the documentaion](https://platform.openai.com/docs/guides/fine-tuning/advanced-usage) for more information on the parameters you can play on this fine-tuning. Below, we provide the command used to fine-tune Ada on this classification task.**

```bash

openai api fine_tunes.create -t "fine_tuning_classification_train_prepared.jsonl" --compute_classification_metrics --classification_n_classes 142 -m ada --n_epochs 10 --suffix "aira"

```

**If you have your API key set up, OpenAI will accept your request and begin the process of fine-tuning. The cost of the process will show before the fine-tuning starts, as well as your place in the queue:**

```bash

Created fine-tune: ft-xxxxxxxxxxxxxxxxxxxx
Fine-tune costs $0.21
Fine-tune enqueued. Queue number: 9

```

**`ft-xxxxxxxxxxxxxxxxxxxx` is the ID of your fine-tune request. You can follow the process of your model with the following command (fine-tunings usually take around an hour, depending on your place in the queue).**

```bash

openai api fine_tunes.follow -i ft-xxxxxxxxxxxxxxxxxxxx

```

**At the end of the fine-tuning, you will be shown the following message:**

```bash

Fine-tune started
Completed epoch 1/10
Completed epoch 3/10
Completed epoch 5/10
Completed epoch 7/10
Completed epoch 9/10
Uploaded model: xxxxxxxxxxxxxxxxxxxx
Uploaded result file: file-xxxxxxxxxxxxxxxxxxxx
Fine-tune succeeded

Job complete! Status: succeeded 🎉

```

**In it, you will find the name of your new model (which can be called by the API), and the ID of your results folder (which keeps track of the optimization/learning scores during fine-tuning).**

**Let us take a look at the optimization/learning curves from Ada.**



In [5]:
import plotly.graph_objects as go
import pandas as pd

df = pd.read_csv(f'data/fine_tuning_data/ada_classification_fine_tuning_results.csv')

fig = go.Figure()

fig.add_trace(go.Scatter(x=df.step, y=df.training_loss,
 name='Training Loss'))
fig.add_trace(go.Scatter(x=df.step, y=df.training_sequence_accuracy,
 name='Training Sequence Accuracy'))

fig.update_layout(template='plotly_dark',
                  title='Learning/Optimization Curves',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.show()

**Here are [some useful fuctions](https://platform.openai.com/docs/api-reference/fine-tunes/create) to work with your OpenAI API:**

```python
import openai

openai.api_key="your_api_key_here"

# lists all your files
openai.File.list()

# deleates a file
openai.File.delete("file-xxxxxxxxxxxxxxxxxxxx")

# retrieves a file
openai.File.retrieve("file-xxxxxxxxxxxxxxxxxxxx")

# cancels a fine-tuning job
openai.FineTune.cancel(id="file-xxxxxxxxxxxxxxxxxxxx")
```

**Let us now test our new model.**

In [22]:
import openai
from IPython.display import Markdown

openai.api_key="your_api_key_here"

with open(f'data/original_data/answers_en.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()

response = openai.Completion.create(
    model="ada:ft-aires:aira-2023-04-14-13-46-07",
    prompt=f"What is Reinforcment Learning?\n\n###\n\n",
    max_tokens=1,
    temperature=0
    )

prediction = int(response['choices'][0]['text']) - 1
    
display(Markdown(answers[prediction]))

Reinforcement Learning (RL) is a machine learning technique concerned with how intelligent agents should act in an environment to maximize the expected return of reward. RL is one of the three basic paradigms of machine learning, along with supervised learning and unsupervised learning. Reinforcement learning differs from supervised learning in that it does not require labeled input/output pairs. Instead, the focus is on finding a balance between exploration and exploitation, a problem canonically represented by the "multi-armed bandit problem."

**Given our limited test sample, we were not able to test the accuracy of the Ada model fine-tuned for classification. However, we tried all the original 142 questions against the model, and it achieve 100% accuracy. Not a fair test in comparison to the other models, but a good indicator nonetheless.** 

In [5]:
import openai

openai.api_key="your_api_key_here"

with open(f'data/original_data/originalQA_en.txt', encoding='utf-8') as fp:
    questions = [line.strip() for line in fp]
    fp.close()

labels = [i+1 for i in range(len(questions))]

acc = list()

for i, question in enumerate(questions):
    response = openai.Completion.create(
        model="ada:ft-aires:aira-2023-04-14-13-46-07",
        prompt=f"{question}\n\n###\n\n",
        max_tokens=1,
        temperature=0
        )
    
    prediction = int(response['choices'][0]['text'])
    
    if prediction == labels[i]:
        acc.append(1)
    else:
        acc.append(0)

print(f"Accuracy of ADA Model: {(sum(acc)/len(acc)) * 100:.2f}%")

Accuracy of ADA Model: 100.00%


**However, using Ada to perform `text classification` while a BERT already puts us in the 99% accuracy range is overkill. We can leverage the capabilities of these pre-trained GPT models on a task much more interesting, like open-ended Q&A. And for this, we will fine-tune all four available models with our `fine_tuning_completion` dataset.** 

## Open-Domain Chatbots via `Conditional Text Generation`

**For conditional text generation, we try to fine-tune the model to produce certain responses given a specific input, like the answer to a question. This approach has a lot of limitations. Even though we can make a chatbot that can answer questions about anything, forcing the model to produce good-quality responses is hard. And by good, we mean factual and nontoxic responses. This leads us to two of the most common problems of generative models used in conversational applications:**

- 🤥 **Generative models can perpetuate the generation of pseudo-informative content, that is, false information that may appear truthful. For example, multi-modal generative models can be used to create images with untruthful content, while language models for text generation can automate the generation of misinformation.**

- 🤬 **In certain types of tasks, generative models can generate toxic and discriminatory content inspired by historical stereotypes against sensitive attributes (for example, gender, race, religion). Unfiltered public datasets may also contain inappropriate content, such as pornography, racist images, and social stereotypes, which can contribute to unethical biases in generative models. Furthermore, when prompted with non-English languages, some generative models may perform poorly.**

**How to create aligned AI is still an open problem researched by many pioneers. And given the current advances we had in the field of NLP and generative models, much of this effort is currently directed at aligning language models. For more information, here are some interesting sources to learn more about the field:**

1. **_[Risks from Learned Optimization in Advanced Machine Learning Systems](https://arxiv.org/abs/1906.01820)._**
2. **_[Artificial Intelligence, Values and Alignment](https://arxiv.org/abs/2001.09768)._**
3. **_[Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239)._**
4. **_[Learning to summarize with human feedback](https://arxiv.org/abs/2009.01325)._**
5. **_[Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862)._**
6. **_[Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://arxiv.org/abs/2209.07858)._**
7. **_[Discovering Language Model Behaviors with Model-Written Evaluations](https://arxiv.org/abs/2212.09251)._**
8. **_[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)._**
9. **_[Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593)._**
10. **_[A General Language Assistant as a Laboratory for Alignment](https://arxiv.org/abs/2112.00861)._**
11. **_[OpenAssistant Conversations - Democratizing Large Language Model Alignment](https://www.ykilcher.com/OA_Paper_2023_04_15.pdf)._**

**We can think of a pre-trained language model as the aggregation of all the personas that ever wrote something that happens to show up on its dataset. During training, it learns a distribution over all of the seen tokens, that best represents this distribution. A distribution made by "_us_".**

**Fine-tuning the model, like we are about to do, or by, for example, using few-shot examples, just helps to set the model in a mode more aligned with the fine-tuning instructions. If we wanted, too, for example, to pass more information to the model than just examples, we could pass comparisons of examples or a feedback signal created from human evaluations. Demonstrations of expert behavior paired with human evaluations is the base of one of the most used alignment strategies in the literature (`Reinforcement Learning from Human Feedback`).**

> **Note: `RLHF` is a technique that trains a "_reward model_" directly from human feedback and uses the model as a reward function to optimize an agent's policy using `reinforcement learning` through an optimization algorithm like `Proximal Policy Optimization`.**

**Other techniques besides `RLHF` can also be a part of an alignment strategy, like toxicity detection and automated fact-checking. In the future, we will present more tutorials on this topic, but for now, since we are using the OpenAI API (and at the moment, it does not allow other forms of fine-tuning aside from the one we will implement), we will simply work with fine-tuning.** 

**Our dataset is already prepared and available in the `data` folder, and to queue or fine-tune, we just repeat the same command as before. Below, we are choosing Ada as our base model. We will train it for 2 epochs with a low learning rate (just like we did with BERT). For more information, [check the documentation](https://platform.openai.com/docs/api-reference/fine-tunes).**

```bash

openai api fine_tunes.create -t "fine_tuning_completion_prepared.jsonl" -m davinci --n_epochs 2 --learning_rate_multiplier 0.02 --suffix "aira"

```

**Your fine-tuned model will be available for calls after it is done with training, and done! We have our trained models available for testing in the [AIRES playground](https://playground.airespucrs.org/aira).**

In [7]:
import openai
from IPython.display import Markdown

openai.api_key="your_api_key_here"

response = openai.Completion.create(
  model="ada:ft-xxxxxxxxxxxxxxxxx", # replace with your model id
  prompt="What is the pythagorean theorem?\n\n###\n\n",
  max_tokens=250,
  temperature=0.5,
  top_p=0.5,
  frequency_penalty=0.5,
  presence_penalty=0.5,
  stop=["[END]"]
)

display(Markdown(response['choices'][0]['text']))


 The Pythagorean Theorem, which is a theorem in geometry, states that the sum of the squares of two sides of a right angled triangle equals the square of the hypotenuse.

**As you can see, from being trained on 3492 examples of questions/answers (only related to AI Ethics and AI Safety), the model can generalize to other areas, like maths.**

**Congratulations, we just made a whole trip in conversational AI, from close domain expert systems to open domain fine-tuned pre-trained large language models.**

---

Return to the [index](https://github.com/Nkluge-correa/Aira-EXPERT).