# Data Assistant

This is my thesis on creating a large language model that can generate GraphQL queries based on a given prompt.

**Why?**

By teaching a model to help us reach the information we are looking for, instead of expecting it to have all the information, we can make sure that the answers we get are always up to date and are not limited by a dataset.


## Version 1

This was a proof of concept gpt model with a really specific tokenizer, that was entirely created and trained from the ground up.

The tokenizer specified multiple class of tokens:
- 0: END token, describing the end of a prompt to stop generating words.
- 1: EMPTY token, used for padding the fixed input size.
- 2-7: ARGUMENT tokens, used for passing keywords from the prompt to the output without needing to teach the word to the model.
- 8-157: GENERIC tokens, every syntactically required token that doesn't have a special meaning.
- 158 and up: GLOSSARY tokens, these are tokens from the rows of the used database which get replaced with the table's name.

Using the class's (or table's) name instead of the instance (or row) has interesting pros and cons that allow this type of approach to a model to be very useful in specific scenarios.

For example, by only having to teach the model the class names, the results will have better accuracy and less hallucinations; But the model will also require specific training data that, due to the structure of the database it will use, always has to be completely unique, because even the tokenizer depends on it.

Using this technique showed promising results with a small sample set, but collecting that specific dataset proved to be too much work.

## Version 2

To make this project more available, the main dataset needs to be more generalized, so that we can use a really small dataset to fine tune our model.

For this we can use the HuggingFace Transformers library, that can specify a tokenizer and makes fine tuning existing models really easy.

Install dependencies:

In [None]:
pip install pandas; pip install tensorflow; pip install transformers

For the base LLM we can use a generic language model like EleutherAI's gpt-j-6B:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

Then we have to load our GraphQL dataset and tokenize it to fine tune this model.

This will create the generic GraphQL generator model that should be tuned further for real applications.

In [None]:
import pandas as pd

# Load your dataset
dataset = pd.read_csv("../data/training_data.csv", delimiter=';')
questions = dataset["question"].tolist()
answers = dataset["query"].tolist()

inputs = tokenizer(questions, answers, padding=True, truncation=True, return_tensors="tf")

input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

Compile the model and run the fine tuning.

In [None]:
import tensorflow as tf

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss_fn)

model.fit(
    [input_ids, attention_mask],
    epochs=3,  # Adjust the number of epochs as needed
)

Lastly, save the tuned model for further uses.

In [None]:
model.save_pretrained("general_querygen_model")