## ThirdAI's Playground

In this notebook, we will show 

1. How to easily build a semantic QnA engine for all your documents with ThirdAI's BOLT engine.

2. (Optional) How to use your OpenAI key to get retrieval augmented answers from OpenAI.

3. How to teach your retrieval model with RLHF.

4. (Optional) How to save your models and export to ThirdAI's Playground web-app to do interative QnA and teach your model with RLHF.

In [None]:
# thirdai's license activation

import thirdai
try:
    thirdai.licensing.activate("")
except:
    print("You need a license key to use ThirdAI's library. Please request a trial license at https://www.thirdai.com/try-bolt/")

In [None]:
from thirdai import bolt
import os
import nltk
nltk.data.path.append("./data/")
from pathlib import Path
import pickle
from student import models, documents, loggers, teachers, qa
from student.state import State
from student import documents

In [None]:
# how many search results do you want to retrieve from your files for every query

N_REFERENCES = 5

### Model Definition
#### Option 1: Define a model from scratch

In [None]:
# Define a model from scratch

state = State()
state.documents = documents.DocList()
state.model = models.Mach(id_col="id", query_col="query")
state.logger = loggers.InMemoryLogger()

#### Option 2: Load from a checkpoint

In [None]:
# Load from a checkpoint. 
# Please note that if you load a checkpoint that you saved after training on a file, the data associated with that file will also get loaded. 
# You can always clear the data associated with a checkpoint by simply running "state.documents = documents.DocList()" after "state.load(path_to_checkpoint)""

path_to_checkpoint = "/Users/tharunkr/qt-app/base_models/checkpoints_without_bias/msmarco_0_reindexes_frozen.zip"

state = State()
state.load(path_to_checkpoint)

### Load your files

#### Option 1: PDF or DOCX files

In [None]:
filenames = ['/Users/tharunkr/Desktop/mutual_nda_teamplate_for_testing.pdf']

pdfs = [name for name in filenames if name.endswith(".pdf")]
docxs = [name for name in filenames if name.endswith(".docx")]

combined_pdfs = ""
combined_docxs = ""

if len(pdfs)>0:
    combined_pdfs = documents.PDF(
        files=pdfs, 
        expected_id_col=state.model.get_id_col(),
        hash_to_id_offset=state.documents.get_source_hash_to_id_offset_map(),
        next_id_offset=state.documents.get_n_new_ids(),
    )

if len(docxs)>0:
    combined_docxs = documents.DOCX(
        files=docxs, 
        expected_id_col=state.model.get_id_col(),
        hash_to_id_offset=state.documents.get_source_hash_to_id_offset_map(),
        next_id_offset=state.documents.get_n_new_ids(),
    )

#### Option 2: CSV files

In [None]:
csv_file = "/Users/tharunkr/qt-app/checkpoint 2023-05-29 10:03:42.663941/documents/0/train.csv"

# Visualize the dataframe and get the column names in the csv_file. 
# You will have to pick your choice of strong_columns and weak_columns for the train step shown next.
# Strong columns are usually the most important ones like titles of documents, keywords, categories etc
# Weak columns are usually the long descriptions

import pandas as pd
pd.options.display.max_colwidth = 700

df = pd.read_csv(csv_file)
df.iloc[0:1]

### Train the model
state.model.train


In [None]:
state.model.cold_start(
    strong_columns = ["passage"]
    weak_columns = [""]
)

### How to teach your model (RLHF)

This is one of the marquee feature that we provide. Thanks to our super efficient training capabilties, we can offer you to teach the retrieval model to correct itself in the event of it not being able to get the correct paragraphs from the document. To this effect, we provide two functions:

1. Associate: Using this funciton, you can associate two phrases to give similar results. For examples, assume you're in the contract review domain. And you're interested in asking a question like "who are the parties involved in this contract?". However, most contracts have the phrase "made by and between" to suggest the parties involved in the contracts (like "this agreement is made by and between company A and company B"). In this scenario, you can simply call *model.associate(["parties involved","made by and between"])* and the model would learn the relation. In the subsequent documents, you're more likely to retrieve the passage containing the correct information.

2. Upvote: Let's say you searched for a query "is there a limited liability clause?" and you got 5 search results (aling with their passage IDs). If you know that the corect result is actually the 2nd one instead of the first one. Then you can simpley call *model.upvote("is there a limited liability clause",passage_id_of_the_best_search_result)*.

We provide two interfaces to do the teaching.

1. You can save a checkpoint to your trained model and export it to our Playground web-app to do QnA and teaching.

2. You can skip to the cell titled "RLHF using function calls" below.

### 1: Export your model to Playground App

ThirdAI's playground is a dockerized Gradio app that you can run on your laptop and use any model checkpoint to do QnA and teach using the above mentioned functions. 

Before you save your checkpoint, please go through the following short video tutorial to install Docker Desktop and download our image and run the webapp through a container.

https://drive.google.com/file/d/16tI1OAm2Lu0OuUOCiJzGrTjiBZejJWs3/view

In [None]:
# Save your state (include model and the data)

from student import models
from student.documents import DocList
from student.state import State
from student.loggers import InMemoryLogger

checkpoint_location = "checkpoint_custom_name.zip"

state.save(checkpoint_location)

After you save the checkpoint, please copy the .zip file to the folder from where you're running the docker container. And then go through this short video tutorial to do QnA and teach.

https://drive.google.com/file/d/1WIt2-EpYkQJpFgFiUXbc_iYU9uhOJdMn/view

### 2. RLHF using function calls

If you do not want to the dockerized web-app, you can continue from the above 