## RAG Challenge (Online-Tuning)

In this notebook, we will show the benefits of Online Finetuning with NeuralDB on a dataset called Amazon-50K. This notebook shows that users search preferences are very task specific and embedding models and vector DBs cannot capture all these nuances. NeuralDB is the only solution if you want to customize your retrieval for your task.

This dataset is curated by taking the common products between 

1. [AmazonTitles-1.3MM](https://drive.google.com/file/d/12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK/view) dataset from [Extreme Classification Repository](http://manikvarma.org/downloads/XC/XMLRepository.html) repository.
2. [3 million Amazon product catalog](https://www.kaggle.com/datasets/piyushjain16/amazon-product-data) from Kaggle.

In [None]:
!pip3 install thirdai
!pip3 install thirdai[neural_db]

In [None]:
import pandas as pd
from thirdai import neural_db as ndb, licensing

import nltk
nltk.download("punkt")

import os

if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("")  # Enter your ThirdAI key here

### Download Dataset

In [None]:
data_dir = "./amazon-50K/"
if not os.path.isdir(data_dir):
    os.system("mkdir "+data_dir)

os.system("wget -O "+data_dir+"unsup.csv https://www.dropbox.com/scl/fi/0ouziv3n4cf5xfo9zyd2x/unsup.csv?rlkey=7qgnyj6ye6o293oc2t9du2d5o&dl=0")
os.system("wget -O "+data_dir+"trn_sup.csv https://www.dropbox.com/scl/fi/66xpkhj6jt6lmx6kmrquk/trn_sup.csv?rlkey=mitkjdxp6pts5xtdu9w5o7wqf&dl=0")
os.system("wget -O "+data_dir+"tst_sup.csv https://www.dropbox.com/scl/fi/o268pp6y6ynmtlgpolfk6/tst_sup_trimmed.csv?rlkey=bjeimrmv0l1rq6a3go5pl4os6&dl=0")

### Initialize a NeuralDB

In [None]:
db = ndb.NeuralDB(id_delimiter=":")

### Load a document

In [None]:
doc = ndb.CSV(data_dir+"unsup.csv", id_column="DOC_ID", strong_columns=["TITLE"], weak_columns=["DESCRIPTION"])

### Insert the document into NeuralDB

In [None]:
source_ids = db.insert([doc], train=True)

In the previous step, by setting train=True, the NeuralDB will tune directly on the products based on the past responses. We will now show how to do feedbackdriven tuning.

### Load the Feedback Data

In [None]:
sup_doc = ndb.Sup(
    data_dir+"trn_sup.csv",
    query_column="QUERY",
    id_column="DOC_ID",
    id_delimiter=":",
    source_id=source_ids[0],
)

### Train the DB with the feedback data

In [None]:
db.supervised_train([sup_doc], learning_rate=0.001, epochs=10)

### Make Predictions and calculate metrics

In [None]:
import time


df_test = pd.read_csv(data_dir+"tst_sup.csv")

all_preds = []
for i in range(df_test.shape[0]):
    results = db.search(df_test['query'][0], top_k=1)
    top_pred = results[0].id
    all_preds.append(top_pred)



### Comparisons

| text | Precision@1 |
| --- | --- |
| Elastic Search | 6.6% |
| ChromaDB (all-mini-LM6-v2) | 9.98 %|
| NeuralDB (with  Online tuning) | 42% | 