
# Real-Time Query Reformulation a.k.a Query Correction using the UniversalDeepTransformer API

This notebook shows how to build a query reformulation model with ThirdAI's Universal Deep Transformer (UDT) model, our all-purpose solution for classification tasks on tabular datasets and query reformulation. In this demo, we will train and evaluate the model on a spelling correction dataset and show less than 5ms P99.9 inference latency.

You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/QueryReformulation.ipynb

In [51]:
!pip3 install datasets==2.6.2
!pip3 install gradio
!pip3 install thirdai --upgrade

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

In [12]:
import thirdai
thirdai.licensing.activate("AWK9-WPMK-3NRE-AAAV-C39P-N9JV-43VC-CFUH")

AttributeError: module 'thirdai.licensing' has no attribute 'activate'

## Dataset Download

We will use the demos module in the ThirdAI repo to download and pre-process a dataset from HuggingFace. The dataset we will use from HuggingFace is typically used for semantic sentence similarity. We will pre-process it by adding noise so that it is suitable for query reformulation. You can replace this step and the next with a UDT initialization that is specific for your dataset - as long as your train dataset consists of **CSV files with two string columns**: The first one should be incorrect queries and the second column will be their target reformulations.  The incorrect queries column can be empty strings

In [17]:
from thirdai.demos import prepare_query_reformulation_data

train_filename, test_filename, inference_batch = prepare_query_reformulation_data()

Using custom data configuration embedding-data--sentence-compression-d643585deb6e0073
Found cached dataset json (/Users/gyan/.cache/huggingface/datasets/embedding-data___json/embedding-data--sentence-compression-d643585deb6e0073/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)


  0%|          | 0/1 [00:00<?, ?it/s]

## UDT Initialization

We can create a UDT model specific for query reformulation by specifying the name of the source column (optional, column containing queries to be reformulated) and the name of the target column (correct reformulations) and a dataset size parameter. The size of the input dataset can be configured to be either "small" (size < 1M), "medium"(size < 10M) or "large" (size >= 10M). We configure different model parameters depending on the size of the input dataset. 

In [18]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    source_column="source_queries", target_column="target_queries", dataset_size="medium"
)

## Training

We can now train our model in just one line of code. You just have to specify the path to the training file. 

In [19]:
model.train(filename=train_filename) 

loading data | source 'train_file.csv'
loaded data | source 'train_file.csv' | vectors 900000 | batches 90 | time 4s | complete

train | time 167s | complete                                            ==                                           ] 14%



## Evaluation 

Evaluating the UDT model is also just one line of code. Since this UDT model is specific for query reformulation, you need to provide the number of suggested candidate queries that the UDT model generates. For instance, if you want to see the top 5 suggested query reformulations of the input query, set the `top_k` parameter to 5. Evaluating this model will also print out recall @k. If you also want the scores for the reformulated queries, you can specify `return_scores = True`

In [20]:
query_reformulations, scores = model.evaluate(filename=test_filename, top_k=5, return_scores = True)

loading data | source 'test_file.csv'
loaded data | source 'test_file.csv' | vectors 270000 | batches 27 | time 1s | complete

evaluate | {Recall@5: 0.887} | time 54s | complete                         



## Saving and Loading

Saving and loading a trained UDT model to disk is also extremely straight forward. 

In [21]:
model_location = "query_reformulation.model"

# Saving
model.save(filename=model_location)

# Loading
model = bolt.UniversalDeepTransformer.load(model_location)

## Testing Predictions 

The evaluation method is great for testing, but it requires labels, which don't exist in a production environment. We also provide a predict method that can take a list of queries or a single query, which allows for easy integration into production pipelines. 

In [22]:
predictions, = model.predict_batch(queries=inference_batch, top_k=5)

In [27]:
out = model.predict(query="Health iiaclfsfo susei der air quality alert", top_k=5)

In [50]:
import gradio as gr
import pandas as pd


def generate(text, top_k):
    result = model.predict(text, top_k)
    return pd.DataFrame(data=result[0][0], columns=["Queries"])


examples = [
    ["Health iiaclfsfo susei der air quality alert"],
    ["Havy rains hae Mumabai"],
]

demo = gr.Interface(
    fn=generate,
    inputs=[gr.inputs.Textbox(lines=1, label="Input Query"), gr.Slider(5, 10, step=1)],
    outputs=gr.DataFrame(label="Reformulated Query"),
    examples=examples,
)

demo.launch()



Running on local URL:  http://127.0.0.1:7874

To create a public link, set `share=True` in `launch()`.


