
# Real-Time Query Reformulation a.k.a Query Correction using the UniversalDeepTransformer API

This notebook shows how to build a query reformulation model with ThirdAI's Universal Deep Transformer (UDT) model, our all-purpose solution for classification tasks on tabular datasets and query reformulation. In this demo, we will train and evaluate the model on a spelling correction dataset and show less than 5ms P99.9 inference latency.

You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/QueryReformulation.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

In [4]:
# !pip3 install datasets==2.6.1 
# !pip3 install thirdai --upgrade

import thirdai
# thirdai.activate("AWK9-WPMK-3NRE-AAAV-C39P-N9JV-43VC-CFUH")

## Dataset Download

We will use the demos module in the ThirdAI repo to download and pre-process a dataset from HuggingFace. The dataset we will use from HuggingFace is typically used for semantic sentence similarity. We will pre-process it by adding noise so that it is suitable for query reformulation. You can replace this step and the next with a UDT initialization that is specific for your dataset - as long as your train dataset consists of **CSV files with two string columns**: The first one should be incorrect queries and the second column will be their target reformulations.  The incorrect queries column can be empty strings

In [5]:
from thirdai.demos import prepare_query_reformulation_data
import pandas

supervised_train_filename, unsupervised_train_filename, test_filename, inference_batch = prepare_query_reformulation_data()

Using custom data configuration embedding-data--sentence-compression-d643585deb6e0073
Reusing dataset json (/Users/shubh/.cache/huggingface/datasets/embedding-data___json/embedding-data--sentence-compression-d643585deb6e0073/0.0.0/a3e658c4731e59120d44081ac10bf85dc7e1388126b92338344ce9661907f253)
100%|██████████| 1/1 [00:00<00:00, 14.53it/s]


## UDT Initialization

We can create a UDT model specific for query reformulation by specifying the name of the source column (column containing queries to be reformulated) and the name of the target column (correct reformulations) and a dataset size parameter. The size of the input dataset can be configured to be either "small" (size < 1M), "medium"(size < 10M) or "large" (size >= 10M). We configure different model parameters depending on the size of the input dataset. 

In [6]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    source_column="source_queries", target_column="target_queries", dataset_size="medium"
)

## Training

We can now train our model in just one line of code. You just have to specify the path to the training file. 
You can train the model in both **supervised** and **unsupervised** setting. When training the model in an unsupervised setting, only the target column is needed in the training file. 

*supervised_train_filename* has two columns `target_queries,source_queries` whereas *unsupervised_train_filename* has only `target_queries`

In [7]:
# supervised training
model.train(filename=supervised_train_filename) 

loading data | source 'supervised_train_file.csv'
loading data | source 'supervised_train_file.csv' | vectors 900000 | batches 90 | time 4s | complete

train | batches 90 | complete                                           



In [8]:
# unsupervised training
model.train(filename=unsupervised_train_filename)

loading data | source 'unsupervised_train_file.csv'
loading data | source 'unsupervised_train_file.csv' | vectors 900000 | batches 90 | time 3s | complete

train | batches 90 | complete                                           



## Evaluation 

Evaluating the UDT model is also just one line of code. Since this UDT model is specific for query reformulation, you need to provide the number of suggested candidate queries that the UDT model generates. For instance, if you want to see the top 10 suggested query reformulations of the input query, set the top_k parameter to 10. Evaluating this model will also print out recall @k. 

In [11]:
query_reformulations, = model.evaluate(filename=test_filename, top_k=5)

loading data | source 'test_file.csv'
loading data | source 'test_file.csv' | vectors 270000 | batches 27 | time 1s | complete

evaluate | batches 27 | complete                                           

Recall@5 = 0.88363


## Saving and Loading

Saving and loading a trained UDT model to disk is also extremely straight forward. 

In [None]:
model_location = "query_reformulation.model"

# Saving
model.save(filename=model_location)

# Loading
model = bolt.UniversalDeepTransformer.load(model_location)

## Testing Predictions 

The evaluation method is great for testing, but it requires labels, which don't exist in a production environment. We also provide a predict method that can take a list of queries or a single query, which allows for easy integration into production pipelines. 

In [15]:
predictions, = model.predict_batch(queries=inference_batch, top_k=5)

In [16]:
model.predict(query="Health iiaclfsfo susei der air quality alert", top_k=5)

([['Health officials issue red air quality alert',
   'Air quality officials issue health notice',
   'Air quality alert issued for Friday',
   'Wife of Shane Osborn files for protection order',
   'Apple sells $17 billion in bonds in record deal']],)