
# Query Reformulation using the UniversalDeepTransformer API

This notebook shows how to build a query reformulation model with ThirdAI's Universal Deep Transformer (UDT) model, our all-purpose API for classification tasks on tabular datasets and query reformulation. In this demo, we will train and evaluate the model on a spelling correction dataset.

To run this notebook, you will need to obtain a ThirdAI license at the following link if you have not already: https://www.thirdai.com/try-bolt/

In [None]:
!pip3 install thirdai==0.5.4
!pip3 install pandas
!pip3 install numpy

## Dataset Download

We will use the utils module in this repository to download and pre-process a dataset from HuggingFace. The dataset we will use from HuggingFace is typically used for semantic sentence similarity. We will pre-process it by adding noise so that it is suitable for query reformulation. You can replace this step and the next with a UDT initialization that is specific for your dataset - as long as your input dataset consists of a column with incorrect queries and a column with their target reformulations. 

In [None]:
from utils import QueryReformulationDataProcessor
import pandas as pd

TRAIN_FILE_PATH = "queries.csv"

data_processor = QueryReformulationDataProcessor(
    dataset_name="embedding-data/sentence-compression"
)

# The perturbed dataset consists of one column with incorrect queries
# formed by adding noise to the original text.
perturbed_dataset = data_processor.perturb_dataset()

# Add file header since the "train" and "evaluate" methods assume the
# input CSV file has a header.
perturbed_dataset.columns = ["target_column", "source_column"]
perturbed_dataset.to_csv(TRAIN_FILE_PATH, index=False)

In [None]:
perturbed_dataset

## UDT Initialization

We can create a UDT model specific for query reformulation by specifying the name of the source column (column containing queries to be reformulated) and the name of the target column (correct reformulations) and a dataset size parameter. The size of the input dataset can be configured to be either "small", "medium" or "large". We configure different model parameters depending on the size of the input dataset. 

In [None]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    source_column="source_column", target_column="target_column", dataset_size="medium"
)

## Training

We can now train our model in just one line of code. You just have to specify the path to the training file. 

In [None]:
model.train(filename=TRAIN_FILE_PATH)

## Evaluation 

Evaluating the UDT model is also just one line of code. Since this UDT model is specific for query reformulation, you need to provide the number of suggested candidate queries that the UDT model generates. For instance, if you want to see the top 10 suggested query reformulations of the input query, set the top_k parameter to 10. Evaluating this model will also print out recall @k. 

In [None]:
query_reformulations = model.evaluate(filename=TRAIN_FILE_PATH, top_k=5)

## Saving and Loading

Saving and loading a trained UDT model to disk is also extremely straight forward. 

In [None]:
model_location = "query_reformulation.model"

# Saving
model.save(model_location = "query_reformulation.model"
)

# Loading
model = bolt.UniversalDeepTransformer.load(model_location)

## Testing Predictions 

The evaluation method is great for testing, but it requires labels, which don't exist in a production environment. We also provide a predict method that can take a list of queries or a single query, which allows for easy integration into production pipelines. 

In [None]:
incorrect_queries_list = [
    "ehT Beatles almost dformde while they were all still elive",
    "Muslim poplatio to growe at twice the rate",
    "Pakintan begins predesential poll process"
]
predictions = model.predict_batch(queries=incorrect_queries_list, top_k=5)



In [None]:
# Remove the created training file
import os

os.remove(TRAIN_FILE_PATH)