# Text Embeddings/Cold Start/ Recommendation from Raw Text using ThirdAI's UDT

Cold Start is a common problem that most e-commerce companies deal with daily. Here, we demonstrate how to get a neural product search engine from a raw text in the product catalog. A catalog contains productID, title, descriptions (optional), and metadata (optional). We may not have data about the products purchased for a given query.

This notebook shows how to use ThirdAI's Universal Deep Transformer (UDT) to pretain or cold-start a large neural model. The model can generate embeddings for any textual description or the entities provided during training. The model also provides a reasonable semantic search engine ready to go.

#### Get Text and Entities Embeddings
Once the cold-start model is pre-trained on raw text data, the model has two kinds of embeddings for use in other downstream AI tasks. We can use the model to generate any domain-specific embedding of a given text string. The model while training also generates the internal representation of entities (like documents or products)

#### Fine-tune on supervised query-product data (Optional)
If you have both a product catalog and query-product data, we first pre-train on just the catalog data (Cold Start), and the same model can be later fine-tuned on your query-product data for better results. Models pre-trained with cold start converge faster with significantly less supervised data than models trained from scratch on query-product data.

## Dataset Download

We start with the Amazon Kaggle Product Catalog Dataset with 110K products. To make the demo run on a single core collab in few minutes, we randomly sample just 5% of the products (about 5000). Please download the dataset, extract the downloaded file and specify the filepath as *original_product_catalog_file* below.

In [None]:
import pandas as pd
import os

os.system('wget -O amazon-kaggle-product-catalog.csv https://www.dropbox.com/s/9km4arjsjkevzw9/amazon-kaggle-product-catalog.csv?dl=0')

catalog_file = "./reformatted_trn_unsupervised.csv"

## for a quick demo, we are sampling just the first 5% of the products
def sample_catalog(catalog_file, percentage=0.05):
    df = pd.read_csv(catalog_file)
    df = df.iloc[:int(percentage*df.shape[0])]
    df["PRODUCT_ID"] = [i for i in range(df.shape[0])]
    #
    sampled_catalog_file = f"./amazon-kaggle-product-catalog-sampled-{percentage}.csv"
    df.to_csv(sampled_catalog_file, index=False)
    #
    return sampled_catalog_file, df

sampled_catalog_file, dataframe = sample_catalog(catalog_file, 1)

In [None]:
## A sample row from the sampled catalog file is printed below. Please ensure that your file has data in the correct format.
pd.options.display.max_colwidth = 700
dataframe[dataframe["PRODUCT_ID"] == 300]

## UDT Initialization

Initialize a UDT model by simply specifying the input/output names. Here, we intend to use the model for query-to-product prediction. Hence, our input is a "QUERY" and the output is a "PRODUCT_ID". The "QUERY" column name can be anything of your choice. However, it has to be consistent with the prediction step (shown later). The "PRODUCT_ID" column name has to match with the name in your catalog file (shown above).

Note: contextual_encoding has three options, 'local', 'global' and 'none'. If your queries are short in length (~5 tokens per query), 'global' tends to converge to a better accuracy. For longer queries, we would suggest using 'local', although you can experiment with either of them.

In [None]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "QUERY": bolt.types.text(contextual_encoding="local"),
        "PRODUCT_ID": bolt.types.categorical(delimiter=';'),
    },
    target="PRODUCT_ID",
    n_target_classes=dataframe.shape[0],
    integer_target=True,
)

## Cold Start Pretraining

The following does the cold-start training on the model with the catalog information. You can specify what columns in the catalog should have a strong influence on the eventual embeddings and what columsn should have a weak influence. The column names should match with the catalog file. 

NOTE: Specifying learning rate and epochs is optional. We can sutomatically tune the training hyperparemeters, but we do give an option to specify if wish to.

In [None]:
model.cold_start(
    filename=sampled_catalog_file,
    strong_column_names=["TITLE"],
    weak_column_names=["DESCRIPTION", "BULLET_POINTS", "BRAND"],
    learning_rate=0.001,
    epochs=5,
    metrics=["categorical_accuracy"],
)

## Save and Load Model

In [None]:
model_save_path = './ust-cold-start-amzn-kaggle.model'

## save the model
model.save(model_save_path)

## load it back
bolt.UniversalDeepTransformer.load(model_save_path)

## Prediction

In [None]:
## Helper function to print the results
def top_k_products(query, k):
    result = model.predict({"QUERY": query})
    #
    k = min(k, len(result) - 1)
    sorted_product_ids = result.argsort()[-k:][::-1]
    #
    print('***************************************************')
    print('******************* STARTS HERE *******************')
    print('***************************************************')
    for p_id in sorted_product_ids:
        print(dict(df.iloc[p_id,[1,2]]))
        print('******************************')
    #
    print('***************************************************')
    print('******************** ENDS HERE ********************')
    print('***************************************************')

In [None]:
## example 1
top_k_products('laptop bag', 5)

In [None]:
## example 2
top_k_products('dust resistant laptop bag', 5)

## Get Product Embeddings

In [None]:
# get embeddings for a specific product

product_id = 723
product_embedding = model.get_entity_embeddings(product_id)

## Get a Query Embedding

In [None]:
query_emb = model.embedding_representation({'QUERY':'washing machine'})