# Text Embeddings/Cold Start/ Recommendation from Raw Text using ThirdAI's UDT

Cold Start is a common problem that most e-commerce companies deal with daily. Here, we demonstrate how to get a neural product search engine from a raw text in the product catalog. A catalog contains productID, title, descriptions (optional), and metadata (optional). We may not have data about the products purchased for a given query.

This notebook shows how to use ThirdAI's Universal Deep Transformer (UDT) to pretain or cold-start a large neural model. The model can generate embeddings for any textual description or the entities provided during training. The model also provides a reasonable semantic search engine ready to go.
You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/embeddings/EmbeddingsAndColdStart.ipynb
#### Get Text and Entities Embeddings
Once the cold-start model is pre-trained on raw text data, the model produces two kinds of embeddings for use in other downstream AI tasks. We can build embedding models of any given dimentions up to 50,000. This script produces 512 dimentional embedding (default). We can use the model to generate any data-specific text string embeddings. In addition, while training, the model also creates internal representations of entities like the raw text documents or products in the current demo. 

#### Fine-tune on supervised query-product data (Optional)
Suppose you have both a product catalog and query-product data. In that case, we first pre-train on just the catalog data (Cold Start). The same pre-trained model can be further fine-tuned on supervised query-product data for better results. Fine-tunning is a standard supervised text classification where we load the UDT model and train it in a supervised classification way. Models pre-trained with cold start converge faster with significantly less supervised data than models trained from scratch on query-product data.

In [None]:
!pip3 install thirdai --upgrade

import thirdai
## activate the license (works only for this demo)
thirdai.licensing.activate("A4D695-FE9744-A1918F-536DC9-F8728E-V3")

## Dataset Download

We start with the Amazon Kaggle Product Catalog Dataset with 110K products. To make the demo run on a single core collab in few minutes, we randomly sample just 5% of the products (about 5000). Please download the dataset, extract the downloaded file and specify the filepath as *original_product_catalog_file* below.

In [None]:
import pandas as pd
import os

os.system('wget -O amazon-kaggle-product-catalog.csv https://www.dropbox.com/s/kq5396ypmtagsyr/amzn-kaggle-product-catalog.csv?dl=0')

catalog_file = "./amazon-kaggle-product-catalog.csv"

## for a quick demo, we are sampling just the first 5% of the products
def sample_catalog(catalog_file, percentage=0.05):
    df = pd.read_csv(catalog_file)
    df = df.sample(frac=percentage, random_state=43)
    df["PRODUCT_ID"] = [i for i in range(df.shape[0])]
    df["TITLE"] = df["TITLE"].str.lower()
    df["DESCRIPTION"] = df["DESCRIPTION"].str.lower()
    df["BULLET_POINTS"] = df["BULLET_POINTS"].str.lower()
    df["BRAND"] = df["BRAND"].str.lower()
    #
    sampled_catalog_file = f"./amazon-kaggle-product-catalog-sampled-{percentage}.csv"
    df.to_csv(sampled_catalog_file, index=False)
    #
    return sampled_catalog_file, df

sampled_catalog_file, dataframe = sample_catalog(catalog_file, 0.05)

In [None]:
## A sample row from the sampled catalog file is printed below.
pd.options.display.max_colwidth = 700
dataframe[dataframe["PRODUCT_ID"] == 417]

## UDT Initialization

Initialize a UDT model by simply specifying the input/output names. Here, we intend to use the model for query-to-product prediction. Hence, our input is a "QUERY" and the output is a "PRODUCT_ID". The "QUERY" column name can be anything of your choice. However, it has to be consistent with the prediction step (shown later). The "PRODUCT_ID" column name has to match with the name in your catalog file (shown above).

NOTE: *model_config* is optional. If not specified we will automatically initialize an appropriate model. Please reach out to contact@thirdai.com for more info on model configs.

In [None]:
from thirdai import bolt
import os

config_dir = os.path.join(os.path.abspath(""), "../configs/")

model = bolt.UniversalDeepTransformer(
    data_types={
        "QUERY": bolt.types.text(),
        "PRODUCT_ID": bolt.types.categorical(),
    },
    target="PRODUCT_ID",
    n_target_classes=dataframe.shape[0],
    integer_target=True,
    model_config=os.path.join(config_dir, "embeddings_and_cold_start.config")
)

## Cold Start a.k.a Pretraining

The following code starts the pre-training (or cold start) on raw catalog information. UDT allows you to specify what columns in the catalog should strongly influence the eventual embeddings and what column should have a weak influence. In the example below, we choose the product title to have a strong influence and everything else to have a weak influence. The column names should match the catalog file.

NOTE: Specifying learning rate and epochs is optional. We can sutomatically tune the training hyperparemeters, but we do give an option to specify if wish to.

In [None]:
model.cold_start(
    filename=sampled_catalog_file,
    strong_column_names=["TITLE"],
    weak_column_names=["DESCRIPTION", "BULLET_POINTS", "BRAND"],
    learning_rate=0.001,
    epochs=5,
    metrics=["categorical_accuracy"],
)

## Save and Load Model

In [None]:
model_save_path = './udt-cold-start-amzn-kaggle.model'

## save the model
model.save(model_save_path)

## load it back
model = bolt.UniversalDeepTransformer.load(model_save_path)

## Prediction

In [None]:
## Helper function to print the results
def top_k_products(query, k):
    result = model.predict({"QUERY": query})
    #
    k = min(k, len(result) - 1)
    sorted_product_ids = result.argsort()[-k:][::-1]
    #
    for p_id in sorted_product_ids:
        print(dict(dataframe.iloc[p_id,[1,2]]))
        print('******************************')

In [None]:
## example 1
top_k_products('laptop bag', 2)

In [None]:
## example 2
top_k_products('birthday gifts', 2)

## Get Entity Embeddings

In [None]:
# get embeddings for a specific product
# The product ID should match with the catalog file use in the cold_start step

product_id = 723
product_embedding = model.get_entity_embedding(product_id)

## Get Semantic Embedding of any Text

In [None]:
query_emb = model.embedding_representation({'QUERY':'washing machine'})

# Visualize the embeddings

If you want a super-cool 3D visualization of the embeddings that our UDT model learned for this dataset, please visit the following link. You can type in a search query and get results in the serch bar. You can click on any product and navigate the neighborhood for releated products.

http://20.221.80.155/amazon-catalog

If you want to build a similar visualization for your own datasets, please reach out to contact@thirdai.com. We will soon put up instructions to automatically generate the visualization with the trained embeddings.