# Cold Start Product Recommendation using ThirdAI's UDT

Cold Start is a common problem that most e-commerce companies deal with on a daily basis. In this problem, we want to build an engine for product search where we have a product catalog with titles, descriptions (optional) and metadata (optional), but we do not have any data about what products were purchased for a given query.

In this notebook, we show how to use ThirdAI's Universal Deep Transformer (UDT) to train a model using just the product catalog and get good search results without any supervised data about what products were purchased/added-to-cat/clicked for a given query.

#### Get Product Embeddings (Optional)
Once the cold-start model is trained, we show how you can obtain product vectors (embeddings) for use in other downstream tasks with just a one-line code.

#### Finetune on query-product data (Optional)
If you have both a product catalog and also query-product data, we also show how easy it is to train on just the catalog first (Cold Start) and then fine-tune on your query-product data for better results. Models pre-trained with cold start converge faster with much less uspervised data compared to models trained from scratch on query-product data.

## Dataset Download

We train the Amazon Kaggle Product Catalog Dataset with 110K products. For the purprose of this demo, we randomly sample just 5% of the products and build a cold-start search engine with UDT. Please download the dataset, extract the downloaded file and specify the filepath as *original_product_catalog_file* below.

In [19]:
import pandas as pd

original_product_catalog_file = "/share/data/catalog_recommender/amzn-kaggle/reformatted_trn_unsupervised.csv"

## for a quick demo, we are sub-sampling the product catalog to just 5% of the products
def subsample_catalog(catalog_file, percentage=0.05):
    df = pd.read_csv(catalog_file)
    df = df.sample(frac=percentage, random_state=42)
    df["PRODUCT_ID"] = [i for i in range(df.shape[0])]

    sampled_product_catalog_file = f"/share/data/catalog_recommender/amzn-kaggle/reformatted_sampled_{percentage}_trn_unsupervised.csv"
    df.to_csv(sampled_product_catalog_file, index=False)

    return sampled_product_catalog_file, df

sampled_product_catalog_file, dataframe = subsample_catalog()

In [None]:
## A sample row from the sampled catalog file is printed below. Please ensure that your file has data in the correct format.
pd.options.display.max_colwidth = 1000
dataframe[dataframe["PRODUCT_ID"] == 300]

## UDT Initialization

Initialize a UDT model by simply specifying the input/output names. Here, we intend to use the model for query-to-product prediction. Hence, our input is a "QUERY" and the output is a "PRODUCT_ID". The "QUERY" column name can be anything of your choice. However, it has to be consistent with the prediction step (shown later). The "PRODUCT_ID" column name has to match with the name in your catalog file (shown above).

Note: contextual_encoding has three options, 'local', 'global' and 'none'. If your queries are short in length (~5 tokens per query), 'global' tends to converge to a better accuracy. For longer queries, we would suggest using 'local', although you can experiment with either of them.

In [None]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "QUERY": bolt.types.text(contextual_encoding="local"),
        "PRODUCT_ID": bolt.types.categorical(),
    },
    target="PRODUCT_ID",
    n_target_classes=dataframe.shape[0],
    integer_target=True,
)

## Cold Start Pretraining

TODO: Write up a description for every argument in the *cold_start()* method.

In [None]:
model.cold_start(
    filename=sampled_product_catalog_file,
    strong_column_names=["TITLE"],
    weak_column_names=["DESCRIPTION", "BULLET_POINTS", "BRAND"],
    learning_rate=0.001,
    epochs=5,
    metrics=["f_measure(0.95)"]
)

## Prediction

In [None]:
import time
from thirdai import bolt

# dataframe = pd.read_csv('/share/data/catalog_recommender/amzn-kaggle/reformatted_trn_unsupervised.csv')
# df = dataframe.iloc[:,[1,2]]

# model = bolt.UniversalDeepTransformer.load('/home/david/Universe/coldstart_amazon_kaggle.bolt')

def top_k_products(query, k):
    t1 = time.time()
    result = model.predict({"QUERY": query})
    #
    k = min(k, len(result) - 1)
    sorted_product_ids = result.argsort()[-k:][::-1]
    #
    print('***************************************************')
    print('******************* STARTS HERE *******************')
    print('***************************************************')
    for p_id in sorted_product_ids:
        print(dict(df.iloc[p_id]))
        print('******************************')
    #
    t2 = time.time()
    print('Time Elapsed:', (t2-t1)*1000, 'ms')
    print('***************************************************')
    print('******************** ENDS HERE ********************')
    print('***************************************************')


top_k_products('Birthday Party return Gift', 5)

## Get Product Embeddings (Optional)

In [None]:
# get embeddings for all the products
product_embeddings = model._get_model().get_layer('fc_2').weights.copy()

# save the embeddings
np.save('./amazon_kaggle_cold_start_embeddings.npy',product_embeddings)

# reload the embeddings
product_embeddings = np.load('./amazon_kaggle_cold_start_embeddings.npy')
