<a href="https://colab.research.google.com/github/SaeedARV/TorobDataChallenge2023/blob/main/Copy_of_baseline_solution_official.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline Solution for Torob Data Challenge 2023

**Congratulations for participating in Torob Data Challenge 2023! 🎉**

In this notebook, a baseline solution is provied for solving the challenge. As its name suggests, this solution only serves as a *baseline* which helps you to get a better idea of how the challenge data can be loaded and processed; and also, it can be potentially used as a starting point which you can add more complexity upon it and do more tuning of the parameters to get a better result. Of course, the approach/solution provided here is only one of the many different approaches for solving learning-to-rank problems and you are not required to use this solution at all; although, we strongly suggest that you should at least read and run it once to better understand the challenge as well as its data.

In this baseline solution, a LambdaMART model is trained to rank products based on their relevance. We use a combination of TF-IDF and random projection to represent queries as well as product names. As mentioned above, this solution is not tuned to get the best performance possible and there is a lot of room for improvements which you can try and experiment with. Also, feel free to try other solutions which are based on entirely different models, representation, etc. than this baseline. We are really excited and looking forward to see you and your novel solutions in the challenge!

Ok, let's get started!

NOTE: you can find a brief description of the baseline solution in the following document.
https://docs.google.com/document/d/1aLvD5RoakD-eS7IXcTMGSfo4PHzxa75dZwsC1uhFQXw/edit?usp=sharing

In [None]:
!pip install --no-cache-dir --upgrade gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.6.4-py3-none-any.whl (14 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.6.4


In [None]:
from collections import defaultdict, Counter
import json
import os
import re

from tqdm import tqdm
import pandas as pd

## Fetch Data

Fetch the public challenge data from Google Drive and extract its content to a directory called `data`.

In [None]:
!gdown 1spSFY1yieMcjGJc989bbbEheNhegmEmR

Downloading...
From: https://drive.google.com/uc?id=1spSFY1yieMcjGJc989bbbEheNhegmEmR
To: /content/torob-data-challenge-2023_datafiles_v1.7z
100% 454M/454M [00:02<00:00, 159MB/s]


In [None]:
!7z e torob-data-challenge-2023_datafiles_v1.7z -odata/


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 454165175 bytes (434 MiB)

Extracting archive: torob-data-challenge-2023_datafiles_v1.7z
--
Path = torob-data-challenge-2023_datafiles_v1.7z
Type = 7z
Physical Size = 454165175
Headers Size = 276
Method = LZMA2:24
Solid = +
Blocks = 2

  0%      0% - contest_data_v1/products-info_v1.jsonl                                               1% - contest_data_v1/products-info_v1.jsonl                                               2% - contest_data_v1/products-info_v1.jsonl                       

In [None]:
# Create a directory for all the generated files during execution of notebook.
!mkdir output_data

## Data Preprocessing

In [None]:
char_mappings = {
    "٥": "5",
    "А": "a",
    "В": "b",
    "Е": "e",
    "Н": "h",
    "Р": "P",
    "С": "C",
    "Т": "T",
    "а": "a",
    "г": "r",
    "е": "e",
    "к": "k",
    "м": "m",
    "о": "o",
    "р": "p",
    "ڈ": "د",
    "ڇ": "چ",
    # Persian numbers (will be raplaced by english one)
    "۰": "0",
    "۱": "1",
    "۲": "2",
    "۳": "3",
    "۴": "4",
    "۵": "5",
    "۶": "6",
    "۷": "7",
    "۸": "8",
    "۹": "9",
    ".": ".",
    # Arabic numbers (will be raplaced by english one)
    "٠": "0",
    "١": "1",
    "٢": "2",
    "٣": "3",
    "٤": "4",
    "٥": "5",
    "٦": "6",
    "٧": "7",
    "٨": "8",
    "٩": "9",
    # Special Arabic Characters (will be replaced by persian one)
    "ك": "ک",
    "ى": "ی",
    "ي": "ی",
    "ؤ": "و",
    "ئ": "ی",
    "إ": "ا",
    "أ": "ا",
    "آ": "ا",
    "ة": "ه",
    "ء": "ی",
    # French alphabet (will be raplaced by english one)
    "à": "a",
    "ä": "a",
    "ç": "c",
    "é": "e",
    "è": "e",
    "ê": "e",
    "ë": "e",
    "î": "i",
    "ï": "i",
    "ô": "o",
    "ù": "u",
    "û": "u",
    "ü": "u",
    # Camma (will be replaced by dots for floating point numbers)
    ",": ".",
    # And (will be replaced by dots for floating point numbers)
    "&": " and ",
    # Vowels (will be removed)
    "ّ": "",  # tashdid
    "َ": "",  # a
    "ِ": "",  # e
    "ُ": "",  # o
    "ـ": "",  # tatvil
    # Spaces
    "‍": "",  # 0x9E -> ZERO WIDTH JOINER
    "‌": " ",  # 0x9D -> ZERO WIDTH NON-JOINER
    # Arabic Presentation Forms-A (will be replaced by persian one)
    "ﭐ": "ا",
    "ﭑ": "ا",
    "ﭖ": "پ",
    "ﭗ": "پ",
    "ﭘ": "پ",
    "ﭙ": "پ",
    "ﭞ": "ت",
    "ﭟ": "ت",
    "ﭠ": "ت",
    "ﭡ": "ت",
    "ﭺ": "چ",
    "ﭻ": "چ",
    "ﭼ": "چ",
    "ﭽ": "چ",
    "ﮊ": "ژ",
    "ﮋ": "ژ",
    "ﮎ": "ک",
    "ﮏ": "ک",
    "ﮐ": "ک",
    "ﮑ": "ک",
    "ﮒ": "گ",
    "ﮓ": "گ",
    "ﮔ": "گ",
    "ﮕ": "گ",
    "ﮤ": "ه",
    "ﮥ": "ه",
    "ﮦ": "ه",
    "ﮪ": "ه",
    "ﮫ": "ه",
    "ﮬ": "ه",
    "ﮭ": "ه",
    "ﮮ": "ی",
    "ﮯ": "ی",
    "ﮰ": "ی",
    "ﮱ": "ی",
    "ﯼ": "ی",
    "ﯽ": "ی",
    "ﯾ": "ی",
    "ﯿ": "ی",
    # Arabic Presentation Forms-B (will be removed)
    "ﹰ": "",
    "ﹱ": "",
    "ﹲ": "",
    "ﹳ": "",
    "ﹴ": "",
    "﹵": "",
    "ﹶ": "",
    "ﹷ": "",
    "ﹸ": "",
    "ﹹ": "",
    "ﹺ": "",
    "ﹻ": "",
    "ﹼ": "",
    "ﹽ": "",
    "ﹾ": "",
    "ﹿ": "",
    # Arabic Presentation Forms-B (will be replaced by persian one)
    "ﺀ": "ی",
    "ﺁ": "ا",
    "ﺂ": "ا",
    "ﺃ": "ا",
    "ﺄ": "ا",
    "ﺅ": "و",
    "ﺆ": "و",
    "ﺇ": "ا",
    "ﺈ": "ا",
    "ﺉ": "ی",
    "ﺊ": "ی",
    "ﺋ": "ی",
    "ﺌ": "ی",
    "ﺍ": "ا",
    "ﺎ": "ا",
    "ﺏ": "ب",
    "ﺐ": "ب",
    "ﺑ": "ب",
    "ﺒ": "ب",
    "ﺓ": "ه",
    "ﺔ": "ه",
    "ﺕ": "ت",
    "ﺖ": "ت",
    "ﺗ": "ت",
    "ﺘ": "ت",
    "ﺙ": "ث",
    "ﺚ": "ث",
    "ﺛ": "ث",
    "ﺜ": "ث",
    "ﺝ": "ج",
    "ﺞ": "ج",
    "ﺟ": "ج",
    "ﺠ": "ج",
    "ﺡ": "ح",
    "ﺢ": "ح",
    "ﺣ": "ح",
    "ﺤ": "ح",
    "ﺥ": "خ",
    "ﺦ": "خ",
    "ﺧ": "خ",
    "ﺨ": "خ",
    "ﺩ": "د",
    "ﺪ": "د",
    "ﺫ": "ذ",
    "ﺬ": "ذ",
    "ﺭ": "ر",
    "ﺮ": "ر",
    "ﺯ": "ز",
    "ﺰ": "ز",
    "ﺱ": "س",
    "ﺲ": "س",
    "ﺳ": "س",
    "ﺴ": "س",
    "ﺵ": "ش",
    "ﺶ": "ش",
    "ﺷ": "ش",
    "ﺸ": "ش",
    "ﺹ": "ص",
    "ﺺ": "ص",
    "ﺻ": "ص",
    "ﺼ": "ص",
    "ﺽ": "ض",
    "ﺾ": "ض",
    "ﺿ": "ض",
    "ﻀ": "ض",
    "ﻁ": "ط",
    "ﻂ": "ط",
    "ﻃ": "ط",
    "ﻄ": "ط",
    "ﻅ": "ظ",
    "ﻆ": "ظ",
    "ﻇ": "ظ",
    "ﻈ": "ظ",
    "ﻉ": "ع",
    "ﻊ": "ع",
    "ﻋ": "ع",
    "ﻌ": "ع",
    "ﻍ": "غ",
    "ﻎ": "غ",
    "ﻏ": "غ",
    "ﻐ": "غ",
    "ﻑ": "ف",
    "ﻒ": "ف",
    "ﻓ": "ف",
    "ﻔ": "ف",
    "ﻕ": "ق",
    "ﻖ": "ق",
    "ﻗ": "ق",
    "ﻘ": "ق",
    "ﻙ": "ک",
    "ﻚ": "ک",
    "ﻛ": "ک",
    "ﻜ": "ک",
    "ﻝ": "ل",
    "ﻞ": "ل",
    "ﻟ": "ل",
    "ﻠ": "ل",
    "ﻡ": "م",
    "ﻢ": "م",
    "ﻣ": "م",
    "ﻤ": "م",
    "ﻥ": "ن",
    "ﻦ": "ن",
    "ﻧ": "ن",
    "ﻨ": "ن",
    "ﻩ": "ه",
    "ﻪ": "ه",
    "ﻫ": "ه",
    "ﻬ": "ه",
    "ﻭ": "و",
    "ﻮ": "و",
    "ﻯ": "ی",
    "ﻰ": "ی",
    "ﻱ": "ی",
    "ﻲ": "ی",
    "ﻳ": "ی",
    "ﻴ": "ی",
    "ﻵ": "لا",
    "ﻶ": "لا",
    "ﻷ": "لا",
    "ﻸ": "لا",
    "ﻹ": "لا",
    "ﻺ": "لا",
    "ﻻ": "لا",
    "ﻼ": "لا",
}

valid_chars = [
    " ",
    "0",
    "1",
    "2",
    "3",
    "4",
    "5",
    "6",
    "7",
    "8",
    "9",
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W",
    "X",
    "Y",
    "Z",
    "a",
    "b",
    "c",
    "d",
    "e",
    "f",
    "g",
    "h",
    "i",
    "j",
    "k",
    "l",
    "m",
    "n",
    "o",
    "p",
    "q",
    "r",
    "s",
    "t",
    "u",
    "v",
    "w",
    "x",
    "y",
    "z",
    "ا",
    "ب",
    "ت",
    "ث",
    "ج",
    "ح",
    "خ",
    "د",
    "ذ",
    "ر",
    "ز",
    "س",
    "ش",
    "ص",
    "ض",
    "ط",
    "ظ",
    "ع",
    "غ",
    "ف",
    "ق",
    "ل",
    "م",
    "ن",
    "ه",
    "و",
    "پ",
    "چ",
    "ژ",
    "ک",
    "گ",
    "ی",
]

translation_table = dict((ord(a), b) for a, b in char_mappings.items())

# Create a regex for recognizing invalid characters.
nonvalid_reg_text = '[^{}]'.format("".join(valid_chars))
nonvalid_reg = re.compile(nonvalid_reg_text)


def normalize_text(text, to_lower=True, remove_invalid=True):
    # Map invalid characters with replacement to valid characters.
    text = text.translate(translation_table)
    if to_lower:
        text = text.lower()
    if remove_invalid:
        text = nonvalid_reg.sub(' ', text)
    # Replace consecutive whitespaces with a single space character.
    text = re.sub(r"\s+", " ", text)
    return text

In [None]:
def read_json_lines(path, n_lines=None):
    """Creates a generator which reads and returns lines of
    a json lines file, one line at a time, each as a dictionary.
    
    This could be used as a memory-efficient alternative of `pandas.read_json`
    for reading a json lines file.
    """
    with open(path, 'r') as f:
        for i, line in enumerate(f):
            if n_lines == i:
                break
            yield json.loads(line)

In [None]:
class JSONLinesWriter:
    """
    Helper class to write list of dictionaries into a file in json lines
    format, i.e. one json record per line.
    """

    def __init__(self, file_path):
        self.fd = None
        self.file_path = file_path
        self.delimiter = "\n"

    def open(self):
        self.fd = open(self.file_path, "w")
        self.first_record_written = False
        return self

    def close(self):
        self.fd.close()
        self.fd = None

    def write_record(self, obj):
        if self.first_record_written:
            self.fd.write(self.delimiter)
        self.fd.write(json.dumps(obj))
        self.first_record_written = True

    def __enter__(self):
        return self.open()

    def __exit__(self, type, value, traceback):
        self.close()

In [None]:
data_dir = os.path.join('data')
output_dir = os.path.join('output_data')

search_data_path = os.path.join(data_dir, 'torob-search-data_v1.jsonl')
aggregated_search_data_path = os.path.join(output_dir, 'aggregated_search_data.jsonl')

products_path = os.path.join(data_dir, 'products-info_v1.jsonl')
preprocessed_products_path = os.path.join(output_dir, 'preprocessed_products.jsonl')

test_data_path = os.path.join(data_dir, 'test-offline-data_v1.jsonl')
preprocessed_test_queries_path = os.path.join(output_dir, 'preprocessed_test_queries.jsonl')

In [None]:
def aggregate_searches(search_data_path, output_path):
    """Aggregate searches based on raw query.
    
    For each unique raw query in the search data, the frequency of products and
    clicked products would be aggregated.
    """
    agg_searches = defaultdict(
        lambda : dict(
            results=Counter(),
            clicks=Counter(),
        )
    )
    print("Aggregating searches based on raw query...")
    for search in tqdm(read_json_lines(search_data_path)):
        agg_searches[search['raw_query']]['results'].update(search['result'])
        agg_searches[search['raw_query']]['clicks'].update(search['clicked_result'])
    
    print('Writing aggregated searches into file...')
    with JSONLinesWriter(output_path) as out_file:
        for raw_query, stats in tqdm(agg_searches.items()):
            results, results_count = list(zip(*stats['results'].most_common()))
            clicks, clicks_count = list(zip(*stats['clicks'].most_common()))
            record = {
                'raw_query': raw_query,
                'raw_query_normalized': normalize_text(raw_query),
                'results': results,
                'results_count': results_count,
                'clicks': clicks,
                'clicks_count': clicks_count,
            }
            out_file.write_record(record)

    print("Finished aggregating searches.")
    print(f'Number of aggregate search records: {len(agg_searches)}')
    print(f"The aggregated searches data were stored in '{output_path}'.")

In [None]:
aggregate_searches(search_data_path, aggregated_search_data_path)

Aggregating searches based on raw query...


2499901it [01:07, 37021.62it/s]


Writing aggregated searches into file...


100%|██████████| 270099/270099 [00:16<00:00, 16680.41it/s]

Finished aggregating searches.
Number of aggregate search records: 270099
The aggregated searches data were stored in 'output_data/aggregated_search_data.jsonl'.





In [None]:
def preprocess_products(products_path, output_path):
    """Preprocess product names.
    
    The different titles of a product are concatenated together and 
    the resulting string would be normalized. Then, the normalized title
    is split into tokens and only the set of unique tokens would be selected
    as the final title of the product.
    """
    print('Preprocessing products...')
    count = 0
    with JSONLinesWriter(output_path) as out_file:
        for product in tqdm(read_json_lines(products_path)):
            titles = product['titles']
            titles_concat_normalized = normalize_text(" ".join(titles))
            titles_words_set = set(titles_concat_normalized.split())
            titles_words_concat = " ".join(titles_words_set)
            
            record = {
                'id': product['id'],
                'title_normalized': titles_words_concat,
            }
            out_file.write_record(record)
            count += 1
    print('Finished preprocessing products.')
    print(f'Number of processed products: {count}')
    print(f"The processed products data were stored in '{output_path}'")

In [None]:
preprocess_products(products_path, preprocessed_products_path)

Preprocessing products...


3612277it [02:04, 29014.69it/s]

Finished preprocessing products.
Number of processed products: 3612277
The processed products data were stored in 'output_data/preprocessed_products.jsonl'





In [None]:
def preprocess_test_queries(test_data_path, output_path):
    """Normalize test queries."""
    print('Preprocessing test queries...')
    count = 0
    with JSONLinesWriter(output_path) as out_file:
        for test_sample in tqdm(read_json_lines(test_data_path)):
            normalized_query = normalize_text(test_sample['raw_query'])
            record = {
                'raw_query_normalized': normalized_query,
            }
            count += 1
            out_file.write_record(record)
    print('Finished preprocessing test queries.')
    print(f'Number of processed test queries: {count}')
    print(f"The processed test queries were stored in '{output_path}'")

In [None]:
preprocess_test_queries(test_data_path, preprocessed_test_queries_path)

Preprocessing test queries...


23140it [00:00, 58063.63it/s]


Finished preprocessing test queries.
Number of processed test queries: 23140
The processed test queries were stored in 'output_data/preprocessed_test_queries.jsonl'


## Feature extraction

In [None]:
# Reset environment due to memory constraints.
%reset -f

In [None]:
import os
import json
import gc
import pickle

from tqdm import tqdm
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def read_json_lines(path, n_lines=None):
    """Creates a generator which reads and returns lines of
    a json lines file, one line at a time, each as a dictionary.
    
    This could be used as a memory-efficient alternative of `pandas.read_json`
    for reading a json lines file.
    """
    with open(path, 'r') as f:
        for i, line in enumerate(f):
            if n_lines == i:
                break
            yield json.loads(line)

In [None]:
output_dir = os.path.join('output_data')

aggregated_search_data_path = os.path.join(output_dir, 'aggregated_search_data.jsonl')
preprocessed_products_path = os.path.join(output_dir, 'preprocessed_products.jsonl')
preprocessed_test_queries_path = os.path.join(output_dir, 'preprocessed_test_queries.jsonl')

train_dat_file_path = os.path.join(output_dir, 'train.dat')

random_projection_mat_path = os.path.join(output_dir, 'random_projection_mat.npy')
product_features_path = os.path.join(output_dir, 'product_features.npy')
queries_train_features_path = os.path.join(output_dir, 'queries_train_features.npy')
queries_test_features_path = os.path.join(output_dir, 'queries_test_features.npy')
products_id_to_idx_path = os.path.join(output_dir, 'products_id_to_idx.pkl')

In [None]:
# Number of tokens in the vocabulary of TF-IDF.
VOCAB_SIZE = 4096
# Embedding dimension used for random projection of TF-IDF vectors.
EMBEDDING_DIM = 256
# Number of training samples to use (set to None to use all samples).
NUM_TRAIN_SAMPLES = 10_000

In [None]:
# Load aggregated search data which will be used as training data.
aggregated_searches_df = pd.DataFrame(
    read_json_lines(aggregated_search_data_path, n_lines=NUM_TRAIN_SAMPLES)
)

In [None]:
# Load preprocessed product data.
products_data_df = pd.DataFrame(read_json_lines(preprocessed_products_path))

In [None]:
# Load preprocessed test queries.
test_offline_queries_df = pd.DataFrame(read_json_lines(preprocessed_test_queries_path))

In [None]:
# Create a mapping from ID of products to their integer index.
products_id_to_idx = dict(
    (p_id, idx)
    for idx, p_id in enumerate(products_data_df['id'])
)

In [None]:
# Create a random matrix which will be used for projection of
# TF-IDF vector to a lower-ranked random space.
random_projection_mat = np.random.rand(VOCAB_SIZE, EMBEDDING_DIM)

vectorizer = TfidfVectorizer(max_features=VOCAB_SIZE, lowercase=True, use_idf=True)

# Fit tf-idf vectorizer on normalized product names and compute their tf-idf vectors.
products_tfidf = vectorizer.fit_transform(products_data_df['title_normalized'])
# Project the tf-idf vectors using random projection matrix.
products_projected = products_tfidf.dot(random_projection_mat)
del products_tfidf  # Free up memory.
gc.collect()

# Transform the training raw queries into tf-idf vectors.
queries_train_tfidf = vectorizer.transform(aggregated_searches_df['raw_query_normalized'])
queries_train_projected = queries_train_tfidf.dot(random_projection_mat)
del queries_train_tfidf # Free up memory.
gc.collect();

In [None]:
# Transform test raw queries into tf-idf vectors.
queries_test_tfidf = vectorizer.transform(test_offline_queries_df['raw_query_normalized'])
queries_test_projected = queries_test_tfidf.dot(random_projection_mat)
del queries_test_tfidf # Free up memory.
gc.collect();

In [None]:
del vectorizer
gc.collect();

For each pair of (Q, P) where Q is a search query and P is a product in the search-results of query Q, a training sample is generated. The features of Q and P will be their corresponding random-projected TF-IDF vectors. Further, the sample is labeled based on the number of clicks on P (in the search results of query Q).

In [None]:
def create_dat_file(
    dat_file_path,
    agg_searches_df,
    query_features,
    product_features,
    n_candidates=None,
):
    """
    Create a `dat` file which is the training data of LambdaMart model.

    The file format of the training and test files is the same as for SVMlight,
    with the exception that the lines in the input files have to be sorted by increasing qid.
    The first lines may contain comments and are ignored if they start with #.
    Each of the following lines represents one training example and is of the following format:

    <line> .=. <target> qid:<qid> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
    <target> .=. <float>
    <qid> .=. <positive integer>
    <feature> .=. <positive integer>
    <value> .=. <float>
    <info> .=. <string>

    The target value and each of the feature/value pairs are separated by a space character.
    Feature/value pairs MUST be ordered by increasing feature number.
    Features with value zero can be skipped.
    The target value defines the order of the examples for each query.
    Implicitly, the target values are used to generated pairwise preference constraints as described in [Joachims, 2002c].
    A preference constraint is included for all pairs of examples in the example_file, for which the target value differs.
    The special feature "qid" can be used to restrict the generation of constraints.
    Two examples are considered for a pairwise preference constraint only if the value of "qid" is the same.

    For example, given the example_file

    3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
    2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
    1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
    1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
    1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
    2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B
    1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C
    1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D
    2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A
    3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B
    4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C
    1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D

    the following set of pairwise constraints is generated (examples are referred to by the info-string after the # character):

    1A>1B, 1A>1C, 1A>1D, 1B>1C, 1B>1D, 2B>2A, 2B>2C, 2B>2D, 3C>3A, 3C>3B, 3C>3D, 3B>3A, 3B>3D, 3A>3D

    More information:
     - https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#embedding-additional-information-inside-libsvm-file
     - https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
    """
    with open(dat_file_path, "w") as file:
        for qid, agg_search in tqdm(enumerate(agg_searches_df.itertuples(index=False))):
            if n_candidates is None:
                limit = len(agg_search.results)
            else:
                limit = min(n_candidates, len(agg_search.results))
            clicks = dict(zip(agg_search.clicks, agg_search.clicks_count))

            for candidate_product_id in agg_search.results[:limit]:
                if candidate_product_id is None:
                    continue
                candidate_score = clicks.get(candidate_product_id, 0)
                candidate_score = np.log2(candidate_score + 1)

                p_idx = products_id_to_idx[candidate_product_id]
                features = np.concatenate((product_features[p_idx], query_features[qid]))
                features = np.around(features, 3)

                file.write(
                    f"{candidate_score} qid:{qid} "
                    + " ".join([f"{i}:{s}" for i, s in enumerate(features)])
                    + "\n"
                )

In [None]:
create_dat_file(
    train_dat_file_path,
    aggregated_searches_df,
    queries_train_projected,
    products_projected,
    n_candidates=200,
)

10000it [10:30, 15.86it/s]


In [None]:
# Since memory is limited, we store all the neccessary data
# such as extracted features on disk. Later, in inference
# step we may need some of these files.
np.save(random_projection_mat_path, random_projection_mat)
np.save(product_features_path, products_projected)
np.save(queries_train_features_path, queries_train_projected)
np.save(queries_test_features_path, queries_test_projected)

In [None]:
with open(products_id_to_idx_path, 'wb') as f:
    pickle.dump(products_id_to_idx, f)

## Training of model

In [None]:
# Reset due to memory constraints.
%reset -f

In [None]:
import os
import xgboost as xgb

In [None]:
output_dir = os.path.join('output_data')

train_dat_path = os.path.join(output_dir, 'train.dat')
model_path = os.path.join(output_dir, 'ranker.json')

In [None]:
train_data = xgb.DMatrix(train_dat_path)

In [None]:
param = {
    "max_depth": 20,
    "eta": 0.3,
    "objective": "rank:ndcg",
    "verbosity": 1,
    "num_parallel_tree": 1,
    "tree_method": "gpu_hist",
    "eval_metric": ["ndcg"],
}
eval_list = [(train_data, "train")]

model = xgb.train(
    param,
    train_data,
    num_boost_round=200,
    evals=eval_list,
)

model.save_model(model_path)

[0]	train-ndcg:0.61594
[1]	train-ndcg:0.66907
[2]	train-ndcg:0.69414
[3]	train-ndcg:0.70671
[4]	train-ndcg:0.71801
[5]	train-ndcg:0.72733
[6]	train-ndcg:0.73543
[7]	train-ndcg:0.74177
[8]	train-ndcg:0.74898
[9]	train-ndcg:0.75489
[10]	train-ndcg:0.76081
[11]	train-ndcg:0.76573
[12]	train-ndcg:0.77120
[13]	train-ndcg:0.77680
[14]	train-ndcg:0.78134
[15]	train-ndcg:0.78544
[16]	train-ndcg:0.78966
[17]	train-ndcg:0.79433
[18]	train-ndcg:0.79824
[19]	train-ndcg:0.80163
[20]	train-ndcg:0.80453
[21]	train-ndcg:0.80773
[22]	train-ndcg:0.81058
[23]	train-ndcg:0.81338
[24]	train-ndcg:0.81670
[25]	train-ndcg:0.81926
[26]	train-ndcg:0.82174
[27]	train-ndcg:0.82452
[28]	train-ndcg:0.82677
[29]	train-ndcg:0.82899
[30]	train-ndcg:0.83119
[31]	train-ndcg:0.83310
[32]	train-ndcg:0.83449
[33]	train-ndcg:0.83661
[34]	train-ndcg:0.83861
[35]	train-ndcg:0.84041
[36]	train-ndcg:0.84234
[37]	train-ndcg:0.84406
[38]	train-ndcg:0.84584
[39]	train-ndcg:0.84738
[40]	train-ndcg:0.84899
[41]	train-ndcg:0.85065
[4

## Inference (prediction) on test data

In [None]:
# Reset due to memory constraints.
%reset -f

In [None]:
import os
import pickle
import json

from tqdm import tqdm
import pandas as pd
import numpy as np
import xgboost as xgb

In [None]:
def read_json_lines(path, n_lines=None):
    """Creates a generator which reads and returns lines of
    a json lines file, one line at a time, each as a dictionary.
    
    This could be used as a memory-efficient alternative of `pandas.read_json`
    for reading a json lines file.
    """
    with open(path, 'r') as f:
        for i, line in enumerate(f):
            if n_lines == i:
                break
            yield json.loads(line)

In [None]:
data_dir = os.path.join('data')
output_dir = os.path.join('output_data')

test_data_path = os.path.join(data_dir, 'test-offline-data_v1.jsonl')

product_features_path = os.path.join(output_dir, 'product_features.npy')
queries_test_features_path = os.path.join(output_dir, 'queries_test_features.npy')
products_id_to_idx_path = os.path.join(output_dir, 'products_id_to_idx.pkl')

predictions_path = os.path.join(output_dir, 'predictions.txt')

model_path = os.path.join(output_dir, 'ranker.json')

In [None]:
# Load projected products and queries data.
products_projected = np.load(product_features_path)
queries_test_projected = np.load(queries_test_features_path)
with open(products_id_to_idx_path, 'rb') as f:
    products_id_to_idx = pickle.load(f)

In [None]:
# Load original test data which contains the result to be ranked.
test_data_df = pd.DataFrame(read_json_lines(test_data_path))

In [None]:
# Load trained LambdaMART model.
param = {}
model = xgb.Booster(**param)
model.load_model(model_path)

In [None]:
BATCH_SIZE = 64
test_predictions = []
for batch_idx in tqdm(range(0, len(test_data_df), BATCH_SIZE)):
    batch_data = test_data_df['result_not_ranked'].iloc[batch_idx:batch_idx + BATCH_SIZE]
    batch_features = []
    for test_qid, test_candidates in enumerate(batch_data, start=batch_idx):
        test_query_projected = queries_test_projected[test_qid]
        for candidate_pid in test_candidates:
            p_idx = products_id_to_idx[candidate_pid]
            features = np.concatenate((products_projected[p_idx], test_query_projected))
            batch_features.append(features)
    
    batch_features = np.stack(batch_features)
    batch_features = xgb.DMatrix(batch_features)
    batch_preds = model.predict(batch_features)
    
    start_idx = 0
    for test_candidates in batch_data:
        preds_sample = batch_preds[start_idx:start_idx + len(test_candidates)]
        sorted_idx = np.argsort(preds_sample)[::-1]
        sorted_candidates = [test_candidates[i] for i in sorted_idx]
        test_predictions.append(sorted_candidates)
        start_idx += len(test_candidates)

In [None]:
def write_test_predictions(predictions_path, predictions):
    lines = []
    for preds in predictions:
        lines.append(",".join([str(p_id) for p_id in preds]))

    with open(predictions_path, 'w') as f:
        f.write("\n".join(lines))

In [None]:
write_test_predictions(predictions_path, test_predictions)

Now, you can submit the `predictions.txt` file in the `output_data` directory using RoboEpics platform.