# Deep Learning Project - Data Preprocessing and Dataset Creation
##### Andrea Gervasio, Matricola Number: 1883259

# 0. Installing libraries and downloading required files

### Data Preprocessing

The dataset used in this experiment is the MS MARCO dataset, in particular the Document ranking dataset, publicly available in [this](https://github.com/microsoft/msmarco/blob/master/Datasets.md#document-ranking-dataset) github repository. The notebook will automatically download all necessary files, so there's no need for the user to manually do anything. \\
The main corpus is contained in the [msmarco-docs.tsv](https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz) file. It stores all the dataset's 3.2 million documents, their title, their unique id and their urls. \\
The train, validation and test splits are all made of two files: one .tsv file containing queries and their corresponding query ids, and another file that maps each query id to the top 100 ranked documents ids. \\

Because of the Colab runtime constraints, I sampled 8000, 2000 and 2000 queries from the train, validation and test split respectively. Moreover, I only considered the first 10 ranked documents for each query.

The final dataset will the be made of:
*   a training dataset containing all the documents to be indexed and their queries;
*   two datasets, one for validation and one for testing, containing the queries and the ids of the first 10 ranked documents for each query.

### Usage
Recognizing that preprocessing the dataset is an essential aspect of the project, I created this explanatory notebook that details the code used to generate the train, validation, and test sets. \\
While this preprocessing notebook is integral to the work, it is not included as part of the model training notebook. Consequently, due to the large size of the original dataset and the limitations in Colab's free tier regarding runtime and storage, full reproducibility of the preprocessing steps is not expected. \\
The notebook is divided into 7 sections. The user should restart the Colab runtime after each one, remembering to execute the first one (this one, number 0) each time, to import the necessary libraries. \\
A Google Drive folder with all the intermediate files can be found at [this link.](https://drive.google.com/drive/folders/1ENf6VXPJNiu6-NJ1i10kuP6ss7VKfuz_?usp=sharing)

### Model training:
The definition and training of the model can be found at [this link.](https://colab.research.google.com/drive/1d5RXWi-ZAPjJwMGxZG_dbRePUYZvA8Bc?usp=sharing)

In [None]:
!pip install dask[dataframe]

Collecting dask-expr<1.2,>=1.1 (from dask[dataframe])
  Downloading dask_expr-1.1.11-py3-none-any.whl.metadata (2.5 kB)
INFO: pip is looking at multiple versions of dask-expr to determine which version is compatible with other requirements. This could take a while.
  Downloading dask_expr-1.1.10-py3-none-any.whl.metadata (2.5 kB)
  Downloading dask_expr-1.1.9-py3-none-any.whl.metadata (2.5 kB)
Downloading dask_expr-1.1.9-py3-none-any.whl (241 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.9/241.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dask-expr
Successfully installed dask-expr-1.1.9


In [None]:
from tqdm.notebook import tqdm

import pandas as pd
import numpy as np
import dask.dataframe as df
import torch

import gzip
import pickle
import os

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [None]:
if not os.path.exists("/content/DLPreprocessing"):
  !gdown --fuzzy --folder https://drive.google.com/drive/folders/1_SLHgD4dK_gOPdRLizxQ06hWECBjiKZt?usp=drive_link

Retrieving folder contents
Processing file 1rfTx9PEPmtKZRtgAz8IG8gvwYMD_qbEl docleaderboard-queries.tsv.gz
Processing file 1qN5pfd73ly5v9NPBu0zE3mkpKpzqFKqs docleaderboard-top100.tsv.gz
Processing file 1no9744luKp0j1BJ8wfXIuf8g1uGEDFW5 msmarco-docdev-queries.tsv.gz
Processing file 18iTo9GAERW6a3jP1210fOf8VEHNfP3mX msmarco-docdev-top100.gz
Processing file 1Yq9CJ6vHh3-GPa69gLL1kPCoo0j6L5nY msmarco-doctrain-queries.tsv.gz
Processing file 1O-LS8oW1VTvgkHbsOlAJjCc2gDUcD9YO msmarco-doctrain-top100.gz
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1rfTx9PEPmtKZRtgAz8IG8gvwYMD_qbEl
To: /content/DLPreprocessing/docleaderboard-queries.tsv.gz
100% 102k/102k [00:00<00:00, 61.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1qN5pfd73ly5v9NPBu0zE3mkpKpzqFKqs
To: /content/DLPreprocessing/docleaderboard-top100.tsv.gz
100% 6.36M/6.36M [00:00<00:00, 50.2MB/s]
Downloading...
From: https

In [None]:
RANDOM_SEED = 1883259

# 1. Sampling queries

I load the query files and sample a different number from them, depending on the split.

In [None]:
def sample_queries(path, number, split):
  '''
  Samples the training, validation and test queries.
  '''
  queries = pd.read_csv(path, sep="\t")
  queries.columns = ["qid", "query"]
  queries = queries.sample(number,
                           random_state = RANDOM_SEED).reset_index(drop = True)

  queries.to_csv(f"{split}_queries.csv.zip")

  print(f"{split} queries shape after sampling:", queries.shape)
  return queries

In [None]:
train_queries_path = "/content/DLPreprocessing/msmarco-doctrain-queries.tsv.gz"
train_queries = sample_queries(train_queries_path, 8000, "train")

val_queries_path = "/content/DLPreprocessing/msmarco-docdev-queries.tsv.gz"
val_queries = sample_queries(val_queries_path, 2000, "val")

test_queries_path = "/content/DLPreprocessing/docleaderboard-queries.tsv.gz"
test_queries = sample_queries(test_queries_path, 2000, "test")

train queries shape after sampling: (8000, 2)
val queries shape after sampling: (2000, 2)
test queries shape after sampling: (2000, 2)


# 2. Reduce the top100 files

Since I sampled the query files, I need to take only the corresponding ones in the top100 files. Ater doing that, I take only the first 10 ranked documents for each query.

In [None]:
def reduce_top100(path, queries, split):
  '''
  Filters the top100 documents to get the ones sampled from the queries and
  reduces the number of documents to 10, then saves the new file.
  '''
  top100 = pd.read_table(path, delimiter = " ", header = None)
  top100.columns = ["qid", "Q0", "docid", "rank", "score", "runstring"]
  print(split, "documents shape before reduction:", top100.shape)

  condition = top100["qid"].isin(queries["qid"].unique())
  sampled_top100 = top100[condition].reset_index(drop = True)
  print(split, "documents shape after reduction:", sampled_top100.shape)

  sampled_top100.to_csv(f"{split}_top100.csv.zip")

  top10 = sampled_top100.copy()
  top10["tmp"] = top10["rank"].apply(lambda x : 1
                                      if x in list(range(1,11))
                                      else np.nan)

  top10 = top10.dropna()

  print(split, "top10 shape:", top10.shape)

  top10.to_csv(f"{split}_top10.csv.zip")

  return top10

In [None]:
train_top_100_path = "/content/DLPreprocessing/msmarco-doctrain-top100.gz"
train_top10 = reduce_top100(train_top_100_path, train_queries, "train")

train documents shape before reduction: (36701116, 6)
train documents shape after reduction: (800000, 6)
train top10 shape: (80000, 7)


In [None]:
val_top_100_path = "/content/DLPreprocessing/msmarco-docdev-top100.gz"
val_top10 = reduce_top100(val_top_100_path, val_queries, "val")

val documents shape before reduction: (519300, 6)
val documents shape after reduction: (200000, 6)
val top10 shape: (20000, 7)


In [None]:
test_top_100_path = "/content/DLPreprocessing/docleaderboard-top100.tsv.gz"
test_top10 = reduce_top100(test_top_100_path, test_queries, "test")

test documents shape before reduction: (579300, 6)
test documents shape after reduction: (200000, 6)
test top10 shape: (20000, 7)


# 3. Reduce corpus

I do the same thing for the main file, taking only the documents corresponding to the sampled queries and top10s.

In [None]:
!wget https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz

In [None]:
def decompress_gz_file(input_path, output_path, chunk_size=1024*1024):
  '''
  Decompress a .gz file in chunks.
  '''
  with gzip.open(input_path, 'rb') as input_file:
    with open(output_path, 'wb') as output_file:
      while True:
        chunk = input_file.read(chunk_size)
        if not chunk:
          break
        output_file.write(chunk)

In [None]:
input_gz_file = "msmarco-docs.tsv.gz"
output_file = "msmarco-docs.tsv"
decompress_gz_file(input_gz_file, output_file)

In [None]:
dataset_path = "msmarco-docs.tsv"
dataset = df.read_table(dataset_path, blocksize=100e6, header=None)
dataset.columns = ["docid", "url", "title", "body"]

In [None]:
def create_corpus(top10, dataset, split):
  '''
  Filters the corpus to get only the documents present in the split.
  '''
  condition = dataset["docid"].isin(top10["docid"].unique())
  corpus = dataset[condition].reset_index(drop=True)
  corpus = corpus.drop(columns="url")

  corpus["doc"] = corpus["title"] + " " + corpus["body"]

  print("Length of ", split, " corpus:", len(corpus))

  return corpus

In [None]:
train_corpus = create_corpus(train_top10, dataset, "train")
train_corpus = train_corpus.compute()
train_corpus.to_csv("train_corpus.csv.zip")

In [None]:
val_corpus = create_corpus(val_top10, dataset, "val")
val_corpus = val_corpus.compute()
val_corpus.to_csv("val_corpus.csv.zip")

In [None]:
test_corpus = create_corpus(test_top10, dataset, "test")
test_corpus = test_corpus.compute()
test_corpus.to_csv("test_corpus.csv.zip")

# 4. Semantic Clustering

Inspired by the [DSI paper](https://arxiv.org/pdf/2202.06991v3), using semantically structured identifiers can improve the performance of the model, especially when beam search is used as a decoding strategy. \\
I implemented semantic clustering by creating the embeddings of the documents and applying a K-Means algorithm to clusterize and assign semantic ids.

For the embeddings, I used the sentence transformer from the HuggingFace library. I decided to use the [*msmarco-bert-base-dot-v5*](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5) transformer, which is a sentente transformer based on BERT model. It uses the same embedding dimension of BERT, 768, and it was designed for semantic search.

*Note: this part of the notebook contains the creation of the embeddings. It is advisable to save the files created up to this point, and switch the runtime to work on GPU.*

In [None]:
train_path = "train_corpus.csv.zip"
train_corpus = pd.read_csv(train_path).drop(columns="Unnamed: 0")

val_path = "val_corpus.csv.zip"
val_corpus = pd.read_csv(val_path).drop(columns="Unnamed: 0")

test_corpus = "test_corpus.csv.zip"
test_corpus = pd.read_csv(test_corpus).drop(columns="Unnamed: 0")

In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.met

In [None]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

sentence_transformer = SentenceTransformer("sentence-transformers/msmarco-bert-base-dot-v5").to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.19k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def create_embedding_dict(corpus):
  '''
  Creates a dictionaries to store the embeddings of the documents in the corpus,
  using the same docids as keys.
  '''
  embeddings_dict = {}

  loop = tqdm(range(len(corpus)))

  for i in loop:
    docid = corpus['docid'].iloc[i]
    doc = corpus['body'].iloc[i]

    if type(doc) is str:
      embedding = sentence_transformer.encode(doc)
      embeddings_dict[docid] = embedding
      torch.cuda.empty_cache()

  return embeddings_dict

In [None]:
train_embeddings = create_embedding_dict(train_corpus)

with open("train_embeddings.pkl", "wb") as f:
  pickle.dump(train_embeddings, f)

  0%|          | 0/72295 [00:00<?, ?it/s]

In [None]:
val_embeddings = create_embedding_dict(val_corpus)

with open("val_embeddings.pkl", "wb") as f:
  pickle.dump(val_embeddings, f)

  0%|          | 0/19396 [00:00<?, ?it/s]

In [None]:
test_embeddings = create_embedding_dict(test_corpus)

with open("test_embeddings.pkl", "wb") as f:
  pickle.dump(test_embeddings, f)

  0%|          | 0/19397 [00:00<?, ?it/s]

In [None]:
c = {}

c.update(train_embeddings)
c.update(val_embeddings)
c.update(test_embeddings)

In [None]:
with open("corpus_embeddings.pkl", "wb") as f:
  pickle.dump(c, f)

In [None]:
with open("corpus_embeddings.pkl", "rb") as f:
  corpus_embeddings = pickle.load(f)

These are the functions to cluster the embeddings and generate the new semantic ids.

In [None]:
from sklearn.cluster import KMeans

def cluster_documents(embeddings):
  '''
  Cluster documents using KMeans.
  '''
  embeddings_values = list(embeddings.values())

  kmeans = KMeans(n_clusters = 10,
                  random_state = RANDOM_SEED).fit(embeddings_values)

  clusters = {i : [] for i in range(10)}
  for docid, label in zip(embeddings.keys(), kmeans.labels_):
    clusters[label].append(docid)

  return clusters

def generate_semantic_ids(embeddings, prefix=""):
  '''
  Recursively generates semantic ids with mapping.
  '''
  if len(embeddings) == 0:
    return {}

  clusters = cluster_documents(embeddings)

  new_ids = {}
  for i in range(10):
    cluster_ids = clusters[i]
    cluster_embeddings = {id : embeddings[id] for id in cluster_ids}

    if len(cluster_embeddings) > 100:
      temp_ids = generate_semantic_ids(cluster_embeddings,
                                       prefix=f"{prefix}{i}")
    else:
      temp_ids = {id : f"{prefix}{i}{j}" for j, id in enumerate(cluster_ids)}

    new_ids.update(temp_ids)

  return new_ids

In [None]:
semantic_map = generate_semantic_ids(corpus_embeddings)

  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

In [None]:
with open("semantic_map.pkl", "wb") as f:
  pickle.dump(semantic_map, f)

# 5. Mapping the corpus

I map every document id in the train, validation and test corpora to its corresponding semantic one.

In [None]:
with open("semantic_map.pkl", "rb") as f:
  semantic_map = pickle.load(f)

In [None]:
def docid_to_semantic(docid):
  '''
  Maps the docid in the corpus to its semantic one.
  '''
  return semantic_map.get(docid, None)

def map_corpus(corpus):
  '''
  Maps the docids of the corpus to their semantic ones.
  '''
  mapped_corpus = corpus.copy()
  mapped_corpus["semantic_id"] = mapped_corpus["docid"].map(docid_to_semantic)

  return mapped_corpus

In [None]:
train_top100_path = "train_top100.csv.zip"
train_top100 = pd.read_csv(train_top100_path).drop(columns = "Unnamed: 0")

val_top100_path = "val_top100.csv.zip"
val_top100 = pd.read_csv(val_top100_path).drop(columns = "Unnamed: 0")

test_top100_path = "test_top100.csv.zip"
test_top100 = pd.read_csv(test_top100_path).drop(columns = "Unnamed: 0")

In [None]:
train_corpus_path = "train_corpus.csv.zip"
train_corpus = pd.read_csv(train_corpus_path).drop(columns = "Unnamed: 0")

val_corpus_path = "val_corpus.csv.zip"
val_corpus = pd.read_csv(val_corpus_path).drop(columns = "Unnamed: 0")

test_corpus_path = "test_corpus.csv.zip"
test_corpus = pd.read_csv(test_corpus_path).drop(columns = "Unnamed: 0")

In [None]:
mapped_train_top100 = map_corpus(train_top100)
mapped_val_top100 = map_corpus(val_top100)
mapped_test_top100 = map_corpus(test_top100)

mapped_train_corpus = map_corpus(train_corpus)
mapped_val_corpus = map_corpus(val_corpus)
mapped_test_corpus = map_corpus(test_corpus)

full_corpus = pd.concat([mapped_train_corpus,
                         mapped_val_corpus,
                         mapped_test_corpus], ignore_index=True)

In [None]:
full_corpus = full_corpus.dropna()

In [None]:
with open("mapped_train_top100.pkl", "wb") as f:
  pickle.dump(mapped_train_top100, f)

with open("mapped_val_top100.pkl", "wb") as f:
  pickle.dump(mapped_val_top100, f)

with open("mapped_test_top100.pkl", "wb") as f:
  pickle.dump(mapped_test_top100, f)

with open("full_corpus.pkl", "wb") as f:
  pickle.dump(full_corpus, f)

# 6. Creating the training dataset

The training dataset will be made of the corpus and the train queries, each of them equipped with the id of the highest ranked document as label.

In [None]:
with open("mapped_train_top100.pkl", "rb") as f:
  mapped_train_top100 = pickle.load(f)

with open("full_corpus.pkl", "rb") as f:
  full_corpus = pickle.load(f)

In [None]:
train_queries_path = "train_queries.csv.zip"
train_queries = pd.read_csv(train_queries_path)

I extract the highest ranked id for each train query, and create a dictionary that maps each query to the respective docid. \\
To save memory, I take only the first 50 words of each document.

In [None]:
query_to_semantic_id_map = {}

for i in range(len(train_queries)):
  row = train_queries.iloc[i]
  qid = row["qid"]
  query = row["query"]

  top100 = mapped_train_top100[mapped_train_top100["qid"] == qid]
  if len(top100) > 0:
    docid = top100.iloc[0]["semantic_id"]
    query_to_semantic_id_map[query] = docid

semantic_dict = {"doc": list(query_to_semantic_id_map.keys()),
                 "semantic_id": list(query_to_semantic_id_map.values())}

train_queries_df = pd.DataFrame(semantic_dict)

full_corpus["doctype"] = "document"
train_queries_df["doctype"] = "query"

def shorten_document(row):
  doc = row["doc"]
  words = doc.split()
  doc = " ".join(words[:min(50, len(words))])
  return doc

full_corpus["doc"] = full_corpus.apply(shorten_document, axis=1)

full_train_corpus = pd.concat([full_corpus, train_queries_df],
                              ignore_index=True)

full_train_corpus = full_train_corpus.sample(frac=1).reset_index(drop=True)

full_train_corpus = full_train_corpus.dropna()
full_train_corpus = full_train_corpus.drop_duplicates()

In [None]:
full_train_corpus = full_train_corpus.drop(columns = ["docid", "title", "body"])

In [None]:
full_train_corpus.head()

Unnamed: 0,doc,semantic_id,doctype
0,What is Curriculum? From Managed Instruction t...,667721,document
1,How Long Can Lice Live Without a Host? How Lon...,36183,document
2,Which Rangefinder Should You Purchase ? Which ...,417221,document
3,"What's the difference between latte, mocha, an...",12044,document
4,"Quality management system From Wikipedia, the ...",864910,document


In [None]:
with open("full_train_corpus.pkl", "wb") as f:
  pickle.dump(full_train_corpus, f)

In [None]:
full_corpus_path = "full_train_corpus.pkl"
full_corpus = pd.read_pickle(full_corpus_path)

In [None]:
corpus1 = full_corpus.iloc[:35000, :]
corpus2 = full_corpus.iloc[35000:70000, :]
corpus3 = full_corpus.iloc[70000:, :]

# 7. Query Generation

I use the [*castorini/doc2query-t5-base-msmarco*](https://github.com/castorini/docTTTTTquery?tab=readme-ov-file#predicting-queries-from-passages-t5-inference-with-pytorch) model for generating the queries of the documents. Each query is then stored inside the new corpus.

For computational reasons, the corpus is split into three segments. The user can control which one to process with a parameter, however all three should be processed.

*Note: this part of the notebook contains the generation of the query for each document. It is advisable to save the files created up to this point, and switch the runtime to work on GPU.*

In [None]:
with open("corpus1.pkl", "wb") as f:
  pickle.dump(corpus1, f)

with open("corpus2.pkl", "wb") as f:
  pickle.dump(corpus2, f)

with open("corpus3.pkl", "wb") as f:
  pickle.dump(corpus3, f)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_name = "castorini/doc2query-t5-base-msmarco"

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
segment = 1 # @param = ["1", "2", "3"] {type:"raw"}

In [None]:
path = f"corpus{segment}.pkl"

In [None]:
corpus = pd.read_pickle(path)

In [None]:
corpus = corpus.reset_index(drop = True)

In [None]:
pd.options.mode.chained_assignment = None

loop = tqdm(range(len(corpus["doc"])))

for i in loop:
  doc = corpus["doc"][i]

  input_ids = tokenizer.encode(doc, return_tensors='pt').to(device)
  outputs = model.generate(
      input_ids=input_ids,
      max_length=16,
      do_sample=True,
      top_k=3,
      num_return_sequences=1)
  query = tokenizer.decode(outputs[0], skip_special_tokens=True)

  corpus["doc"][i] = query + " " + doc

  0%|          | 0/35000 [00:00<?, ?it/s]

In [None]:
with open(f"query_generation_train_corpus{segment}.pkl", "wb") as f:
  pickle.dump(corpus, f)