<a href="https://colab.research.google.com/github/NickVoulg02/Information-Retrieval/blob/main/colbert_test_link4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ColBERTv2: Indexing & Search Notebook
First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [1]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')


fatal: cannot change to 'ColBERT/': No such file or directory
Cloning into 'ColBERT'...
remote: Enumerating objects: 2576, done.[K
remote: Counting objects: 100% (1089/1089), done.[K
remote: Compressing objects: 100% (332/332), done.[K
remote: Total 2576 (delta 853), reused 801 (delta 757), pack-reused 1487[K
Receiving objects: 100% (2576/2576), 2.01 MiB | 14.42 MiB/s, done.
Resolving deltas: 100% (1606/1606), done.


In [2]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README. You can also install (as above) with pip but it may install slower or less stable faiss or torch dependencies. Conda is recommended.")
    assert False

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
Obtaining file:///content/ColBERT
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bitarray (from colbert-ai==0.2.17)
  Downloading bitarray-2.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (34 kB)
Collecting datasets (from colbert-ai==0.2.17)
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting git-python (from colbert-ai==0.2.17)
  Downloading git_python-1.0.3-py2.py3-none-any.whl (1.9 kB)
Collecting python-dotenv (from colbert-ai==0.2.17)
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting ninja (from colbert-

In [3]:
import colbert

In [4]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

We will use the docs files and the Queries_20 file.

In [5]:
#importing tsv files from personal github repository and creating docs and queries dataset
!git clone https://github.com/NickVoulg02/Information-Retrieval.git
import pandas as pd
from datasets import Dataset
dataset = 'test'
df1 = pd.read_csv("Information-Retrieval/colbert_test/doc_col.tsv", delimiter = '\t', index_col=0)
df2 =  pd.read_csv("Information-Retrieval/colbert_test/queries_20.tsv", delimiter = '\t', index_col=0)
collection = Dataset.from_pandas(df1, preserve_index=True)
query = Dataset.from_pandas(df2, preserve_index=True)
f'Loaded {len(query)} queries and {len(collection):,} passages'

Cloning into 'Information-Retrieval'...
remote: Enumerating objects: 87, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 87 (delta 23), reused 15 (delta 15), pack-reused 59[K
Receiving objects: 100% (87/87), 525.94 KiB | 14.21 MiB/s, done.
Resolving deltas: 100% (32/32), done.


'Loaded 20 queries and 1,209 passages'

In [6]:
print(query[0])
print(collection[0])

{'query': 'WHAT ARE THE EFFECTS OF CALCIUM ON THE PHYSICAL PROPERTIES OF MUCUS FROM CF PATIENTS', 'query_id': 0}
{'doc': 'PSEUDOMONAS AERUGINOSA INFECTION IN CYSTIC FIBROSIS OCCURRENCE OF PRECIPITATING ANTIBODIES AGAINST PSEUDOMONAS AERUGINOSA IN RELATION TO THE CONCENTRATION OF SIXTEEN SERUM PROTEINS AND THE CLINICAL AND RADIOGRAPHICAL STATUS OF THE LUNGS THE SIGNIFICANCE OF PSEUDOMONAS AERUGINOSA INFECTION IN THE RESPIRATORY TRACT OF 9 CYSTIC FIBROSIS PATIENTS HAVE BEEN STUDIED BY MEANS OF IMMUNOELECTROPHORETICAL ANALYSIS OF PATIENTS SERA FOR THE NUMBER OF PRECIPITINS AGAINST PSEUDOMONAS AERUGINOSA AND THE CONCENTRATIONS OF 16 SERUM PROTEINS IN ADDITION THE CLINICAL AND RADIOGRAPHICAL STATUS OF THE LUNGS HAVE BEEN EVALUATED USING 2 SCORING SYSTEMS PRECIPITINS AGAINST PSEUDOMONAS AERUGINOSA WERE DEMONSTRATED IN ALL SERA THE MAXIMUM NUMBER IN ONE SERUM WAS 22 THE CONCENTRATIONS OF 12 OF THE SERUM PROTEINS WERE SIGNIFICANTLY CHANGED COMPARED WITH MATCHED CONTROL PERSONS NOTABLY IGG AND 

## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [7]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens

index_name = f'{dataset}.{nbits}bits'

Now run the `Indexer` on the collection subset. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [8]:
checkpoint = 'colbert-ir/colbertv2.0'

with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection["doc"], overwrite=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



[Jan 22, 15:42:23] #> Creating directory /content/experiments/notebook/indexes/test.2bits 


#> Starting...
#> Joined...


In [9]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/content/experiments/notebook/indexes/test.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

In [10]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name, collection=collection["doc"])

[Jan 22, 15:46:08] #> Loading codec...
[Jan 22, 15:46:08] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 22, 15:46:08] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 22, 15:46:08] #> Loading IVF...
[Jan 22, 15:46:08] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 3457.79it/s]

[Jan 22, 15:46:08] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 146.49it/s]


In [60]:
#question = query["query"][1]
question = Queries("Information-Retrieval/queries_201.tsv")   # queries_20.tsv without the header
print(f"#> {question}")

# Find the top-10 passages for this query
results = searcher.search(question[0], k=120)
print(results)
#full length search? check file

passages_ranked = {}
#Print out the top-k retrieved passages
print("Rank\tScore\tId\tPassage")
for passage_id, passage_rank, passage_score in zip(*results):
     print(f"[{passage_rank}] \t{passage_score:.1f} \t{collection['doc_id'][passage_id]} \t{searcher.collection[passage_id]}")
     passages_ranked[passage_id] = passage_rank

[Jan 22, 17:29:51] #> Loading the queries from Information-Retrieval/queries_201.tsv ...
[Jan 22, 17:29:51] #> Got 20 queries. All QIDs are unique.

#> <colbert.data.queries.Queries object at 0x7aae98f2f910>
([516, 431, 934, 473, 721, 807, 491, 937, 1171, 427, 734, 1156, 485, 1081, 136, 487, 546, 51, 144, 550, 847, 383, 1139, 722, 1117, 441, 944, 855, 470, 240, 259, 952, 510, 186, 148, 617, 197, 639, 927, 482, 541, 295, 1177, 1147, 692, 444, 509, 10, 1002, 830, 503, 505, 578, 756, 429, 965, 671, 1145, 1207, 875, 294, 946, 1143, 657, 881, 84, 377, 1170, 858, 770, 455, 107, 1141, 451, 735, 959, 992, 388, 995, 420, 306, 425, 891, 674, 745, 1159, 574, 746, 281, 373, 1051, 260, 54, 523, 544, 201, 571, 904, 39, 453, 1094, 840, 629, 774, 187, 549, 1173, 525, 731, 163, 950, 738, 170, 134, 577, 484, 488, 62, 169, 520], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

Metrics


In [61]:
f = open("Information-Retrieval/Relevant_20", "r")
file = f.read().splitlines()
relevant_docs = []
for line in file:
    line=line.split()
    list = []
    for num in line:
        num = int(num)
        list.append('doc'+str(num))
    relevant_docs.append(list)                  # relevant_docs list includes every line from Relevant_20

Mean Average Precision

In [67]:
#avg_pr_list = []
#i=0;
for i in range(20):
    results = searcher.search(question[i], k=400)
    passages_ranked = {}
    for passage_id, passage_rank, passage_score in zip(*results):
      passages_ranked[passage_id] = passage_rank
    precision_at_k = []
    recall_at_k = []
    true_positives = 0

    for key, value in passages_ranked.items():
      doc = "doc"+str(collection['doc_id'][key])
      if(doc in relevant_docs[i]):
        #print(doc)
        true_positives+=1
        #print(value)
        precision_at_k.append(true_positives/value)
        recall_at_k.append(true_positives/len(relevant_docs[i]))

    average_precision = 0
    for x in range(len(precision_at_k)):
      value = precision_at_k[x]*recall_at_k[x]
      average_precision += value

    #print(average_precision)
    average_precision = average_precision/len(relevant_docs[i])
    print(average_precision)
    #avg_pr_list.append(average_precision)

# mean_average_precision = sum(avg_pr_list)/20
# print(mean_average_precision)

0.12064240604644527
0.05778769841269841
0.03727467579632158
0.077378055351569
0.029028859789991122
0.0451372762945074
0.015614722990647293
0.018055090011407263
0.10758473441452165
0.06199334075503886
0.22026992407388776
0.029319945024747034
0.019303477091204002
0.04042878167057118
0.023373417203075265
0.04868709221683805
0.018224388665211644
0.031986484945198866
0.02031230112695237
0.15614637822938712


Mean Reciprocal Rank

In [70]:
mean_rep_rank = 0
for i in range(20):
    results = searcher.search(question[i], k=400)
    passages_ranked = {}
    for passage_id, passage_rank, passage_score in zip(*results):
      passages_ranked[passage_id] = passage_rank

    for key, value in passages_ranked.items():
      doc = "doc"+str(collection['doc_id'][key])
      if(doc in relevant_docs[i]):
            mean_rep_rank += 1/value
            break

mean_rep_rank = mean_rep_rank/20
print(mean_rep_rank)


0.7333333333333333
