<a href="https://colab.research.google.com/github/Twenkid/Vsy-Jack-Of-All-Trades-AGI-Bulgarian-Internet-Archive-And-Search-Engine/blob/main/code/retrieve/Colbertv2_Retrieval_16_12_2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ColBERTv2: Indexing & Search Notebook

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

**Continued: Twenkid 16.12.2024**

In [None]:
!pip install --upgrade torch torchvision torchaudio
#Restart Session!

In [None]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')


Already up to date.


In [None]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README. You can also install (as above) with pip but it may install slower or less stable faiss or torch dependencies. Conda is recommended.")
    assert False

Obtaining file:///content/ColBERT
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch==1.13.1 (from colbert-ai==0.2.20)
  Using cached torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Using cached torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
Installing collected packages: torch, colbert-ai
  Attempting uninstall: torch
    Found existing installation: torch 2.5.1
    Uninstalling torch-2.5.1:
      Successfully uninstalled torch-2.5.1
  Attempting uninstall: colbert-ai
    Found existing installation: colbert-ai 0.2.20
    Uninstalling colbert-ai-0.2.20:
      Successfully uninstalled colbert-ai-0.2.20
[33m  DEPRECATION: Legacy editable install of colbert-ai[faiss-gpu,torch]==0.2.20 from file:///content/ColBERT (setup.py develop) is deprecated. pip 25.0 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expe

In [None]:
!pip install --upgrade torch torchvision torchaudio

Collecting torch
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)


In [None]:
!pip install --upgrade torch torchvision torchaudio

Collecting torch
  Using cached torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Using cached torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.1
    Uninstalling torch-1.13.1:
      Successfully uninstalled torch-1.13.1
Successfully installed torch-2.5.1


In [None]:
import torch
print(torch.__version__)
print(torch.cuda.is_available()) # Check if CUDA is available if you're using GPU

2.5.1+cu124
True


In [None]:
import torch
import colbert

In [None]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

We will use the dev set of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. We'll download it from HuggingFace datasets. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely lifestyle:dev.

For the purposes of a quick demo, we will only run the `Indexer` on the first 10,000 passages. As we do this, let's also remove the queries whose relevant passages are all outside this small set of passages.

In [None]:
from datasets import load_dataset

dataset = 'lifestyle'
datasplit = 'dev'

collection_dataset = load_dataset("colbertv2/lotte_passages", dataset)
collection = [x['text'] for x in collection_dataset[datasplit + '_collection']]

queries_dataset = load_dataset("colbertv2/lotte", dataset)
queries = [x['query'] for x in queries_dataset['search_' + datasplit]]

f'Loaded {len(queries)} queries and {len(collection):,} passages'

'Loaded 417 queries and 268,893 passages'

In [None]:
from datasets import load_dataset

def load(s="technology"): # return queries, collection
  dataset = s #'lifestyle'
  datasplit = 'dev'
  collection_dataset = load_dataset("colbertv2/lotte_passages", dataset)
  collection = [x['text'] for x in collection_dataset[datasplit + '_collection']]

  queries_dataset = load_dataset("colbertv2/lotte", dataset)
  queries = [x['query'] for x in queries_dataset['search_' + datasplit]]

  f'Loaded {len(queries)} queries and {len(collection):,} passages'
  return queries, collection



In [None]:
query1, techno = load("technology") #:2:08:29 and still running! stopped it

In [None]:
print(query1[0:10])
print(techno[212456])

This loaded 417 queries and 269k passages. Let's inspect one query and one passage to verify we have done so correctly.

In [None]:
print(queries[24])
print()
print(collection[19929])
print()

are blossom end rot tomatoes edible?

I think the spraying thing is not after, its during. The cold will freeze the mist, keeping the air around the trees at (but not below) freezing. See http://www.ehow.com/how_5805520_use-freeze-damage-fruit-trees.html for example which recommends a sprinkler. The releases heat thing is kind of an oversimplification, but basically as long as you have any liquid water around, it will keep things at zero. The sap of your tree is not pure water, and therefore freezes somewhat below zero. By having the water freeze instead you stay away from the temps that would damage your plants. That said, http://www.ehow.com/how-does_5245655_spraying-frost-protect-fruit-freezing_.html is total gibberish since evaporation doesnt generate heat, quite the opposite. There is a better explanation at http://www.gardenguides.com/135830-spray-water-plants-during-frost.html This is a picture from a blog entry that gives you details from the citrus farmers point of view.



## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [None]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
max_id = 110000

index_name = f'{dataset}.{datasplit}.{nbits}bits'

To save space and time, we will only run the `Indexer` on the first 10,000 passages. To do so, we will filter out queries that do not contain passages with ids less than 10,000.

In [None]:
answer_pids = [x['answers']['answer_pids'] for x in queries_dataset['search_' + datasplit]]
filtered_queries = [q for q, apids in zip(queries, answer_pids) if any(x < max_id for x in apids)]

f'Filtered down to {len(filtered_queries)} queries'

'Filtered down to 141 queries'

Now run the `Indexer` on the collection subset. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [None]:
checkpoint = 'colbert-ir/colbertv2.0'
from colbert.infra import Run, RunConfig, ColBERTConfig # import the necessary classes
with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection[:max_id], overwrite=True)

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



[Dec 16, 08:21:01] #> Creating directory /content/experiments/notebook/indexes/lifestyle.dev.2bits 


#> Starting...
#> Joined...


In [None]:
p = indexer.get_index() # You can get the absolute path of the index, if needed.
print(p)

/content/experiments/notebook/indexes/lifestyle.dev.2bits


In [None]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
max_id = 110000

index_name = f'{dataset}.{datasplit}.{nbits}bits'
answer_pids = [x['answers']['answer_pids'] for x in queries_dataset['search_' + datasplit]]
filtered_queries = [q for q, apids in zip(queries, answer_pids) if any(x < max_id for x in apids)]

f'Filtered down to {len(filtered_queries)} queries'

checkpoint = 'colbert-ir/colbertv2.0'
from colbert.infra import Run, RunConfig, ColBERTConfig # import the necessary classes
with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection[:max_id], overwrite=True)

p1 = indexer.get_index() # You can get the absolute path of the index, if needed.
print(p)

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [None]:
Run().context(RunConfig(experiment='notebook'))
searcher = Searcher(index=index_name, collection=collection)

FileNotFoundError: [Errno 2] No such file or directory: '/content/experiments/default/indexes/lifestyle.dev.2bits/plan.json'

In [None]:
searcher = None
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name, collection=collection)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).

[Dec 16, 09:00:01] #> Loading codec...
[Dec 16, 09:00:01] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  self.scaler = torch.cuda.amp.GradScaler()
  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[Dec 16, 09:00:01] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[Dec 16, 09:00:02] #> Loading IVF...


  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')


[Dec 16, 09:00:02] #> Loading doclens...


100%|██████████| 5/5 [00:00<00:00, 249.31it/s]

[Dec 16, 09:00:02] #> Loading codes and residuals...



  return torch.load(codes_path, map_location='cpu')
  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 5/5 [00:00<00:00,  5.80it/s]


In [None]:
searcher

<colbert.searcher.Searcher at 0x7beceac0b6a0>

In [None]:
query = filtered_queries[13] # try with an in-range query or supply your own
print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> are some cats just skinny?

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: are some cats just skinny?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2024,  2070,  8870,  2074, 15629,  1029,   102,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')



  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


	 [1] 		 25.4 		 A cat can certainly be naturally skinny. I know one who was the runt of her litter and has been extremely thin all her life, to the point where you can easily count every bone. She is now 17 years old, having outlived two other cats in that household, so it certainly doesnt seem to have held her back.
	 [2] 		 25.0 		 Yes. Just like us, cats vary in size and shape and weight. And like us, some of that is diet, some is health, some is genetics, some is age. One of my ladys cats was the runt of the litter; it wasnt certain she would survive, and she has always been both small and skinny. It hasnt seemed to limit her climbing/jumping much, if at all; I think she benefits from square/cube law to be stronger relative to her weight than you would expect. Another cat in the family was not only longer/taller/broader but also more solidly muscled. I think he may have weighed twice what the small one did, without being overweight. (The simplest rule-of-thumb test for whether a c

In [None]:
def many():
  for query in filtered_queries: #[13] # try with an in-range query or supply your own
    print(f"#> {query}")

    # Find the top-3 passages for this query
    results = searcher.search(query, k=3)

    # Print out the top-k retrieved passages
    for passage_id, passage_rank, passage_score in zip(*results):
      print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

In [None]:
many()

#> how much should i feed my 1 year old english mastiff?
	 [1] 		 21.5 		 I have a 2 1/2 year old bull mastiff. I have been feeding him Blue Buffalo since I got him at 8 weeks old. He is very lean and active for a bull mastiff. I feed him about 3-4 cups twice a day which averages about 130.00 a month. It is very important that you can afford this breed. I just had to take mine to the vet because he developed some sort of allergies on his skin, eyes and ears and the vet bill was $210.00 with all his medication. This wasnt an option I had to take him an get all his meds or he would have gotten worse. Theyre just like your children, you can expect things to come up and you need to be able to care for them.
	 [2] 		 18.0 		 I breed mastiffs so I will try to help you with this: Age Amount 4-8 weeks 3-4 cups per day spread between 3-4 meals 8-12 weeks 4-6 cups per day spread between 3-4 meals 12-16 weeks 6-8 cups per day spread between 3-4 meals 4 to 6 months 8-10 cups per day spread between

In [None]:
def interact():
    query = input("Query... ")
    if query == "Q" or query == "q": return False
    #for query in filtered_queries: #[13] # try with an in-range query or supply your own
    #print(f"#> {query}")

    # Find the top-3 passages for this query
    results = searcher.search(query, k=8)

    # Print out the top-k retrieved passages
    for passage_id, passage_rank, passage_score in zip(*results):
      print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

    return True

In [None]:
# Word Wrap for long lines #1.3.2024
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
interact()

Query... What is a computer?
	 [1] 		 16.6 		 I think that computers are incredibly powerful tools for learning, and getting them in the hands of as many children as possible is a good thing. Certainly, supervising your children, especially when they have internet access, is a good idea. My 8yo keeps his netbook on a small work table in my office, so we have our computer time together. I do like that hes mobile -- hes even taken to coming to tech conferences with me (he spoke at one of them) -- and our rules at home are plenty to keep the computer where it belongs.
	 [2] 		 15.6 		 When they show a clear interest in a deeper understanding of computers. And yes, that can very easily happen at pretty much any age. For some, it may take 3 years. For some, it may take 6 years. For many, it takes 12 years. For most, it never happens. And you should be able to accept that. Computers are really not very interesting to anyone who is not interested in computers. I dont understand what is it abo

In [None]:
while(True):
  #if not interact(): break
  interact()


Query... Where is home?
	 [1] 		 16.5 		 For social animals, home is where the rest of their pack is, not a specific physical location. Assuming the animal left home under their own power, they can find their way back by following their own scent. If the rest of the pack has moved on in the meantime, they can follow the packs scent until catching up to them. Leaving under their own power is an important point, though. If you walk your dog to the park, he should be able to backtrack to your home if you get separated. If you take him in a car, though, that wont work; he will aimlessly wander until he picks up a familiar scent—or until Animal Control captures him.
	 [2] 		 14.6 		 We usually start from home, set off on a journey and eventually arrive back home. Home is usually the tonic chord: 1 V (the dominant 7) can take us back home of to a new, temporary home. So a C major chord can lead to its V, the G7 and G7 can point us back to C. But you can use a dominant 7 chord to take you off