# ColBERT v2, PLAID Classes

In [1]:
from fastrag.stores import PLAIDDocumentStore
import fastrag, torch

## Store

In [2]:
store = PLAIDDocumentStore(index_path="/path/to/index",
                           checkpoint_path="Intel/ColBERT-NQ",
                           collection_path="/path/to/collection.tsv")

[Dec 05, 20:11:21] #> Loading collection...
0M 
[Dec 05, 20:11:24] #> Loading codec...
[Dec 05, 20:11:24] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Dec 05, 20:11:25] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Dec 05, 20:11:25] #> Loading IVF...
[Dec 05, 20:11:25] #> Loading doclens...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 666.08it/s]

[Dec 05, 20:11:25] #> Loading codes and residuals...



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15.11it/s]


In [4]:
store.query("Where is Paris?",top_k = 3)

[<Document: {'content': 'located close to the European Atlantic coast, in the southwest of France and in the north of the Aquitaine region. It is around southwest of Paris. The city is built on a bend of the river Garonne, and is divided into two parts: the right bank to the east and left bank in the west. Historically the left bank is more developed because when flowing outside the bend, the water makes a furrow of the required depth to allow the passing of merchant ships, which used to offload on this side of the river. But, today, the right bank is', 'content_type': 'text', 'score': 14.8046875, 'meta': {'title': 'Bordeaux'}, 'embedding': None, 'id': '48428'}>,
 <Document: {'content': 'which took place in Paris from April to October in 1925. This was officially sponsored by the French government, and covered a site in Paris of 55 acres, running from the Grand Palais on the right bank to Les Invalides on the left bank, and along the banks of the Seine. The Grand Palais, the largest ha

# Retriever

In [5]:
from fastrag.retrievers.colbert import ColBERTRetriever

In [6]:
retriever = ColBERTRetriever(store)

In [7]:
retriever.retrieve("What is Machine Learning?", 3)

[<Document: {'content': "intelligence. Machine learning, a fundamental concept of AI research since the field's inception, is the study of computer algorithms that improve automatically through experience. Unsupervised learning is the ability to find patterns in a stream of input, without requiring a human to label the inputs first. Supervised learning includes both classification and numerical regression, which requires a human to label the input data first. Classification is used to determine what category something belongs in, after seeing a number of examples of things from several categories. Regression is the attempt to produce a function that describes the relationship between inputs and", 'content_type': 'text', 'score': 17.21875, 'meta': {'title': 'Artificial intelligence'}, 'embedding': None, 'id': '11791'}>,
 <Document: {'content': 'as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and association rule mining, amo

# Haystack Pipeline

In [8]:
from haystack import Pipeline

In [9]:
p = Pipeline()
p.add_node(component=retriever, name="Retriever", inputs=["Query"])

In [10]:
res = p.run(query="What did Einstein work on?")
res["documents"][:3]

[<Document: {'content': 'his Prague stay, he wrote 11 scientific works, five of them on radiation mathematics and on the quantum theory of solids. In July 1912, he returned to his alma mater in Zürich. From 1912 until 1914, he was professor of theoretical physics at the ETH Zurich, where he taught analytical mechanics and thermodynamics. He also studied continuum mechanics, the molecular theory of heat, and the problem of gravitation, on which he worked with mathematician and friend Marcel Grossmann. On 3 July 1913, he was voted for membership in the Prussian Academy of Sciences in Berlin. Max Planck and Walther Nernst', 'content_type': 'text', 'score': 20.828125, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2081'}>,
 <Document: {'content': 'technology". Much of his work at the patent office related to questions about transmission of electric signals and electrical–mechanical synchronization of time, two technical problems that show up conspicuously in the thought ex

## Pipeline from YAML
```yaml
version: 1.12.2

components:
- name: Store
  params:
    checkpoint_path: Intel/ColBERT-NQ
    collection_path: /path/to/collection.tsv
    index_path: /path/to/index
  type: PLAIDDocumentStore
- name: Retriever
  params:
    document_store: Store
    top_k: 10
    use_gpu: false
  type: ColBERTRetriever

pipelines:
- name: my_pipeline
  nodes:
  - inputs:
    - Query
    name: Retriever
```

In [11]:
pipeline = Pipeline.load_from_yaml("plaid-colbert-pipeline.yaml")

[Dec 05, 20:12:49] #> Loading collection...
0M 
[Dec 05, 20:12:51] #> Loading codec...
[Dec 05, 20:12:51] #> Loading IVF...
[Dec 05, 20:12:51] #> Loading doclens...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1044.27it/s]

[Dec 05, 20:12:51] #> Loading codes and residuals...



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22.42it/s]


In [13]:
pipeline.run(query="What did Einstein work on?")["documents"][:2]

[<Document: {'content': 'his Prague stay, he wrote 11 scientific works, five of them on radiation mathematics and on the quantum theory of solids. In July 1912, he returned to his alma mater in Zürich. From 1912 until 1914, he was professor of theoretical physics at the ETH Zurich, where he taught analytical mechanics and thermodynamics. He also studied continuum mechanics, the molecular theory of heat, and the problem of gravitation, on which he worked with mathematician and friend Marcel Grossmann. On 3 July 1913, he was voted for membership in the Prussian Academy of Sciences in Berlin. Max Planck and Walther Nernst', 'content_type': 'text', 'score': 20.828125, 'meta': {'title': 'Albert Einstein'}, 'embedding': None, 'id': '2081'}>,
 <Document: {'content': 'technology". Much of his work at the patent office related to questions about transmission of electric signals and electrical–mechanical synchronization of time, two technical problems that show up conspicuously in the thought ex

------

# Index Creation

1. Install the package and go to `libs/colbert`. 
2. Install it using Anaconda/Miniconda, either GPU or CPU using the provided yaml files. 
3. Make sure you have RTX GPUs or better (RTX 3090, V100, etc.)
4. Use our own `scripts/indexing/create_plaid.py` script (fastRAG). 

Example:
```sh
python scripts/indexing/create_plaid.py            \
         --checkpoint /path/to/checkpoint          \
         --collection /path/to/collection.tsv      \
         --index-save-path /path/to/index          \
         --gpu 2 --ranks 2
```