<a href="https://colab.research.google.com/github/TobiasHeOl/kasearch/blob/main/notebooks/KAsearch_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Quick and easy use of KA-Search**

You can run a simple query on the reduced version of OAS here. This part of the notebook only allows you to run a single query at a time on a single region. For running more queries and searching for more regions simultaneously, look at using KA-Search with more configuration.

In [None]:
#@title Input query sequence, then hit `Runtime` -> `Run all`
import sys
python_version = f"{sys.version_info.major}.{sys.version_info.minor}"

#@markdown Insert the query sequence. The sequence should be either the heavy or light chain of an antibody variable domain.

query_sequence = 'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS' #@param {type:"string"}

#@markdown Select what region you want to search by and whether to restrict the search to antibodies with the same number of amino acids in the selected region. 

search_by = "whole" #@param ["whole", "cdrs", "cdr3"]
length_matched = False #@param {type:"boolean"}

#@markdown Restrict what species to search by.

species = "Any" #@param ["Human", "Mouse", "Any"]

#@markdown Choose how many closest matches to retrieve (there are not many CPUs on colab so this will affect runtime)

n_sequences = 100 #@param {type:"integer"}

In [None]:
#@title Install dependencies
%%capture
%%bash -s $python_version

#@markdown This script will download and install the KA-Search code and ANARCI

PYTHON_VERSION=$1
set -e


# setup conda
if [ ! -f CONDA_READY ]; then
  wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
  rm Miniconda3-latest-Linux-x86_64.sh
  touch CONDA_READY
fi

# setup anarci
if [ ! -f ANARCI_READY ]; then
  conda install -y -q -c bioconda anarci python="${PYTHON_VERSION}" 2>&1 1>/dev/null
  touch ANARCI_READY
fi

# setup kasearch
if [ ! -f CODE_READY ]; then
  # install dependencies
  pip install kasearch 2>&1 1>/dev/null --root-user-action=ignore
  touch CODE_READY
fi

In [None]:
#@title Download database
%%bash

if [ ! -f DATA_READY ]; then
  # install dependencies
  wget -qnc "https://zenodo.org/record/7384311/files/oas-aligned-tiny.tar" -O small_OAS.tar
  tar -xf small_OAS.tar
  touch DATA_READY
fi

In [None]:
#@title Search the database

#@markdown This will take a few minutes
if f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

from kasearch import EasySearch

results = EasySearch(query_sequence,                              # Single sequence to search
               keep_best_n=n_sequences,                           # Number of closest matches to return
               database_path='oas-aligned-tiny',                  # Database to search
               allowed_chain='Any',                               # Chains to search, either 'Heavy', 'Light' or 'Any'
               allowed_species=species,                           # Species to search
               regions=[search_by],                               # Region to search, either 'whole', 'cdrs', 'cdr3' or a user-specified region
               length_matched=[length_matched],                   # To search only for sequences with a matched length or any length
              )

results

In [None]:
#@title Download results
#@markdown If you are having issues downloading the result archive, try disabling your adblocker and run this cell again. If that fails click on the little folder icon to the left, navigate to file: `KA_search_output.csv`, right-click and select \"Download\".

from google.colab import files

results.to_csv("KA_search_output.csv", index = False)
files.download("KA_search_output.csv")

------------------
# **KA-Search with more configuration**

In [1]:
from kasearch import AlignSequences, SearchDB, PrepareDB

### **Align query sequences**

Sequences to search with needs to be aligned to the KA-Search alignment as described in Olsen et al, 2022.

In [2]:
raw_queries = [
    'QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
    'QVQLKESGPGLVAPSQSLSITCTVSGFSLTSYGVSWVRQPPGKGLEWLGVIWGDGSTNYHSALISRLSISKENSKSQVFLKLNSLQTDDTATYYCAKPGGDYWGQGTSVTVSS',
]

aligned_seqs = AlignSequences(allowed_species=['Human', 'Mouse'], # Species to use for numbering (human and mouse is default).
                              n_jobs=1                            # Allocated number for jobs/threads for the search.
                             )(raw_queries)                       # Sequences as strings to align.
aligned_seqs

array([[81, 86, 75,  0, 76, 81, 69, 83, 71, 65,  0, 69, 76, 65, 82, 80,
        71, 65, 83, 86, 75, 76, 83, 67, 75, 65, 83, 71, 89, 84, 70,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 84, 78, 89, 87, 77, 81,  0, 87,
        86, 75, 81,  0, 82,  0, 80,  0, 71,  0, 81,  0,  0, 71,  0, 76,
        68,  0, 87, 73, 71, 65, 73, 89, 80, 71,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0, 68, 71, 78, 84, 82, 89,  0,  0, 84,  0,  0,
        72,  0,  0, 75, 70,  0,  0, 75,  0,  0,  0, 71, 75, 65, 84, 76,
        84, 65,  0, 68,  0,  0,  0, 75,  0, 83,  0,  0, 83, 83,  0,  0,
         0,  0, 84,  0, 65, 89, 77, 81, 76, 83, 83, 76, 65, 83,  0, 69,
        68, 83, 71, 86, 89, 89, 67, 65, 82, 71, 69, 71, 78,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0, 89, 65, 87, 70, 65, 89, 87, 71,  0, 81,
        71, 84, 84, 86, 84, 86, 83, 83],
       [81, 86, 81,  0, 76, 75, 69, 83, 71, 80,  0, 71, 76, 86, 65, 80,
        83, 81, 83, 76,

### Canonical alignment
The unique positions allowed in the canonical alignment can be viewed with the following 

In [3]:
from kasearch import canonical_numbering
print(canonical_numbering)

['1 ', '2 ', '3 ', '3A', '4 ', '5 ', '6 ', '7 ', '8 ', '9 ', '10 ', '11 ', '12 ', '13 ', '14 ', '15 ', '16 ', '17 ', '18 ', '19 ', '20 ', '21 ', '22 ', '23 ', '24 ', '25 ', '26 ', '27 ', '28 ', '29 ', '30 ', '31 ', '32 ', '32A', '32B', '33C', '33B', '33A', '33 ', '34 ', '35 ', '36 ', '37 ', '38 ', '39 ', '40 ', '40A', '41 ', '42 ', '43 ', '44 ', '44A', '45 ', '45A', '46 ', '46A', '47 ', '47A', '48 ', '48A', '48B', '49 ', '49A', '50 ', '51 ', '51A', '52 ', '53 ', '54 ', '55 ', '56 ', '57 ', '58 ', '59 ', '60 ', '60A', '60B', '60C', '60D', '61E', '61D', '61C', '61B', '61A', '61 ', '62 ', '63 ', '64 ', '65 ', '66 ', '67 ', '67A', '67B', '68 ', '68A', '68B', '69 ', '69A', '69B', '70 ', '71 ', '71A', '71B', '72 ', '73 ', '73A', '73B', '74 ', '75 ', '76 ', '77 ', '78 ', '79 ', '80 ', '80A', '81 ', '81A', '81B', '81C', '82 ', '82A', '83 ', '83A', '83B', '84 ', '85 ', '85A', '85B', '85C', '85D', '86 ', '86A', '87 ', '88 ', '89 ', '90 ', '91 ', '92 ', '93 ', '94 ', '95 ', '96 ', '96A', '97 ', '

--------------
### **Initiate search class**

#### Database to search against
- If no database path is given, a small OAS version will be downloaded to search against.
- The full version of OAS can be downloaded here ().
- You can also give it the path for a custom database to search against. (See below for how to create a custom database).
- You can place the custom database in the OAS folder to have KA-Search search against both databases.

#### Regions to search with
- Default regions are the whole chain, CDRs or CDR3.
- User-defined regions can be added, as seen with the paratope search below.
- For each region, the search can either be based on exact length match or not.
- For a more specific search, the search can be focused on a specific chain and species.

In [4]:
paratope = ["107 ", "108 ","111C", "114 ","115 "]

In [5]:
searchdb = SearchDB(
    database_path='oas-aligned-tiny',   # Path to your database. Default will be to download a small prepared version of OAS.
    allowed_chain='Heavy',         # Search against a specific chain. Default is any chain.
    allowed_species='Human',       # Search against a specific species. Default is any species.
    regions=['whole', 'cdrs', 'cdr3', paratope], # Regions to search with.
    length_matched=[False, True, True, True],    # Whether to search with length match or not.
)

-----------
### **Run search**

A search takes ~23min per sequence against all of OAS and ~2min per sequence against the small OAS.

To specify the number of closest similar sequences to keep, you can change the keep_best_n parameter.

In [6]:
%%time 
searchdb.search(aligned_seqs,   # Input can be a single or multiple aligned sequences at a time.
                keep_best_n=5,  # You can define how many most similar sequences to return
            )

CPU times: user 49.9 s, sys: 24.1 s, total: 1min 14s
Wall time: 17.4 s


### Get N best identities

Identities of the most similar sequence for each of the regions can be fetched from the object with the bellow command.

In [7]:
searchdb.current_best_identities

array([[[0.90378153, 0.78571427, 0.8333333 , 1.        ],
        [0.8983696 , 0.75      , 0.8333333 , 1.        ],
        [0.89311516, 0.75      , 0.75      , 1.        ],
        [0.8925946 , 0.75      , 0.75      , 1.        ],
        [0.8907563 , 0.71428573, 0.75      , 1.        ]],

       [[0.91817904, 0.8636364 , 0.85714287, 1.        ],
        [0.91069317, 0.77272725, 0.85714287, 1.        ],
        [0.90951705, 0.77272725, 0.85714287, 1.        ],
        [0.9026549 , 0.72727275, 0.85714287, 1.        ],
        [0.90085495, 0.72727275, 0.85714287, 1.        ]]], dtype=float32)

---------
## Extract the meta data from matched sequences

Using the get_meta function, the meta data for all matched sequences for each query and region can be extracted as seen below.

Zero (0) is the first query or the first region in the list when initiating the search class. 

NB: The column "sequence_alignment_aa" holds the antibody sequence.

In [8]:
n_best_sequences = searchdb.get_meta(n_query = 0,          # Which query to extract meta data from
                                     n_region = 0,         # Which region to extract meta data from
                                     n_sequences = 'all',  # Number of sequences to extract (default is all, which is keep_best_n)
                                     n_jobs=10             # Allocated number for jobs/threads for the extraction
                                    )
n_best_sequences

Unnamed: 0,sequence,locus,stop_codon,vj_in_frame,v_frameshift,productive,rev_comp,complete_vdj,v_call,d_call,...,Longitudinal,Age,Disease,Subject,Vaccine,Chain,Unique sequences,Total sequences,Isotype,Identity
0,GAAACAACCTATGATCAGTGTCCTCTCTACACAGTCCCTGACGACA...,H,F,T,F,T,F,F,IGHV1-46*01,IGHD3-16*01,...,no,no,POEMS,Patient_12,,Heavy,21060,37905,Bulk,0.903782
1,GGCATATGATCAGTAACCTCTTCACAGTCACTGAAAACACTGACTC...,H,F,T,F,T,F,F,IGHV1-46*01,IGHD5-12*01,...,no,no,POEMS,Patient_12,,Heavy,21060,37905,Bulk,0.89837
2,GGCATATGATCAGTAACCTCTTCACAGTCACTGAAAACACTGACTC...,H,F,T,F,T,F,F,IGHV1-46*01,IGHD3-9*01,...,no,no,POEMS,Patient_12,,Heavy,21060,37905,Bulk,0.893115
3,GACAGTCACTGAAAACACTGACTCTAATCATGGAATGTAACTGGAT...,H,F,T,F,T,F,T,IGHV1-46*01,IGHD5-12*01,...,no,no,POEMS,Patient_12,,Heavy,21060,37905,Bulk,0.892595
4,GGCATATGATCAGTAACCTCTTCACAGTCACTGAAAACACTGACTC...,H,F,T,F,T,F,F,IGHV1-46*01,IGHD3-3*01,...,no,no,POEMS,Patient_12,,Heavy,21060,37905,Bulk,0.890756


In [9]:
n_best_sequences.sequence_alignment_aa.values

array(['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGEPRYDYAWFAYWGQGTLVTVS',
       'QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGPATAWFAYWGQGTLVTVS',
       'QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARSAWFAYWGQGTLVTVS',
       'QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGYWGQGTTLTVSS',
       'QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGLRRGAWFAYWGQGTLVTVS'],
      dtype=object)

----------
## Create custom database


To create your own database you first need to create a csv file in the OAS format. For an example file, look at data/custom-data-example.csv. This file consists of a dictionary containing the metadata in the first line and then rows of the individual sequences afterwards. Only the Species and Chain is strictly needed in the metadata, and only the amino acids sequence of the antibodies is required for each antibody sequence.

### 1. Format your data into OAS files

In [2]:
import json, os, shutil
import pandas as pd

In [3]:
custom_data_file = "custom-data-examples.csv"

seq_df = pd.DataFrame([
    ["EVQLVESGGGLAKPGGSLRLHCAASGFAFSSYWMNWVRQAPGKRLEWVSAINLGGGLTYYAASVKGRFTISRDNSKNTLSLQMNSLRAEDTAVYYCATDYCSSTYCSPVGDYWGQGVLVTVSS"],
    ["EVQLVQSGAEVKRPGESLKISCKTSGYSFTSYWISWVRQMPGKGLEWMGAIDPSDSDTRYNPSFQGQVTISADKSISTAYLQWSRLKASDTATYYCAIKKYCTGSGCRRWYFDLWGPGT"],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGEPRYDYAWFAYWGQGTLVTVS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGPATAWFAYWGQGTLVTVS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARSAWFAYWGQGTLVTVS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGYWGQGTTLTVSS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGLRRGAWFAYWGQGTLVTVS']
], columns = ['heavy_sequences'])
meta_data = pd.Series(name=json.dumps({"Species":"Human", "Chain":"Heavy"}), dtype='object')

meta_data.to_csv(custom_data_file, index=False)
seq_df.to_csv(custom_data_file, index=False, mode='a')

### 2. Turn your OAS formatted files into a custom database

After creating all the files you want to include in the new database, you can run the following code to create the database.

**NB:** Each csv file needs to be moved to the 'extra_data' folder, for extraction of meta data

In [4]:
path_to_custom_db = "my_kasearch_db"
many_custom_data_files = [custom_data_file]

In [5]:
%%timeit -n 1 -r 1

customDB = PrepareDB(db_path=path_to_custom_db, n_jobs=2, from_scratch=True)

for num, data_file in enumerate(many_custom_data_files):
    
    customDB.prepare_sequences(
        data_file,
        file_id=num, 
        chain='Heavy', 
        species='Human',
        seq_column_name = 'heavy_sequences',
    )
    shutil.copy(data_file, os.path.join(path_to_custom_db, 'extra_data'))
    
customDB.finalize_prepared_files()

243 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### 3. Initiate the search class with your custom database


In [25]:
from kasearch import EasySearch

raw_queries = [
    'VKLLEQSGAEVKKPGASVKVSCKASGYSFTSYGLHWVRQAPGQRLEWMGWISAGTGNTKYSQKFRGRVTFTRDTSATTAYMGLSSLRPEDTAVYYCARDPYGGGKSEFDYWGQGTLVTVSS',
]

results = EasySearch(
    raw_queries, 
    keep_best_n=10,
    database_path=path_to_custom_db, 
    allowed_chain='Any', 
    allowed_species='Any',
    regions=['whole'],
    length_matched=[False],
)
results

Unnamed: 0,heavy_sequences,Species,Chain,Identity
0,QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLE...,Human,Heavy,0.628099
1,QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLE...,Human,Heavy,0.619835
2,QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLE...,Human,Heavy,0.619835
3,QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLE...,Human,Heavy,0.619835
4,QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLE...,Human,Heavy,0.603306
5,EVQLVESGGGLAKPGGSLRLHCAASGFAFSSYWMNWVRQAPGKRLE...,Human,Heavy,0.504065
6,EVQLVQSGAEVKRPGESLKISCKTSGYSFTSYWISWVRQMPGKGLE...,Human,Heavy,0.496
