# 3. Building a mapping between Ciona KY identifiers and UniProt identifiers.

This notebook is used to build a crude mapping between Ciona KY identifiers and UniProt identifiers using all-vs-all BLAST. Running this notebook requires setting up a `conda` environment with `blast` installed. You can use the `envs/blast.yml` file to create such an environment using the following command from the root directory:

```{bash}
conda env create -f envs/blast.yml
```

## 3.1. Build a BLAST database from the KY21 gene set.

The Ciona reference proteome in UniProt and the KY21 proteome are downloaded from the [Ghost database](http://ghost.zool.kyoto-u.ac.jp/download_ht.html) as part of the snakemake workflow.

In [1]:
from pathlib import Path

data_dir = Path("../../data/Ciona_gene_models")
ky21_fasta_filename = "HT.KY21Gene.protein.2.fasta"

reference = data_dir / ky21_fasta_filename
blastdb = data_dir / f"{ky21_fasta_filename}.blastdb"

!makeblastdb -in $reference -parse_seqids -dbtype prot -out $blastdb



Building a new DB, current time: 05/08/2025 09:42:10
New DB name:   /Users/dennis/Code/2025-zoogle-collabs/data/Ciona_gene_models/HT.KY21Gene.protein.2.fasta.blastdb
New DB title:  ../../data/Ciona_gene_models/HT.KY21Gene.protein.2.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 55505 sequences in 1.25157 seconds.




## 3.2. Run BLAST between the KY21 and Ciona reference proteomes.

Specify the number of threads to use using the `num_threads` variable. This can take a while (15min+) to run.

In [2]:
query_fasta_filename = "Ciona_intestinalis.faa"
query = data_dir / query_fasta_filename
output = data_dir / f"{ky21_fasta_filename}.{query_fasta_filename}.blastout"
num_threads = 10

!blastp -query $query -db $blastdb -out $output -outfmt 6 -num_threads $num_threads

To process the BLAST results, run the `4_ciona-blast-processing.ipynb` notebook.