# 4. Process BLAST results to generate mapping between UniProt and KY21 protein IDs

This notebook processes the BLAST results to generate a mapping between UniProt and KY21 protein IDs using the results from all-vs-all BLAST generated by `3_ciona-all-v-all-blast.ipynb`.

## 4.1 Setup

Load the necessary libraries and set the data directory.

In [1]:
from pathlib import Path

import pandas as pd

import zoogletools as zt

data_dir = Path("../../data/Ciona_gene_models")

ky21_fasta_filename = "HT.KY21Gene.protein.2.fasta"
query_fasta_filename = "Ciona_intestinalis.faa"

query = data_dir / query_fasta_filename
output = data_dir / f"{ky21_fasta_filename}.{query_fasta_filename}.blastout"

## 4.2 Process BLAST results

Read the BLAST results and select the top hit for each query sequence based on the e-value.

In [2]:
blast_result_columns = [
    "qseqid",
    "sseqid",
    "pident",
    "length",
    "mismatch",
    "gapopen",
    "qstart",
    "qend",
    "sstart",
    "send",
    "evalue",
    "bitscore",
]
blast_results = pd.read_csv(output, sep="\t", header=None, names=blast_result_columns)

# Sort by e-value so we can take the top hit by getting the first row of the sorted DataFrame.
blast_results.sort_values(ascending=True, by="evalue", axis=0, inplace=True)
display(blast_results.head(10))

top_hits = (
    blast_results.groupby("qseqid")
    .agg(
        top_hit=("sseqid", "first"),
        top_hit_evalue=("evalue", "first"),
        top_hit_bitscore=("bitscore", "first"),
    )
    .reset_index()
)
display(top_hits.head(10))

Unnamed: 0,qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore
0,tr|A0A1W3JEV5|A0A1W3JEV5_CIOIN,KY21.Chr9.416.v1.SL2-1,100.0,245,0,0,1,245,5,249,0.0,502.0
1328378,tr|H2XXI4|H2XXI4_CIOIN,KY21.Chr11.1100.v1.nonSL6-1,99.029,412,4,0,1,412,13,424,0.0,855.0
1328377,tr|H2XXI4|H2XXI4_CIOIN,KY21.Chr11.1100.v1.nonSL11-1,99.029,412,4,0,1,412,14,425,0.0,855.0
1040086,tr|F6RMT1|F6RMT1_CIOIN,KY21.Chr5.414.v1.SL1-1,97.641,551,2,2,1,551,7,546,0.0,1111.0
1328320,tr|H2XVY7|H2XVY7_CIOIN,KY21.Chr5.388.v1.SL1-1,100.0,559,0,0,1,559,6,564,0.0,1135.0
1328289,tr|H2XVJ5|H2XVJ5_CIOIN,KY21.Chr11.338.v1.nonSL16-2,98.765,243,3,0,20,262,1,243,0.0,499.0
1328288,tr|H2XVJ5|H2XVJ5_CIOIN,KY21.Chr11.338.v1.nonSL16-1,98.765,243,3,0,20,262,1,243,0.0,499.0
1328287,tr|H2XVJ5|H2XVJ5_CIOIN,KY21.Chr11.338.v1.nonSL21-2,98.765,243,3,0,20,262,1,243,0.0,499.0
1328379,tr|H2XXI4|H2XXI4_CIOIN,KY21.Chr11.1100.v1.nonSL7-1,99.029,412,4,0,1,412,12,423,0.0,855.0
1328286,tr|H2XVJ5|H2XVJ5_CIOIN,KY21.Chr11.338.v1.nonSL17-1,98.765,243,3,0,20,262,1,243,0.0,499.0


Unnamed: 0,qseqid,top_hit,top_hit_evalue,top_hit_bitscore
0,sp|F6Q2R9|CRYBG_CIOIN,KY21.Chr1.2268.v1.nonSL14-1,7.65e-57,171.0
1,sp|F6RCC2|NNRD_CIOIN,KY21.Chr2.1144.v1.nonSL6-1,0.0,737.0
2,sp|F6W3G8|MTND_CIOIN,KY21.Chr9.269.v2.nonSL7-1,1.52e-129,362.0
3,sp|F6X2V8|MTAP_CIOIN,KY21.Chr6.40.v1.nonSL8-1,0.0,584.0
4,sp|F6Y089|ITPA_CIOIN,KY21.Chr9.290.v1.nonSL4-1,8.68e-144,399.0
5,sp|F7A355|TRM5_CIOIN,KY21.Chr4.392.v1.nonSL3-1,0.0,932.0
6,sp|F7J186|NVD1_CIOIN,KY21.Chr6.393.v1.SL1-1,0.0,972.0
7,sp|F7J187|NVD2_CIOIN,KY21.Chr9.1102.v1.SL1-1,0.0,935.0
8,sp|O02367|CALM_CIOIN,KY21.Chr3.1526.v1.nonSL4-1,1.7100000000000001e-105,298.0
9,sp|O76808|SUH_CIOIN,KY21.Chr2.360.v1.SL1-1,0.0,1164.0


## 4.3 Extract UniProt and KY21 IDs to generate a mapping

Extract the UniProt and KY21 IDs from the BLAST results and generate a mapping between them.
Also, check for any multi-mappings and display them (there should be none).

Save the mapping to a TSV file to be used in future functions.

In [6]:
top_hits["nonref_protein"] = top_hits["qseqid"].str.split("|", expand=True)[1]
top_hits["ky_id"] = top_hits["top_hit"].str.split(".v", expand=True)[0]
display(top_hits)

uniprot_ky_map = top_hits[["nonref_protein", "ky_id"]].drop_duplicates()
display(uniprot_ky_map)

zt.utils.check_many_to_many_mappings(uniprot_ky_map, "nonref_protein", "ky_id")

uniprot_ky_map.to_csv(
    data_dir / "ciona_uniprot_ky_map.tsv",
    sep="\t",
    index=False,
)

Unnamed: 0,qseqid,top_hit,top_hit_evalue,top_hit_bitscore,nonref_protein,ky_id
0,sp|F6Q2R9|CRYBG_CIOIN,KY21.Chr1.2268.v1.nonSL14-1,7.650000e-57,171.0,F6Q2R9,KY21.Chr1.2268
1,sp|F6RCC2|NNRD_CIOIN,KY21.Chr2.1144.v1.nonSL6-1,0.000000e+00,737.0,F6RCC2,KY21.Chr2.1144
2,sp|F6W3G8|MTND_CIOIN,KY21.Chr9.269.v2.nonSL7-1,1.520000e-129,362.0,F6W3G8,KY21.Chr9.269
3,sp|F6X2V8|MTAP_CIOIN,KY21.Chr6.40.v1.nonSL8-1,0.000000e+00,584.0,F6X2V8,KY21.Chr6.40
4,sp|F6Y089|ITPA_CIOIN,KY21.Chr9.290.v1.nonSL4-1,8.680000e-144,399.0,F6Y089,KY21.Chr9.290
...,...,...,...,...,...,...
16417,tr|Q9NDQ5|Q9NDQ5_CIOIN,KY21.Chr13.333.v2.SL2-2,0.000000e+00,713.0,Q9NDQ5,KY21.Chr13.333
16418,tr|Q9NL28|Q9NL28_CIOIN,KY21.Chr12.1009.v2.SL2-1,0.000000e+00,756.0,Q9NL28,KY21.Chr12.1009
16419,tr|Q9NL43|Q9NL43_CIOIN,KY21.Chr1.2121.v1.SL2-1,0.000000e+00,846.0,Q9NL43,KY21.Chr1.2121
16420,tr|Q9NL46|Q9NL46_CIOIN,KY21.Chr7.155.v1.SL1-1,0.000000e+00,1420.0,Q9NL46,KY21.Chr7.155


Unnamed: 0,nonref_protein,ky_id
0,F6Q2R9,KY21.Chr1.2268
1,F6RCC2,KY21.Chr2.1144
2,F6W3G8,KY21.Chr9.269
3,F6X2V8,KY21.Chr6.40
4,F6Y089,KY21.Chr9.290
...,...,...
16417,Q9NDQ5,KY21.Chr13.333
16418,Q9NL28,KY21.Chr12.1009
16419,Q9NL43,KY21.Chr1.2121
16420,Q9NL46,KY21.Chr7.155


No identifiers in nonref_protein map to multiple values in ky_id


## 4.4 Merge the mapping with the Zoogle results

The following cells are mostly illustrative data exploration to show why we don't merge the mapping with the Zoogle results. The mapping between human and Ciona genes is not 1:1, so we need to be careful about how we merge the data.

In [4]:
ciona_zoogle_results = pd.read_csv(
    "../../data/2025-04-21-os-portal-reprocessed/per-nonref-species/Ciona-intestinalis.tsv",
    sep="\t",
    usecols=["nonref_protein", "ref_protein", "hgnc_gene_symbol"],
)
display(ciona_zoogle_results)

zt.utils.check_many_to_many_mappings(ciona_zoogle_results, "ref_protein", "hgnc_gene_symbol")

Unnamed: 0,nonref_protein,ref_protein,hgnc_gene_symbol
0,H2XQT6,P21439,ABCB4
1,F7BIM6,P26006,ITGA3
2,F6Z5V2,Q6ZT07,TBC1D9
3,F6S968,Q9H2B2,SYT4
4,F6VKE2,P30531,SLC6A1
...,...,...,...
50688,H2XPY4,P78312,FAM193A
50689,F6VCN7,Q92831,KAT2B
50690,F6QB43,Q8IZF0,NALCN
50691,H2Y2X3,Q9UM22,EPDR1


No identifiers in ref_protein map to multiple values in hgnc_gene_symbol


In [5]:
merged_results = pd.merge(
    ciona_zoogle_results,
    uniprot_ky_map,
    on="nonref_protein",
    how="left",
)
display(merged_results)

zt.utils.check_many_to_many_mappings(merged_results, "nonref_protein", "hgnc_gene_symbol")

Unnamed: 0,nonref_protein,ref_protein,hgnc_gene_symbol,ky_id
0,H2XQT6,P21439,ABCB4,KY21.Chr11.1314
1,F7BIM6,P26006,ITGA3,KY21.Chr9.516
2,F6Z5V2,Q6ZT07,TBC1D9,KY21.Chr9.336
3,F6S968,Q9H2B2,SYT4,KY21.Chr10.936
4,F6VKE2,P30531,SLC6A1,KY21.Chr8.880
...,...,...,...,...
50688,H2XPY4,P78312,FAM193A,KY21.Chr5.476
50689,F6VCN7,Q92831,KAT2B,KY21.Chr7.877
50690,F6QB43,Q8IZF0,NALCN,KY21.Chr3.1336
50691,H2Y2X3,Q9UM22,EPDR1,KY21.Chr11.819


Found 6048 identifiers in nonref_protein that map to multiple values in hgnc_gene_symbol:
Showing first 10 multi-mappings found:
  A0A140TAT7 maps to 2 values: ['SHMT1', 'SHMT2']
  A0A1W2W0N1 maps to 3 values: ['SRD5A2', 'SRD5A3', 'SRD5A1']
  A0A1W2W150 maps to 4 values: ['YPEL5', 'YPEL1', 'YPEL3', 'YPEL4']
  A0A1W2W3I8 maps to 28 values: ['SYT13', 'SYT11', 'SYT10', 'RPH3A', 'SYT16', 'SYT4', 'SYT3', 'SYT17', 'MCTP1', 'SYTL1', 'SYT14', 'SYT5', 'MCTP2', 'SYT12', 'SYT15B', 'SYT15', 'SYT9', 'SYT8', 'DOC2B', 'SYT6', 'DOC2A', 'SYTL5', 'SYT2', 'SYT7', 'SYTL2', 'SYT14P1', 'SYT1', 'SYTL4']
  A0A1W2W422 maps to 6 values: ['CTDSPL2', 'CTDSP1', 'CTDSP2', 'CTDNEP1', 'CTDSPL', 'TIMM50']
  A0A1W2W480 maps to 5 values: ['KDM8', 'TYW5', 'HSPBAP1', 'JMJD7', 'HIF1AN']
  A0A1W2W5D3 maps to 5 values: ['JMJD7', 'KDM8', 'TYW5', 'HSPBAP1', 'HIF1AN']
  A0A1W2W635 maps to 5 values: ['AP4S1', 'AP2S1', 'AP1S2', 'AP1S1', 'AP1S3']
  A0A1W2W6U9 maps to 3 values: ['CAV1', 'CAV3', 'CAV2']
  A0A1W2W7Q7 maps to 3 values

nonref_protein
A0A140TAT7     2
A0A1W2W0N1     3
A0A1W2W150     4
A0A1W2W3I8    28
A0A1W2W422     6
              ..
Q9NDQ5        15
Q9NL28        19
Q9NL43         2
Q9NL46         4
Q9U6V0        19
Name: hgnc_gene_symbol, Length: 6048, dtype: int64