# Database Sources & Links

## UniProt - protein dataset
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/33237286/)
- DATABASE LINK (https://beta.uniprot.org/)

## VenomKB - a centralized resource for discovering therapeutic uses for animal venoms and venom compounds
- PUBMED LINK (https://www.biorxiv.org/content/10.1101/295204v1.full)
- DATABASE LINK (http://venomkb.org/)

## DPL - a comprehensive database on sequences, structures, sources, and functions of peptide ligands.
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/33216893/)
- DATABASE LINK (http://www.peptide-ligand.cn/index.php/category/home/)

## APD3 - the updated antimicrobial peptide database and its application in peptide design
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/18957441/)
- DATABASE LINK (https://aps.unmc.edu/downloads)

## PepBank - a database of peptides based on sequence text mining and public peptide data sources
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/17678535/)
- DATABASE LINK (http://pepbank.mgh.harvard.edu/)

## FermFooDb - a database of bioactive peptides derived from fermented foods
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/33898816/)
- DATABASE LINK (https://webs.iiitd.edu.in/raghava/fermfoodb/index.php)

## PlantAFP - a curated database of plant-origin antifungal peptides
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/31612325/)
- DATABASE LINK (http://bioinformatics.cimap.res.in/sharma/PlantAFP/)

## DBETH - a database of bacterial exotoxins for human
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/22102573/)
- DATABASE LINK (http://www.hpppi.iicb.res.in/btox/)

## THPdb - database of FDA approved peptide and protein therapeutics
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/28759605/)
- DATABASE LINK (https://webs.iiitd.edu.in/raghava/thpdb/)

## PepTherDia - database and structural composition analysis of approved peptide therapeutics and diagnostics
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/33647438/)
- DATABASE LINK (http://peptherdia.herokuapp.com)

## PlantPepDB - a manually curated plant peptide database
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/32042035/)
- DATABASE LINK (http://www.nipgr.ac.in/PlantPepDB/)

## ConoServer - a database for conopeptides
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/18065428/)
- DATABASE LINK (http://www.conoserver.org/)

## ISOB - a database of indigenous snake species of Bangladesh with respective known venom composition
- PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/25848172/)
- DATABASE LINK (http://www.snakebd.com/)
 
 ## T3DB Database - a database of toxic proteins and compounds for humans
 - PUBMED LINK (https://pubmed.ncbi.nlm.nih.gov/25378312/)
 - DATABASE LINK (http://www.t3db.ca/)
 - **NOTE**: All duplicates of UniProt

# UniProt Database

In [39]:
from bs4 import BeautifulSoup
from io import StringIO
import requests
import pandas as pd
import numpy as np
import json

In [2]:
uniprot_toxin_kw = [
    "0008",
    "0108",
    "0123",
    "0204",
    "0260",
    "0528",
    "0629",
    "0632",
    "0638",
    "0738",
    "0766",
    "0800",
    "0843",
    "0870",
    "0872",
    "0959",
    "1052",
    "1053",
    "1061",
    "1199",
    "1200",
    "1201",
    "1202",
    "1203",
    "1204",
    "1205",
    "1206",
    "1213",
    "1214",
    "1216",
    "1217",
    "1218",
    "1219",
    "1220",
    "1221",
    "1222",
    "1255",
    "1265",
    "1266",
    "1275",
    "0878",
]
uniprot_toxin_kw = ["KW-" + x for x in uniprot_toxin_kw]

In [71]:
# Reviewed Frames
uniprot_reviewed_frames = []

for toxin_kw in uniprot_toxin_kw:
    init_kw_url = f'https://rest.uniprot.org/uniprotkb/search?query=keyword:{toxin_kw}+AND+reviewed:true&format=tsv&fields=id,protein_name,sequence&size=500'
    kw_req = requests.get(init_kw_url)

    while 'Link' in kw_req.headers:
        kw_stringio = StringIO(kw_req.text)
        df = pd.read_csv(kw_stringio, sep='\t')
        uniprot_reviewed_frames += [df]
        kw_next_url = kw_req.headers['Link']
        kw_next_url = kw_next_url[1:kw_next_url.rfind('>')]
        kw_req = requests.get(kw_next_url)
    
    final_kw_stringio = StringIO(kw_req.text)
    df = pd.read_csv(final_kw_stringio, sep='\t')
    uniprot_reviewed_frames += [df]
    
uniprot_all_rev = pd.concat(uniprot_reviewed_frames)

In [72]:
# Unreviewed Frames
uniprot_unreviewed_frames = []

for toxin_kw in uniprot_toxin_kw:
    init_kw_url = f'https://rest.uniprot.org/uniprotkb/search?query=keyword:{toxin_kw}+AND+reviewed:false&format=tsv&fields=id,protein_name,sequence&size=500'
    kw_req = requests.get(init_kw_url)

    while 'Link' in kw_req.headers:
        kw_stringio = StringIO(kw_req.text)
        df = pd.read_csv(kw_stringio, sep='\t')
        uniprot_unreviewed_frames += [df]
        kw_next_url = kw_req.headers['Link']
        kw_next_url = kw_next_url[1:kw_next_url.rfind('>')]
        kw_req = requests.get(kw_next_url)
    
    final_kw_stringio = StringIO(kw_req.text)
    df = pd.read_csv(final_kw_stringio, sep='\t')
    uniprot_unreviewed_frames += [df]
    
uniprot_all_unrev = pd.concat(uniprot_unreviewed_frames)

In [79]:
# Drop duplicates
uniprot_all_rev = uniprot_all_rev.drop_duplicates(subset='Sequence')
uniprot_all_unrev = uniprot_all_unrev.drop_duplicates(subset='Sequence')

In [86]:
# Write both out
uniprot_all_rev.to_csv('uniprot_manually_annotated.csv.gz', index=False)
uniprot_all_unrev.to_csv('uniprot_auto_annotated.csv.gz', index=False)

# VenomKB

In [74]:
# assume that venomkb_proteins.json is already downloaded
# from this page: http://venomkb.org/download
import pandas as pd

In [83]:
venoms_df = pd.read_json('venomkb_proteins.json')
venoms_df = venoms_df[['name', 'description', 'aa_sequence']]
venoms_df = venoms_df.drop_duplicates(subset='aa_sequence')
venoms_df

Unnamed: 0,name,description,aa_sequence
0,Thrombin-like enzyme asperase,Snake venom serine protease that clots human p...,MVLIRVLANLLILQLSYAQKSSELVIGGDECNINEHRSLVVLFNSS...
1,Snake venom serine protease,Snake venom serine protease that may act in th...,MVLIRVLANLLILQLSYAQKSSELVIGGDECNINEHRSLALVYITS...
2,Zinc metalloproteinase/disintegrin,Snake venom metalloproteinase: impairs hemosta...,MIEVLLVTICLAVSPYQGSSIILESGNVNDYEVVYPRKVTELPKGA...
3,Cysteine-rich venom protein VAR2,"Blocks ryanodine receptors, and potassium chan...",MILLKLYLTLAAILCQSRGTTSLDLDDLMTTNPEIQNEIINKHNDL...
4,Toxin KTx8,This recombinant toxin inhibits the mammalian ...,MNKVCFVVVLVLFVALAAYVSPIEGVPTGGCPLSDSLCAKYCKSHK...
...,...,...,...
6231,Scolopendra 20566.01 Da toxin,,XKMSEQGLNAQMKAQIVDLHNXARQGVANGQ
6232,Scolopendra 20528.11 Da toxin,,XQVVERGLDAKAKAAMLDAHNKARQKVANG
6233,Phospholipase A1 verutoxin-1,Catalyzes the hydrolysis of emulsified phospha...,GLLPKVKLVPEQISFILSTRENR
6234,Conantokin-E,Conantokins inhibit N-methyl-D-aspartate (NMDA...,LLVPLVTFHLILGMGTLDHGGALTERRSADATALKPEPVLLQKSDA...


In [85]:
venoms_df.to_csv('venom_proteins_venomkb.csv', index=False)

# DPL