# Description

This notebook makes use of the **pandas** library and the **ete3 toolkit**, specifically ete3's [NCBI Taxonomic database](http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html). Several custom functions simplifying the use of ete3 are in the zoonosis_helper_functions.py script which is imported into the notebook.


The primary data is obtained from the Uniprot database. It contains data on proteins which facilitate viral [entry into host cells](https://www.uniprot.org/uniprot/?query=keyword:%22Virus%20entry%20into%20host%20cell%20[KW-1160]%22). The data is in 2 parts, the first being the [tabular data](#tabular-data) and the second being the [fasta sequences](#fasta) of the virus surface proteins. The 2 are linked by their uniprot entry identifiers. A very small portion of the data has been reviewed and is not sufficient for application in deep learning so both the reviewed and unreviewed data is kept, however, the unreviewed data lacks information on host data.

To alleviate the host data information, external sources were used, namely:

- [NCBI Virus database](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?VirusLineage_ss=Viruses,%20taxid:10239&SeqType_s=Protein&Proviral_s=exclude&HostLineage_ss=Mammalia%20(mammals),%20taxid:40674)
- [Virus-Host database](https://www.genome.jp/virushostdb/)
- [Enhanced Infectious Disease Database (EID2)](https://eid2.liverpool.ac.uk/OrganismInteractions)

1. [The ete3 is first used to obtain taxomomic identifiers at the species level and if an identifier is already present to still use the ete3 taxonomic identifiers for consistency.](#ete3-taxo)

1. [Data is further filtered to keep only viruses (organism).](#filter)

1. [The dataset contains some repetitive information i.e. same virus, same hosts but different protein or different protein entry. Therefore, the next step was to fill in the host data using information from the reviewed data. The premise was if it's the same virus then it ought to have the same hosts.](#Updating-host-names-from-other-host-data-in-the-dataset)

1. [Thereafter information form external sources is appended to the primary data. Missing data after final processing is dropped.](#Updating-host-names-from-external-sources)

1. [An additional column (Infects human) is then later added. If at least one of the virus hosts is homo sapiens then an assignment of **1** is given otherwise **0** is assigned to the row.](#Further-Processing)

1. [Since the data was obtained from multiple sources further processing was done to make the information format consistent.](#host-name-consistency)

1. [The sequence data is loaded and linked to the tabular data.](#fasta)

1. [Protein names are also updated from sequence data for consistency in the data.](#protein-names-from-sequence)

1. [After processing, the data is then split into training and testing data. Validation split is done upon loading the training data with **keras**.](#splits)

1. [random undersampling](#Random-Undersampling-of-datasets)

1. [Write file sequences to fasta for feature extraction](#Write-file-sequences-to-fasta-for-feature-extraction)


### [Absolutely no idea why Virus host name != Virus hosts](#issue)

## Packages

In [1]:
# Import all necesssary modules
## Always import pandas before swifter ##
import pandas as pd
import swifter # enables pandas multiprocessing using modin and ray as a backend. Also adds progress bar functionality
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
import re
import os
from ete3 import NCBITaxa
from pprint import pprint
from tqdm.notebook import tqdm_notebook, tqdm
# import warnings
# warnings.filterwarnings("ignore", category=UserWarning)
from zoonosis_helper_functions import *

Please check the zoonosis_helper_functions.py in the current directory

In [2]:
# Configure Progress bar and Modin Pandas Engine

tqdm.pandas(desc='Processing')
os.environ["MODIN_ENGINE"] = "ray"

## Data exploration

<a id='tabular-data'></a>

In [3]:
# Load dataset downloaded from Uniprot
df = pd.read_table('../data/uniprot-keyword Virus+entry+into+host+cell+[KW-1160] +fragment no.tab.gz')

In [4]:
df.shape

(358333, 9)

In [5]:
df.sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
291187,A0A0K1H9A7,A0A0K1H9A7_9HIV1,unreviewed,Envelope glycoprotein gp160 (Env polyprotein) ...,Human immunodeficiency virus 1,856,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]
9698,K0N0Y8,K0N0Y8_PHUV,unreviewed,Capsid protein (Coat protein),Pepper huasteco yellow vein virus (PHYVV) (Pep...,251,223303,Pepper huasteco yellow vein virus (PHYVV) (Pep...,Capsicum annuum (Capsicum pepper) [TaxID: 4072]
183524,N0A2E7,N0A2E7_9INFA,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza A virus (A/blue-winged teal/North Da...,498,1322771,Influenza A virus,


In [6]:
# Check for number of rows with missing host names
print(df[df['Virus hosts'].isnull()].shape)
df[df['Virus hosts'].isnull()].sample(3)

(237573, 9)


Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
216582,A0A2S0SZG8,A0A2S0SZG8_PAVC,unreviewed,Capsid protein VP1 (Coat protein VP1),Canine parvovirus 2a,727,497961,Carnivore protoparvovirus 1,
252726,A0A3S6H6K7,A0A3S6H6K7_9INFA,unreviewed,Hemagglutinin [Cleaved into: Hemagglutinin HA1...,Influenza A virus (A/Cheongju/G03578/2016(H3N2)),566,1937226,Influenza A virus,
250265,D7RYH7,D7RYH7_9INFA,unreviewed,Hemagglutinin [Cleaved into: Hemagglutinin HA1...,Influenza A virus (A/Ulyanovsk/CRIE-SHTA/2009(...,566,762371,Influenza A virus,


In [7]:
# Total number of different organisms in dataset (inclusive of reviewed and non-reviewed)
df['Organism'].nunique()

100216

In [8]:
# Total number of different organisms with reviewed data
df[df['Status'] == 'reviewed']['Organism'].nunique()

1518

In [9]:
df[df['Status'] == 'unreviewed']['Organism'].nunique()

99095

In [10]:
# Check for number of rows with reviewed hosts
df[df['Status'] == 'reviewed']['Virus hosts'].nunique()

321

In [11]:
df[df['Status'] == 'unreviewed']['Virus hosts'].nunique()

200

In [12]:
# Total number of rows with host information (inclusive of reviewed and non-reviewed)
df['Virus hosts'].nunique()

373

In [13]:
# Checking if there is no missing organism taxonomy data. Organism == Tax IDs
df['Taxonomic lineage IDs'].nunique()

100216

## Initial processing

In [14]:
## Replace N/A columns with an empty string... prevents errors with column wide string operations

df['Virus hosts'] = np.where(df['Virus hosts'].isnull(), '',df['Virus hosts'])

In [15]:
df.sample(5)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
300089,D4NUH0,D4NUH0_9HIV1,unreviewed,Envelope glycoprotein gp160 (Env polyprotein) ...,Human immunodeficiency virus 1,849,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]
102900,E7FL33,E7FL33_RUBV,unreviewed,Capsid protein (Coat protein) (E1 envelope gly...,Rubella virus (RUBV),1063,11041,Rubella virus (RUBV),Homo sapiens (Human) [TaxID: 9606]
337621,K4JMP2,K4JMP2_9HIV1,unreviewed,Protein Vpr (R ORF protein) (Viral protein R),Human immunodeficiency virus 1,96,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]
120262,Q1KL40,Q1KL40_9HEPC,unreviewed,Core protein precursor (EC 2.7.7.48) (EC 3.4.2...,Hepatitis C virus subtype 6a,3019,31655,Hepacivirus C,
106942,A0A1V1FWA8,A0A1V1FWA8_9PICO,unreviewed,Genome polyprotein (EC 2.7.7.48) (EC 3.4.22.28...,Foot-and-mouth disease virus - type O,2332,12118,Foot-and-mouth disease virus,


In [16]:
def join_names(df, col_name: str):
    df[col_name] = df[col_name].str.split('; ').apply(set).apply('; '.join) # 'set' function removes duplicate entries
    return df

In [17]:
# df['Virus hosts'] = df['Virus hosts'].str.split('; ')
# df['Virus hosts'] = df['Virus hosts'].swifter.progress_bar(enable=True, desc='Removing duplicate host names').apply(set)
# df['Virus hosts'] = df['Virus hosts'].swifter.progress_bar(enable=True, desc='Joining host names list').apply('; '.join)

# df['Protein names'] = df['Protein names'].str.split('; ')
# df['Protein names'] = df['Protein names'].swifter.progress_bar(enable=True, desc='Removing duplicate protein names').apply(set)
# df['Protein names'] = df['Protein names'].swifter.progress_bar(enable=True, desc='Joining protein names list').apply('; '.join)

# df['Organism'] = df['Organism'].str.split('; ')
# df['Organism'] = df['Organism'].swifter.progress_bar(enable=True, desc='Removing duplicate organism names').apply(set)
# df['Organism'] = df['Organism'].swifter.progress_bar(enable=True, desc='Joining organism names list').apply('; '.join)

In [18]:
# Remove duplicate entries if present
df = join_names(df, 'Virus hosts')
df = join_names(df, 'Protein names')
df = join_names(df, 'Organism')

df.sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
151272,J3EJL5,J3EJL5_9PSED,unreviewed,Integrase,Pseudomonas sp. GM21,399,1144325,Pseudomonas sp. GM21,
140663,I2E0L8,I2E0L8_9INFB,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza B virus (B/Malaysia/30/2007),560,1038387,Influenza B virus,
293899,A0A1D9IYX2,A0A1D9IYX2_9HIV1,unreviewed,Envelope glycoprotein gp160 (Env polyprotein) ...,Human immunodeficiency virus 1,859,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]


<a id="ete3-taxo" ></a>

In [19]:
# Species ID from organism ID
df['Species taxonomic ID'] = (df['Taxonomic lineage IDs']
                              .swifter.progress_bar(enable=True, desc='Getting Viruses taxonomic IDs')
                              .apply(getRankID, rank='species')) # getRankID function in zoonosis_helper_functions.py

Getting Viruses taxonomic IDs:   0%|          | 0/16 [00:00<?, ?it/s]

In [20]:
# Copy for later use
dff = df[['Entry', 'Species taxonomic ID']].copy()

In [21]:
df.sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts,Species taxonomic ID
151824,A0A101KRF5,A0A101KRF5_RHILI,unreviewed,Integrase,Rhizobium loti (Mesorhizobium loti),407,381,Rhizobium loti (Mesorhizobium loti),,381.0
11306,A7UC63,A7UC63_9MONO,unreviewed,Hemagglutinin-neuraminidase (EC 3.2.1.18),Avian orthoavulavirus 1,571,2560319,Avian orthoavulavirus 1,,2560319.0
26498,A0A2I6SX06,A0A2I6SX06_9INFA,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza A virus (A/swine/North Carolina/A017...,498,2028552,Influenza A virus,,11320.0


In [22]:
# Check if all tax IDs could be found in NCBI taxonomy database
df[df['Species taxonomic ID'].isnull()].sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts,Species taxonomic ID
259582,I7C489,I7C489_GALIV,unreviewed,Genome polyprotein (EC 3.6.1.15) (P1C) (P1D) (...,Gallivirus A (isolate Turkey/Hungary/M176/2011...,2474,1560035,Gallivirus A,Meleagris gallopavo (Wild turkey) [TaxID: 9103],
189041,Q0NCN9,Q0NCN9_VAR65,unreviewed,IMV membrane protein,Variola virus (isolate Human/South Africa/102/...,133,587201,Variola virus,Homo sapiens (Human) [TaxID: 9606],
233522,Q0NLX6,Q0NLX6_VAR66,unreviewed,Protein L5,Variola virus (isolate Human/Brazil/v66-39/196...,128,587203,Variola virus,Homo sapiens (Human) [TaxID: 9606],


In [23]:
# Get the species name of the earlier unidentified taxonomic IDs
idx_species_name = df.columns.get_loc('Taxonomic lineage (SPECIES)')
idx_organism_id = df.columns.get_loc('Species taxonomic ID')

for row in tqdm_notebook(range(len(df)), desc='Getting species ID from organism name'):
    if np.isnan(df.iat[row, idx_organism_id]):
        df.iat[row, idx_organism_id] = getIDfromName(df.iat[row, idx_species_name]) # getIDfromName function in zoonosis_helper_functions.py

Getting species ID from organism name:   0%|          | 0/358333 [00:00<?, ?it/s]

In [24]:
df[df['Species taxonomic ID'].isnull()]

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts,Species taxonomic ID


In [25]:
df['Species taxonomic ID'] = df['Species taxonomic ID'].apply(int) # convert taxid from floats to int

In [26]:
df.shape

(358333, 10)

In [27]:
df = (df.drop(['Status','Taxonomic lineage IDs'], axis=1)
      .groupby('Species taxonomic ID', as_index=False)
      .agg({'Virus hosts':set, 'Organism':set,
            'Protein names':set, 'Taxonomic lineage (SPECIES)':'first'}))

In [28]:
df['Virus hosts'] = df['Virus hosts'].str.join('; ')
df['Organism'] = df['Organism'].str.join('; ')
df['Protein names'] = df['Protein names'].str.join('; ')

In [29]:
df.sample(5)

Unnamed: 0,Species taxonomic ID,Virus hosts,Organism,Protein names,Taxonomic lineage (SPECIES)
5220,1397528,,Halomonas sp. PBN3,Integrase; Tyr recombinase domain-containing p...,Halomonas sp. PBN3
14227,2723938,,Listeria virus P200,Portal protein,Listeria virus P200
2128,232237,,Xanthomonas virus Xp10,7R,Xanthomonas virus Xp10
2421,313985,,Geobacter lovleyi (strain ATCC BAA-1151 / DSM ...,Integrase family protein,Geobacter lovleyi
13298,2601613,,Klebsiella phage KOX3,Internal virion protein gp15; Internal virion ...,Klebsiella phage KOX3


In [30]:
df.shape

(15109, 5)

In [31]:
# Get species name from NCBI taxo database using Taxonomic ID
df['Species name'] = (df.drop('Taxonomic lineage (SPECIES)', axis=1)
                      .swifter.progress_bar(enable=True, desc='Getting Species name')
                      .apply(lambda x: getRankName(x['Species taxonomic ID'], 
                                                   rank='species'), axis=1))

In [32]:
# Get superkingdom name from NCBI taxo database using Taxonomic ID
df['Species superkingdom'] = df['Species taxonomic ID'].progress_apply(getRankName, rank='superkingdom')

Processing:   0%|          | 0/15109 [00:00<?, ?it/s]

In [33]:
# Get family from NCBI taxo database using Taxonomic ID
df['Species family'] = df['Species taxonomic ID'].progress_apply(getRankName, rank='family')

Processing:   0%|          | 0/15109 [00:00<?, ?it/s]

In [34]:
df['Species superkingdom'].unique()

array(['Bacteria', 'Archaea', 'Eukaryota', 'Viruses', 'IncJ plasmid R391',
       'uncultured organism', 'metagenome', 'Plasmid pFKY1',
       'human gut metagenome', 'marine metagenome',
       'mine drainage metagenome', 'marine sediment metagenome',
       'freshwater metagenome',
       'uncultured marine microorganism HF4000_005I08',
       'wastewater metagenome', 'hydrothermal vent metagenome',
       'sediment metagenome', 'viral metagenome', 'biofilter metagenome',
       'bioreactor metagenome', 'anaerobic digester metagenome',
       'plant metagenome', 'invertebrate metagenome'], dtype=object)

<a id="filter"></a>

In [35]:
# Filter to include only viruses
df = df[df['Species superkingdom'] == 'Viruses']

In [36]:
df.sample(5)

Unnamed: 0,Species taxonomic ID,Virus hosts,Organism,Protein names,Taxonomic lineage (SPECIES),Species name,Species superkingdom,Species family
7233,1863008,,Shigella phage SHFML-11,Portal protein (gp20); Tail sheath monomer; Pr...,Shigella phage SHFML-11,Shigella phage SHFML-11,Viruses,Myoviridae
5141,1357713,,Bacillus phage phiCM3,"Portal protein, HK97 family; YdcL",Bacillus phage phiCM3,Bacillus phage phiCM3,Viruses,Siphoviridae
11734,2500799,,Mycobacterium phage CicholasNage,Integrase,Mycobacterium phage CicholasNage,Mycobacterium phage CicholasNage,Viruses,Siphoviridae
12156,2548107,,Streptococcus phage Javan320,Portal protein,Streptococcus phage Javan320,Streptococcus phage Javan320,Viruses,Siphoviridae
7232,1862978,,Etheostoma fonticola aquareovirus,Putative outer capsid protein,Etheostoma fonticola aquareovirus,Etheostoma fonticola aquareovirus,Viruses,Reoviridae


In [37]:
df.drop(['Taxonomic lineage (SPECIES)'], axis=1, inplace=True)

In [38]:
# Convert empty strings to nan for easy downstream processing
df['Virus hosts'] = np.where(df['Virus hosts']=='', np.nan, df['Virus hosts'])

In [39]:
df[df['Virus hosts'].isnull()].sample(3)

Unnamed: 0,Species taxonomic ID,Virus hosts,Organism,Protein names,Species name,Species superkingdom,Species family
14813,2735535,,Bacillus phage phi3Ts,Integrase; Core-binding (CB) domain-containing...,Bacillus phage phi3Ts,Viruses,Siphoviridae
10592,2170413,,Siphoviridae sp.,phage_tail_N domain-containing protein; Integr...,Siphoviridae sp.,Viruses,Siphoviridae
13887,2686372,,Pseudomonas phage CHF1,Portal protein (Head-to-tail connector),Pseudomonas phage CHF1,Viruses,Autographiviridae


In [40]:
df.drop('Organism', axis=1, inplace=True) # Organism == Species name

## Updating host names from other host data in the dataset

Premise: Same virus has same host irrespective of whether the info has been reviewed or not

In [41]:
# List of viruses which do not have assigned hosts in the data
noHostViruses = (df[df['Virus hosts'].isnull()]['Species name']
                 .unique()
                 .tolist())

In [42]:
# Create independent dataframe of viruses with no assigned host and simltaneously identify the same viruses from the data 
# whcih already have assigned hosts and assign host names based on those. 
df_na_hosts = df[(~df['Virus hosts'].isnull()) & (df['Species name'].isin(noHostViruses))][['Species name', 'Virus hosts']]
df_na_hosts = df_na_hosts.groupby('Species name')['Virus hosts'].apply(list) # Reduces dimention
df_na_hosts = df_na_hosts.reset_index(name='Viral hosts nw')

In [43]:
# Previous operation reurns a list for multiple host
# Converts the lists into regular string entries separated by a ;
df_na_hosts['Viral hosts nw'] = (df_na_hosts['Viral hosts nw']
                                 .swifter.progress_bar(desc='Joining host names list', enable=True)
                                 .apply('; '.join))

In [44]:
# Updates the viruses hosts info in the main dataset
df_naa = (df[df['Virus hosts'].isnull()]
          .merge(df_na_hosts, on='Species name', how='left')
          .drop('Virus hosts', axis=1)
          .rename({'Viral hosts nw':'Virus hosts'}, axis=1))

In [45]:
# Creates independant dataset with viruses which have hosts
df_notna = df[~df['Virus hosts'].isnull()]

In [46]:
# merges the updated virus hosts dataset with the dataset with viruses which have hosts
df = df_naa.append(df_notna)

In [47]:
df.shape # Reduced dimention because of grouping, will later ungroup

(8113, 6)

In [48]:
df.sample(5)

Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
5894,2591238,Spike glycoprotein; Spike protein S2; Spike pr...,Coronavirus BtRt-BetaCoV/GX2018,Viruses,Coronaviridae,
1144,1105173,Core protein (EC 3.4.21.91) (EC 3.6.1.15) (EC ...,Marisma mosquito virus,Viruses,Flaviviridae,
3668,2045361,Integrase; Tailspike protein,Escherichia phage APC_JM3.2,Viruses,Podoviridae,
4874,2502449,Portal protein,Streptomyces phage BoomerJR,Viruses,Siphoviridae,
3895,2094134,Portal protein,Gordonia phage Kerry,Viruses,Siphoviridae,


In [49]:
print(df[df['Virus hosts'].isnull()].shape)
df[df['Virus hosts'].isnull()].sample(3)

(7512, 6)


Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
3494,2025388,Capsid protein (Coat protein),Maize striate mosaic virus,Viruses,Geminiviridae,
1488,1289596,Integrase; Putative phage portal protein,Streptococcus phage phi30c,Viruses,Siphoviridae,
6525,2697540,Portal protein (gp20); Tail sheath monomer; Pr...,Escherichia phage teqsoen,Viruses,Myoviridae,


In [50]:
df['Virus hosts'] = np.where(df['Virus hosts'].isnull(), '', df['Virus hosts'])

In [51]:
df.sample(3)

Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
4044,2169741,Capsid protein (Coat protein),Whitefly-associated begomovirus 3,Viruses,Geminiviridae,
6981,2733882,Internal virion protein gp14; Internal virion ...,Salmonella virus Vi06,Viruses,Autographiviridae,
2110,1701260,Portal protein,Streptococcus phage phiSC070807,Viruses,Siphoviridae,


In [52]:
df = mergeRows(df, 'Species taxonomic ID','Virus hosts') # mergeRows in zoonosis_helper_functions.py

In [53]:
df[(df['Species name'].str.contains('Influenza A virus')) & (df['Virus hosts'] != '')]

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family
161,11320,; Aves [TaxID: 8782]; Aves [TaxID: 8782]; Sus ...,Nucleoprotein; Hemagglutinin; Nucleoprotein (N...,Influenza A virus,Viruses,Orthomyxoviridae


In [54]:
df.sample(3)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family
4657,2169967,,Tail fiber protein; Portal protein B (GpB) (Mi...,Escherichia virus DE3,Viruses,Siphoviridae
4140,2029659,,Putative portal protein,Lactococcus phage 16802,Viruses,Siphoviridae
1526,1034067,,Capsid protein (Coat protein),Jatropha mosaic India virus,Viruses,Geminiviridae


In [55]:
# Separate dataset for easy tracking of updates
dfna = df[df['Virus hosts'] == '']
df = df[~(df['Virus hosts'] == '')]

In [56]:
dfna.shape

(7512, 6)

In [57]:
df.shape

(601, 6)

## Updating host names from external sources

In [58]:
# Data from NCBI Virus
df2 = pd.read_csv('../data/sequences.csv')
df2.shape

(2599675, 3)

In [59]:
df2.sample(2)

Unnamed: 0,Species,Molecule_type,Host
1735656,Human immunodeficiency virus 1,ssRNA-RT,Homo sapiens
1619550,Influenza A virus,ssRNA(-),Sus scrofa


In [60]:
df2.drop_duplicates(inplace=True)
df2.shape

(10956, 3)

In [61]:
# Get taxonomic IDs from species names
df2['Species ID'], df2['Host ID'] = df2['Species'].progress_apply(getIDfromName), df2['Host'].progress_apply(getIDfromName)

Processing:   0%|          | 0/10956 [00:00<?, ?it/s]

'Ungulate copiparvovirus 5'
'Feline pegivirus JP03-2471'
'Feline pegivirus JP03-3208'
'Mamastrovirus HMU-1'
'Hedgehog coronavirus'
'Cingulatid gammaherpesvirus 1'
'Mammarenavirus AnRB3214'
'Bat SARS-like coronavirus Khosta-1'
'Bat SARS-like coronavirus Khosta-2'
'Ungulate copiparvovirus 5'
'Torque teno mustelid virus 2'
'Feline stool-associated circular virus'
'Jingmen Rhinolophus sinicus hepacivirus 1'
'Wenzhou Apodemus agrarius hepacivirus 1'
'Longquan Rhinolophus sinicus hepacivirus 1'
'Longquan Niviventer niviventer hepacivirus 1'
'Longquan Niviventer fulvescens hepacivirus 1'
'Wenzhou Suncus murinus hepacivirus 1'
'Wufeng Rhinolophus sinicus hepacivirus 1'
'Wufeng Niviventer niviventer hepacivirus 1'
'Wufeng Niviventer fulvescens hepacivirus 1'
'Wenzhou Rattus norvegicus pegivirus 1'
'Wenzhou Rattus tanezumi pegivirus 1'
'Longquan Rhinolophus pearsonii pegivirus 1'
'Longquan Rhinolophus sinicus pegivirus 1'
'Longquan Niviventer niviventer pegivirus 1'
'Longquan Niviventer fulvesce

Processing:   0%|          | 0/10956 [00:00<?, ?it/s]

'Bolomys lasiurus'
'Bolomys lasiurus'
'Pipistrellus sp. pipistrellus/pygmaeus AO-2021'
'Pipistrellus musciculus'
'Funisciurus bayonii'
'Rattus sp. r3 YH-2020'
'Rattus sp. r3 YH-2020'
'Soricidae sp. YH-2020'
'Rattus sp. r3 YH-2020'
'Rattus sp. r3 YH-2020'
'Acomys selousi'
'Rhinolophus smithersi'
'Alouatta sp.'
'Pipistrellys abramus'
'Sturnira angeli'
'Sturnira angeli'
'Hipposideros curtus'
'Pipistrellus inexspectatus'
'Dobsonia exoleta'
'Mops demonstrator'
'Pipistrellus musciculus'
'Mus sp. TG-2020'
'Murinae gen. sp. TG-2020'
'Vespadelus baverstocki'
'Ozimops sp. DP-2019'
'Scoterepens balstoni'
'Neoromicia capensis'
'Neoromicia capensis'
'Mus sp. CL-2019'
'Mus sp. CL-2019'
'Neoromicia capensis'
'Bolomys lasiurus'
'Bolomys lasiurus'
'Bolomys lasiurus'
'Neoromicia capensis'
'Neoromicia capensis'
'Neoromicia capensis'
'Neoromicia capensis'
'Bolomys lasiurus'
'Pipistrellus inexspectatus'
'Chiroptera sp.'
'Chaerephon aloysiisabaudiae'
'Chiroptera sp.'
'Paradoxurus musangus'
'Neoromicia capen

In [62]:
df2.dropna(inplace=True)
df2['Species ID'], df2['Host ID'] = df2['Species ID'].astype(int), df2['Host ID'].astype(int)
df2.shape

(10802, 5)

In [63]:
df2['Host name'] = df2.progress_apply(lambda x: nameMerger(x['Host'], x['Host ID']), axis=1)
# Remove Host and Host ID columns as they have been merged and are no longer needed
df2.drop(['Host', 'Host ID'], axis=1, inplace=True)

Processing:   0%|          | 0/10802 [00:00<?, ?it/s]

In [64]:
df2['Species ID'] = df2['Species ID'].progress_apply(getRankID, rank='species')

Processing:   0%|          | 0/10802 [00:00<?, ?it/s]

In [65]:
## Create a copy for later use
dfff = df2.copy()

In [66]:
# Add host names
df_na_hosts = AggregateHosts(df2,'Species ID', 'Host name')
dfna = dfna.merge(df_na_hosts, left_on='Species taxonomic ID', right_on='Species ID', how='left')
dfna = dfna.drop(['Virus hosts', 'Species ID'], axis=1).rename({'Host name':'Virus hosts'}, axis=1)
dfna = UpdateHosts(dfna, df_na_hosts, 'Species taxonomic ID', 'Species ID')
df, dfna = UpdateMain(df, dfna)
df = mergeRows(df, 'Species taxonomic ID', 'Virus hosts')

In [67]:
dfna.shape

(6476, 6)

In [68]:
df.shape

(1637, 6)

In [69]:
# Data from virus host database
df2 = pd.read_table('../data/virushostdb.tsv')
df2.head(3)

Unnamed: 0,virus tax id,virus name,virus lineage,refseq id,KEGG GENOME,KEGG DISEASE,DISEASE,host tax id,host name,host lineage,pmid,evidence,sample type,source organism
0,438782,Abaca bunchy top virus,Viruses; Monodnaviria; Shotokuvirae; Cressdnav...,"NC_010314, NC_010315, NC_010316, NC_010317, NC...",,,,46838.0,Musa sp.,Eukaryota; Viridiplantae; Streptophyta; Strept...,17978886.0,"Literature, NCBI Virus, RefSeq",,
1,438782,Abaca bunchy top virus,Viruses; Monodnaviria; Shotokuvirae; Cressdnav...,"NC_010314, NC_010315, NC_010316, NC_010317, NC...",,,,214697.0,Musa acuminata AAA Group,Eukaryota; Viridiplantae; Streptophyta; Strept...,17978886.0,Literature,,
2,1241371,Abalone herpesvirus Victoria/AUS/2009,Viruses; Duplodnaviria; Heunggongvirae; Peplov...,NC_018874,,,,6451.0,Haliotidae,Eukaryota; Opisthokonta; Metazoa; Eumetazoa; B...,,UniProt,,


In [70]:
df2 = df2[['virus tax id', 'virus name', 'host tax id', 'host name']].copy()
df2.drop_duplicates(inplace=True)
print(df2.shape)
df2.head()

(16612, 4)


Unnamed: 0,virus tax id,virus name,host tax id,host name
0,438782,Abaca bunchy top virus,46838.0,Musa sp.
1,438782,Abaca bunchy top virus,214697.0,Musa acuminata AAA Group
2,1241371,Abalone herpesvirus Victoria/AUS/2009,6451.0,Haliotidae
3,1241371,Abalone herpesvirus Victoria/AUS/2009,36100.0,Haliotis rubra
4,491893,Abalone shriveling syndrome-associated virus,37770.0,Haliotis diversicolor aquatilis


In [71]:
df2[df2['host tax id'].isnull()]

Unnamed: 0,virus tax id,virus name,host tax id,host name
1236,2662138,Bacteriophage Phobos,,
3750,1131416,Cucurbit mild mosaic virus,,
15925,1888308,Wabat virus,,


In [72]:
df2.dropna(inplace=True)

In [73]:
df2['host tax id'] = df2['host tax id'].astype(int)
df2.head()

Unnamed: 0,virus tax id,virus name,host tax id,host name
0,438782,Abaca bunchy top virus,46838,Musa sp.
1,438782,Abaca bunchy top virus,214697,Musa acuminata AAA Group
2,1241371,Abalone herpesvirus Victoria/AUS/2009,6451,Haliotidae
3,1241371,Abalone herpesvirus Victoria/AUS/2009,36100,Haliotis rubra
4,491893,Abalone shriveling syndrome-associated virus,37770,Haliotis diversicolor aquatilis


In [74]:
df2['Species ID'] = df2['virus tax id'].progress_apply(getRankID, rank='species')

Processing:   0%|          | 0/16609 [00:00<?, ?it/s]

In [75]:
df2['Host name'] = df2.progress_apply(lambda x: nameMerger(x['host name'], x['host tax id']), axis=1)
# Remove Host and Host ID columns as they have been merged and are no longer needed
df2.drop(['host name', 'host tax id'], axis=1, inplace=True)
df2.head()

Processing:   0%|          | 0/16609 [00:00<?, ?it/s]

Unnamed: 0,virus tax id,virus name,Species ID,Host name
0,438782,Abaca bunchy top virus,438782,Musa sp. [TaxID: 46838]
1,438782,Abaca bunchy top virus,438782,Musa acuminata AAA Group [TaxID: 214697]
2,1241371,Abalone herpesvirus Victoria/AUS/2009,1513231,Haliotidae [TaxID: 6451]
3,1241371,Abalone herpesvirus Victoria/AUS/2009,1513231,Haliotis rubra [TaxID: 36100]
4,491893,Abalone shriveling syndrome-associated virus,491893,Haliotis diversicolor aquatilis [TaxID: 37770]


In [76]:
df_na_hosts = AggregateHosts(df2,'Species ID', 'Host name')
dfna = dfna.merge(df_na_hosts, left_on='Species taxonomic ID', right_on='Species ID', how='left')
dfna = dfna.drop(['Virus hosts', 'Species ID'], axis=1).rename({'Host name':'Virus hosts'}, axis=1)
dfna = UpdateHosts(dfna, df_na_hosts, 'Species taxonomic ID', 'Species ID')
df, dfna = UpdateMain(df, dfna)
df = mergeRows(df, 'Species taxonomic ID', 'Virus hosts')

In [77]:
df.shape

(4760, 6)

In [78]:
dfna.shape

(3353, 6)

In [79]:
# Data from EID2 (Liverpool University)
df2 = pd.read_csv('../data/virus_host_4rm_untitled.csv')
df2.sample(2)

Unnamed: 0,Host_name,Host_TaxId,Host Group,Virus_name,Virus_TaxId,Micobe_group,Host_common_name,Host_common_name_rev
34826,homo sapiens,9606,primates,influenza a virus (a/yamanashi/1/2004(h3n2)),515116,viruses,Human,Human
4029,sus scrofa,9823,mammals,influenza a virus (a/swine/tennessee/49/1977(h...,437351,viruses,Wild boar,Pig


In [80]:
df2 = df2[['Host_name', 'Host_TaxId', 'Virus_name', 'Virus_TaxId']].copy()
df2['Species ID'] = df2['Virus_TaxId'].progress_apply(getRankID, rank='species')
df2['Host name'] = df2.progress_apply(lambda x: nameMerger(x['Host_name'], x['Host_TaxId']), axis=1)
df2.drop(['Host_name', 'Host_TaxId'], axis=1, inplace=True)
df2.dropna(inplace=True)
df2.sample(2)

Processing:   0%|          | 0/59859 [00:00<?, ?it/s]

878474 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found


Processing:   0%|          | 0/59859 [00:00<?, ?it/s]

Unnamed: 0,Virus_name,Virus_TaxId,Species ID,Host name
33639,norovirus hu/p7-3/2001/swe,534580,11983.0,homo sapiens [TaxID: 9606]
38905,influenza a virus (a/germany/af1011/2007(h3n2)),452873,11320.0,homo sapiens [TaxID: 9606]


In [81]:
df_na_hosts = AggregateHosts(df2,'Species ID', 'Host name')
dfna = dfna.merge(df_na_hosts, left_on='Species taxonomic ID', right_on='Species ID', how='left')
dfna = dfna.drop(['Virus hosts', 'Species ID'], axis=1).rename({'Host name':'Virus hosts'}, axis=1)
dfna = UpdateHosts(dfna, df_na_hosts, 'Species taxonomic ID', 'Species ID')
df, dfna = UpdateMain(df, dfna)
df = mergeRows(df, 'Species taxonomic ID', 'Virus hosts')

In [82]:
df.shape

(4766, 6)

In [83]:
dfna.shape

(3347, 6)

In [84]:
dfna.sample(2)

Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
1622,2479933,Portal protein (gp20); Tail sheath protein,Escherichia phage p000v,Viruses,Myoviridae,
859,2015841,Integrase,Mycobacterium phage Appletree2,Viruses,Siphoviridae,


## Further Processing

In [85]:
# Add column to discriminate viruses which contain human hosts from those which do not
df['Infects human'] = np.where(df['Virus hosts'].str.contains(r'960[56]'), 'human-true','human-false')

In [86]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
1427,1093958,Sida rhombifolia [TaxID: 108377],Capsid protein (Coat protein),Sida yellow mottle virus,Viruses,Geminiviridae,human-false
2555,1914162,Shigella flexneri [TaxID: 623],Portal protein (gp20); Protein Gp38 (Receptor-...,Shigella virus UTAM,Viruses,Myoviridae,human-false


In [87]:
df['Virus hosts'] = df['Virus hosts'].str.split('; ')
df['Virus hosts'] = df.progress_apply(lambda x: list(filter(None, x['Virus hosts'])), axis=1)
df['Virus hosts'] = df['Virus hosts'].progress_apply('; '.join)

Processing:   0%|          | 0/4766 [00:00<?, ?it/s]

Processing:   0%|          | 0/4766 [00:00<?, ?it/s]

In [88]:
df.sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
973,439427,Solanum lycopersicum [TaxID: 4081],Capsid protein (Coat protein),Tomato leaf curl Toliara virus,Viruses,Geminiviridae,human-false
1091,588068,Vibrio cholerae [TaxID: 666],Portal protein (Head-to-tail connector),Vibrio phage VP3,Viruses,Autographiviridae,human-false
4375,2733602,Klebsiella pneumoniae [TaxID: 573],Head-tail connector protein,Klebsiella virus KpS2,Viruses,Autographiviridae,human-false
1145,665887,Lactococcus lactis [TaxID: 1358],Putative receptor binding protein; Putative po...,Lactococcus virus CB20,Viruses,Siphoviridae,human-false


In [89]:
df[df['Infects human'] == 'human-true'].sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
130,11041,Homo sapiens (Human) [TaxID: 9606],Structural polyprotein (p110) [Cleaved into: C...,Rubella virus,Viruses,Matonaviridae,human-true
1193,694448,Homo sapiens [TaxID: 9606],Spike glycprotein,unidentified human coronavirus,Viruses,Coronaviridae,human-true
2020,1513258,Homo sapiens [TaxID: 9606]; Tscherskia triton ...,Minor capsid protein L2; Major capsid protein L1,Gammapapillomavirus 13,Viruses,Papillomaviridae,human-true
531,138950,Homo sapiens (Human) [TaxID: 9606],Protein 3CD (EC 3.4.22.28); Protein 3A (P3A); ...,Enterovirus C,Viruses,Picornaviridae,human-true


<a id="host-name-consistency"></a>

In [90]:
# Ungrouping operation based on host
# 1. Splits Virus host using based on ; separator
# 2. Horizontally stack the data based on virus hosts
df = (df.set_index(df.columns.drop('Virus hosts', 1).tolist())['Virus hosts'].str.split(';', expand=True)
          .stack()
          .reset_index()
          .rename(columns={0:'Virus hosts'})
          .loc[:, df.columns]
         ).copy()

In [91]:
df.shape

(7270, 7)

In [92]:
df.sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
2838,1221637,Taphozous melanopogon [TaxID: 187003],Minor capsid protein VP2 (Minor structural pro...,Bat polyomavirus,Viruses,Polyomaviridae,human-false
5505,2137545,Calomys tener [TaxID: 162310],Capsid protein VP1 (Coat protein VP1); Capsid ...,Rodent protoparvovirus 3,Viruses,Parvoviridae,human-false
950,59563,Molothrus bonariensis (Shiny cowbird) (Tanagr...,Core protein (EC 3.4.21.91) (EC 3.6.1.15) (EC ...,Ilheus virus,Viruses,Flaviviridae,human-true
663,29252,Escherichia coli [TaxID: 562],Integrase (EC 2.7.7.-) (EC 3.1.-.-); Tail shea...,Escherichia virus 186,Viruses,Myoviridae,human-false


In [93]:
df['Virus hosts ID'] = None
idx_organism = df.columns.get_loc('Virus hosts')
idx_host_id = df.columns.get_loc('Virus hosts ID')

pattern = r'(\d+)\]'
for row in range(len(df)):
    host_id = re.search(pattern, df.iat[row, idx_organism]).group()
    df.iat[row, idx_host_id] = host_id
df.head()

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID
0,10243,Mus musculus (Mouse) [TaxID: 10090],CPXV098 protein (Poxvirus myristoylprotein); C...,Cowpox virus,Viruses,Poxviridae,human-true,10090]
1,10243,Homo sapiens (Human) [TaxID: 9606],CPXV098 protein (Poxvirus myristoylprotein); C...,Cowpox virus,Viruses,Poxviridae,human-true,9606]
2,10243,Felis catus (Cat) (Felis silvestris catus) [T...,CPXV098 protein (Poxvirus myristoylprotein); C...,Cowpox virus,Viruses,Poxviridae,human-true,9685]
3,10243,Microtus agrestis (Short-tailed field vole) [...,CPXV098 protein (Poxvirus myristoylprotein); C...,Cowpox virus,Viruses,Poxviridae,human-true,29092]
4,10243,Myodes glareolus (Bank vole) (Clethrionomys g...,CPXV098 protein (Poxvirus myristoylprotein); C...,Cowpox virus,Viruses,Poxviridae,human-true,447135]


In [94]:
df['Virus hosts ID'] = df['Virus hosts ID'].str.strip('\]')

In [95]:
df['Virus hosts ID'] = df['Virus hosts ID'].progress_apply(int)

df['Virus hosts ID'] = df['Virus hosts ID'].progress_apply(getRankID, rank='species')
df['Virus host name'] = df['Virus hosts ID'].progress_apply(getRankName, rank='species')
df['Host superkingdom'] = df['Virus hosts ID'].progress_apply(getRankName, rank='superkingdom')
df['Host kingdom'] = df['Virus hosts ID'].progress_apply(getRankName, rank='kingdom')

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

In [96]:
df[df['Virus hosts ID'].isna()]

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom


In [97]:
df['Virus hosts ID'][1866]

274

In [98]:
df['Virus hosts ID'] = df['Virus hosts ID'].progress_apply(int)

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

In [99]:
df['Virus hosts'] = (df.drop('Virus hosts', axis=1)
                     .apply(lambda x: nameMerger(x['Virus host name'], x['Virus hosts ID']), axis=1))

In [100]:
df.sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
2983,1281454,Equus ferus [TaxID: 1114792],Envelope glycoprotein E1 (EC 2.7.7.48) (EC 3.6...,Rodent hepacivirus,Viruses,Flaviviridae,human-false,1114792,Equus ferus,Eukaryota,Metazoa
206,10820,Avena sativa [TaxID: 4498],Capsid protein (Coat protein); Capsid protein ...,Chloris striate mosaic virus,Viruses,Geminiviridae,human-false,4498,Avena sativa,Eukaryota,Viridiplantae
6684,2721749,Callosciurus prevostii [TaxID: 64676],Minor capsid protein VP2 (Minor structural pro...,Callosciurus prevostii polyomavirus 1,Viruses,Polyomaviridae,human-false,64676,Callosciurus prevostii,Eukaryota,Metazoa
6076,2560298,Arthrobacter sp. ATCC 21022 [TaxID: 1771959],Portal protein,Arthrobacter virus Abidatro,Viruses,Siphoviridae,human-false,1771959,Arthrobacter sp. ATCC 21022,Bacteria,Arthrobacter sp. ATCC 21022


In [101]:
df.shape

(7270, 11)

In [102]:
# Ungroup based on protein names
df = (df.set_index(df.columns.drop('Protein names',1).tolist())['Protein names'].str.split(';', expand=True)
          .stack()
          .reset_index()
          .rename(columns={0:'Protein names'})
          .loc[:, df.columns]
         ).copy()

In [103]:
df[df['Host superkingdom'].isnull()].shape

(0, 11)

In [104]:
df['Host superkingdom'].unique()

array(['Eukaryota', 'Bacteria', 'Viruses', 'root', 'Archaea'],
      dtype=object)

In [105]:
df[df['Host superkingdom'] == 'Eukaryota'].shape

(18376, 11)

In [106]:
df[df['Host superkingdom'] == 'Viruses'].shape

(4, 11)

In [107]:
df[df['Host superkingdom'] == 'Bacteria'].shape

(4099, 11)

In [108]:
df[df['Host superkingdom'] == 'root'].shape

(38, 11)

In [109]:
df[df['Host superkingdom'] == 'Archaea'].shape

(14, 11)

In [110]:
print(df[df['Host kingdom'] == 'Metazoa'].shape)
df[df['Host kingdom'] == 'Metazoa'].sample(3)

(17312, 11)


Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
14683,1508220,Rousettus madagascariensis [TaxID: 77223],Spike protein,Bat coronavirus,Viruses,Coronaviridae,human-false,77223,Rousettus madagascariensis,Eukaryota,Metazoa
20433,2479483,Tupaia belangeri [TaxID: 37347],Glycoprotein G1 (GP1),Rat mammarenavirus,Viruses,Arenaviridae,human-false,37347,Tupaia belangeri,Eukaryota,Metazoa
5635,12637,Aedes taylori [TaxID: 299628],Serine protease subunit NS2B (Flavivirin prot...,Dengue virus,Viruses,Flaviviridae,human-true,299628,Aedes taylori,Eukaryota,Metazoa


In [111]:
df[df['Infects human'] == 'human-true'].shape

(8457, 11)

In [112]:
df[df['Infects human'] == 'human-false'].shape

(14074, 11)

In [113]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
6834,37124,Aedes polynesiensis [TaxID: 188700],Frameshifted structural polyprotein (p130) [C...,Chikungunya virus,Viruses,Togaviridae,human-true,188700,Aedes polynesiensis,Eukaryota,Metazoa
15060,1511906,Felidae [TaxID: 9681],Capsid protein VP2 (Structural protein VP2),Carnivore protoparvovirus 1,Viruses,Parvoviridae,human-false,9681,Felidae,Eukaryota,Metazoa


In [114]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
703,10245,Homo sapiens [TaxID: 9606],Uncharacterized protein,Vaccinia virus,Viruses,Poxviridae,human-true,9606,Homo sapiens,Eukaryota,Metazoa
533,10244,Cynomys leucurus [TaxID: 99825],A17L (MPXV-COP-122) (MPXV-SL-122) (MPXV-WRAIR...,Monkeypox virus,Viruses,Poxviridae,human-true,99825,Cynomys leucurus,Eukaryota,Metazoa


<a id="issue"></a>

In [115]:
###### Absolutely no idea why Virus host name != Virus hosts
for column in df.columns:
    print(column, df[column].nunique())
print('Dataframe total',len(df))

Species taxonomic ID 4766
Virus hosts 1765
Protein names 2062
Species name 4766
Species superkingdom 1
Species family 80
Infects human 2
Virus hosts ID 1765
Virus host name 1756
Host superkingdom 5
Host kingdom 344
Dataframe total 22531


In [116]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
827,10245,Equus caballus [TaxID: 9796],Protein H2,Vaccinia virus,Viruses,Poxviridae,human-true,9796,Equus caballus,Eukaryota,Metazoa
14106,1354514,Mycolicibacterium smegmatis [TaxID: 1772],Integrase,Mycobacterium phage Quink,Viruses,Siphoviridae,human-false,1772,Mycolicibacterium smegmatis,Bacteria,Mycolicibacterium smegmatis


## Restructuring the data

In [117]:
# Earlier saved data
dff.sample(2)

Unnamed: 0,Entry,Species taxonomic ID
15884,A0A1W5X582,11520.0
132531,A0A0X8EDJ8,11320.0


In [118]:
dff.shape

(358333, 2)

In [119]:
## Load sequences
# Using custom IO instead of Bio.SeqIO because it was much easier to customise
# Not as efficient but still light on resources

<a id='fasta'></a>

In [120]:
fastaFileName = '../data/uniprot-keyword Virus+entry+into+host+cell+[KW-1160] +fragment no.fasta'

entry_seq = read_fasta(fastaFileName) # read_fasta in zoonosis_helper_functions.py

In [None]:
dff.sort_values(by='Entry', inplace=True)

seq_object_list = [seq_obj for entry, seq_obj in entry_seq]

dff['Sequence'] = seq_object_list

In [None]:
dff.head()

Unnamed: 0,Entry,Species taxonomic ID,Sequence
50368,A0A009FEK4,470.0,<zoonosis_helper_functions.FASTASeq object at ...
156673,A0A009G3H3,1310609.0,<zoonosis_helper_functions.FASTASeq object at ...
146717,A0A009GC36,470.0,<zoonosis_helper_functions.FASTASeq object at ...
146730,A0A009GCG0,470.0,<zoonosis_helper_functions.FASTASeq object at ...
144753,A0A009GXT7,1310609.0,<zoonosis_helper_functions.FASTASeq object at ...


In [None]:
df.drop(['Virus host name', 'Protein names', 'Species superkingdom'], axis=1, inplace=True)

In [None]:
df = df.merge(dff, on='Species taxonomic ID', how='left')

In [None]:
del dff, df2

In [None]:
df.shape

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

In [None]:
df['Virus hosts ID'] = df['Virus hosts ID'].apply(str)

###### Group by Entry and aggregate using set function to avoid duplication
df = (df.groupby('Entry', as_index=False)
       .agg({'Virus hosts':set, #'Protein':'first', 
             'Infects human':'first', 'Species name':'first',
             'Host superkingdom':set,
             'Host kingdom':set,
             'Virus hosts ID':set,
             'Species family':'first',
             'Species taxonomic ID':'first',
             'Sequence': 'first'}))

df['Virus hosts'] = (df['Virus hosts']
                     .swifter.progress_bar(enable=True,
                                           desc='Joining host names list')
                     .apply('; '.join))
df['Virus hosts ID'] = (df['Virus hosts ID']
                        .swifter.progress_bar(enable=True,
                                              desc='Joining host IDs')
                        .apply('; '.join))
df['Host kingdom'] = (df['Host kingdom']
                      .swifter.progress_bar(enable=True,
                                            desc='Joining host kingdom names')
                      .apply('; '.join))
df['Host superkingdom'] = (df['Host superkingdom']
                           .swifter.progress_bar(enable=True,
                                                 desc='Joining host superkingdom names')
                           .apply('; '.join))enable

In [None]:
# Group by Entry and aggregate using set function to avoid duplication
df = (df.groupby('Entry', as_index=False)
       .agg({'Virus hosts':set, #'Protein':set, 
             'Infects human':set, 'Species name':set,
             'Host superkingdom':set,
             'Host kingdom':set,
             'Virus hosts ID':set,
             'Species family':set,
             'Species taxonomic ID':set,
             'Sequence': set}))

df[:, 1:] = df[:, 1:].swifter.applymap('; '.join)

In [None]:
df.shape

In [None]:
# Get additional sequence info from the dataset
df['Sequence'] = df.progress_apply(lambda x: getSequenceFeatures(
    seqObj=x['Sequence'], entry=x['Entry'],
    organism=x['Species name'], status=x['Infects human']), axis=1)

<a id="protein-names-from-sequence"></a>

In [None]:
df['Protein'] = df['Sequence'].apply(lambda x: x.protein_name)

In [None]:
df.sample(3)

In [None]:
df[df['Infects human'] == 'human-true'].shape

In [None]:
df[df['Infects human'] == 'human-false'].shape

In [None]:
# Sequences loaded earlier from NCBI Virus ###Add Molecule type
dfff.rename({'Species ID': 'Species taxonomic ID', 'Molecule_type': 'Molecule type'}, axis=1, inplace=True)
dfff.head()

In [None]:
df['Species taxonomic ID'] = df['Species taxonomic ID'].apply(int)

In [None]:
df = df.merge(dfff[['Species taxonomic ID', 'Molecule type']], how='left', on='Species taxonomic ID')

In [None]:
df.shape

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

In [None]:
del dfff

## Reorganise dataframe

In [None]:
df = df[['Entry', 'Protein', 'Species name', 
         'Species taxonomic ID', 'Species family', 'Virus hosts',
         'Virus hosts ID', 'Host kingdom', 
         'Host superkingdom', 'Molecule type', 'Infects human', 'Sequence']]

In [None]:
df.sample(3)

## Split Dataframe to multiple datasets

In [None]:
df['Host superkingdom'].unique()

In [None]:
df['Host kingdom'].unique()

In [None]:
df[(df['Host kingdom'].str.contains('Viridiplantae')) | df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens')].shape

In [None]:
df['Molecule type'] = np.where(df['Molecule type'].isna(), '', df['Molecule type'])

In [None]:
df[df['Molecule type'].isna()]

In [None]:
df[df['Host kingdom'].str.contains('Metazoa')][df[df['Host kingdom'].str.contains('Metazoa')]['Molecule type'].str.contains('DNA')].shape

In [None]:
df[df['Host kingdom'].str.contains('Metazoa')][df[df['Host kingdom'].str.contains('Metazoa')]['Molecule type'].str.contains('RNA')].shape

In [None]:
df.shape

In [None]:
df[~df['Host kingdom'].str.contains('Metazoa')].shape

In [None]:
df[(df['Host superkingdom'].isin(['Bacteria', 'Viruses', 'Archaea'])) | (df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens'))].shape

In [None]:
unfiltered = df
metazoa = df[df['Host kingdom'].str.contains('Metazoa')]
plant_human = df[(df['Host kingdom'].str.contains('Viridiplantae')) | df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens')]
NonEukaryote_Human = df[(df['Host superkingdom'].isin(['Bacteria', 'Viruses', 'Archaea'])) | (df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens'))]
DNA_MetazoaZoonosis = metazoa[metazoa['Molecule type'].str.contains('DNA')]
RNA_MetazoaZoonosis = metazoa[metazoa['Molecule type'].str.contains('RNA')]

In [None]:
def check_dist(df):
    true_count = df[df['Infects human'].str.contains('true')].shape[0]
    false_count = df[df['Infects human'].str.contains('false')].shape[0]
    imb = (false_count/true_count)
    print('The minoity class is %.2f of the majority\nhuman-true == %d and human false == %d\n' % (imb, true_count, false_count))

In [None]:
dataframes = [metazoa, unfiltered, plant_human, NonEukaryote_Human, DNA_MetazoaZoonosis, RNA_MetazoaZoonosis]
for dt in dataframes:
    check_dist(dt)

## Random Undersampling of datasets

In [None]:
seed = 960505

In [None]:
# Undersample majority class such that minority class (human-false) is 60% of the majority class (human-true317316)
rus = RandomUnderSampler(sampling_strategy=0.6, random_state=seed)
sampled_dataframes = []
for dt in dataframes:
    clas = dt['Infects human']
#     print('Dataframe before sampling: ', dt.shape[0])
    dt, _ = rus.fit_resample(dt, clas)
    sampled_dataframes.append(dt)
    check_dist(dt)
#     print('Dataframe after sampling: ', dt.shape[0])

## Write file sequences to fasta for feature extraction

In [181]:
metazoaFile = 'MetazoaZoonosis'
plant_humanFile = 'Plant-HumanZoonosis'
unfilteredFile = 'Zoonosis'
NonEukaryote_HumanFile = 'NonEukaryote-Human'
DNA_metazoaFile = 'DNA-MetazoaZoonosis'
RNA_metazoaFile = 'RNA-MetazoaZoonosis'

In [182]:
dirs = ['MetazoaZoonosisData', 'ZoonosisData',
        'Plant-HumanZoonosisData', 'NonEukaryote-HumanData',
        'DNA-MetazoaZoonosisData', 'RNA-MetazoaZoonosisData']
dirs = [os.path.join('../data/', fol) for fol in dirs] # Do not include in script
files = [metazoaFile, unfilteredFile, plant_humanFile, NonEukaryote_HumanFile, DNA_metazoaFile, RNA_metazoaFile]
toSave = list(zip(sampled_dataframes, files, dirs))

<a id="splits"></a>

In [185]:
for dff, file, folder in toSave:
#    save dataframes as csv
    dff.drop('Sequence', axis=1).to_csv(f'{folder}/{file}Data.csv.gz', index=False, compression='gzip')
    
#    Create subdirectories
    os.makedirs(os.path.join(folder, 'train/human-true'), exist_ok=True)
    os.makedirs(os.path.join(folder, 'test/human-true'), exist_ok=True)
    os.makedirs(os.path.join(folder, 'train/human-false'), exist_ok=True)
    os.makedirs(os.path.join(folder, 'test/human-false'), exist_ok=True)
#    Split data to train and test data
    train, test = train_test_split(dff, test_size=0.2, random_state=) # Will further split 15% of train as validation during training
#    Save test and train sequences
    save_sequences(train, f'{folder}/train/Sequences') # Will move to subdirectories after feature extraction
    save_sequences(test, f'{folder}/test/Sequences')
    
    print('Done with', folder)

Done with ../data/MetazoaZoonosisData
Done with ../data/ZoonosisData
Done with ../data/Plant-HumanZoonosisData
Done with ../data/NonEukaryote-HumanData
Done with ../data/DNA-MetazoaZoonosisData
Done with ../data/RNA-MetazoaZoonosisData
