# Description

This notebook makes use of the **pandas** library and the **ete3 toolkit**, specifically ete3's [NCBI Taxonomic database](http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html). Several custom functions simplifying the use of ete3 are in the zoonosis_helper_functions.py script which is imported into the notebook.


The primary data is obtained from the Uniprot database. It contains data on proteins which facilitate viral [entry into host cells](https://www.uniprot.org/uniprot/?query=keyword:%22Virus%20entry%20into%20host%20cell%20[KW-1160]%22). The data is in 2 parts, the first being the [tabular data](#tabular-data) and the second being the [fasta sequences](#fasta) of the virus surface proteins. The 2 are linked by their uniprot entry identifiers. A very small portion of the data has been reviewed and is not sufficient for application in deep learning so both the reviewed and unreviewed data is kept, however, the unreviewed data lacks information on host data.

To alleviate the host data information, external sources were used, namely:

- [NCBI Virus database](https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?VirusLineage_ss=Viruses,%20taxid:10239&SeqType_s=Protein&Proviral_s=exclude&HostLineage_ss=Mammalia%20(mammals),%20taxid:40674)
- [Virus-Host database](https://www.genome.jp/virushostdb/)
- [Enhanced Infectious Disease Database (EID2)](https://eid2.liverpool.ac.uk/OrganismInteractions)

1. [The ete3 is first used to obtain taxomomic identifiers at the species level and if an identifier is already present to still use the ete3 taxonomic identifiers for consistency.](#ete3-taxo)

1. [Data is further filtered to keep only viruses (organism).](#filter)

1. [The dataset contains some repetitive information i.e. same virus, same hosts but different protein or different protein entry. Therefore, the next step was to fill in the host data using information from the reviewed data. The premise was if it's the same virus then it ought to have the same hosts.](#Updating-host-names-from-other-host-data-in-the-dataset)

1. [Thereafter information form external sources is appended to the primary data. Missing data after final processing is dropped.](#Updating-host-names-from-external-sources)

1. [An additional column (Infects human) is then later added. If at least one of the virus hosts is homo sapiens then an assignment of **1** is given otherwise **0** is assigned to the row.](#Further-Processing)

1. [Since the data was obtained from multiple sources further processing was done to make the information format consistent.](#host-name-consistency)

1. [The sequence data is loaded and linked to the tabular data.](#fasta)

1. [Protein names are also updated from sequence data for consistency in the data.](#protein-names-from-sequence)

1. [After processing, the data is then split into training and testing data. Validation split is done upon loading the training data with **keras**.](#splits)

1. [random undersampling](#Random-Undersampling-of-datasets)

1. [Write file sequences to fasta for feature extraction](#Write-file-sequences-to-fasta-for-feature-extraction)


### [Absolutely no idea why Virus host name != Virus hosts](#issue)

## Packages

In [1]:
# Import all necesssary modules
## Always import pandas before swifter ##
import pandas as pd
import swifter # enables pandas multiprocessing using modin and ray as a backend. Also adds progress bar functionality
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
import re
import os
from ete3 import NCBITaxa
from pprint import pprint
from tqdm.notebook import tqdm_notebook, tqdm
# import warnings
# warnings.filterwarnings("ignore", category=UserWarning)
from zoonosis_helper_functions import *

Please check the zoonosis_helper_functions.py in the current directory

In [2]:
# Configure Progress bar and Modin Pandas Engine

tqdm.pandas(desc='Processing')
os.environ["MODIN_ENGINE"] = "ray"

## Data exploration

<a id='tabular-data'></a>

In [3]:
# Load dataset downloaded from Uniprot
df = pd.read_table('../data/uniprot-keyword Virus+entry+into+host+cell+[KW-1160] +fragment no.tab.gz')

In [4]:
df.shape

(358333, 9)

In [5]:
df.sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
319028,A0A1P8P674,A0A1P8P674_9HIV1,unreviewed,Envelope glycoprotein gp160 (Env polyprotein) ...,Human immunodeficiency virus 1,867,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]
329090,A0A3G5NAL6,A0A3G5NAL6_9HIV1,unreviewed,Envelope glycoprotein gp160 (Env polyprotein) ...,Human immunodeficiency virus 1,847,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]
303443,A0A0R6CY34,A0A0R6CY34_9HIV1,unreviewed,Envelope glycoprotein gp160 (Env polyprotein) ...,Human immunodeficiency virus 1,858,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606]


In [6]:
# Check for number of rows with missing host names
print(df[df['Virus hosts'].isnull()].shape)
df[df['Virus hosts'].isnull()].sample(3)

(237573, 9)


Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
158491,A0A172S3R6,A0A172S3R6_9INFB,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza B virus (B/California/34/2014),560,1824147,Influenza B virus,
182636,A1E2T2,A1E2T2_9GEMI,unreviewed,Capsid protein (Coat protein),Tomato yellow leaf curl virus - [Tunisia],258,413710,Tomato yellow leaf curl virus,
188270,X2DNY3,X2DNY3_9INFA,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza A virus (A/mallard/Sweden/80166/2008...,498,1455511,Influenza A virus,


In [7]:
# Total number of different organisms in dataset (inclusive of reviewed and non-reviewed)
df['Organism'].nunique()

100216

In [8]:
# Total number of different organisms with reviewed data
df[df['Status'] == 'reviewed']['Organism'].nunique()

1518

In [9]:
df[df['Status'] == 'unreviewed']['Organism'].nunique()

99095

In [10]:
# Check for number of rows with reviewed hosts
df[df['Status'] == 'reviewed']['Virus hosts'].nunique()

321

In [11]:
df[df['Status'] == 'unreviewed']['Virus hosts'].nunique()

200

In [12]:
# Total number of rows with host information (inclusive of reviewed and non-reviewed)
df['Virus hosts'].nunique()

373

In [13]:
# Checking if there is no missing organism taxonomy data. Organism == Tax IDs
df['Taxonomic lineage IDs'].nunique()

100216

## Initial processing

In [14]:
## Replace N/A columns with an empty string... prevents errors with column wide string operations

df['Virus hosts'] = np.where(df['Virus hosts'].isnull(), '',df['Virus hosts'])

In [15]:
df.sample(5)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
228353,A0A1Z1X249,A0A1Z1X249_9INFA,unreviewed,Hemagglutinin [Cleaved into: Hemagglutinin HA1...,Influenza A virus (A/mallard duck/Netherlands/...,564,2004310,Influenza A virus,
2919,A0A085P747,A0A085P747_ECOLX,unreviewed,DUF4102 domain-containing protein (Integrase) ...,Escherichia coli,393,562,Escherichia coli,
11061,A0A1D8HWW9,A0A1D8HWW9_9INFB,unreviewed,Hemagglutinin [Cleaved into: Hemagglutinin HA1...,Influenza B virus (B/Vermont/13/2016),585,1910624,Influenza B virus,
106427,A0A1P8L8E1,A0A1P8L8E1_9INFA,unreviewed,Hemagglutinin [Cleaved into: Hemagglutinin HA1...,Influenza A virus (A/California/179/2016(H3N2)),566,1936861,Influenza A virus,
150379,A0A1Q9KXJ8,A0A1Q9KXJ8_ECOLX,unreviewed,Integrase,Escherichia coli,421,562,Escherichia coli,


In [16]:
def join_names(df, col_name: str):
    df[col_name] = df[col_name].str.split('; ').apply(set).apply('; '.join) # 'set' function removes duplicate entries
    return df

In [17]:
# df['Virus hosts'] = df['Virus hosts'].str.split('; ')
# df['Virus hosts'] = df['Virus hosts'].swifter.progress_bar(enable=True, desc='Removing duplicate host names').apply(set)
# df['Virus hosts'] = df['Virus hosts'].swifter.progress_bar(enable=True, desc='Joining host names list').apply('; '.join)

# df['Protein names'] = df['Protein names'].str.split('; ')
# df['Protein names'] = df['Protein names'].swifter.progress_bar(enable=True, desc='Removing duplicate protein names').apply(set)
# df['Protein names'] = df['Protein names'].swifter.progress_bar(enable=True, desc='Joining protein names list').apply('; '.join)

# df['Organism'] = df['Organism'].str.split('; ')
# df['Organism'] = df['Organism'].swifter.progress_bar(enable=True, desc='Removing duplicate organism names').apply(set)
# df['Organism'] = df['Organism'].swifter.progress_bar(enable=True, desc='Joining organism names list').apply('; '.join)

In [18]:
# Remove duplicate entries if present
df = join_names(df, 'Virus hosts')
df = join_names(df, 'Protein names')
df = join_names(df, 'Organism')

df.sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts
247726,A0A2I7A7K8,A0A2I7A7K8_9INFA,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza A virus (A/California/150/2017(H3N2)),498,2072053,Influenza A virus,
186581,A0A143LYY6,A0A143LYY6_9INFA,unreviewed,Hemagglutinin HA2 chain]; Hemagglutinin [Cleav...,Influenza A virus (A/blue-winged teal/Texas/AI...,564,1807599,Influenza A virus,
234852,A0A1Y0AXS6,A0A1Y0AXS6_9INFA,unreviewed,Nucleoprotein (Nucleocapsid protein) (Protein N),Influenza A virus (A/swine/Kansas/A01781760/20...,498,1911222,Influenza A virus,


<a id="ete3-taxo" ></a>

In [19]:
# Species ID from organism ID
df['Species taxonomic ID'] = (df['Taxonomic lineage IDs']
                              .swifter.progress_bar(enable=True, desc='Getting Viruses taxonomic IDs')
                              .apply(getRankID, rank='species')) # getRankID function in zoonosis_helper_functions.py

Getting Viruses taxonomic IDs:   0%|          | 0/16 [00:00<?, ?it/s]

In [20]:
# Copy for later use
dff = df[['Entry', 'Species taxonomic ID']].copy()

In [21]:
df.sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts,Species taxonomic ID
47934,A0A1Y1CBP2,A0A1Y1CBP2_9DELA,unreviewed,Env polyprotein (Surface protein) (Transmembra...,Human T-cell leukemia virus type I,488,11908,Primate T-lymphotropic virus 1,,194440.0
58886,A0A1U9VWY7,A0A1U9VWY7_CHIKV,unreviewed,Togavirin (EC 3.4.21.90),Chikungunya virus (CHIKV),824,37124,Chikungunya virus (CHIKV),Aedes furcifer (Mosquito) [TaxID: 299627]; Hom...,37124.0
33861,A0A2P1GY93,A0A2P1GY93_9HIV1,unreviewed,Transmembrane protein gp41 (TM) (Glycoprotein ...,Human immunodeficiency virus 1,865,11676,Human immunodeficiency virus 1,Homo sapiens (Human) [TaxID: 9606],11676.0


In [22]:
# Check if all tax IDs could be found in NCBI taxonomy database
df[df['Species taxonomic ID'].isnull()].sample(3)

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts,Species taxonomic ID
189041,Q0NCN9,Q0NCN9_VAR65,unreviewed,IMV membrane protein,Variola virus (isolate Human/South Africa/102/...,133,587201,Variola virus,Homo sapiens (Human) [TaxID: 9606],
198013,Q0NG17,Q0NG17_VAR46,unreviewed,Myristylprotein,Variola virus (isolate Human/Japan/Yamada MS-2...,377,587202,Variola virus,Homo sapiens (Human) [TaxID: 9606],
140090,Q0NCM3,Q0NCM3_VAR65,unreviewed,Carbonic anhydrase homolog (Cell surface-bindi...,Variola virus (isolate Human/South Africa/102/...,304,587201,Variola virus,Homo sapiens (Human) [TaxID: 9606],


In [23]:
# Get the species name of the earlier unidentified taxonomic IDs
idx_species_name = df.columns.get_loc('Taxonomic lineage (SPECIES)')
idx_organism_id = df.columns.get_loc('Species taxonomic ID')

for row in tqdm_notebook(range(len(df)), desc='Getting species ID from organism name'):
    if np.isnan(df.iat[row, idx_organism_id]):
        df.iat[row, idx_organism_id] = getIDfromName(df.iat[row, idx_species_name]) # getIDfromName function in zoonosis_helper_functions.py

Getting species ID from organism name:   0%|          | 0/358333 [00:00<?, ?it/s]

In [24]:
df[df['Species taxonomic ID'].isnull()]

Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Length,Taxonomic lineage IDs,Taxonomic lineage (SPECIES),Virus hosts,Species taxonomic ID


In [25]:
df['Species taxonomic ID'] = df['Species taxonomic ID'].apply(int) # convert taxid from floats to int

In [26]:
df.shape

(358333, 10)

In [27]:
df = (df.drop(['Status','Taxonomic lineage IDs'], axis=1)
      .groupby('Species taxonomic ID', as_index=False)
      .agg({'Virus hosts':set, 'Organism':set,
            'Protein names':set, 'Taxonomic lineage (SPECIES)':'first'}))

In [28]:
df['Virus hosts'] = df['Virus hosts'].str.join('; ')
df['Organism'] = df['Organism'].str.join('; ')
df['Protein names'] = df['Protein names'].str.join('; ')

In [29]:
df.sample(5)

Unnamed: 0,Species taxonomic ID,Virus hosts,Organism,Protein names,Taxonomic lineage (SPECIES)
1493,101564,,Pseudomonas alcaliphila JAB1; Pseudomonas alca...,Integrase; Phage integrase family protein,Pseudomonas alcaliphila
2443,319003,,Bradyrhizobium sp. WSM1253,Integrase; Site-specific recombinase XerD,Bradyrhizobium sp. WSM1253
8468,1970425,,Polynucleobacter sp. 35-46-11,Tyr recombinase domain-containing protein,Polynucleobacter sp. 35-46-11
14038,2707201,,Bruce virus,RNA-dependent RNA polymerase,Bruce virus
1875,188763,,Panine betaherpesvirus 2 (Chimpanzee cytomegal...,Envelope glycoprotein H (gH); Capsid vertex co...,Panine betaherpesvirus 2 (Chimpanzee cytomegal...


In [30]:
df.shape

(15104, 5)

In [31]:
# Get species name from NCBI taxo database using Taxonomic ID
df['Species name'] = (df.drop('Taxonomic lineage (SPECIES)', axis=1)
                      .swifter.progress_bar(enable=True, desc='Getting Species name')
                      .apply(lambda x: getRankName(x['Species taxonomic ID'], 
                                                   rank='species'), axis=1))

Getting Species name:   0%|          | 0/16 [00:00<?, ?it/s]

In [32]:
# Get superkingdom name from NCBI taxo database using Taxonomic ID
df['Species superkingdom'] = df['Species taxonomic ID'].progress_apply(getRankName, rank='superkingdom')

Processing:   0%|          | 0/15104 [00:00<?, ?it/s]

In [33]:
# Get family from NCBI taxo database using Taxonomic ID
df['Species family'] = df['Species taxonomic ID'].progress_apply(getRankName, rank='family')

Processing:   0%|          | 0/15104 [00:00<?, ?it/s]

In [34]:
df['Species superkingdom'].unique()

array(['Bacteria', 'Archaea', 'Eukaryota', 'Viruses', 'IncJ plasmid R391',
       'uncultured organism', 'metagenome', 'Plasmid pFKY1',
       'human gut metagenome', 'marine metagenome',
       'mine drainage metagenome', 'marine sediment metagenome',
       'freshwater metagenome',
       'uncultured marine microorganism HF4000_005I08',
       'wastewater metagenome', 'hydrothermal vent metagenome',
       'sediment metagenome', 'viral metagenome', 'biofilter metagenome',
       'bioreactor metagenome', 'anaerobic digester metagenome',
       'plant metagenome', 'invertebrate metagenome'], dtype=object)

<a id="filter"></a>

In [35]:
# Filter to include only viruses
df = df[df['Species superkingdom'] == 'Viruses']

In [36]:
df.sample(5)

Unnamed: 0,Species taxonomic ID,Virus hosts,Organism,Protein names,Taxonomic lineage (SPECIES),Species name,Species superkingdom,Species family
11326,2447802,,Miniopterus schreibersi polyomavirus 3,Minor capsid protein; Capsid protein VP1,Miniopterus schreibersi polyomavirus 3,Miniopterus schreibersi polyomavirus 3,Viruses,Polyomaviridae
13852,2686265,,Flavobacterium phage vB_FspS_mumin9-3,Portal protein,Flavobacterium phage vB_FspS_mumin9-3,Flavobacterium phage vB_FspS_mumin9-3,Viruses,Siphoviridae
13889,2686466,,Mycobacteriophage Whitty,Integrase,Mycobacteriophage Whitty,Mycobacteriophage Whitty,Viruses,Siphoviridae
3577,668548,,Bat betaherpesvirus 2,Envelope glycoprotein B (gB),Bat betaherpesvirus 2,Bat betaherpesvirus 2,Viruses,Herpesviridae
1465,93465,,Avian endogenous retrovirus EAV-HP,Envelope glycoprotein; Integrase; Capsid prote...,Avian endogenous retrovirus EAV-HP,Avian endogenous retrovirus EAV-HP,Viruses,Retroviridae


In [37]:
df.drop(['Taxonomic lineage (SPECIES)'], axis=1, inplace=True)

In [38]:
# Convert empty strings to nan for easy downstream processing
df['Virus hosts'] = np.where(df['Virus hosts']=='', np.nan, df['Virus hosts'])

In [39]:
df[df['Virus hosts'].isnull()].sample(3)

Unnamed: 0,Species taxonomic ID,Virus hosts,Organism,Protein names,Species name,Species superkingdom,Species family
969,47681,,Enterovirus sp.,Capsid protein VP3 (P1C) (Virion protein 3); C...,Enterovirus sp.,Viruses,Picornaviridae
6115,1647554,,Moraxella phage Mcat8,Integrase,Moraxella phage Mcat8,Viruses,Siphoviridae
12177,2548149,,Streptococcus phage Javan40,Portal protein,Streptococcus phage Javan40,Viruses,Siphoviridae


In [40]:
df.drop('Organism', axis=1, inplace=True) # Organism == Species name

## Updating host names from other host data in the dataset

Premise: Same virus has same host irrespective of whether the info has been reviewed or not

In [41]:
# List of viruses which do not have assigned hosts in the data
noHostViruses = (df[df['Virus hosts'].isnull()]['Species name']
                 .unique()
                 .tolist())

In [42]:
# Create independent dataframe of viruses with no assigned host and simltaneously identify the same viruses from the data 
# whcih already have assigned hosts and assign host names based on those. 
df_na_hosts = df[(~df['Virus hosts'].isnull()) & (df['Species name'].isin(noHostViruses))][['Species name', 'Virus hosts']]
df_na_hosts = df_na_hosts.groupby('Species name')['Virus hosts'].apply(list) # Reduces dimention
df_na_hosts = df_na_hosts.reset_index(name='Viral hosts nw')

In [43]:
# Previous operation reurns a list for multiple host
# Converts the lists into regular string entries separated by a ;
df_na_hosts['Viral hosts nw'] = (df_na_hosts['Viral hosts nw']
                                 .swifter.progress_bar(desc='Joining host names list', enable=True)
                                 .apply('; '.join))

In [44]:
# Updates the viruses hosts info in the main dataset
df_naa = (df[df['Virus hosts'].isnull()]
          .merge(df_na_hosts, on='Species name', how='left')
          .drop('Virus hosts', axis=1)
          .rename({'Viral hosts nw':'Virus hosts'}, axis=1))

In [45]:
# Creates independant dataset with viruses which have hosts
df_notna = df[~df['Virus hosts'].isnull()]

In [46]:
# merges the updated virus hosts dataset with the dataset with viruses which have hosts
df = df_naa.append(df_notna)

In [47]:
df.shape # Reduced dimention because of grouping, will later ungroup

(8113, 6)

In [48]:
df.sample(5)

Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
3957,2126790,Portal protein,Mycobacterium phage SchoolBus,Viruses,Siphoviridae,
5622,2563581,Integrase,Pseudomonas phage vB_Pae_CF60a,Viruses,Siphoviridae,
4899,2508204,Putative head to tail connecting protein,Escherichia phage vB_EcoP_R4596,Viruses,Autographiviridae,
5836,2591056,Protein Gp38 (Receptor-recognizing protein); L...,Shigella phage CM8,Viruses,Myoviridae,
796,644007,Uncharacterized protein,Streptococcus phage PH10,Viruses,Siphoviridae,


In [49]:
print(df[df['Virus hosts'].isnull()].shape)
df[df['Virus hosts'].isnull()].sample(3)

(7512, 6)


Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
4549,2365030,Portal protein,Streptococcus phage CHPC1156,Viruses,Siphoviridae,
5595,2563517,Integrase,Pseudomonas phage vB_Pae_BR233a,Viruses,Siphoviridae,
2681,1916156,Integrase,Streptococcus phage IPP16,Viruses,Siphoviridae,


In [50]:
df['Virus hosts'] = np.where(df['Virus hosts'].isnull(), '', df['Virus hosts'])

In [51]:
df.sample(3)

Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
4083,2169967,Tape measure protein (TMP) (Gene product H) (g...,Escherichia virus DE3,Viruses,Siphoviridae,
6955,2733671,Portal protein (Head-to-tail connector); Inter...,Klebsiella virus KN1-1,Viruses,Autographiviridae,
1866,1567487,XkdK,Bacillus phage BalMu-1,Viruses,Myoviridae,


In [52]:
df = mergeRows(df, 'Species taxonomic ID','Virus hosts') # mergeRows in zoonosis_helper_functions.py

In [53]:
df[(df['Species name'].str.contains('Influenza A virus')) & (df['Virus hosts'] != '')]

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family
161,11320,; Sturnira lilium (Lesser yellow-shouldered ba...,Hemagglutinin HA2 chain]; Hemagglutinin [Cleav...,Influenza A virus,Viruses,Orthomyxoviridae


In [54]:
df.sample(3)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family
6111,2560781,,Putative portal protein,Staphylococcus virus Sb1,Viruses,Herelleviridae
5644,2547985,,Portal (Connector) protein,Streptococcus phage Javan116,Viruses,Siphoviridae
7959,2744003,,Integrase,Gordonia phage BlingBling,Viruses,Siphoviridae


In [55]:
# Separate dataset for easy tracking of updates
dfna = df[df['Virus hosts'] == '']
df = df[~(df['Virus hosts'] == '')]

In [56]:
dfna.shape

(7512, 6)

In [57]:
df.shape

(601, 6)

## Updating host names from external sources

In [58]:
# Data from NCBI Virus
df2 = pd.read_csv('../data/sequences.csv')
df2.shape

(2599675, 3)

In [59]:
df2.sample(2)

Unnamed: 0,Species,Molecule_type,Host
1423463,Influenza A virus,ssRNA(-),Homo sapiens
1457127,Norwalk virus,ssRNA(+),Homo sapiens


In [60]:
df2.drop_duplicates(inplace=True)
df2.shape

(10956, 3)

In [61]:
# Get taxonomic IDs from species names
df2['Species ID'], df2['Host ID'] = df2['Species'].progress_apply(getIDfromName), df2['Host'].progress_apply(getIDfromName)

Processing:   0%|          | 0/10956 [00:00<?, ?it/s]

'nan'
'nan'
'nan'
'nan'


Processing:   0%|          | 0/10956 [00:00<?, ?it/s]

'Bolomys lasiurus'
'Bolomys lasiurus'
'Pipistrellus sp. pipistrellus/pygmaeus AO-2021'
'Pipistrellus musciculus'
'Funisciurus bayonii'
'Rattus sp. r3 YH-2020'
'Rattus sp. r3 YH-2020'
'Soricidae sp. YH-2020'
'Rattus sp. r3 YH-2020'
'Rattus sp. r3 YH-2020'
'Acomys selousi'
'Rhinolophus smithersi'
'Alouatta sp.'
'Pipistrellys abramus'
'Sturnira angeli'
'Sturnira angeli'
'Hipposideros curtus'
'Pipistrellus inexspectatus'
'Dobsonia exoleta'
'Mops demonstrator'
'Pipistrellus musciculus'
'Mus sp. TG-2020'
'Murinae gen. sp. TG-2020'
'Vespadelus baverstocki'
'Ozimops sp. DP-2019'
'Scoterepens balstoni'
'Neoromicia capensis'
'Neoromicia capensis'
'Mus sp. CL-2019'
'Mus sp. CL-2019'
'Neoromicia capensis'
'Bolomys lasiurus'
'Bolomys lasiurus'
'Bolomys lasiurus'
'Neoromicia capensis'
'Neoromicia capensis'
'Neoromicia capensis'
'Neoromicia capensis'
'Bolomys lasiurus'
'Pipistrellus inexspectatus'
'Chiroptera sp.'
'Chaerephon aloysiisabaudiae'
'Chiroptera sp.'
'Paradoxurus musangus'
'Neoromicia capen

In [62]:
df2.dropna(inplace=True)
df2['Species ID'], df2['Host ID'] = df2['Species ID'].astype(int), df2['Host ID'].astype(int)
df2.shape

(10886, 5)

In [63]:
df2['Host name'] = df2.progress_apply(lambda x: nameMerger(x['Host'], x['Host ID']), axis=1)
# Remove Host and Host ID columns as they have been merged and are no longer needed
df2.drop(['Host', 'Host ID'], axis=1, inplace=True)

Processing:   0%|          | 0/10886 [00:00<?, ?it/s]

In [64]:
df2['Species ID'] = df2['Species ID'].progress_apply(getRankID, rank='species')

Processing:   0%|          | 0/10886 [00:00<?, ?it/s]

In [65]:
## Create a copy for later use
dfff = df2.copy()

In [66]:
# Add host names
df_na_hosts = AggregateHosts(df2,'Species ID', 'Host name')
dfna = dfna.merge(df_na_hosts, left_on='Species taxonomic ID', right_on='Species ID', how='left')
dfna = dfna.drop(['Virus hosts', 'Species ID'], axis=1).rename({'Host name':'Virus hosts'}, axis=1)
dfna = UpdateHosts(dfna, df_na_hosts, 'Species taxonomic ID', 'Species ID')
df, dfna = UpdateMain(df, dfna)
df = mergeRows(df, 'Species taxonomic ID', 'Virus hosts')

In [67]:
dfna.shape

(6476, 6)

In [68]:
df.shape

(1637, 6)

In [69]:
# Data from virus host database
df2 = pd.read_table('../data/virushostdb.tsv')
df2.head(3)

Unnamed: 0,virus tax id,virus name,virus lineage,refseq id,KEGG GENOME,KEGG DISEASE,DISEASE,host tax id,host name,host lineage,pmid,evidence,sample type,source organism
0,438782,Abaca bunchy top virus,Viruses; Monodnaviria; Shotokuvirae; Cressdnav...,"NC_010314, NC_010315, NC_010316, NC_010317, NC...",,,,46838.0,Musa sp.,Eukaryota; Viridiplantae; Streptophyta; Strept...,17978886.0,"Literature, NCBI Virus, RefSeq",,
1,438782,Abaca bunchy top virus,Viruses; Monodnaviria; Shotokuvirae; Cressdnav...,"NC_010314, NC_010315, NC_010316, NC_010317, NC...",,,,214697.0,Musa acuminata AAA Group,Eukaryota; Viridiplantae; Streptophyta; Strept...,17978886.0,Literature,,
2,1241371,Abalone herpesvirus Victoria/AUS/2009,Viruses; Duplodnaviria; Heunggongvirae; Peplov...,NC_018874,,,,6451.0,Haliotidae,Eukaryota; Opisthokonta; Metazoa; Eumetazoa; B...,,UniProt,,


In [70]:
df2 = df2[['virus tax id', 'virus name', 'host tax id', 'host name']].copy()
df2.drop_duplicates(inplace=True)
print(df2.shape)
df2.head()

(16612, 4)


Unnamed: 0,virus tax id,virus name,host tax id,host name
0,438782,Abaca bunchy top virus,46838.0,Musa sp.
1,438782,Abaca bunchy top virus,214697.0,Musa acuminata AAA Group
2,1241371,Abalone herpesvirus Victoria/AUS/2009,6451.0,Haliotidae
3,1241371,Abalone herpesvirus Victoria/AUS/2009,36100.0,Haliotis rubra
4,491893,Abalone shriveling syndrome-associated virus,37770.0,Haliotis diversicolor aquatilis


In [71]:
df2[df2['host tax id'].isnull()]

Unnamed: 0,virus tax id,virus name,host tax id,host name
1236,2662138,Bacteriophage Phobos,,
3750,1131416,Cucurbit mild mosaic virus,,
15925,1888308,Wabat virus,,


In [72]:
df2.dropna(inplace=True)

In [73]:
df2['host tax id'] = df2['host tax id'].astype(int)
df2.head()

Unnamed: 0,virus tax id,virus name,host tax id,host name
0,438782,Abaca bunchy top virus,46838,Musa sp.
1,438782,Abaca bunchy top virus,214697,Musa acuminata AAA Group
2,1241371,Abalone herpesvirus Victoria/AUS/2009,6451,Haliotidae
3,1241371,Abalone herpesvirus Victoria/AUS/2009,36100,Haliotis rubra
4,491893,Abalone shriveling syndrome-associated virus,37770,Haliotis diversicolor aquatilis


In [74]:
df2['Species ID'] = df2['virus tax id'].progress_apply(getRankID, rank='species')

Processing:   0%|          | 0/16609 [00:00<?, ?it/s]

In [75]:
df2['Host name'] = df2.progress_apply(lambda x: nameMerger(x['host name'], x['host tax id']), axis=1)
# Remove Host and Host ID columns as they have been merged and are no longer needed
df2.drop(['host name', 'host tax id'], axis=1, inplace=True)
df2.head()

Processing:   0%|          | 0/16609 [00:00<?, ?it/s]

Unnamed: 0,virus tax id,virus name,Species ID,Host name
0,438782,Abaca bunchy top virus,438782,Musa sp. [TaxID: 46838]
1,438782,Abaca bunchy top virus,438782,Musa acuminata AAA Group [TaxID: 214697]
2,1241371,Abalone herpesvirus Victoria/AUS/2009,1513231,Haliotidae [TaxID: 6451]
3,1241371,Abalone herpesvirus Victoria/AUS/2009,1513231,Haliotis rubra [TaxID: 36100]
4,491893,Abalone shriveling syndrome-associated virus,491893,Haliotis diversicolor aquatilis [TaxID: 37770]


In [76]:
df_na_hosts = AggregateHosts(df2,'Species ID', 'Host name')
dfna = dfna.merge(df_na_hosts, left_on='Species taxonomic ID', right_on='Species ID', how='left')
dfna = dfna.drop(['Virus hosts', 'Species ID'], axis=1).rename({'Host name':'Virus hosts'}, axis=1)
dfna = UpdateHosts(dfna, df_na_hosts, 'Species taxonomic ID', 'Species ID')
df, dfna = UpdateMain(df, dfna)
df = mergeRows(df, 'Species taxonomic ID', 'Virus hosts')

In [77]:
df.shape

(4760, 6)

In [78]:
dfna.shape

(3353, 6)

In [79]:
# Data from EID2 (Liverpool University)
df2 = pd.read_csv('../data/virus_host_4rm_untitled.csv')
df2.sample(2)

Unnamed: 0,Host_name,Host_TaxId,Host Group,Virus_name,Virus_TaxId,Micobe_group,Host_common_name,Host_common_name_rev
52713,anas rubripes,75857,vertebrates,influenza a virus (a/american black duck/quebe...,568495,viruses,American black duck,Duck
19134,homo sapiens,9606,primates,influenza a virus (a/bochum/ins375/2009(h1n1)),856502,viruses,Human,Human


In [80]:
df2 = df2[['Host_name', 'Host_TaxId', 'Virus_name', 'Virus_TaxId']].copy()
df2['Species ID'] = df2['Virus_TaxId'].progress_apply(getRankID, rank='species')
df2['Host name'] = df2.progress_apply(lambda x: nameMerger(x['Host_name'], x['Host_TaxId']), axis=1)
df2.drop(['Host_name', 'Host_TaxId'], axis=1, inplace=True)
df2.dropna(inplace=True)
df2.sample(2)

Processing:   0%|          | 0/59859 [00:00<?, ?it/s]

878474 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found
555869 taxid not found


Processing:   0%|          | 0/59859 [00:00<?, ?it/s]

Unnamed: 0,Virus_name,Virus_TaxId,Species ID,Host name
25156,influenza a virus (a/new york/4981/2009(h1n1)),691915,11320.0,homo sapiens [TaxID: 9606]
55705,influenza a virus (a/mallard duck/alb/191/1990...,352725,11320.0,aves [TaxID: 8782]


In [81]:
df_na_hosts = AggregateHosts(df2,'Species ID', 'Host name')
dfna = dfna.merge(df_na_hosts, left_on='Species taxonomic ID', right_on='Species ID', how='left')
dfna = dfna.drop(['Virus hosts', 'Species ID'], axis=1).rename({'Host name':'Virus hosts'}, axis=1)
dfna = UpdateHosts(dfna, df_na_hosts, 'Species taxonomic ID', 'Species ID')
df, dfna = UpdateMain(df, dfna)
df = mergeRows(df, 'Species taxonomic ID', 'Virus hosts')

In [82]:
df.shape

(4766, 6)

In [83]:
dfna.shape

(3347, 6)

In [84]:
dfna.sample(2)

Unnamed: 0,Species taxonomic ID,Protein names,Species name,Species superkingdom,Species family,Virus hosts
1777,2502433,Integrase,Mycobacterium phage MisterCuddles,Viruses,Siphoviridae,
784,1971433,Portal protein,Streptococcus phage P7631,Viruses,Siphoviridae,


## Further Processing

In [85]:
# Add column to discriminate viruses which contain human hosts from those which do not
df['Infects human'] = np.where(df['Virus hosts'].str.contains(r'960[56]'), 'human-true','human-false')

In [86]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
1824,1340819,Mycolicibacterium smegmatis MC2 155 [TaxID: 24...,Portal protein,Mycobacterium phage Catdawg,Viruses,Siphoviridae,human-false
3192,2049886,Gordonia terrae [TaxID: 2055],Integrase; Portal protein,Gordonia virus Vendetta,Viruses,Siphoviridae,human-false


In [87]:
df['Virus hosts'] = df['Virus hosts'].str.split('; ')
df['Virus hosts'] = df.progress_apply(lambda x: list(filter(None, x['Virus hosts'])), axis=1)
df['Virus hosts'] = df['Virus hosts'].progress_apply('; '.join)

Processing:   0%|          | 0/4766 [00:00<?, ?it/s]

Processing:   0%|          | 0/4766 [00:00<?, ?it/s]

In [88]:
df.sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
2083,1541887,Salmonella [TaxID: 590]; Salmonella enterica [...,Portal protein,Salmonella virus Chi,Viruses,Siphoviridae,human-false
2886,1982154,Arthrobacter sp. ATCC 21022 [TaxID: 1771959],Portal protein,Arthrobacter virus Gordon,Viruses,Siphoviridae,human-false
4418,2733653,Pectobacterium carotovorum subsp. carotovorum ...,Portal protein (Head-to-tail connector); Inter...,Pectobacterium virus PP81,Viruses,Autographiviridae,human-false
3846,2560502,Gordonia terrae [TaxID: 2055],Integrase,Gordonia virus Hedwig,Viruses,Siphoviridae,human-false


In [89]:
df[df['Infects human'] == 'human-true'].sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
491,121791,Sus scrofa (Pig) [TaxID: 9823]; Cynopterus bra...,Fusion glycoprotein F1]; Fusion glycoprotein F...,Nipah henipavirus,Viruses,Paramyxoviridae,human-true
1891,1415628,Homo sapiens [TaxID: 9606],CA1 (Capsid protein) (Coat protein),Gyrovirus Tu789,Viruses,Anelloviridae,human-true
2741,1961678,Homo sapiens [TaxID: 9606],Major capsid protein L1; Minor capsid protein L2,Gammapapillomavirus 21,Viruses,Papillomaviridae,human-true
177,11676,Homo sapiens (Human) [TaxID: 9606],Gag polyprotein; Exoribonuclease H (EC 2.7.7.4...,Human immunodeficiency virus 1,Viruses,Retroviridae,human-true


<a id="host-name-consistency"></a>

In [90]:
# Ungrouping operation based on host
# 1. Splits Virus host using based on ; separator
# 2. Horizontally stack the data based on virus hosts
df = (df.set_index(df.columns.drop('Virus hosts', 1).tolist())['Virus hosts'].str.split(';', expand=True)
          .stack()
          .reset_index()
          .rename(columns={0:'Virus hosts'})
          .loc[:, df.columns]
         ).copy()

In [91]:
df.shape

(7270, 7)

In [92]:
df.sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human
2197,722417,Campylobacter sp. [TaxID: 205],Possible phage tail sheath protein,Campylobacter virus CP220,Viruses,Myoviridae,human-false
4237,1906673,Rhinolophus ferrumequinum [TaxID: 59479],Spike glycoprotein; Spike glycoprotein (S glyc...,Alphacoronavirus sp.,Viruses,Coronaviridae,human-false
6670,2714896,Camelus [TaxID: 9836],Major capsid protein L1; Minor capsid protein L2,Camelus dromedarius papillomavirus 3,Viruses,Papillomaviridae,human-false
3321,1437125,Rattus norvegicus [TaxID: 10116],Glycoprotein G2 (GP2)]; Glycoprotein G1 (GP1);...,Cardamones virus,Viruses,Arenaviridae,human-true


In [93]:
df['Virus hosts ID'] = None
idx_organism = df.columns.get_loc('Virus hosts')
idx_host_id = df.columns.get_loc('Virus hosts ID')

pattern = r'(\d+)\]'
for row in range(len(df)):
    host_id = re.search(pattern, df.iat[row, idx_organism]).group()
    df.iat[row, idx_host_id] = host_id
df.head()

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID
0,10243,Loxodonta africana (African elephant) [TaxID: ...,IMV membrane protein; Protein L5; CPXV098 prot...,Cowpox virus,Viruses,Poxviridae,human-true,9785]
1,10243,Microtus agrestis (Short-tailed field vole) [...,IMV membrane protein; Protein L5; CPXV098 prot...,Cowpox virus,Viruses,Poxviridae,human-true,29092]
2,10243,Myodes glareolus (Bank vole) (Clethrionomys g...,IMV membrane protein; Protein L5; CPXV098 prot...,Cowpox virus,Viruses,Poxviridae,human-true,447135]
3,10243,Felis catus (Cat) (Felis silvestris catus) [T...,IMV membrane protein; Protein L5; CPXV098 prot...,Cowpox virus,Viruses,Poxviridae,human-true,9685]
4,10243,Bos taurus (Bovine) [TaxID: 9913],IMV membrane protein; Protein L5; CPXV098 prot...,Cowpox virus,Viruses,Poxviridae,human-true,9913]


In [94]:
df['Virus hosts ID'] = df['Virus hosts ID'].str.strip('\]')

In [95]:
df['Virus hosts ID'] = df['Virus hosts ID'].progress_apply(int)

df['Virus hosts ID'] = df['Virus hosts ID'].progress_apply(getRankID, rank='species')
df['Virus host name'] = df['Virus hosts ID'].progress_apply(getRankName, rank='species')
df['Host superkingdom'] = df['Virus hosts ID'].progress_apply(getRankName, rank='superkingdom')
df['Host kingdom'] = df['Virus hosts ID'].progress_apply(getRankName, rank='kingdom')

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

In [96]:
df[df['Virus hosts ID'].isna()]

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom


In [97]:
df['Virus hosts ID'][1866]

274

In [98]:
df['Virus hosts ID'] = df['Virus hosts ID'].progress_apply(int)

Processing:   0%|          | 0/7270 [00:00<?, ?it/s]

In [99]:
df['Virus hosts'] = (df.drop('Virus hosts', axis=1)
                     .apply(lambda x: nameMerger(x['Virus host name'], x['Virus hosts ID']), axis=1))

In [100]:
df.sample(4)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
3170,1329650,Eptesicus serotinus [TaxID: 59452],Capsid protein,Bat circovirus,Viruses,Circoviridae,human-false,59452,Eptesicus serotinus,Eukaryota,Metazoa
1556,232380,Solanum lycopersicum [TaxID: 4081],Capsid protein (Coat protein),Tomato yellow leaf curl Axarquia virus,Viruses,Geminiviridae,human-false,4081,Solanum lycopersicum,Eukaryota,Viridiplantae
1955,505220,Sus scrofa [TaxID: 9823],Hemagglutinin-neuraminidase (EC 3.2.1.18),Swine parainfluenza virus 3,Viruses,Paramyxoviridae,human-false,9823,Sus scrofa,Eukaryota,Metazoa
6226,2560499,Gordonia terrae [TaxID: 2055],Portal protein,Gordonia virus Ghobes,Viruses,Siphoviridae,human-false,2055,Gordonia terrae,Bacteria,Gordonia terrae


In [101]:
df.shape

(7270, 11)

In [102]:
# Ungroup based on protein names
df = (df.set_index(df.columns.drop('Protein names',1).tolist())['Protein names'].str.split(';', expand=True)
          .stack()
          .reset_index()
          .rename(columns={0:'Protein names'})
          .loc[:, df.columns]
         ).copy()

In [103]:
df[df['Host superkingdom'].isnull()].shape

(0, 11)

In [104]:
df['Host superkingdom'].unique()

array(['Eukaryota', 'Bacteria', 'Viruses', 'root', 'Archaea'],
      dtype=object)

In [105]:
df[df['Host superkingdom'] == 'Eukaryota'].shape

(18376, 11)

In [106]:
df[df['Host superkingdom'] == 'Viruses'].shape

(4, 11)

In [107]:
df[df['Host superkingdom'] == 'Bacteria'].shape

(4099, 11)

In [108]:
df[df['Host superkingdom'] == 'root'].shape

(38, 11)

In [109]:
df[df['Host superkingdom'] == 'Archaea'].shape

(14, 11)

In [110]:
print(df[df['Host kingdom'] == 'Metazoa'].shape)
df[df['Host kingdom'] == 'Metazoa'].sample(3)

(17312, 11)


Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
193,10243,Mus musculus [TaxID: 10090],CPXV098 protein (Myristylprotein) (Myristylpr...,Cowpox virus,Viruses,Poxviridae,human-true,10090,Mus musculus,Eukaryota,Metazoa
10761,394239,Phascolarctos cinereus [TaxID: 38626],R-peptide (p2E)],Koala retrovirus,Viruses,Retroviridae,human-false,38626,Phascolarctos cinereus,Eukaryota,Metazoa
5308,12110,Ovis aries [TaxID: 9940],Genome polyprotein [Cleaved into: Leader prot...,Foot-and-mouth disease virus,Viruses,Picornaviridae,human-false,9940,Ovis aries,Eukaryota,Metazoa


In [111]:
df[df['Infects human'] == 'human-true'].shape

(8457, 11)

In [112]:
df[df['Infects human'] == 'human-false'].shape

(14074, 11)

In [113]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
10502,340907,Macaca leonina [TaxID: 90387],Envelope glycoprotein B (gB),Papiine alphaherpesvirus 2,Viruses,Herpesviridae,human-true,90387,Macaca leonina,Eukaryota,Metazoa
21267,2591233,Rhinolophus [TaxID: 49442],Spike protein S2,Coronavirus BtRl-BetaCoV/SC2018,Viruses,Coronaviridae,human-false,49442,Rhinolophus,Eukaryota,Metazoa


In [114]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
7170,44024,Culex annulirostris [TaxID: 162997],Small envelope protein M (Matrix protein),Kokobera virus,Viruses,Flaviviridae,human-true,162997,Culex annulirostris,Eukaryota,Metazoa
2115,11020,Homo sapiens [TaxID: 9606],Spike glycoprotein E2 (E2 envelope glycoprotein),Barmah Forest virus,Viruses,Togaviridae,human-true,9606,Homo sapiens,Eukaryota,Metazoa


<a id="issue"></a>

In [115]:
###### Absolutely no idea why Virus host name != Virus hosts
for column in df.columns:
    print(column, df[column].nunique())
print('Dataframe total',len(df))

Species taxonomic ID 4766
Virus hosts 1765
Protein names 2068
Species name 4766
Species superkingdom 1
Species family 80
Infects human 2
Virus hosts ID 1765
Virus host name 1756
Host superkingdom 5
Host kingdom 344
Dataframe total 22531


In [116]:
df.sample(2)

Unnamed: 0,Species taxonomic ID,Virus hosts,Protein names,Species name,Species superkingdom,Species family,Infects human,Virus hosts ID,Virus host name,Host superkingdom,Host kingdom
391,10243,Mus musculus [TaxID: 10090],Carbonic anhydrase homolog (Cell surface-bind...,Cowpox virus,Viruses,Poxviridae,human-true,10090,Mus musculus,Eukaryota,Metazoa
13358,1300978,Rhinolophus pusillus [TaxID: 159858],Attachment glycoprotein,Bat paramyxovirus,Viruses,Paramyxoviridae,human-false,159858,Rhinolophus pusillus,Eukaryota,Metazoa


## Restructuring the data

In [117]:
# Earlier saved data
dff.sample(2)

Unnamed: 0,Entry,Species taxonomic ID
131564,A0A4D6TZ07,11320.0
161504,A0A075EUR5,11320.0


In [118]:
dff.shape

(358333, 2)

In [119]:
## Load sequences
# Using custom IO instead of Bio.SeqIO because it was much easier to customise
# Not as efficient but still light on resources

<a id='fasta'></a>

In [120]:
fastaFileName = '../data/uniprot-keyword Virus+entry+into+host+cell+[KW-1160] +fragment no.fasta'

entry_seq = read_fasta(fastaFileName) # read_fasta in zoonosis_helper_functions.py

In [121]:
dff.sort_values(by='Entry', inplace=True)

seq_object_list = [seq_obj for entry, seq_obj in entry_seq]

dff['Sequence'] = seq_object_list

In [122]:
dff.head()

Unnamed: 0,Entry,Species taxonomic ID,Sequence
50368,A0A009FEK4,470.0,<zoonosis_helper_functions.FASTASeq object at ...
156673,A0A009G3H3,1310609.0,<zoonosis_helper_functions.FASTASeq object at ...
146717,A0A009GC36,470.0,<zoonosis_helper_functions.FASTASeq object at ...
146730,A0A009GCG0,470.0,<zoonosis_helper_functions.FASTASeq object at ...
144753,A0A009GXT7,1310609.0,<zoonosis_helper_functions.FASTASeq object at ...


In [123]:
df.drop(['Virus host name', 'Protein names', 'Species superkingdom'], axis=1, inplace=True)

In [124]:
df = df.merge(dff, on='Species taxonomic ID', how='left')

In [125]:
del dff, df2

In [126]:
df.shape

(48728179, 10)

In [127]:
df.drop_duplicates(inplace=True)

In [128]:
df.shape

(2277584, 10)

In [129]:
df['Virus hosts ID'] = df['Virus hosts ID'].apply(str)

In [130]:
# Group by Entry and aggregate using set function to avoid duplication
df = (df.groupby('Entry', as_index=False)
       .agg({'Virus hosts':set, #'Protein':set, 
             'Infects human':set, 'Species name':set,
             'Host superkingdom':set,
             'Host kingdom':set,
             'Virus hosts ID':set,
             'Species family':set,
             'Species taxonomic ID':'first',
             'Sequence': 'first'})) 

df.iloc[:, 1:-2] = df.iloc[:, 1:-2].swifter.applymap('; '.join)

Pandas Apply:   0%|          | 0/2221212 [00:00<?, ?it/s]

In [131]:
df.shape

(317316, 10)

In [132]:
# Get additional sequence info from the dataset
df['Sequence'] = df.progress_apply(lambda x: getSequenceFeatures(
    seqObj=x['Sequence'], entry=x['Entry'],
    organism=x['Species name'], status=x['Infects human']), axis=1)

Pandas Apply:   0%|          | 0/317316 [00:00<?, ?it/s]

<a id="protein-names-from-sequence"></a>

In [133]:
df['Protein'] = df['Sequence'].apply(lambda x: x.protein_name)

In [134]:
df.sample(3)

Unnamed: 0,Entry,Virus hosts,Infects human,Species name,Host superkingdom,Host kingdom,Virus hosts ID,Species family,Species taxonomic ID,Sequence,Protein
49997,A0A140EVM3,Homo sapiens [TaxID: 9606],human-true,Influenza B virus,Eukaryota,Metazoa,9606,Orthomyxoviridae,11520,<zoonosis_helper_functions.FASTASeq object at ...,Hemagglutinin
20466,A0A0C5AGA5,Cetacea [TaxID: 9721]; Sturnira lilium [TaxID:...,human-true,Influenza A virus,Eukaryota,Metazoa,9796; 9823; 9721; 9606; 9691; 9666; 9685; 9709...,Orthomyxoviridae,11320,<zoonosis_helper_functions.FASTASeq object at ...,Nucleoprotein
41301,A0A0S3MR50,Sus scrofa [TaxID: 9823],human-false,Porcine epidemic diarrhea virus,Eukaryota,Metazoa,9823,Coronaviridae,28295,<zoonosis_helper_functions.FASTASeq object at ...,Spike glycoprotein


In [135]:
df[df['Infects human'] == 'human-true'].shape

(278789, 11)

In [136]:
df[df['Infects human'] == 'human-false'].shape

(38527, 11)

In [137]:
# Sequences loaded earlier from NCBI Virus ###Add Molecule type
dfff.rename({'Species ID': 'Species taxonomic ID', 'Molecule_type': 'Molecule type'}, axis=1, inplace=True)
dfff.head()

Unnamed: 0,Species,Molecule type,Species taxonomic ID,Host name
0,Epsilonarterivirus zamalb,ssRNA(+),2501966,Chlorocebus [TaxID: 392815]
1,Rodent arterivirus,ssRNA(+),1806636,Eothenomys inez [TaxID: 870526]
2,Wencheng Sm shrew coronavirus,ssRNA(+),1508228,Suncus murinus [TaxID: 9378]
3,Bat coronavirus,ssRNA(+),1508220,Eidolon helvum [TaxID: 77214]
4,NL63-related bat coronavirus strain BtKYNL63-9b,ssRNA(+),2501929,Triaenops afer [TaxID: 549403]


In [138]:
df['Species taxonomic ID'] = df['Species taxonomic ID'].apply(int)

In [139]:
df = df.merge(dfff[['Species taxonomic ID', 'Molecule type']], how='left', on='Species taxonomic ID')

In [140]:
df.shape

(8551317, 12)

In [141]:
df.drop_duplicates(inplace=True)

In [142]:
df.shape

(317316, 12)

In [143]:
del dfff

## Reorganise dataframe

In [144]:
df = df[['Entry', 'Protein', 'Species name', 
         'Species taxonomic ID', 'Species family', 'Virus hosts',
         'Virus hosts ID', 'Host kingdom', 
         'Host superkingdom', 'Molecule type', 'Infects human', 'Sequence']]

In [145]:
df.sample(3)

Unnamed: 0,Entry,Protein,Species name,Species taxonomic ID,Species family,Virus hosts,Virus hosts ID,Host kingdom,Host superkingdom,Molecule type,Infects human,Sequence
526183,A0A0A7R6V6,Large envelope protein,Hepatitis B virus,10407,Hepadnaviridae,Gorilla gorilla [TaxID: 9593]; Hylobatidae [Ta...,9600; 9598; 9577; 9606; 9593,Metazoa,Eukaryota,dsDNA-RT,human-true,<zoonosis_helper_functions.FASTASeq object at ...
2256996,A0A1L6YZF4,Nucleoprotein,Influenza B virus,11520,Orthomyxoviridae,Homo sapiens [TaxID: 9606],9606,Metazoa,Eukaryota,ssRNA(-),human-true,<zoonosis_helper_functions.FASTASeq object at ...
3270072,A0A2I7A3J3,Hemagglutinin,Influenza A virus,11320,Orthomyxoviridae,Cetacea [TaxID: 9721]; Sturnira lilium [TaxID:...,9796; 9823; 9721; 9606; 9691; 9666; 9685; 9709...,Metazoa,Eukaryota,ssRNA(-),human-true,<zoonosis_helper_functions.FASTASeq object at ...


## Split Dataframe to multiple datasets

In [146]:
df['Host superkingdom'].unique()

array(['Eukaryota', 'Bacteria', 'root', 'Archaea', 'Viruses',
       'root; Eukaryota'], dtype=object)

In [147]:
df['Host kingdom'].unique()

array(['Metazoa', 'Viridiplantae', 'Lactococcus lactis',
       'Escherichia coli', 'Serratia marcescens',
       'Mycolicibacterium smegmatis', 'Bacillus thuringiensis',
       'Trichormus variabilis', 'Listeria monocytogenes',
       'Pseudomonas syringae', 'Metazoa; Viridiplantae',
       'Cronobacter sakazakii', 'Staphylococcus epidermidis',
       'Enterococcus faecium', 'root', 'Caulobacter vibrioides',
       'Vibrio alginolyticus', 'Staphylococcus aureus', 'Bacillus cereus',
       'Ralstonia solanacearum', 'Klebsiella pneumoniae',
       'Staphylococcus aureus; Staphylococcus xylosus',
       'Acinetobacter baumannii', 'Dickeya sp.',
       'Lactobacillus delbrueckii', 'Salmonella', 'Bacillus pumilus',
       'Citrobacter; Citrobacter freundii', 'Mycobacterium',
       'Rhizobium leguminosarum', 'Mesorhizobium loti',
       'Shigella flexneri', 'Yersinia enterocolitica',
       'Idiomarinaceae bacterium N2-2', 'Sulfitobacter sp. CB2047',
       'Lelliottia sp. GL2', 'Clostridi

In [148]:
df[(df['Host kingdom'].str.contains('Viridiplantae')) | df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens')].shape

(284537, 12)

In [149]:
df['Molecule type'] = np.where(df['Molecule type'].isna(), '', df['Molecule type'])

In [150]:
df[df['Molecule type'].isna()]

Unnamed: 0,Entry,Protein,Species name,Species taxonomic ID,Species family,Virus hosts,Virus hosts ID,Host kingdom,Host superkingdom,Molecule type,Infects human,Sequence


In [151]:
df[df['Host kingdom'].str.contains('Metazoa')][df[df['Host kingdom'].str.contains('Metazoa')]['Molecule type'].str.contains('DNA')].shape

(31528, 12)

In [152]:
df[df['Host kingdom'].str.contains('Metazoa')][df[df['Host kingdom'].str.contains('Metazoa')]['Molecule type'].str.contains('RNA')].shape

(273101, 12)

In [153]:
df.shape

(317316, 12)

In [154]:
df[~df['Host kingdom'].str.contains('Metazoa')].shape

(9987, 12)

In [155]:
df[(df['Host superkingdom'].isin(['Bacteria', 'Viruses', 'Archaea'])) | (df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens'))].shape

(283199, 12)

In [156]:
unfiltered = df
metazoa = df[df['Host kingdom'].str.contains('Metazoa')]
plant_human = df[(df['Host kingdom'].str.contains('Viridiplantae')) | df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens')]
NonEukaryote_Human = df[(df['Host superkingdom'].isin(['Bacteria', 'Viruses', 'Archaea'])) | (df['Virus hosts'].str.contains('[Hh]omo [Ss]apiens'))]
DNA_MetazoaZoonosis = metazoa[metazoa['Molecule type'].str.contains('DNA')]
RNA_MetazoaZoonosis = metazoa[metazoa['Molecule type'].str.contains('RNA')]

In [157]:
def check_dist(df):
    true_count = df[df['Infects human'].str.contains('true')].shape[0]
    false_count = df[df['Infects human'].str.contains('false')].shape[0]
    imb = (false_count/true_count)
    print('The minoity class is %.2f of the majority\nhuman-true == %d and human false == %d\n' % (imb, true_count, false_count))

In [158]:
dataframes = [metazoa, unfiltered, plant_human, NonEukaryote_Human, DNA_MetazoaZoonosis, RNA_MetazoaZoonosis]
for dt in dataframes:
    check_dist(dt)

The minoity class is 0.10 of the majority
human-true == 278755 and human false == 28574

The minoity class is 0.14 of the majority
human-true == 278789 and human false == 38527

The minoity class is 0.02 of the majority
human-true == 278789 and human false == 5748

The minoity class is 0.02 of the majority
human-true == 278755 and human false == 4444

The minoity class is 0.29 of the majority
human-true == 24518 and human false == 7010

The minoity class is 0.07 of the majority
human-true == 254072 and human false == 19029



## Random Undersampling of datasets

In [159]:
seed = 960505

In [160]:
# Undersample majority class such that minority class (human-false) is 60% of the majority class (human-true317316)
rus = RandomUnderSampler(sampling_strategy=0.6, random_state=seed)
sampled_dataframes = []
for dt in dataframes:
    clas = dt['Infects human']
#     print('Dataframe before sampling: ', dt.shape[0])
    dt, _ = rus.fit_resample(dt, clas)
    sampled_dataframes.append(dt)
    check_dist(dt)
#     print('Dataframe after sampling: ', dt.shape[0])

The minoity class is 0.60 of the majority
human-true == 47623 and human false == 28574

The minoity class is 0.60 of the majority
human-true == 64211 and human false == 38527

The minoity class is 0.60 of the majority
human-true == 9580 and human false == 5748

The minoity class is 0.60 of the majority
human-true == 7406 and human false == 4444

The minoity class is 0.60 of the majority
human-true == 11683 and human false == 7010

The minoity class is 0.60 of the majority
human-true == 31715 and human false == 19029



## Write file sequences to fasta for feature extraction

In [161]:
metazoaFile = 'MetazoaZoonosis'
plant_humanFile = 'Plant-HumanZoonosis'
unfilteredFile = 'Zoonosis'
NonEukaryote_HumanFile = 'NonEukaryote-Human'
DNA_metazoaFile = 'DNA-MetazoaZoonosis'
RNA_metazoaFile = 'RNA-MetazoaZoonosis'

# dirs = ['MetazoaZoonosisData', 'ZoonosisData',
#         'Plant-HumanZoonosisData', 'NonEukaryote-HumanData',
#         'DNA-MetazoaZoonosisData', 'RNA-MetazoaZoonosisData']

files = [metazoaFile, unfilteredFile, plant_humanFile,
         NonEukaryote_HumanFile, DNA_metazoaFile, RNA_metazoaFile]

dirs = [os.path.join('../data/', fol) for fol in files] # Do not include in script

toSave = list(zip(sampled_dataframes, files, dirs))

<a id="splits"></a>

In [162]:
for dff, file, folder in toSave:
    # Create subdirectories
    os.makedirs(os.path.join(folder, 'train/human-true'), exist_ok=True)
    os.makedirs(os.path.join(folder, 'test/human-true'), exist_ok=True)
    os.makedirs(os.path.join(folder, 'train/human-false'), exist_ok=True)
    os.makedirs(os.path.join(folder, 'test/human-false'), exist_ok=True)

    # save dataframes as csv
    dff.drop('Sequence', axis=1).to_csv(f'{folder}/{file}Data.csv.gz', index=False, compression='gzip')
    
    # Split data to train and test data
    train, test = train_test_split(dff, test_size=0.2, random_state=2022) # Will further split 15% of train as validation during training
    # Save test and train sequences
    save_sequences(train, f'{folder}/train/Sequences') # Will move to subdirectories after feature extraction
    save_sequences(test, f'{folder}/test/Sequences')
    
    print('Done with', folder)

Done with ../data/MetazoaZoonosis
Done with ../data/Zoonosis
Done with ../data/Plant-HumanZoonosis
Done with ../data/NonEukaryote-Human
Done with ../data/DNA-MetazoaZoonosis
Done with ../data/RNA-MetazoaZoonosis
