## Definition of the working set of fungal species for Fungal Divergent Actin (FDA) and trait association analysis

**Notebook summary**

This Jupyter Notebook:
- Is associated with Step 2 of the Pub approach: Defining the ‘working set of species’
- Determines the number of proteins with alphafold structure per fungal species 
- Filters out fungal species with fewer than 6000 proteins with avaialble structures


**Context/Goal reminder**
In order to carry out trait mapping and association analysis between the presence/absence of Fungal divergent actin (FDA) and fungal traits, we need to confidently establish the set of fungal species that do or don't possess FDA. This is what we call the working set. Because ProteinCartography relies and protein structures available in UniProt and AlphaFold, we decided to define our working set as any fungal species that has a minimum of 6000 protein structures in AlphaFold. We chose this threshold as studies have shown that Fungi can possess as little as 6000 proteins (https://doi.org/10.1186/s12575-015-0020-z; https://doi.org/10.1038/s41598-021-86201-6).

**Notebook description**
In this notebook, we reformat the output of a Uniprot query to obtain the number of protein structure per fungal species. Then we filter out any fungal species than possess fewer than 6000 available structures and obtain their taxonomic information from NCBI.

---

### Setup path and environment

In [1]:
setwd('..')

library(dplyr)
library(rentrez)
library(XML)
library(readr)
library(tidyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [None]:
getwd()

### Function definition

In [2]:
## Function to fetch lineage information from NCBI Taxonomy database while providing species information

get_lineage_from_species <- function(species_name){
  search_result <- entrez_search(db = "taxonomy", term = species_name, retmax = 1)
  
  if (length(search_result$ids) == 0) {
    return(NULL)
  }
  
  taxonomy_id <- search_result$ids[1]
  taxonomy_record <- entrez_fetch(db = "taxonomy", id = taxonomy_id, rettype = "xml")
  
  # Parse the XML response
  tax_doc <- xml2::read_xml(taxonomy_record)
  
  # Extract taxonomic ranks and names
  ranks <- xml2::xml_text(xml2::xml_find_all(tax_doc, ".//Rank"))
  names <- xml2::xml_text(xml2::xml_find_all(tax_doc, ".//ScientificName"))
  
  # Combine ranks and names into a data frame
  taxonomic_info <- data.frame(Rank = ranks, Name = names)
  
  return(taxonomic_info)
  
}


### Analysis

#### Definition of working set

In [3]:
## Data import
 # We imported a tsv file from UniProt and turned it into a csv

data_fungi_prot=read.csv('data/step2/Fungi_prot_uniprot.csv',header=TRUE,sep = '\t') # this is a large file, may take a while to load
head(data_fungi_prot)

Entry,Organism,AlphaFoldDB
A0A010SAB3,Colletotrichum fioriniae PJ7,A0A010SAB3;
A0A015JW94,Rhizophagus irregularis (strain DAOM 197198w) (Glomus intraradices),A0A015JW94;
A0A017SIQ0,Aspergillus ruber (strain CBS 135680),A0A017SIQ0;
A0A022VTE5,Trichophyton rubrum CBS 288.86,A0A022VTE5;
A0A022XK41,Trichophyton soudanense CBS 452.61,A0A022XK41;
A0A023I7E1,Rhizomucor miehei,A0A023I7E1;


In [4]:
dim(data_fungi_prot)

In [5]:
## Number of protein with structure per species
 # We group the data by Organism (species) and summarize the number of protein by organsism

data_fungi_prot$ps_count=1 # we add a value of 1 for each protein in order to count them afterwards
data_fungi_g=data_fungi_prot%>% group_by(Organism)
data_prot_table= data_fungi_g %>% summarise(n_prot=sum(ps_count))

head(data_prot_table,10)

Organism,n_prot
'Aporospora terricola' (nom. ined.),2
[Acremonium] antarcticum,2
[Acremonium] macroclavatum,1
[Anthodidymella] ranunculacearum (nom. inval.),5
[Arthonia] cinnabarina f. cuspidans,4
[Bisifusarium] tonghuanum,5
[Caloplaca] arnoldii subsp. oblitterata,1
[Candida] adriatica,2
[Candida] aechmeae,4
[Candida] akabanensis,3


In [6]:
dim(data_prot_table)

In [7]:
## Save information about number of protein with available structure per fungal species

write.csv(data_prot_table,'results/step2/uniprot_fungal_species_and_proteinpdb.csv')

In [8]:
## Filtering out Fungal species with fewer than 6000 proteins

data_wset=subset(data_prot_table, data_prot_table$n_prot>=6000)

head(data_wset,10)

Organism,n_prot
[Candida] intermedia,10617
[Torrubiella] hemipterigena,11065
Aaosphaeria arxii CBS 175.79,13815
Absidia glauca (Pin mould),14217
Absidia repens,14353
Acaromyces ingoldii,7585
Acidomyces richmondensis BFW,10856
Agaricus bisporus var. burnettii (strain JB137-S8 / ATCC MYA-4627 / FGSC 10392) (White button mushroom),10948
Ajellomyces capsulatus (strain G186AR / H82 / ATCC MYA-2454 / RMSCC 2432) (Darling's disease fungus) (Histoplasma capsulatum),9199
Ajellomyces capsulatus (strain H143) (Darling's disease fungus) (Histoplasma capsulatum),9314


In [9]:
dim(data_wset)

In [11]:
## Save the data for fungal species with more than 6000 protein structures

write.csv(data_wset,'results/step2/data_wset.csv')


#### Taxonomic information for species of the working set

In [12]:
## Reformat Organism name to fetch taxonomy in NCBI Taxonomy database

data_wset$species=gsub("\\(.*","",data_wset$Organism)  # adding a new column that will be the species name to search in NCBI
colnames(data_wset)=c('Organism_uniprot','n_prot','species_name')
species=c(unique(data_wset$species_name))

In [16]:
## Fetching lineage information for each species in the NCBI Taxonomy database
 # We create a for loop that uses the 'get_lineage_from_species' function for each species of the table
 # While we store taxonomic information in a new table, we also store the name of species fo which the taxonomy search was unsuccessful in a specific vector to search and add them manually

data_taxo=data.frame()  # initiate the table for taxonomic information
species_no_taxo=c() # initiate the vector for unsuccessful taxonomy search

species=c(unique(data_wset$species_name))

In [25]:
## For loop that fetched Taxonomy information for each species
# Note: the following for loop is not ideal and  is interupted because connection with NCBI is lost during the process, we included a print(i) line in the loop to easily keep track of where the loop was when it broke and be able to restart it from there

for (i in 606:length(species)){
  gen=data_wset$species_name[i]
  lin=get_lineage_from_species(gen)
  if(length(lin)[1]==2){
    dat_temp=lin
    dat_temp$Organism=gen
    data_taxo=rbind(data_taxo,dat_temp)
  }
  if(length(lin)==0){
    species_no_taxo=c(species_no_taxo,gen)
  }
  print(i)
  
}

[1] 606
[1] 607
[1] 608
[1] 609
[1] 610
[1] 611
[1] 612
[1] 613
[1] 614
[1] 615
[1] 616
[1] 617
[1] 618
[1] 619
[1] 620
[1] 621
[1] 622
[1] 623
[1] 624
[1] 625
[1] 626
[1] 627
[1] 628
[1] 629
[1] 630
[1] 631
[1] 632
[1] 633
[1] 634
[1] 635
[1] 636
[1] 637
[1] 638
[1] 639
[1] 640
[1] 641
[1] 642
[1] 643
[1] 644
[1] 645
[1] 646
[1] 647
[1] 648
[1] 649
[1] 650
[1] 651
[1] 652
[1] 653
[1] 654
[1] 655
[1] 656
[1] 657
[1] 658
[1] 659
[1] 660
[1] 661
[1] 662
[1] 663
[1] 664
[1] 665
[1] 666
[1] 667
[1] 668
[1] 669
[1] 670
[1] 671
[1] 672
[1] 673
[1] 674
[1] 675
[1] 676
[1] 677
[1] 678
[1] 679
[1] 680
[1] 681
[1] 682
[1] 683
[1] 684
[1] 685
[1] 686
[1] 687
[1] 688
[1] 689
[1] 690
[1] 691
[1] 692
[1] 693
[1] 694
[1] 695
[1] 696
[1] 697
[1] 698
[1] 699
[1] 700
[1] 701
[1] 702
[1] 703
[1] 704
[1] 705
[1] 706
[1] 707
[1] 708
[1] 709
[1] 710
[1] 711
[1] 712
[1] 713
[1] 714
[1] 715
[1] 716
[1] 717
[1] 718
[1] 719
[1] 720
[1] 721
[1] 722
[1] 723
[1] 724
[1] 725
[1] 726
[1] 727
[1] 728
[1] 729
[1] 730


In [27]:
data_taxo_s=data_taxo%>% distinct  #removes any potential duplicates that were added by restarting the for loop at a given i
dim(data_taxo_s)

In [28]:
head(data_taxo_s,10)

Rank,Name,Organism
strain,Aaosphaeria arxii CBS 175.79,Aaosphaeria arxii CBS 175.79
no rank,cellular organisms,Aaosphaeria arxii CBS 175.79
superkingdom,Eukaryota,Aaosphaeria arxii CBS 175.79
clade,Opisthokonta,Aaosphaeria arxii CBS 175.79
kingdom,Fungi,Aaosphaeria arxii CBS 175.79
subkingdom,Dikarya,Aaosphaeria arxii CBS 175.79
phylum,Ascomycota,Aaosphaeria arxii CBS 175.79
clade,saccharomyceta,Aaosphaeria arxii CBS 175.79
subphylum,Pezizomycotina,Aaosphaeria arxii CBS 175.79
clade,leotiomyceta,Aaosphaeria arxii CBS 175.79


In [29]:
length(species_no_taxo)

In [30]:
head(species_no_taxo,10)

In [31]:
## Data cleaning
 # We clean up taxonomic information by removing any taxonomic rank that is 'no_rank','clade'or 'species' and then only keep kingdom, phylum, class, order, family,and genus information

data_taxo_t=subset(data_taxo_s, data_taxo_s$Rank!='no rank' & data_taxo_s$Rank!='clade'& data_taxo_s$Rank!='species' ) %>% distinct

data_taxo_sp=spread(data_taxo_t,Rank,Name)

data_simplified=data_taxo_sp[,c('Organism','kingdom','phylum','class',
                                'order','family','genus')]

head(data_simplified,10)

Organism,kingdom,phylum,class,order,family,genus
Aaosphaeria arxii CBS 175.79,Fungi,Ascomycota,Dothideomycetes,Pleosporales,,Aaosphaeria
Absidia glauca,Fungi,Mucoromycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia
Absidia repens,Fungi,Mucoromycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia
Acaromyces ingoldii,Fungi,Basidiomycota,Exobasidiomycetes,Exobasidiales,Cryptobasidiaceae,Acaromyces
Acidomyces richmondensis BFW,Fungi,Ascomycota,Dothideomycetes,Mycosphaerellales,Teratosphaeriaceae,Acidomyces
Agaricus bisporus var. burnettii,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Agaricaceae,Agaricus
Ajellomyces dermatitidis,Fungi,Ascomycota,Eurotiomycetes,Onygenales,Ajellomycetaceae,Blastomyces
Akanthomyces lecanii RCEF 1005,Fungi,Ascomycota,Sordariomycetes,Hypocreales,Cordycipitaceae,Akanthomyces
Allomyces macrogynus,Fungi,Blastocladiomycota,Blastocladiomycetes,Blastocladiales,Blastocladiaceae,Allomyces
Alternaria alternata,Fungi,Ascomycota,Dothideomycetes,Pleosporales,Pleosporaceae,Alternaria


In [32]:
dim(data_simplified)

In [33]:
## Check for presence of taxonomy errors
no_fungi_check=subset(data_simplified, data_simplified$kingdom!='Fungi')
dim(no_fungi_check) 

2 species got wrong taxonomyinformation

In [34]:
no_fungi_check

Unnamed: 0,Organism,kingdom,phylum,class,order,family,genus
353,Gibberella moniliformis,Metazoa,Acanthocephala,Archiacanthocephala,Moniliformida,Moniliformidae,Moniliformis
436,Melampsora larici-populina,Viridiplantae,Streptophyta,Magnoliopsida,Lamiales,Acanthaceae,Populina


In [45]:
## We can replace this two entries with the right information
 # We first need to get the location of these two errors

rowN_Gibb <- which(grepl('Gibberella moniliformis',data_simplified$Organism)) 

rowN_Mel <- which(grepl('Melampsora larici-populina',data_simplified$Organism)) 

rowN_Gibb

rowN_Mel



In [62]:
# First need to turn data_simplified into characters
data_simplified_ch <- data_simplified %>%
  mutate_all(~as.character(.))

In [63]:
#For Melampsora
data_simplified_ch[436,]=c('Melampsora larici-populina', 
                        'Fungi',
                        'Basidiomycota',
                        'Pucciniomycetes',
                        'Pucciniales',
                        'Melampsoraceae',
                        'Melampsora')

#For Gibberella
data_simplified_ch[353,]=c('Gibberella moniliformis',
                    'Fungi',
                    'Ascomycota',
                    'Sordariomycetes',
                    'Hypocreales', 
                    'Nectriaceae',
                    'Fusarium')


In [64]:
head(data_simplified_ch,10)

Organism,kingdom,phylum,class,order,family,genus
Aaosphaeria arxii CBS 175.79,Fungi,Ascomycota,Dothideomycetes,Pleosporales,,Aaosphaeria
Absidia glauca,Fungi,Mucoromycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia
Absidia repens,Fungi,Mucoromycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia
Acaromyces ingoldii,Fungi,Basidiomycota,Exobasidiomycetes,Exobasidiales,Cryptobasidiaceae,Acaromyces
Acidomyces richmondensis BFW,Fungi,Ascomycota,Dothideomycetes,Mycosphaerellales,Teratosphaeriaceae,Acidomyces
Agaricus bisporus var. burnettii,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Agaricaceae,Agaricus
Ajellomyces dermatitidis,Fungi,Ascomycota,Eurotiomycetes,Onygenales,Ajellomycetaceae,Blastomyces
Akanthomyces lecanii RCEF 1005,Fungi,Ascomycota,Sordariomycetes,Hypocreales,Cordycipitaceae,Akanthomyces
Allomyces macrogynus,Fungi,Blastocladiomycota,Blastocladiomycetes,Blastocladiales,Blastocladiaceae,Allomyces
Alternaria alternata,Fungi,Ascomycota,Dothideomycetes,Pleosporales,Pleosporaceae,Alternaria


In [58]:
## Saving list of species with unsuccessful taxonomy query
 # This represents 21 species for which we curate taxonomy information manually 

write.csv(unique(species_no_taxo),'results/step2/species_need_taxo_manually.csv')

In [65]:
## Import the manually curated taxonomy for missing species

man_cur=read.csv('data/step2/manually_curated_taxonomy_species.csv')

In [68]:
## Final table of working set with correct taxonomy information

all_uni_data=rbind(data_simplified_ch, man_cur)

 # create a final table with protein information as well

colnames(all_uni_data)=c('species_name', colnames(all_uni_data[,2:7]))
data_uni_id=left_join(data_wset,all_uni_data,by='species_name')
head(data_uni_id,10)

Organism_uniprot,n_prot,species_name,kingdom,phylum,class,order,family,genus
[Candida] intermedia,10617,[Candida] intermedia,Fungi,Ascomycota,Saccharomycetes,Saccharomycetales,Metschnikowiaceae,Candida
[Torrubiella] hemipterigena,11065,[Torrubiella] hemipterigena,Fungi,Ascomycota,Sordariomycetes,Hypocreales,Clavicipitaceae,Torrubiella
Aaosphaeria arxii CBS 175.79,13815,Aaosphaeria arxii CBS 175.79,Fungi,Ascomycota,Dothideomycetes,Pleosporales,,Aaosphaeria
Absidia glauca (Pin mould),14217,Absidia glauca,Fungi,Mucoromycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia
Absidia repens,14353,Absidia repens,Fungi,Mucoromycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia
Acaromyces ingoldii,7585,Acaromyces ingoldii,Fungi,Basidiomycota,Exobasidiomycetes,Exobasidiales,Cryptobasidiaceae,Acaromyces
Acidomyces richmondensis BFW,10856,Acidomyces richmondensis BFW,Fungi,Ascomycota,Dothideomycetes,Mycosphaerellales,Teratosphaeriaceae,Acidomyces
Agaricus bisporus var. burnettii (strain JB137-S8 / ATCC MYA-4627 / FGSC 10392) (White button mushroom),10948,Agaricus bisporus var. burnettii,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Agaricaceae,Agaricus
Ajellomyces capsulatus (strain G186AR / H82 / ATCC MYA-2454 / RMSCC 2432) (Darling's disease fungus) (Histoplasma capsulatum),9199,Ajellomyces capsulatus,Fungi,Ascomycota,Eurotiomycetes,Onygenales,Ajellomycetaceae,Ajellomyces
Ajellomyces capsulatus (strain H143) (Darling's disease fungus) (Histoplasma capsulatum),9314,Ajellomyces capsulatus,Fungi,Ascomycota,Eurotiomycetes,Onygenales,Ajellomycetaceae,Ajellomyces


In [69]:
## Save the data

write.csv(data_uni_id,'results/step2/uniprot_all_wset_taxo.csv')

In [70]:
sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
BLAS/LAPACK: /Users/manonmorin/miniconda3/envs/R_good_env/lib/libopenblasp-r0.3.24.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_0.8.3   readr_1.3.1   XML_3.98-1.19 rentrez_1.2.3 dplyr_0.8.0.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       xml2_1.2.0       magrittr_1.5     hms_0.4.2       
 [5] tidyselect_0.2.5 uuid_0.1-2       R6_2.4.0         rlang_0.3.4     
 [9] httr_1.4.0       tools_3.6.3      htmltools_0.3.6  assertthat_0.2.1
[13] digest_0.6.18    tibble_2.1.1     crayon_1.3.4     IRdisplay_0.7.0 
[17] purrr_0.3.2      repr_0.19.2      base64enc_0.1-3  curl_3.3        
[21] IRkernel_0.8.15  glue_1.3.1       evaluate_0.13    pbdZMQ_0.3-3    
[25] compiler_3.6.3   pilla