# Building the `community` Database

In this notebook, you'll find explanations of essential functions for constructing the 'community' database. These functions provide users with the flexibility to auto update the database or perform manual interventions during preprocessing. You can also customize the database by providing your own annotations or specifying lists of ligands and receptors to align it with your specific research requirements.

**For users looking to quickly update the database, simply run the following command:**

If you are using the community `conda environment`, the necessary libraries should be installed. However, if you are using a different virtual environment, and do not have the dependencies for [mygene](https://mygene.info/) and [OmniPathR](https://omnipathdb.org/), please install.

In [1]:
library(community) # load community package

In [2]:
sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] community_1.4.1

loaded via a namespace (and not attached):
  [1] readxl_1.4.1                uuid_1.1-0                 
  [3] backports_1.4.1             Hmisc_4.7-2                
  [5] BiocFileCache_2.2.1         plyr_1.8.8             

In [3]:
LR_database <- auto_update_db("both") 

[1] "Retrieved interactions from both DB"
[1] "2109 Number of complex pairs detected"
[1] "13582 Number of non-redundant binary pairs produced"
[1] "1262 Number of binary pairs detected through PPI"
[1] "Number of PPI network interactions found:"
[1] 1262
[1] "7415 Non-redundant number of pairs in the DB"


“If this function fails, it may be due to internet connectivity issues. Try running it again.”
Querying chunk 1

Querying chunk 2

Querying chunk 3



Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.


If you do not have `mygene` and `OmniPathR` libraries installed please uncomment the block by removing the dash symbol, #, and run the following.

In [4]:
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")

# BiocManager::install("mygene")

In [5]:
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")

# BiocManager::install("OmnipathR")

# Building step by step



### Import the database of interest from OmniPath

This function imports ligand-receptor interaction data based on the specified database type. It allows for the selection of `noncurated`, `curated`, or `both` types of databases.

In [6]:
db <- import_db("both")
# db <- import_db("curated")
# db <- import_db("noncurated")

[1] "Retrieved interactions from both DB"


### Break down complex interactions

Next, we processes the database to handle complex rows where either the target or the source is a complex. It splits such complex interactions into pairwise binary interactions.

In [7]:
pairwise_pairs <- create_pairwise_pairs(db)

[1] "2109 Number of complex pairs detected"
[1] "13582 Number of non-redundant binary pairs produced"


In [8]:
head(pairwise_pairs)

Unnamed: 0_level_0,Pair.Name,Ligand,Receptor,complex_pair,source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,curation_effort,n_references,n_resources,annotation_strategy
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<int>,<chr>
1,IL17A_IL17RA,IL17A,IL17RA,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
2,IL17A_IL17RC,IL17A,IL17RC,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
3,IL17RA_IL17RC,IL17RA,IL17RC,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
4,IL17RA_IL17A,IL17RA,IL17A,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
5,IL17RC_IL17A,IL17RC,IL17A,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
6,IL17RC_IL17RA,IL17RC,IL17RA,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both


### Filter through PPI

Now, we filter those binary pairs based on their presence in the protein-protein interaction (PPI) network.

In [9]:
pt_interactions <- filter_pairs_with_ppi(pairwise_pairs)

[1] "1262 Number of binary pairs detected through PPI"


In [10]:
head(pt_interactions)

Unnamed: 0_level_0,Pair.Name,Ligand,Receptor,complex_pair,source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,curation_effort,n_references,n_resources,annotation_strategy
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<int>,<chr>
1,IL17A_IL17RA,IL17A,IL17RA,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
2,IL17A_IL17RC,IL17A,IL17RC,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
3,IL17RA_IL17A,IL17RA,IL17A,IL17A_IL17RA_IL17RC,Q16552,COMPLEX:Q8NAC3_Q96F46,IL17A,IL17RA_IL17RC,1,1,0,1,1,0,CellChatDB;CellPhoneDB;Cellinker;ICELLNET;SIGNOR,Cellinker:19838198;Cellinker:25204502;Cellinker:9367539;ICELLNET:24011563;SIGNOR:32024054,5,5,5,both
4,NPNT_ITGA8,NPNT,ITGA8,NPNT_ITGA8_ITGB1,Q6UXI9,COMPLEX:P05556_P53708,NPNT,ITGA8_ITGB1,1,1,0,1,1,0,Baccin2019;SIGNOR,Baccin2019:16988024;SIGNOR:22613833,2,2,2,LR
5,NPNT_ITGB1,NPNT,ITGB1,NPNT_ITGA8_ITGB1,Q6UXI9,COMPLEX:P05556_P53708,NPNT,ITGA8_ITGB1,1,1,0,1,1,0,Baccin2019;SIGNOR,Baccin2019:16988024;SIGNOR:22613833,2,2,2,LR
6,ITGAL_ICAM1,ITGAL,ICAM1,ITGAL_ITGB2_ICAM1,COMPLEX:P05107_P20701,P05362,ITGAL_ITGB2,ICAM1,1,1,0,0,0,0,Baccin2019;CellPhoneDB;ICELLNET;SIGNOR,Baccin2019:16988024;ICELLNET:10940895;ICELLNET:23418628;SIGNOR:12808052,4,4,4,both


### Merge these binary pairs

This function processes binary pairs from the database and merges them with the binary pairs detected through PPI. It also standarizes and reorder columns

In [12]:
complete_data <- process_binary_pairs(db, pt_interactions)

[1] "7415 Non-redundant number of pairs in the DB"


In [13]:
head(complete_data)

Unnamed: 0_level_0,Pair.Name,Ligand,Receptor,source,target,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,curation_effort,n_references,n_resources,annotation_strategy,complex_pair
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<int>,<chr>,<chr>
3,JAK2_EPOR,JAK2,EPOR,O60674,P19235,1,1,0,1,1,0,BEL-Large-Corpus_ProtMapper;BioGRID;Cellinker;HPRD;HPRD-phos;HPRD_KEA;HPRD_MIMP;KEA;MIMP;PhosphoNetworks;PhosphoPoint;PhosphoSite_KEA;PhosphoSite_MIMP;ProtMapper;SIGNOR;SIGNOR_ProtMapper;SPIKE;Wang;iPTMnet;phosphoELM;phosphoELM_KEA;phosphoELM_MIMP,BioGRID:8343951;Cellinker:9030561;HPRD-phos:12441334;HPRD:11779507;HPRD:12441334;HPRD:8343951;KEA:10579919;KEA:10660611;KEA:11443118;KEA:12027890;KEA:12441334;KEA:7559499;KEA:9573010;ProtMapper:12441334;ProtMapper:15212693;SIGNOR:12441334;SPIKE:12524467;SPIKE:18672044;iPTMnet:10579919;iPTMnet:12441334;phosphoELM:10579919,21,13,14,LR,
4,NOTCH1_JAG2,NOTCH1,JAG2,P46531,Q9Y219,1,0,1,0,0,0,Baccin2019;CellCall;HPRD;NetPath;Ramilowski2015_Baccin2019;SPIKE,HPRD:11006133;NetPath:11006133;SPIKE:15358736,3,2,5,LR,
7,NOTCH1_DLL1,NOTCH1,DLL1,P46531,O00548,1,1,1,0,0,0,Baccin2019;CellCall;HPRD;NetPath;Ramilowski2015_Baccin2019;SPIKE;SignaLink3;Wang,Baccin2019:11006133;HPRD:11006133;NetPath:11006133;SPIKE:11006133;SPIKE:15358736;SPIKE:17537801;SignaLink3:21985982,7,4,7,LR,
11,SFRP1_WNT4,SFRP1,WNT4,Q8N474,P56705,1,1,1,1,1,0,CancerCellMap;CellChatDB-cofactors;HPRD;NetPath;SIGNOR;SPIKE;SignaLink3;Wang,CancerCellMap:11287180;HPRD:11287180;NetPath:11287180;SIGNOR:11287180;SPIKE:11287180;SignaLink3:11287180;SignaLink3:18988627;SignaLink3:23331499,8,3,8,LR,
12,SFRP2_WNT1,SFRP2,WNT1,Q96HF1,P04628,1,0,1,1,0,1,CellChatDB-cofactors;HPRD;SPIKE;Wang,HPRD:10654605;SPIKE:10654605,2,1,4,LR,
13,SFRP1_WNT2,SFRP1,WNT2,Q8N474,P09544,1,0,1,1,0,1,CellChatDB-cofactors;HPRD;NetPath;SPIKE;Wang,HPRD:10347172;NetPath:10347172;SPIKE:10347172,3,1,5,LR,


### Map gene descriptions

we enriche the database with gene descriptions. It queries gene symbols to fetch their respective gene descriptions from [MyGene, a gene annotation servise](https://mygene.info/).

<div class="alert alert-block alert-info">
<b>Note:</b> This function may fail due to internet connectivity issues.If this is the case, please try again.
</div>



In [14]:
complete_data <- map_gene_data(complete_data)

“If this function fails, it may be due to internet connectivity issues. Try running it again.”
Querying chunk 1

Querying chunk 2

Querying chunk 3



Finished
Pass returnall=TRUE to return lists of duplicate or missing query terms.


In [15]:
head(complete_data)

Unnamed: 0_level_0,Pair.Name,Ligand,Ligand.Name,Receptor,Receptor.Name,complex_pair,source,target,is_directed,is_stimulation,⋯,consensus_direction,consensus_stimulation,consensus_inhibition,sources,references,curation_effort,n_references,n_resources,annotation_strategy,dup
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<int>,<chr>,<chr>
3,JAK2_EPOR,JAK2,Janus kinase 2,EPOR,erythropoietin receptor,,O60674,P19235,1,1,⋯,1,1,0,BEL-Large-Corpus_ProtMapper;BioGRID;Cellinker;HPRD;HPRD-phos;HPRD_KEA;HPRD_MIMP;KEA;MIMP;PhosphoNetworks;PhosphoPoint;PhosphoSite_KEA;PhosphoSite_MIMP;ProtMapper;SIGNOR;SIGNOR_ProtMapper;SPIKE;Wang;iPTMnet;phosphoELM;phosphoELM_KEA;phosphoELM_MIMP,BioGRID:8343951;Cellinker:9030561;HPRD-phos:12441334;HPRD:11779507;HPRD:12441334;HPRD:8343951;KEA:10579919;KEA:10660611;KEA:11443118;KEA:12027890;KEA:12441334;KEA:7559499;KEA:9573010;ProtMapper:12441334;ProtMapper:15212693;SIGNOR:12441334;SPIKE:12524467;SPIKE:18672044;iPTMnet:10579919;iPTMnet:12441334;phosphoELM:10579919,21,13,14,LR,EPOR_JAK2
4,NOTCH1_JAG2,NOTCH1,notch receptor 1,JAG2,jagged canonical Notch ligand 2,,P46531,Q9Y219,1,0,⋯,0,0,0,Baccin2019;CellCall;HPRD;NetPath;Ramilowski2015_Baccin2019;SPIKE,HPRD:11006133;NetPath:11006133;SPIKE:15358736,3,2,5,LR,JAG2_NOTCH1
7,NOTCH1_DLL1,NOTCH1,notch receptor 1,DLL1,delta like canonical Notch ligand 1,,P46531,O00548,1,1,⋯,0,0,0,Baccin2019;CellCall;HPRD;NetPath;Ramilowski2015_Baccin2019;SPIKE;SignaLink3;Wang,Baccin2019:11006133;HPRD:11006133;NetPath:11006133;SPIKE:11006133;SPIKE:15358736;SPIKE:17537801;SignaLink3:21985982,7,4,7,LR,DLL1_NOTCH1
11,SFRP1_WNT4,SFRP1,secreted frizzled related protein 1,WNT4,Wnt family member 4,,Q8N474,P56705,1,1,⋯,1,1,0,CancerCellMap;CellChatDB-cofactors;HPRD;NetPath;SIGNOR;SPIKE;SignaLink3;Wang,CancerCellMap:11287180;HPRD:11287180;NetPath:11287180;SIGNOR:11287180;SPIKE:11287180;SignaLink3:11287180;SignaLink3:18988627;SignaLink3:23331499,8,3,8,LR,WNT4_SFRP1
12,SFRP2_WNT1,SFRP2,secreted frizzled related protein 2,WNT1,Wnt family member 1,,Q96HF1,P04628,1,0,⋯,1,0,1,CellChatDB-cofactors;HPRD;SPIKE;Wang,HPRD:10654605;SPIKE:10654605,2,1,4,LR,WNT1_SFRP2
13,SFRP1_WNT2,SFRP1,secreted frizzled related protein 1,WNT2,Wnt family member 2,,Q8N474,P09544,1,0,⋯,1,0,1,CellChatDB-cofactors;HPRD;NetPath;SPIKE;Wang,HPRD:10347172;NetPath:10347172;SPIKE:10347172,3,1,5,LR,WNT2_SFRP1


### Annotate gene space

Annotate each gene in the protein-protein interaction (PPI) network with their corresponding parent categories, along with a score indicating how many of the resources (# 44 resources) have annotated that gene as such.

In [16]:
annotation <- annotate_components(complete_data)

In [17]:
head(annotation)

Unnamed: 0_level_0,genesymbol,score,parent
Unnamed: 0_level_1,<chr>,<dbl>,<chr>
1,JAK2,2,receptor
2,NOTCH1,22,receptor
3,SFRP1,2,receptor
4,SFRP2,6,ligand
5,MTNR1A,14,receptor
6,GRB7,5,intracellular


### True Ligand Receptor Pairs

Additionally, as part of the annotations, we identify pairs situated between Ligand and Receptor molecules and label them as 'True_LR = TRUE,' while other pairs, such as adhesive pairs or those between Receptor-Receptor molecules, will be marked as 'True_LR = False.

In [18]:
true_LR_DB <- process_lr_db(complete_data, annotation)

### Process and direction correction on adhesive pairs

This function is designed for processing adhesive interactions, including handling swapped duplicated pairs. It allows manual curation by enabling the user to specify lists of genes annotated as ligands or receptors. If none is given, ligands and receptors will be detected through the annotation table. 

In this step we categorize ADAM, Plexin and Neuroligin families as ligands. 

In [19]:
adhesive_DB <- process_adhesive_DB(complete_data, annotation, ligand_list=list(), receptor_list=list())

### Merge adhesive and True LR

In [20]:
LR_database <- rbind(true_LR_DB, adhesive_DB)

In [21]:
str(LR_database)

'data.frame':	6941 obs. of  21 variables:
 $ True_LR              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ Pair.Name            : chr  "TP53_TNFRSF10D" "LRIG1_EGFR" "INHA_ACVR1B" "INHA_ACVR1C" ...
 $ Ligand               : chr  "TP53" "LRIG1" "INHA" "INHA" ...
 $ Ligand.Name          : chr  "tumor protein p53" "leucine rich repeats and immunoglobulin like domains 1" "inhibin subunit alpha" "inhibin subunit alpha" ...
 $ Receptor             : chr  "TNFRSF10D" "EGFR" "ACVR1B" "ACVR1C" ...
 $ Receptor.Name        : chr  "TNF receptor superfamily member 10d" "epidermal growth factor receptor" "activin A receptor type 1B" "activin A receptor type 1C" ...
 $ complex_pair         : chr  NA NA NA NA ...
 $ source               : chr  "P04637" "Q96JA1" "P05111" "P05111" ...
 $ target               : chr  "Q9UBN6" "P00533" "P36896" "Q8NER5" ...
 $ is_directed          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ is_stimulation       : num  1 0 1 1 1 1 1 1 1 1 ...
 $ is_inhibition        : num  0 1 0 0