# **Table goes here!**

In [4]:
!sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9

/bin/sh: 1: sudo: not found


## Dependencies







In [0]:
library(rentrez)
library(stringr)
library(dplyr)
library(curl)
librarey(httr)

## Load the R functions 

In [0]:
!source("hummusFunctions.R")

## Read Legumes taxids

Load the object `my_hummus`. It contains the vector `legumesIds` with the legumes taxids. You can download from the NCBI the taxids from your own family or species. 

In [0]:
file = "my_hummus.RData"
load(file)

## Conserved Domains

Plant gene families are characterized by common protein structure. The structure that defines a given family can be found in literature. 

You can specify the conserved domain accession number as a query. For example, the three conserved domains that define the ARF gene family are: 

In [0]:
!arf <- c("pfam02362", "pfam06507", "pfam02309")
arf

In [0]:
## [1] "pfam02362" "pfam06507" "pfam02309"

Those three accessions correspond with the three ARF conserved domains:

* B3 DNA binding domain
* Auxin response factor
* AUX/ IAA family 

We know from literatute that the last domain AUX/ IAA may be/ not be present. 

## SPARCLE architectures

Then, you can get the SPARCLE arhitectures for each conserved domain using the `getSparcleArchs function`. 

In [0]:
cd1 = getSparcleArchs(arf[1])
cd2 = getSparcleArchs(arf[2])
cd3 = getSparcleArchs(arf[3])

For example, the SPARCLE architectures for Pfsam02363 are: 

`12034188, 12034184, 12034182, 12034166, 12034151, 11279088, 11279084, 11266712, 11130507, 11130491, 11130489, 11130478, 10975108, 10889850, 10874725, 10803150, 10492348, 10492347, 10178159, 10178158.`

Not all those architectures link to ARF proteins. But the architecture ids fot the ARF proteins will be definitely among any of them.

## Conserved domains 

Plant gene families are characterized by common protein structure. The structure that defines a given family can be found in literature. 

You can specify the conserved domain accession number as a query. For example, the three conserved domains that define the ARF gene family are: 

In [0]:
arf <- c("pfam02362", "pfam06507", "pfam02309")
arf

In [0]:
## [1] "pfam02362" "pfam06507" "pfam02309"

Those three accessions correspond with the three ARF conserved domains: 

- B3 DNA binding domain
- Auxin response factor
- AUX/IAA family

We know from literature that in the case of ARF gene family, the last domain (AUX/IAA) may be/ not be present. 

## SPARCLE architectures

Then, you can get the SPARCLE architectures for each conserved domain using the `getSparcleArchs` function. 

In [0]:
cd1 = getSparcleArchs(arf[1])
cd2 = getSparcleArchs(arf[2])
cd3 = getSparcleArchs(arf[3])

For example, the SPARCLE architectures for 
Pfam02362 are:

`12034188, 12034184, 12034182, 12034166, 12034151, 11279088, 11279084, 11266712, 11130507, 11130491, 11130489, 11130478, 10975108, 10889850, 10874725, 10803150, 10492348, 10492347, 10178159, 10178158.`

Not all those architectures link to ARF proteins. But the architecture ids for the ARF proteins will be definitely among them.

## SPARCLE labels

Thus, next step, is to identify the architecture ids corresponding to ARF proteins. For this, the SPARCLE labels come in handy. We do not need to get all the labels for each architecture ids. We just need those labels that could be present in the ARF proteins. So, we can filter by any word that we know will be present in the ARF family. For example, we know that a given ARF protein at least contain the domain "B3_DNA". 

In [0]:
getSparcleLabels(cd1, "B3_DNA")

In [0]:
## [1] "12034188 protein containing domains B3_DNA, Auxin_resp, Activator_LAG-3, and AUX_IAA"
## [1] "12034184 protein containing domains B3_DNA, Auxin_resp, Med15, and PB1"
## [1] "12034182 protein containing domains B3_DNA, Auxin_resp, Atrophin-1, and AUX_IAA"
## [1] "12034166 protein containing domains B3_DNA, Auxin_resp, PAT1, and AUX_IAA"
## [1] "12034151 protein containing domains B3_DNA, LRR_3, and zf-CCCH"
## [1] "11279088 B3_DNA domain-containing protein"
## [1] "11279084 B3_DNA domain-containing protein"
## [1] "11130507 B3_DNA and Auxin_resp domain-containing protein"
## [1] "11130491 protein containing domains B3_DNA, Auxin_resp, and PEARLI-4"
## [1] "11130489 B3_DNA and Auxin_resp domain-containing protein"
## [1] "10975108 B3_DNA domain-containing protein"
## [1] "10889850 AP2 and B3_DNA domain-containing protein"
## [1] "10874725 B3_DNA domain-containing protein"
## [1] "10492348 protein containing domains B3_DNA, Auxin_resp, and PB1"
## [1] "10492347 B3_DNA domain-containing protein"
## [1] "10178159 B3_DNA domain-containing protein"
## [1] "10178158 B3_DNA domain-containing protein"

We can run the `getSparcleLabels` function on the other two architectures (Pfam06507, Pfam02309). After examination of the labels, based on the number of domains (>=3), the only architecture ids making sense for the ARF are:

In [0]:
my_labelsIds <-  c(10332700, 11130507, 10332698, 10492348, 12034166, 11130489)

## Protein ids

Now for each SPARCLE architecture we'll get the whole protein ids list showing that architecture. We'll start by analyzing the first architecture (n = 1).

In [0]:
n = 1
sparcleArch = my_labelsIds[n]

To get the protein ids for the architecture id 10332700, we'll call the function `getProtlinks`:

In [0]:
!my_protIds <- getProtlinks(sparcleArch)

Depending on the architecture id, the protein ids can be a long list. Because in the next step we'll interact with the [NCBI's web history](https://www.ncbi.nlm.nih.gov/books/NBK25501/) feature, we have to check the length of that list. 

**Note:** If you have a very long list of ids (>300) you may receive a 414 error. 

In [0]:
length(my_protIds)

In [0]:
## [1] 28

As we have < 300 ids, now we can call the function `extract_proteins` that has two arguments. The first one is a vector containing the protein ids; the second argument is the taxonomy ids for the species you want to identify the proteins. In this case study, the Legumes ids. The function returns only the protein ids hosted by the RefSeq database. 
Let's create an empty vector called `my_Values` where we'll keep track of every ARF protein id from the Legume family. 

In [0]:
my_values <- extract_proteins(my_protIds, legumesIds)

Now we check the architecture n = 2. 

In [0]:
n = 2
my_protIds <- getProtlinks(my_labelsIds[n])

In [0]:
length(my_protIds)

In [0]:
## [1] 2489

Because this time we have a very long list (>>300), we need to subset the elements, so the `extract_proteins` function can work properly. For subsetting, we'll use the function `subsetIds` that takes as first argument the protein ids and as second argument the length of the subsetting. 

In [0]:
protIds_subset <-  subsetIds(my_protIds, 300)
length(protIds_subset)

In [0]:
## [1] 9

Now we have a `proIds_subset` vector, which is a list containing 9 elements. Each element of the list is made of 300 protein ids. 

Now we can call the function `extract_proteins_from_ subset`, that in turn will pass the function `extract_proteins` on each of the 9 elements. We need two arguments, the vector list, the targeted taxnonomy ids, and the vector containing the values to be updated. 

In [0]:
vals = extract_proteins_from_subset(protIds_subset, legumesIds)

# Update my_values
my_values = c(my_values, vals)

Let's run the code for the architectures n=3-6. 

In [0]:
n = 3
my_protIds <- getProtlinks(my_labelsIds[n])
length(my_protIds)

In [0]:
## [1] 29

Update `my_values` with the protein ids from SPARCLE architecture n=3:

In [0]:
my_values <- c(my_values, extract_proteins(my_protIds, legumesIds))
length(my_values)

In [0]:
## [1] 170

In [0]:
n = 4
my_protIds <- getProtlinks(my_labelsIds[n])
length(my_protIds)

In [0]:
## [1] 4577

In [0]:
protIds_subset <-  subsetIds(my_protIds, 300)
vals = extract_proteins_from_subset(protIds_subset, legumesIds)
my_values <- c(my_values,vals)
length(my_values)

In [0]:
## [1] 547

In [0]:
n = 5
my_protIds <- getProtlinks(my_labelsIds[n])
length(my_protIds)

In [0]:
## [1] 34

In [0]:
my_values <- c(my_values, extract_proteins(my_protIds, legumesIds))
length(my_values)

In [0]:
## [1] 549

In [0]:
n = 6
my_protIds <- getProtlinks(my_labelsIds[n])
length(my_protIds)

In [0]:
## [1] 144

In [0]:
my_values <- c(my_values, extract_proteins(my_protIds, legumesIds))
length(my_values)

In [0]:
## [1] 561

At this point we have likely identified the whole set of ARF protein ids from the Legume family. Because two given SPARCLE architectures may link to the same sequence, finally we want to check that `my_values` does not contain duplicated values.

In [0]:
my_values = unique(my_values)
length(my_values)

In [0]:
## [1] 561

Now, we want to get the legume species and the number of proteins per species. 

In [0]:
my_values_subset <-  subsetIds(my_values, 300)
spp = extract_spp_from_subset(my_values_subset)
length(spp) == length(my_values)

In [0]:
## [1] TRUE

In [0]:
spp_tidy = c()
for(sp in seq_along(spp)) {
  spp_tidy = c(spp_tidy, get_spp(spp[sp]))
}
sort(table(spp_tidy), decreasing = TRUE)

In [0]:
## spp_tidy
##      Lupinus angustifolius                Glycine max 
##                         92                         86 
##           Arachis ipaensis         Arachis duranensis 
##                         63                         57 
##              Cajanus cajan        Medicago truncatula 
##                         56                         50 
##            Cicer arietinum            Vigna angularis 
##                         45                         43 
## Vigna radiata var. radiata         Phaseolus vulgaris 
##                         41                         30