# Query in storage TARA_R110002003

![A user queries data of Ocean Microbial Reference Gene Catalog v2](https://upload.wikimedia.org/wikipedia/commons/1/19/Oceania-query-fasta.png)

The FASTA file can have two types biological sequences (exclusion):

1. Nucleotides: (comes from DNA/RNA) if it has its 4 letters alphabet (there are exceptions, but they are not relevant): ATCG. They correspond to the files of scaftigs (fragments of chromosomes or plasmids) and those of CDS (CoDing Sequences), genes that will become proteins. The central dogma of molecular biology says that GEN (in the genome) - transcription -> TRANSCRIT (mRNA) - translation -> PROTEIN (plays a role). Not everything that is in the genome is executed, so we have two types of nucleotides, which in the OM-RGC-V2, they are separated in metaG -> files with genome fragments metaT -> files with fragments of what was transcribed (what is running, transcriptome).

2. Aminoacids: Their alphabet of 20-25 letters approx:

ABCDEFGHIKLMNPQRSTVWYZ

They correspond to proteins or protein fragments that play a role, supposedly, genes can be transformed to this type of sequence, for example in biopython using the translate command.

This example has the following steps:

1. Install de dependency library
2. List all FASTA samples
3. Choose the sample of the list.
4. List all genes and gaps between genes for the sample
5. Choose some gaps with some critera of filter (some simple like the length of the gap > n)
6. Extract the sequences of the selected gaps
    
When the user chooses the proper input file of Ocean Microbial Reference Gene Catalog v2, in this case data/raw/tara/OM-RGC_v2/assemblies/TARA_R110002003.scaftig.gz and sets the filters in the POSITIONS variables. 

STORAGE_KEY: The object key file in the OceanIA storage.

POSITIONS: The values to query. Each line represents a sequence to extract in the format "sequence_id,start,end,type", where:

sequence_id: The sequence ID.

start: The start index position of the sequence to be extracted.

end: The end index position of the sequence to extract.

type: The type of the sequence to extract. Options are ["raw", "complement", "reverse_complement"]. Type value is optional, if not provided default is "raw".

After that, the user calls to the get_sequences_from_fasta method, in order to have the result data in pandas.

Finally, the user prints the pandas data frame to the console.

## 1. Install oceania-query-fasta like a dependency

In [1]:
!pip install oceania-query-fasta



## 2. List all FASTA samples

In [2]:
#@title Double click to see the Python program

from oceania import list_fasta_samples

# Lista all FASTA samples
df_samples = list_fasta_samples()
print("Samples list:")
print(df_samples)

Samples list:
           sample_id                                         sample_key
0    TARA_A100000164  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
1    TARA_A100000171  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
2    TARA_A100000172  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
3    TARA_A100001011  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
4    TARA_A100001015  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
..               ...                                                ...
365  TARA_Y100001972  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
366  TARA_Y100001973  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
367  TARA_Y100001978  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
368  TARA_Y100001980  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
369  TARA_Y200000002  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y20000...

[370 rows x 2 columns]


## 3. Manually choose the file from the list of FASTA samples

In [3]:
#@title Double click to see the cell of the Python program

from oceania import list_genes_and_gaps

sample_id = "TARA_R110002003"
STORAGE_KEY = df_samples[df_samples.sample_id == sample_id]['sample_key'].values[0]
df_gaps = list_genes_and_gaps(sample_id)
print("Genes and gaps by sample:")
print(df_gaps)

Genes and gaps by sample:
             original_sequence_id  \
0   TARA_R110002003_G_scaffold3_1   
1   TARA_R110002003_G_scaffold3_1   
2   TARA_R110002003_G_scaffold3_1   
3   TARA_R110002003_G_scaffold3_1   
4   TARA_R110002003_G_scaffold3_1   
5   TARA_R110002003_G_scaffold3_1   
6   TARA_R110002003_G_scaffold3_1   
7   TARA_R110002003_G_scaffold3_3   
8   TARA_R110002003_G_scaffold3_3   
9   TARA_R110002003_G_scaffold3_3   
10  TARA_R110002003_G_scaffold3_3   
11  TARA_R110002003_G_scaffold3_4   
12  TARA_R110002003_G_scaffold3_4   
13  TARA_R110002003_G_scaffold3_4   
14  TARA_R110002003_G_scaffold3_4   
15  TARA_R110002003_G_scaffold3_4   
16  TARA_R110002003_G_scaffold3_4   
17  TARA_R110002003_G_scaffold3_4   
18  TARA_R110002003_G_scaffold3_4   
19  TARA_R110002003_G_scaffold3_4   
20  TARA_R110002003_G_scaffold3_4   
21  TARA_R110002003_G_scaffold3_4   
22  TARA_R110002003_G_scaffold3_4   
23  TARA_R110002003_G_scaffold3_4   
24  TARA_R110002003_G_scaffold3_4   
25  TARA_R11

## 4. Create the query filter to list gaps

In [4]:
#@title Double click to see the cell of the Python program

query = 'length > 100 and id.str.startswith("gap__")'
query_result = df_gaps.query(query, engine='python').head(5)

print("Query list of gaps, to get 5 with length over 100")
print(query_result)

Query list of gaps, to get 5 with length over 100
             original_sequence_id  \
5   TARA_R110002003_G_scaffold3_1   
7   TARA_R110002003_G_scaffold3_3   
9   TARA_R110002003_G_scaffold3_3   
11  TARA_R110002003_G_scaffold3_4   
13  TARA_R110002003_G_scaffold3_4   

                                                   id strand  start  stop  \
5   gap__TARA_R110002003_G_scaffold3_1_gene3__TARA...    NaN   3290  6293   
7   gap__TARA_R110002003_G_scaffold3_1_gene4__TARA...    NaN      0   327   
9   gap__TARA_R110002003_G_scaffold3_3_gene5__TARA...    NaN    944  2742   
11  gap__TARA_R110002003_G_scaffold3_3_gene6__TARA...    NaN      0   379   
13  gap__TARA_R110002003_G_scaffold3_4_gene7__TARA...    NaN   1530  1669   

    length start_codon stop_codon gene_type  
5     3003         NaN        NaN       NaN  
7      327         NaN        NaN       NaN  
9     1798         NaN        NaN       NaN  
11     379         NaN        NaN       NaN  
13     139         NaN        NaN 

## 5. Choose some gaps with some critera of filter (some simple like the length of the gap > n)

In [5]:
#@title Double click to see the cell of the Python program

params = query_result[['original_sequence_id', 'start', 'stop']].copy()
POSITIONS = []
for row in params.itertuples():
    POSITIONS.append((str(row[1]), int(row[2]), int(row[3])))

print("Positions:")
print(POSITIONS)

Positions:
[('TARA_R110002003_G_scaffold3_1', 3290, 6293), ('TARA_R110002003_G_scaffold3_3', 0, 327), ('TARA_R110002003_G_scaffold3_3', 944, 2742), ('TARA_R110002003_G_scaffold3_4', 0, 379), ('TARA_R110002003_G_scaffold3_4', 1530, 1669)]


## 6. Extract the biological sequences of the selected gaps

In [6]:
#@title Double click to see the cell of the Python program

from oceania import get_sequences_from_fasta

results = get_sequences_from_fasta(
    STORAGE_KEY,
    POSITIONS
)

print("Dataframe loaded:")
print(results)

[27-06-2021 01:30:39] Sending request for fasta sequences
[27-06-2021 01:30:40] Request accepted
[27-06-2021 01:30:40] Waiting for results...
[27-06-2021 01:30:53] Done. Elapsed time: 13.899570089000918 seconds


Dataframe loaded:
                              id  start   end type  \
0  TARA_R110002003_G_scaffold3_1   3290  6293  raw   
1  TARA_R110002003_G_scaffold3_3      0   327  raw   
2  TARA_R110002003_G_scaffold3_3    944  2742  raw   
3  TARA_R110002003_G_scaffold3_4      0   379  raw   
4  TARA_R110002003_G_scaffold3_4   1530  1669  raw   

                                            sequence  
0  TGATCGGGAGTCCTCCAGGCTTTGGATCGTTTGGGATAGATTTGTT...  
1  TCCCTCTACACAGAGCAAACCTCCCAGGTAAGATCAGCCCGGGCTA...  
2  CAACATCTCCCTCTTCTTTACTTTGAATCTCTCGTCCTTATTTCGT...  
3  TCTCTCAAACAGTTGTTGTGCTCAACTTAGCAATCCATGTATTTGC...  
4  GAGCAATTTGCAGATGGTGGTGTAGTCCTCGAAGTTGGAACAGATG...  
