# Query in storage TARA_A100000171

![A user queries data of Ocean Microbial Reference Gene Catalog v2](https://upload.wikimedia.org/wikipedia/commons/1/19/Oceania-query-fasta.png)

The FASTA file can have two types biological sequences (exclusion):

1. Nucleotides: (comes from DNA/RNA) if it has its 4 letters alphabet (there are exceptions, but they are not relevant): ATCG. They correspond to the files of scaftigs (fragments of chromosomes or plasmids) and those of CDS (CoDing Sequences), genes that will become proteins. The central dogma of molecular biology says that GEN (in the genome) - transcription -> TRANSCRIT (mRNA) - translation -> PROTEIN (plays a role). Not everything that is in the genome is executed, so we have two types of nucleotides, which in the OM-RGC-V2, they are separated in metaG -> files with genome fragments metaT -> files with fragments of what was transcribed (what is running, transcriptome).

2. Aminoacids: Their alphabet of 20-25 letters approx:

ABCDEFGHIKLMNPQRSTVWYZ

They correspond to proteins or protein fragments that play a role, supposedly, genes can be transformed to this type of sequence, for example in biopython using the translate command.

This example has the following steps:

1. Install de dependency library
2. List all FASTA samples
3. Choose the sample of the list.
4. List all genes and gaps between genes for the sample
5. Choose some gaps with some critera of filter (some simple like the length of the gap > n)
6. Extract the sequences of the selected gaps
    
When the user chooses the proper input file of Ocean Microbial Reference Gene Catalog v2, in this case data/raw/tara/OM-RGC_v2/assemblies/TARA_A100000171.scaftig.gz and sets the filters in the POSITIONS variables. 

STORAGE_KEY: The object key file in the OceanIA storage.

POSITIONS: The values to query. Each line represents a sequence to extract in the format "sequence_id,start,end,type", where:

sequence_id: The sequence ID.

start: The start index position of the sequence to be extracted.

end: The end index position of the sequence to extract.

type: The type of the sequence to extract. Options are ["raw", "complement", "reverse_complement"]. Type value is optional, if not provided default is "raw".

After that, the user calls to the get_sequences_from_fasta method, in order to have the result data in pandas.

Finally, the user prints the pandas data frame to the console.

## 1. Install oceania-query-fasta like a dependency

In [1]:
!pip install oceania-query-fasta

Collecting oceania-query-fasta
  Using cached oceania_query_fasta-0.1.6-py3-none-any.whl (13 kB)
Collecting pandas==1.*
  Using cached pandas-1.2.5-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (9.7 MB)
Collecting click==7.*
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting requests==2.*
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting numpy>=1.16.5
  Using cached numpy-1.21.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Collecting chardet<5,>=3.0.2
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.5-py2.py3-none-any.whl (138 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2021.5.30-py2.py3-none-any.whl (145 kB)
Installing collected packages: urllib3, pytz, numpy, idna, chardet, certif

## 2. List all FASTA samples

In [2]:
#@title Doubke click to see the cell of the Python program

from oceania import list_fasta_samples

df_samples = list_fasta_samples()

print("Samples list:")
print(df_samples)

Samples list:
           sample_id                                         sample_key
0    TARA_A100000164  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
1    TARA_A100000171  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
2    TARA_A100000172  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
3    TARA_A100001011  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
4    TARA_A100001015  data/raw/tara/OM-RGC_v2/assemblies/TARA_A10000...
..               ...                                                ...
365  TARA_Y100001972  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
366  TARA_Y100001973  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
367  TARA_Y100001978  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
368  TARA_Y100001980  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y10000...
369  TARA_Y200000002  data/raw/tara/OM-RGC_v2/assemblies/TARA_Y20000...

[370 rows x 2 columns]


## 3. Manually choose the file from the list of FASTA samples

In [3]:
#@title Doubke click to see the cell of the Python program

from oceania import list_genes_and_gaps

sample_id = "TARA_A100000171"
STORAGE_KEY = df_samples[df_samples.sample_id == sample_id]['sample_key'].values[0]
df_gaps = list_genes_and_gaps(sample_id)

print("Genes and gaps by sample:")
print(df_gaps)

Genes and gaps by sample:
              original_sequence_id  \
0    TARA_A100000171_G_scaffold5_1   
1    TARA_A100000171_G_scaffold5_1   
2    TARA_A100000171_G_scaffold5_1   
3    TARA_A100000171_G_scaffold6_1   
4    TARA_A100000171_G_scaffold6_1   
5    TARA_A100000171_G_scaffold8_1   
6    TARA_A100000171_G_scaffold8_1   
7   TARA_A100000171_G_scaffold10_1   
8   TARA_A100000171_G_scaffold10_1   
9   TARA_A100000171_G_scaffold12_1   
10  TARA_A100000171_G_scaffold12_1   
11  TARA_A100000171_G_scaffold14_1   
12  TARA_A100000171_G_scaffold14_1   
13  TARA_A100000171_G_scaffold14_1   
14  TARA_A100000171_G_scaffold14_1   
15  TARA_A100000171_G_scaffold23_1   
16  TARA_A100000171_G_scaffold23_1   
17  TARA_A100000171_G_scaffold23_1   
18  TARA_A100000171_G_scaffold23_1   
19  TARA_A100000171_G_scaffold23_1   
20  TARA_A100000171_G_scaffold23_1   
21  TARA_A100000171_G_scaffold24_1   
22  TARA_A100000171_G_scaffold24_1   
23  TARA_A100000171_G_scaffold25_1   
24  TARA_A100000171_G_sc

## 4. Create the query filter to list gaps

In [4]:
#@title Doubke click to see the cell of the Python program

query = 'length > 100 and id.str.startswith("gap__")'
query_result = df_gaps.query(query, engine='python').head(5)

print("Query list of gaps, to get 5 with length over 100")
print(query_result)

Query list of gaps, to get 5 with length over 100
              original_sequence_id  \
1    TARA_A100000171_G_scaffold5_1   
15  TARA_A100000171_G_scaffold23_1   
17  TARA_A100000171_G_scaffold23_1   
23  TARA_A100000171_G_scaffold25_1   
29  TARA_A100000171_G_scaffold25_2   

                                                   id strand  start  stop  \
1   gap__TARA_A100000171_G_scaffold5_1_gene1__TARA...    NaN    568   783   
15  gap__TARA_A100000171_G_scaffold14_1_gene8__TAR...    NaN      0   157   
17  gap__TARA_A100000171_G_scaffold23_1_gene9__TAR...    NaN    432   567   
23  gap__TARA_A100000171_G_scaffold24_1_gene12__TA...    NaN      0   106   
29  gap__TARA_A100000171_G_scaffold25_2_gene15__TA...    NaN    402   661   

    length start_codon stop_codon gene_type  
1      215         NaN        NaN       NaN  
15     157         NaN        NaN       NaN  
17     135         NaN        NaN       NaN  
23     106         NaN        NaN       NaN  
29     259         NaN      

## 5. Choose some gaps with some critera of filter (some simple like the length of the gap > n)

In [5]:
#@title Doubke click to see the cell of the Python program

params = query_result[['original_sequence_id', 'start', 'stop']].copy()
POSITIONS = []
for row in params.itertuples():
    POSITIONS.append((str(row[1]), int(row[2]), int(row[3])))

print("Positions:")
print(POSITIONS)

Positions:
[('TARA_A100000171_G_scaffold5_1', 568, 783), ('TARA_A100000171_G_scaffold23_1', 0, 157), ('TARA_A100000171_G_scaffold23_1', 432, 567), ('TARA_A100000171_G_scaffold25_1', 0, 106), ('TARA_A100000171_G_scaffold25_2', 402, 661)]


## 6. Extract the biological sequences of the selected gaps

In [6]:
#@title Doubke click to see the cell of the Python program

from oceania import get_sequences_from_fasta

results = get_sequences_from_fasta(
    STORAGE_KEY,
    POSITIONS
)

print("Dataframe loaded:")
print(results)

[24-06-2021 21:51:47] Sending request for fasta sequences
[24-06-2021 21:51:47] Request accepted
[24-06-2021 21:51:47] Waiting for results...
[24-06-2021 21:51:58] Done. Elapsed time: 11.851772893999964 seconds


Dataframe loaded:
                               id  start  end type  \
0   TARA_A100000171_G_scaffold5_1    568  783  raw   
1  TARA_A100000171_G_scaffold23_1      0  157  raw   
2  TARA_A100000171_G_scaffold23_1    432  567  raw   
3  TARA_A100000171_G_scaffold25_1      0  106  raw   
4  TARA_A100000171_G_scaffold25_2    402  661  raw   

                                            sequence  
0  GAAAGGTGGAAAGGCAAGGTGGCGGCGCCAATTCCGCCGGTGGTCG...  
1  AAGTTCGTAGGTCACTGGAAATCCAAATTTTACGGGTGATGAATGC...  
2  GGATGCCTCCTTTGAAAATGGCTGTCACCTATGACATAGAAGTCAA...  
3  AGTTTGGAAGGTGGATACGGCATTCATTGTCGTGTCATCTTTTTCG...  
4  AGCGGCGGTATGAAATAGAGCTATTTTGCTGAAAGCCGCCATGCAA...  
