
# concrete.biopython quickstart

## WARNING: this tutorial cannot be run, it is just a code structure suggestion :)

**concrete.biopython** is a FHE library based on python [**biopython**](https://biopython.org/) library. It implements the same objects and functions when they are compatible with fhe.

In biology, data are often sensitive, hence protecting their privacy is a major issue. Using FHE for processing  sensitive data such as human DNA, laboratory researches, or hospitals patient's personnal data, garanties the full privacy of the processing: no one other than the data owners has access to it and the result of the processing. 

### concrete.biopython Seq class

concrete.biopython.Seq.<span style="color:orange">**Seq**</span> is the FHE implementation of biopython Bio.Seq.<span style="color:green">**Seq**</span>.

Biopython Bio.Seq.<span style="color:green">**Seq**</span> objects are constructed from a string, generally representing a **DNA**, **RNA** or a **protein** sequence. It provides functions to process this string sequence. Bio.Seq.<span style="color:green">**MutableSeq**</span> can also be used to provide mutable strings because Bio.Seq.<span style="color:green">**Seq**</span> are immutable.

concrete.biopython.Seq.<span style="color:orange">**Seq**</span> works the same way on unencrypted data, and in encrypted data it implements the same functions as Bio.Seq.<span style="color:green">**Seq**</span> (when they are compatible with FHE), operating on an encrypted array of integers which encodes the string sequence. concrete.biopython.Seq.<span style="color:orange">**MutableSeq**</span> is also available.

concrete.biopython.Seq.<span style="color:orange">**SeqCompiler**</span> is a wrapper of **fhe.compiler** that allows to compile circuits directly on concrete.biopython.Seq.<span style="color:orange">**Seq**</span> objects.

### Working with concrete.biopython Seq and MutableSeq objects

First of all, we need to import concrete.biopython.Seq.<span style="color:orange">**Seq**</span> and concrete.biopython.Seq.<span style="color:orange">**MutableSeq**</span> from **concrete_biopython.Seq**, and also concrete.biopython.Seq.<span style="color:orange">**SeqCompiler**</span> which will allow to compile circuit with **Seq** inputs.

In [None]:
from concrete import fhe

import sys, os
sys.path.append(os.path.dirname(os.getcwd()))
from concrete_biopython.Seq import Seq, MutableSeq, SeqCompiler

Let's define an arbitrary function **process_seq** that takes in input a <span style="color:green">**Seq**</span>  and a <span style="color:green">**MutableSeq**</span>  objects **seq1** and **seq2** representing DNA sequences, processes them using some of the possibilities offered by the biopython library, and returns a short protein sequence.

In [2]:
def process_seq(seq1, seq2):
    seq2.pop()
    new_seq = seq1.reverse_complement() + seq2[0:3]
    protein = new_seq.translate('Standard')
    return protein

As a reminder, [**DNA**](https://en.wikipedia.org/wiki/DNA) strands are very long sequences of nucleotids, which come in four types depending on the nitrogenous base they hold. Each type is represented with a letter, either '**A**', '**C**', '**G**' or '**T**'. Each strand is attached to a complement strand, where the '**A**' bases are linked to '**T**' bases and vice-versa, same for '**C**' and '**G**' letters. Some parts of the strand called genes are encoding for proteins. In a gene, bases can be read in groups of three called codons, where every codon encodes for a amino acid (or a stop when the gene ends). Amino acids are chained to form a protein. The conversion from a gene to a protein is called [**translation**](https://en.wikipedia.org/wiki/Translation_(biology)), and follows an encoding from codons to amino acids that can vary. In our case, we will use the standard [codon table](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables).

Please read the **biopython** [documentation](https://biopython.org/wiki/Documentation) or [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) for further notice about the used functions.  

For the example, let's create two short sequences (with no biological meaning) and test the `process_seq` function on this unencrypted data:

In [3]:
seq1 = Seq('ACCAGGTAC')
seq2 = MutableSeq('CGTTAGC')
output_seq = process_seq(seq1, seq2)
print(output_seq)

VPGR


### Processing steps with FHE

Let's see how to compile a circuit taking Seq objects as input

In [6]:
# compile the process_seq function and create a wrapped circuit from Seq.circuit that takes Seq objects
compiler = SeqCompiler(lambda data1,data2: process_seq(data1, data2), {"data1": "encrypted", "data2": "encrypted"})
seqCircuit = compiler.compile(
    # provide a dictionnary of all possible characters instead of an inputset,
    # the inputset will be generated internally from it
    dictionnary='ACGT',
    # configurations are the same
    configuration=fhe.Configuration(
        enable_unsafe_features=True,
        use_insecure_key_cache=True,
        insecure_key_cache_location=".keys",
        dataflow_parallelize=False, # setting it to True makes the jupyter kernel crash
    ),
    verbose=False,
)

### Execution 

We can now run our wrapped circuit on the variables **seq1** and **seq2** defined earlier and compare the output sequence with the one obtained earlier on clear data:

In [None]:
# now we can run our circuit in FHE and compare the result with output_seq

# with simulation
fheSim_output_seq = seqCircuit.simulate(seq1, seq2)
print('Simulated FHE:', fheSim_output_seq)
assert(output_seq == fheSim_output_seq)

# and with encryption
seqCircuit.encrypt_run_decryp(seq1, seq2)
print('FHE :', fheSim_output_seq)
assert(output_seq == fheSim_output_seq)

### Going further

In a real world application, the data could be sent to a distant server for distant fhe processing. In this scenario, one could proceed as following:

In [None]:
# encrypt and send you encrypted Seq object to a distant server
encryptedSeq1, encryptedSeq2 =  seqCircuit.encrypt(seq1, seq2)

encrypted_out = await( process_server(encryptedSeq1, encryptedSeq2) )

output = seqCircuit.decrypt(encrypted_out)

The Bio.Seq.<span style="color:green">**Seq**</span> objects are compatible with concrete.biopython.Seq<span style="color:orange">**Seq**</span> objects:

In [None]:
from Bio.Seq import Seq as bioSeq

# from Bio.Seq to concrete.biopython.Seq
bio_seq1 = bioSeq('ACGTACGT')
concrete_seq1 = Seq(bio_seq1)

# the other way around
concrete_seq2 = Seq('ACGTACGT')
bio_seq2 = bioSeq( str(concrete_seq2) )
# or:
bio_seq2_2 = concrete_seq2.to(bioSeq)