
# concrete.biopython quickstart

**concrete.biopython** is a FHE library based on python [**biopython**](https://biopython.org/) library. It implements the same objects and functions when they are compatible with fhe.

In biology, data are often sensitive, hence protecting their privacy is a major issue. Using FHE for processing  sensitive data such as human DNA, laboratory researches, or hospitals patient's personnal data, garanties the full privacy of the processing: no one other than the data owners has access to it and the result of the processing. 

### FheSeq class

<span style="color:orange">**FheSeq**</span> is the FHE implementation of biopython <span style="color:green">**Seq**</span>.

**Biopython** <span style="color:green">**Seq**</span> objects are constructed from a string, generally representing a **DNA**, **RNA** or a **protein** sequence. It provides functions to process this string sequence. <span style="color:green">**MutableSeq**</span> can also be used to provide mutable strings because <span style="color:green">**Seq**</span> are immutable.

<span style="color:orange">**FheSeq**</span> implements the same functions as <span style="color:green">**Seq**</span> (when they are compatible with FHE), operating on an encrypted array of integers which encodes the string sequence. <span style="color:orange">**FheMutableSeq**</span> is also available.

### BioCircuit class

To be able to work directly with Sequence objects and not arrays of integers, the <span style="color:purple">**BioCircuit**</span> class can be used. It wraps a concrete circuit compiler and will make internally the conversion from arrays to 

## I. Making a simple circuit
We will deal first with the simplest way to use the library. We will use the <span style="color:purple">**BioCircuit**</span> class in order to easily create circuits for processing sequence objects. However, we will not be able to process both sequence objects and integer arrays in a single circuit, but we will explain how to do that in the following section.



### Working with FheSeq and FheMutableSeq objects
First of all, we need to import **numpy** and **concrete.fhe**, as well as <span style="color:green">**Seq**</span>  and <span style="color:green">**MutableSeq**</span>  from **Bio.Seq**.
Then we import <span style="color:orange">**FheSeq**</span>  and <span style="color:orange">**FheMutableSeq**</span>  from **concrete_biopython.FheSeq**, and also **BioCircuit.**<span style="color:purple">**BioCircuit**</span> to create the circuit.

In [1]:
import numpy as np
import numpy as np
from concrete import fhe
from Bio.Seq import Seq, MutableSeq

import sys, os
sys.path.append(os.path.dirname(os.getcwd()))

from concrete_biopython.FheSeq import FheSeq, FheMutableSeq
from concrete_biopython.BioCircuit import BioCircuit

Let's define an arbitrary function process_seq that takes in input a Seq and a MutableSeq objects seq1 and seq2 representing DNA sequences, processes them using some of the possibilities offered by the biopython library, and returns a short protein sequence.

In [2]:
def process_seq(seq1, seq2):
    seq2.pop()
    new_seq = seq1.reverse_complement() + seq2[0:3]
    protein = new_seq.translate('Standard')
    return protein

As a reminder, [**DNA**](https://en.wikipedia.org/wiki/DNA) strands are very long sequences of nucleotids, which come in four types depending on the nitrogenous base they hold. Each type is represented with a letter, either '**A**', '**C**', '**G**' or '**T**'. Each strand is attached to a complement strand, where the '**A**' bases are linked to '**T**' bases and vice-versa, same for '**C**' and '**G**' letters. Some parts of the strand called genes are encoding for proteins. In a gene, bases can be read in groups of three called codons, where every codon encodes for a amino acid (or a stop when the gene ends). Amino acids are chained to form a protein. The conversion from a gene to a protein is called [**translation**](https://en.wikipedia.org/wiki/Translation_(biology)), and follows an encoding from codons to amino acids that can vary. In our case, we will use the standard [codon table](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables).

Please read the **biopython** [documentation](https://biopython.org/wiki/Documentation) or [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) for further notice about the used functions.  

For the example, let's create two short sequences (with no biological meaning) and test the `process_seq` function on this unencrypted data:

In [3]:
seq1 = Seq('ACCAGGTAC')
seq2 = MutableSeq('CGTTAGC')
output_seq = process_seq(seq1, seq2)
print(output_seq)

VPGR


The result is very small protein made of 4 amino acids with no biological meaning.  

### Processing steps with FHE

Now, we will see how we can run the function **process_seq** homomorphically, using <span style="color:orange">**FheSeq**</span> and a <span style="color:orange">**FheMutableSeq**</span> objects.

In [4]:
# prepare the compiler configuration
configuration=fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    dataflow_parallelize=False,
)

# specify for each sequence if it needs to be encrypted or clear, for instance:
encryption = { "seq1": "encrypted", "seq2":"clear" }

# Create a BioCircuit wrapped circuit using the process_seq funciton, the encryption
# a typical list of sequences that we would like to process, and the configuration
circuit = BioCircuit(
    function=process_seq,
    encryption=encryption,
    seq_list=[seq1, seq2],
    configuration=configuration,
    # any other concrete compiler arguments;
    verbose=False,
)

In [5]:
circuit.encrypt_run_decrypt(seq1, seq2)

ValueError: Concrete cannot represent Seq('ACCAGGTAC')

## II. Making more complex circuits

In this section, we will not use the **BioCircuit** and study how to make a more complex circuit using the library.

As <span style="color:orange">**FheSeq**</span> works within a FHE circuit, it is agnostic to whatever comes before or after the circuit. The class <span style="color:#5CC8FF">**SeqWrapper**</span> is thus used to interface <span style="color:green">**Seq**</span> objects from outside the circuit to <span style="color:orange">**FheSeq**</span> objects inside it during encryption, and the other way around at decryption.

### Working with FheSeq and FheMutableSeq objects
First of all, we need to import **numpy** and **concrete.fhe**, as well as <span style="color:green">**Seq**</span>  and <span style="color:green">**MutableSeq**</span>  from **Bio.Seq**.
Then we import <span style="color:orange">**FheSeq**</span>  and <span style="color:orange">**FheMutableSeq**</span>  from **concrete_biopython.FheSeq**, and also **SeqWrapper.**<span style="color:#5CC8FF">**SeqWrapper**</span> which will allow to interface the two librairies.

In [1]:
import numpy as np
import numpy as np
from concrete import fhe
from Bio.Seq import Seq, MutableSeq

import sys, os
sys.path.append(os.path.dirname(os.getcwd()))

from concrete_biopython.FheSeq import FheSeq, FheMutableSeq
from concrete_biopython.SeqWrapper import SeqWrapper

Let's define an arbitrary function **process_seq** that takes in input a <span style="color:green">**Seq**</span>  and a <span style="color:green">**MutableSeq**</span>  objects **seq1** and **seq2** representing DNA sequences, processes them using some of the possibilities offered by the biopython library, and returns a short protein sequence.

In [2]:
def process_seq(seq1, seq2):
    seq2.pop()
    new_seq = seq1.reverse_complement() + seq2[0:3]
    protein = new_seq.translate('Standard')
    return protein

As a reminder, [**DNA**](https://en.wikipedia.org/wiki/DNA) strands are very long sequences of nucleotids, which come in four types depending on the nitrogenous base they hold. Each type is represented with a letter, either '**A**', '**C**', '**G**' or '**T**'. Each strand is attached to a complement strand, where the '**A**' bases are linked to '**T**' bases and vice-versa, same for '**C**' and '**G**' letters. Some parts of the strand called genes are encoding for proteins. In a gene, bases can be read in groups of three called codons, where every codon encodes for a amino acid (or a stop when the gene ends). Amino acids are chained to form a protein. The conversion from a gene to a protein is called [**translation**](https://en.wikipedia.org/wiki/Translation_(biology)), and follows an encoding from codons to amino acids that can vary. In our case, we will use the standard [codon table](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables).

Please read the **biopython** [documentation](https://biopython.org/wiki/Documentation) or [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) for further notice about the used functions.  

For the example, let's create two short sequences (with no biological meaning) and test the `process_seq` function on this unencrypted data:

In [3]:
seq1 = Seq('ACCAGGTAC')
seq2 = MutableSeq('CGTTAGC')
output_seq = process_seq(seq1, seq2)
print(output_seq)

VPGR


The result is very small protein made of 4 amino acids with no biological meaning.  

### Processing steps with FHE

Now, we will see how we can run the function **process_seq** homomorphically, using <span style="color:orange">**FheSeq**</span> and a <span style="color:orange">**FheMutableSeq**</span>  objects. Indeed, the function can be reused as is on a <span style="color:orange">**FheSeq**</span> and a <span style="color:orange">**FheMutableSeq**</span> objects inside a circuit, but we need some more work to start from unencrypted <span style="color:green">**Seq**</span> and a <span style="color:green">**MutableSeq**</span> objects before processing and return to them after processing.

We will need to follow the steps below:
1. Convert <span style="color:green">**Seq**</span>  and a <span style="color:green">**MutableSeq**</span>  objects to integer arrays with <span style="color:#5CC8FF">**SeqWrapper**</span>**.toIntegers**
2. Encrypt the integer arrays (this is done within the `circuit.encrypt_run_decrypt` function)
3. Create <span style="color:orange">**FheSeq**</span>  and a <span style="color:orange">**FheMutableSeq**</span>  objects from the encrypted integer arrays inside the circuit
4. Call our function **process_seq** homomorphically on the FHE sequence objects
5. Convert back the output from a <span style="color:orange">**FheSeq**</span>  to an encrypted array with <span style="color:orange">**FheSeq**</span>**.toArray**
6. Decrypt the encrypted output array (again done within the `circuit.encrypt_run_decrypt` function)
7. Convert the array back to a <span style="color:green">**Seq**</span> object using <span style="color:#5CC8FF">**SeqWrapper**</span>**.toSeq**

<div>
<img src="https://rcd-media.com/docs/fhe/diagram-im.png" width="650"/>
</div>


### Factorization
For steps **1**, **2**, **6** and **7**, we can create a circuit wrapper to do the Seq to integer and integer to Seq conversions in a FHE-compatible way :

In [4]:
# wrap a fhe circuit in order to input and output Bio.Seq objects.
def circuit_wrapper(circuit, seq1, seq2, simulate=False):
    # convert Seq objects to integers with SeqWrapper.toIntegers
    integers1 = SeqWrapper.toIntegers(seq1)
    integers2 = SeqWrapper.toIntegers(seq2)
    
    # run the circuit with integer inputs
    integer_output = circuit.simulate(integers1, integers2) if simulate else circuit.encrypt_run_decrypt(integers1, integers2)

    # convert back the integer outputs into a Seq objects with SeqWrapper.toSeq
    return SeqWrapper.toSeq(integer_output)

For steps **3** and **5**, we can create an adapter function that will be the circuit's main function. This function will create <span style="color:orange">**FheSeq**</span>  and <span style="color:orange">**FheMutableSeq**</span>  objects for the function **process_seq** to process homomorphically. Then it will convert back the <span style="color:orange">**FheSeq**</span>  output into an encrypted integer array that the circuit can decrypt.

In [5]:
def process_seq_adapter(integer_seq1, integer_seq2):
    
    # convert integer sequences into FheSeq and FheMutableSeq objects
    seq1=FheSeq(integer_seq1)
    seq2=FheMutableSeq(integer_seq2)
    
    # process the sequence objects with our function
    new_seq = process_seq(seq1, seq2)
    
    # return the new sequence as integer array
    return new_seq.toArray()

For step **4**, let's make the circuit from **process_seq_adapter** function, and create a correct inputset taking into account the number of possible letters with **SeqWrapper.maxInteger()** and the size of input sequences with **len(seq1)** and **len(seq2)**:

In [6]:
# compile the process_seq_adapter function and create a circuit
compiler = fhe.Compiler(lambda data1,data2: process_seq_adapter(data1, data2), {"data1": "encrypted", "data2": "encrypted"})
circuit = compiler.compile(
    inputset=[
    (np.random.randint(0, SeqWrapper.maxInteger()+1, size=(len(seq1),)),
    np.random.randint(0, SeqWrapper.maxInteger()+1, size=(len(seq2),)))
    for _ in range(100)
    ],
    configuration=fhe.Configuration(
        enable_unsafe_features=True,
        use_insecure_key_cache=True,
        insecure_key_cache_location=".keys",
        dataflow_parallelize=False, # setting it to True makes the jupyter kernel crash
    ),
    verbose=False,
)

### Execution 

We can now run our wrapped circuit on the variables **seq1** and **seq2** defined earlier and compare the output sequence with the one obtained earlier on clear data:

In [7]:
# now we can run our circuit on Seq objects and compare the result with output_seq

# with simulation
fheSim_output_seq = circuit_wrapper(circuit, seq1, seq2, True)
print('Simulated :', fheSim_output_seq)
assert(output_seq == fheSim_output_seq)

# and without (slower)
fhe_output_seq = circuit_wrapper(circuit, seq1, seq2, False)
print('FHE :', fhe_output_seq)
assert(output_seq == fhe_output_seq)

Simulated : VPGR
FHE : VPGR




### Going further

In a real world application, the data could be sent to a distant server for processing. In this scenario, the DNA sequences in input could be private data. To keep the DNA private, the `circuit_wrapper` function could be changed as bellow, using an **evaluation** and **private** keys encryption set up:
1. Convert DNA sequences to integer arrays
2. Encrypt the sequences locally with `circuit.encrypt` with the **private key**
3. Send the encrypted sequences and the **evaluation key** to the distant server
4. Run the circuit on the server with `circuit.run` and the **evaluation key**
5. Send back the output sequence from the server
6. Decrypt the output array locally with `circuit.decrypt` and the **private key**
7. Convert back the decrypted array to a sequence object