
# concrete.biopython quickstart

**concrete.biopython** is a FHE library based on python [**biopython**](https://biopython.org/) library. It implements the same objects and functions when they are compatible with fhe.

In biology, data are often sensitive, hence protecting their privacy is a major issue. Using FHE for processing  sensitive data such as human DNA, laboratory researches, or hospitals patient's personnal data, garanties the full privacy of the processing: no one other than the data owners has access to it and the result of the processing. 

## I. FheSeq, FheMutableSeq and SeqInterface classes

<span style="color:orange">**FheSeq**</span> is the FHE implementation of biopython <span style="color:green">**Seq**</span>.

**Biopython** <span style="color:green">**Seq**</span> objects are constructed from a string, generally representing a **DNA**, **RNA** or a **protein** sequence. It provides functions to process this string sequence. <span style="color:green">**MutableSeq**</span> can also be used to provide mutable strings because <span style="color:green">**Seq**</span> are immutable.

<span style="color:orange">**FheSeq**</span> implements the same functions as <span style="color:green">**Seq**</span> (when they are compatible with FHE), operating on an encrypted array of integers which encodes the string sequence. <span style="color:orange">**FheMutableSeq**</span> is also available.

As both <span style="color:green">**Seq**</span> objects and <span style="color:orange">**FheSeq**</span> objects can be used the same way, we can make general code that can deal with both.

To build <span style="color:orange">**FheSeq**</span>  and <span style="color:orange">**FheMutableSeq**</span> objects, an <span style="color:#5CC8FF">**SeqInterface**</span> object is required. This <span style="color:#5CC8FF">**SeqInterface**</span> object is built with an argument providing the minimal set of characters that the sequences will contain, which we call an **alphabet**. Choosing the right alphabet is crucial to optimize the computational speed of the circuits, because the smaller it is, the smaller the bitwidth of the sequence characters will be, and the faster the computations will be.


## II. Making a circuit for processing sequences with the BioCircuit class

Let's deal first with the simplest way of using the **concrete.biopython** library. With the <span style="color:purple">**BioCircuit**</span> class, we can easily create circuits for processing sequence objects. 

To work directly with Sequence objects and not arrays of integers, the <span style="color:purple">**BioCircuit**</span> class can be used. Indeed, unlike a regular concrete circuit, it takes as input sequence objects, and can  output either an integer array or a sequence object. It wraps a concrete circuit compiler and makes internally the conversion from <span style="color:green">**Seq**</span> to **numpy integer arrays** before encryption, from **concrete numpy encrypted integerer arrays** to <span style="color:orange">**FheSeq**</span> after encryption, and the opposite before and after decryption.

First of all, we need to import **numpy** and **concrete.fhe**, as well as <span style="color:green">**Seq**</span>  and <span style="color:green">**MutableSeq**</span>  from **Bio.Seq**.
Then we import <span style="color:#5CC8FF">**SeqInterface**</span>, **Alphabets**, <span style="color:orange">**FheSeq**</span>, and <span style="color:orange">**FheMutableSeq**</span>  from **concrete_biopython.FheSeq**, and also **BioCircuit.**<span style="color:purple">**BioCircuit**</span> to create the circuit.

In [1]:
import numpy as np
import numpy as np
from concrete import fhe
from Bio.Seq import Seq, MutableSeq

import sys, os
sys.path.append(os.path.dirname(os.getcwd()))

from concrete_biopython.FheSeq import SeqInterface, Alphabets, FheSeq, FheMutableSeq
from concrete_biopython.BioCircuit import BioCircuit

Let's define an dummy function process_seq that takes in input a Seq and a MutableSeq objects seq1 and seq2 representing DNA sequences, processes them using some of the possibilities offered by the biopython library, and returns a short protein sequence.

In [2]:
def process_seq(seq1, seq2):
    seq2.pop()
    new_seq = seq1.reverse_complement() + seq2[0:3]
    protein = new_seq.translate('Standard')
    return protein

As a reminder, [**DNA**](https://en.wikipedia.org/wiki/DNA) strands are very long sequences of nucleotids, which come in four types depending on the nitrogenous base they hold. Each type is represented with a letter, either '**A**', '**C**', '**G**' or '**T**'. Each strand is attached to a complement strand, where the '**A**' bases are linked to '**T**' bases and vice-versa, same for '**C**' and '**G**' letters. Some parts of the strand called genes are encoding for proteins. In a gene, bases can be read in groups of three called codons, where every codon encodes for a amino acid (or a stop when the gene ends). Amino acids are chained to form a protein. The conversion from a gene to a protein is called [**translation**](https://en.wikipedia.org/wiki/Translation_(biology)), and follows an encoding from codons to amino acids that can vary. In our case, we will use the standard [codon table](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables).

Please read the **biopython** [documentation](https://biopython.org/wiki/Documentation) or [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) for further notice about the used functions.  

For the example, let's create two short sequences (with no biological meaning) and test the `process_seq` function on this unencrypted data:

In [3]:
seq1 = Seq('ACCAGGTAC')
seq2 = MutableSeq('CGTTAGC')
output_seq = process_seq(seq1, seq2)
print(output_seq)

VPGR


The result is very small protein made of 4 amino acids with no biological meaning.  

Running the function **process_seq** homomorphically on <span style="color:orange">**FheMutableSeq**</span> objects can be done in a few simple steps thanks to the <span style="color:purple">**BioCircuit**</span> class, which can be used in a similar way as the concrete compiler. It allows to deal with any number of sequences, all of which will get encrypted (dealing with clear sequences is not available for now due to concrete current version's limitations with clear Tracers).

In [4]:
# Choose an minimal set of letters that the FheSeq will be able to contain
# This is called an alphabet
alphabet = Alphabets.PROTEINS # we need to be able to encode all possible DNA and PROTEINS letters

print("alphabet :", alphabet)

# Create a SeqInterface object that will generate the FheSeq objects within the BioCircuit
seq_interface = SeqInterface(alphabet)

# prepare the compiler configuration
configuration=fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    dataflow_parallelize=False,
)

# Create a BioCircuit wrapped circuit using the process_seq function, the encryption,
# a list of the lengths of the sequences that we will process, and the fhe configuration
# along with any other named compiler arguments
circuit = BioCircuit(
    function=process_seq,
    len_seqs=[len(seq1), len(seq2)],
    seq_interface=seq_interface,
    configuration=configuration,
    seq_output=True, # tell that the output is a sequence object
    show_timing=True, # whether to display timings or not
    # any other concrete compiler arguments;
    verbose=False
)

alphabet :  *ABCDEFGHIJKLMNOPQRSTUVWXYZ
|  Compiling  : 0.63 s  |


The function **process_seq** has been compiled for FHE, and it has been wrapped so that <span style="color:green">**Seq**</span> objects in input to the circuit will be first converted to encrypted <span style="color:orange">**FheMutableSeq**</span> objects  before being passed to the function. The output sequence of the function is also a <span style="color:orange">**FheMutableSeq**</span> object, so we specified `seq_output=True` so that it gets converted back to a <span style="color:green">**Seq**</span> object after decryption.

In [5]:
# Note than encrypt_run_decrypt takes a list of sequence along with a boolean show_timing
circuit.encrypt_run_decrypt(seq1, seq2)

|  Encrypting  : 0.19 s  |
|  Running  : 0.86 s  |
|  Decrypting  : 0.00 s  |


Seq('VPGR')

Now, try to use a bigger **aphabet**, such as `Alphabets.ASCII`, and see the computational time difference.

## III. Making a circuit without the BioCircuit class

You know enough now to make circuits for processing sequences. However, if you want to create a circuit taking both sequence and integer arrays as inputs, you can follow this section of the tutorial. Indeed, in this scenario, you cannot use the <span style="color:purple">**BioCircuit**</span> class because it only supports sequence inputs.  

We will need to use the class <span style="color:#5CC8FF">**SeqInterface**</span> which is used to interface <span style="color:green">**Seq**</span> objects from outside the circuit to <span style="color:orange">**FheSeq**</span> objects inside it at encryption, and the other way around at decryption.  

We will first do the exact same circuit for the example, but without using the <span style="color:purple">**BioCircuit**</span>, to show how it can be done. Then we will discuss how such a circuit could take other types of inputs than sequences, such as integer arrays.  

To make the same circuit as in the previous section, but without using the <span style="color:purple">**BioCircuit**</span> class, we will need additional work to go from <span style="color:green">**Seq**</span> and a <span style="color:green">**MutableSeq**</span> objects to <span style="color:orange">**FheSeq**</span> and <span style="color:orange">**FheMutableSeq**</span> objects, following the steps below:
1. Convert <span style="color:green">**Seq**</span>  and a <span style="color:green">**MutableSeq**</span>  objects to integer arrays with <span style="color:#5CC8FF">**SeqInterface**</span>**.to_integers**
2. Encrypt the integer arrays (this is done within the `circuit.encrypt_run_decrypt` function)
3. Create <span style="color:orange">**FheSeq**</span>  and a <span style="color:orange">**FheMutableSeq**</span>  objects from the encrypted integer arrays inside the circuit using <span style="color:#5CC8FF">**SeqInterface**</span>**.FheSeq** and <span style="color:#5CC8FF">**SeqInterface**</span>**.FheMutableSeq**
4. Call our function **process_seq** homomorphically on the FHE sequence objects
5. Convert back the output from a <span style="color:orange">**FheSeq**</span>  to an encrypted array with <span style="color:orange">**FheSeq**</span>**.to_array**
6. Decrypt the encrypted output array (again done within the `circuit.encrypt_run_decrypt` function)
7. Convert the array back to a <span style="color:green">**Seq**</span> object using <span style="color:#5CC8FF">**SeqInterface**</span>**.to_Seq**

<div>
<img src="https://rcd-media.com/docs/fhe/diagram2.jpg" width="650"/>
</div>


### Factorization
For steps **1**, **2**, **6** and **7**, we can create a circuit wrapper to do the Seq to integer and integer to Seq conversions in a FHE-compatible way :

In [6]:
# wrap a fhe circuit in order to input and output Bio.Seq objects.
def circuit_wrapper(circuit, seq1, seq2, simulate=False):
    # convert Seq objects to integers with seq_interface.to_integers
    integers1 = seq_interface.to_integers(seq1)
    integers2 = seq_interface.to_integers(seq2)
    
    # run the circuit with integer inputs
    integer_output = circuit.simulate(integers1, integers2) if simulate else circuit.encrypt_run_decrypt(integers1, integers2)

    # convert back the integer outputs into a Seq objects with seq_interface.toSeq
    return seq_interface.array_to_seq(integer_output)

For steps **3** and **5**, we can create an adapter function that will be the circuit's main function. This function will create <span style="color:orange">**FheSeq**</span>  and <span style="color:orange">**FheMutableSeq**</span>  objects for the function **process_seq** to process homomorphically. Then it will convert back the <span style="color:orange">**FheSeq**</span>  output into an encrypted integer array that the circuit can decrypt.

In [7]:
def process_seq_adapter(integer_seq1, integer_seq2):
    
    # convert integer sequences into FheSeq and FheMutableSeq objects
    seq1=seq_interface.FheSeq(integer_seq1)
    seq2=seq_interface.FheMutableSeq(integer_seq2)
    
    # process the sequence objects with our function
    new_seq = process_seq(seq1, seq2)
    
    # return the new sequence as integer array
    return new_seq.to_array()

For step **4**, let's make the circuit from **process_seq_adapter** function, and create a correct inputset taking into account the number of possible letters with **seq_interface.max_integer()** and the size of input sequences with **len(seq1)** and **len(seq2)**:

In [8]:
# compile the process_seq_adapter function and create a circuit
compiler = fhe.Compiler(lambda data1,data2: process_seq_adapter(data1, data2), {"data1": "encrypted", "data2": "encrypted"})
circuit = compiler.compile(
    inputset=[
    (np.random.randint(0, seq_interface.max_integer()+1, size=(len(seq1),)),
    np.random.randint(0, seq_interface.max_integer()+1, size=(len(seq2),)))
    for _ in range(100)
    ],
    configuration=fhe.Configuration(
        enable_unsafe_features=True,
        use_insecure_key_cache=True,
        insecure_key_cache_location=".keys",
        dataflow_parallelize=False, # setting it to True makes the jupyter kernel crash
    ),
    verbose=False,
)

### Execution 

We can now run our wrapped circuit on the variables **seq1** and **seq2** defined earlier and compare the output sequence with the one obtained earlier:

In [9]:
# now we can run our circuit on Seq objects and compare the result with output_seq

# with simulation
fheSim_output_seq = circuit_wrapper(circuit, seq1, seq2, True)
print('Simulated :', fheSim_output_seq)
assert(output_seq == fheSim_output_seq)

# and without (slower)
fhe_output_seq = circuit_wrapper(circuit, seq1, seq2, False)
print('FHE :', fhe_output_seq)
assert(output_seq == fhe_output_seq)

Simulated : VPGR
FHE : VPGR


### Introducing other types of inputs

Now that we have seen how to make the same circuit without using the <span style="color:purple">**BioCircuit**</span>, we can slightly modify it so as to replace the second sequence by a regular integer array. For instance, we can modify it this way:

In [10]:
# some function processing a sequence and an integer array
def process_seq(seq1, array2):
    return len(seq1) == np.sum(array2)
    
# wrap a fhe circuit in order to input a Bio.Seq object along with an integer array
# output either an array or a Seq object
def circuit_wrapper(circuit, seq1, array2, seq_output= False, simulate=False):
    # convert Seq objects to integers with seq_interface.to_integers
    integers1 = seq_interface.to_integers(seq1)
    
    # run the circuit with integer inputs
    output = circuit.simulate(integers1, array2) if simulate else circuit.encrypt_run_decrypt(integers1, array2)
    
    if seq_output:
        # convert to Seq if required
        output = seq_interface.array_to_seq(output)

    return output

def process_seq_adapter(integer_seq1, integers2):    
    # convert the first integer sequence into a FheSeq
    seq1=seq_interface.FheSeq(integer_seq1)
    
    # process the sequence object and array with our function
    output = process_seq(seq1, integers2)
    
    # convert back to array if the output is a FheSeq or FheMutableSeq
    if isinstance(output, FheSeq) or isinstance(output, FheMutableSeq):
        output = output.to_array()
        
    return output

min_int = 0 # the minimum possible value for the integer array
max_int = 1 # the maximum possible value for the integer array

seq1 = Seq('ACCAGG')
array2 = np.array([0,1,1,0,1,0,0,1,1,1]) # some numpy array with values between min_int and max_int

# compile the process_seq_adapter function and create a circuit
compiler = fhe.Compiler(lambda data1,data2: process_seq_adapter(data1, data2), {"data1": "encrypted", "data2": "encrypted"})
circuit = compiler.compile(
    inputset=[
    (np.random.randint(0, seq_interface.max_integer()+1, size=(len(seq1),)),
    np.random.randint(min_int, max_int+1, size=(len(array2,))))
    for _ in range(100)
    ],
    configuration=fhe.Configuration(
        enable_unsafe_features=True,
        use_insecure_key_cache=True,
        insecure_key_cache_location=".keys",
        dataflow_parallelize=False, # setting it to True makes the jupyter kernel crash
    ),
    verbose=False,
)

seq_output = False # a boolean to know whether the output should be converted to a Seq

# run the circuit
fhe_output = circuit_wrapper(circuit, seq1, array2, seq_output, simulate=False)
print('FHE :', bool(fhe_output))

FHE : True


### Going further

In a real world application, the data could be sent to a distant server for processing. In this scenario, the DNA sequences in input could be private data. To keep the DNA private, the code structure could be changed as bellow, using an **evaluation** and **private** keys encryption set up:
1. Convert DNA sequences to integer arrays
2. Encrypt the sequences locally with `circuit.encrypt` with the **private key**
3. Send the encrypted sequences and the **evaluation key** to the distant server
4. Run the circuit on the server with `circuit.run` and the **evaluation key**
5. Send back the output sequence from the server
6. Decrypt the output array locally with `circuit.decrypt` and the **private key**
7. Convert back the decrypted array to a sequence object