# Examples for the BE-DataHive Database

This script covers:
1. Data exploration and functions in the BE-DataHive Python Wrapper
2. Example to fit a Gradient Boosting Regressor to BE-DataHive data for efficiency rates
3. Example to fit a neural network model to BE-DataHive data for bystander editing rates

More details about the API can be found here: [BE-DataHive Documentation](https://be-datahive.com/documentation.html)

### Data exploration and functions in the BE-DataHive Python Wrapper

In [7]:
# Import BE-DataHive package
from be_datahive import be_datahive

# Initialize the API
api = be_datahive()

In [8]:
# See study data
studies = api.get_studies()
studies


Unnamed: 0,id,title,authors,year,journal,citations,data,model,base_editors,prediction_score,description,study_link,links
0,1,Determinants of Base Editing Outcomes from Tar...,Arbab et al.,2020,Cell,56,"38,538 total pairs of sgRNAs and target sequen...",BE-Hive\nEfficiency module\n- Gradient Boosted...,"ABE, CBE",Editing Efficiency: R= 0.70\nEditing Outcome: ...,- Develop the BE-Hive machine learning model t...,https://www.sciencedirect.com/science/article/...,Paper: https://www.sciencedirect.com/science/a...
1,2,Optimization of C-to-G base editors with seque...,Yuan et al.,2021,Nature Communications,1,"sgRNA library comprising 41,388 sequences of c...",BE-SMART\n- Deep neural network model with con...,CBE,Editing Efficiency: R = 0.47 – 0.75\nFraction ...,- Develop optimized CGBEs by changing the spe...,https://www.nature.com/articles/s41467-021-252...,Paper: https://www.nature.com/articles/s41467-...
2,3,Predicting base editing outcomes using positio...,Pallaseni et al.,2021,Nucleic Acids Research,5,"This study data: 14,000 target sequences and s...",FORECasT-BE\n- Gradient Boosted Tree model\n- ...,"ABE, CBE",Editing Efficiency: R = 0.49 - 0.72,- Develop a machine learning model (FORECast-B...,https://academic.oup.com/nar/article/50/6/3551...,Paper: https://www.biorxiv.org/content/10.1101...
3,4,Predicting base editing outcomes with an atten...,Marquart et al.,2021,Nature Communications,5,"28,294 lentivirally integrated genetic sequences",BE-DICT\n- Output: per base editing probabilit...,"ABE, CBE",BE-DICT (Editing Efficiency): R= 0.55 - 0.90\n...,- Generate pooled lentiviral library of unique...,https://www.nature.com/articles/s41467-021-253...,Paper: https://www.nature.com/articles/s41467-...
4,5,Sequence-specific prediction of the efficienci...,Song et al.,2020,Nature Biotechnology,21,"Lentiviral library of 15,656 guide RNA-encodin...",DeepBaseEditor (DeepABE/DeepCBE)\n- Two to th...,"ABE, CBE",Editing Efficiency: R = 0.60 - 0.78\nFraction ...,- Develop a deep-learning-based computational...,https://www.nature.com/articles/s41587-020-0573-5,Paper: https://www.nature.com/articles/s41587-...


In [9]:
# Download efficiency data
efficiency_data = api.get_efficiency(max_rows=100)
efficiency_data.head()


Downloaded 100.00% of the request


Unnamed: 0,index,original_id,grna,pam_sequence,sequence,full_context_sequence,full_context_sequence_padded,protospace_position,pam_index,grna_sequence_match,...,one_hot_grna,one_hot_pam_sequence,one_hot_sequence,one_hot_full_context_sequence,one_hot_full_context_sequence_padded,hilbert_curve_grna,hilbert_curve_pam_sequence,hilbert_curve_sequence,hilbert_curve_full_context_sequence,hilbert_curve_full_context_sequence_padded
0,0,1_BE4_ontargetpos_editpos6_mES_AID,GCATCCGCGTGAGAACCGCA,GGG,ACCAAGGGCTGCATCCGCGTGAGAACCGCAGGGAGCAGCT,AACCAAGGGCTGCATCCGCGTGAGAACCGCAGGGAGCAGCTGGGGA...,NNNNNNNNNNNNNNNNNNNNNNNNNNAACCAAGGGCTGCATCCGCG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
1,1,2_BE4_ontargetpos_editpos6_mES_AID,GCTTTCCTAGGGGTGGAGGA,TGG,CAGTCTTAGTGCTTTCCTAGGGGTGGAGGATGGGAGTCAC,GCAGTCTTAGTGCTTTCCTAGGGGTGGAGGATGGGAGTCACCCCTA...,NNNNNNNNNNNNNNNNNNNNNNNNNNGCAGTCTTAGTGCTTTCCTA...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
2,2,3_BE4_ontargetpos_editpos6_mES_AID,TGAGTCCTGGCCTGGTATGT,GGG,AGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGG,GAGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGGGGGCC...,NNNNNNNNNNNNNNNNNNNNNNNNNNGAGAGGCCACATGAGTCCTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
3,3,3_BE4_ontargetpos_editpos6_mES_BE4-CP1028,TGAGTCCTGGCCTGGTATGT,GGG,AGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGG,GAGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGGGGGCC...,NNNNNNNNNNNNNNNNNNNNNNNNNNGAGAGGCCACATGAGTCCTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
4,4,3_BE4_ontargetpos_editpos6_mES_BE4,TGAGTCCTGGCCTGGTATGT,GGG,AGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGG,GAGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGGGGGCC...,NNNNNNNNNNNNNNNNNNNNNNNNNNGAGAGGCCACATGAGTCCTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."


In [10]:
# Download bystander data
bystander_data = api.get_bystander(max_rows=100)
bystander_data


Downloaded 100.00% of the request


Unnamed: 0,index,original_id,grna,pam_sequence,sequence,full_context_sequence,full_context_sequence_padded,protospace_position,pam_index,grna_sequence_match,...,one_hot_grna,one_hot_pam_sequence,one_hot_sequence,one_hot_full_context_sequence,one_hot_full_context_sequence_padded,hilbert_curve_grna,hilbert_curve_pam_sequence,hilbert_curve_sequence,hilbert_curve_full_context_sequence,hilbert_curve_full_context_sequence_padded
0,0,1_BE4_ontargetpos_editpos6_mES_AID,GCATCCGCGTGAGAACCGCA,GGG,ACCAAGGGCTGCATCCGCGTGAGAACCGCAGGGAGCAGCT,AACCAAGGGCTGCATCCGCGTGAGAACCGCAGGGAGCAGCTGGGGA...,NNNNNNNNNNNNNNNNNNNNNNNNNNAACCAAGGGCTGCATCCGCG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
1,1,2_BE4_ontargetpos_editpos6_mES_AID,GCTTTCCTAGGGGTGGAGGA,TGG,CAGTCTTAGTGCTTTCCTAGGGGTGGAGGATGGGAGTCAC,GCAGTCTTAGTGCTTTCCTAGGGGTGGAGGATGGGAGTCACCCCTA...,NNNNNNNNNNNNNNNNNNNNNNNNNNGCAGTCTTAGTGCTTTCCTA...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
2,2,3_BE4_ontargetpos_editpos6_mES_AID,TGAGTCCTGGCCTGGTATGT,GGG,AGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGG,GAGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGGGGGCC...,NNNNNNNNNNNNNNNNNNNNNNNNNNGAGAGGCCACATGAGTCCTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
3,3,3_BE4_ontargetpos_editpos6_mES_BE4-CP1028,TGAGTCCTGGCCTGGTATGT,GGG,AGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGG,GAGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGGGGGCC...,NNNNNNNNNNNNNNNNNNNNNNNNNNGAGAGGCCACATGAGTCCTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
4,4,3_BE4_ontargetpos_editpos6_mES_BE4,TGAGTCCTGGCCTGGTATGT,GGG,AGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGG,GAGAGGCCACATGAGTCCTGGCCTGGTATGTGGGGGGCCGGGGGCC...,NNNNNNNNNNNNNNNNNNNNNNNNNNGAGAGGCCACATGAGTCCTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,84_BE4_ontargetpos_editpos6_mES_AID,AGTACCGTGGGTACTCGAAG,GGG,TCAGCCAGGTAGTACCGTGGGTACTCGAAGGGGCTGCGTA,CTCAGCCAGGTAGTACCGTGGGTACTCGAAGGGGCTGCGTACCACA...,NNNNNNNNNNNNNNNNNNNNNNNNNNCTCAGCCAGGTAGTACCGTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
96,96,84_BE4_ontargetpos_editpos6_mES_BE4-CP1028,AGTACCGTGGGTACTCGAAG,GGG,TCAGCCAGGTAGTACCGTGGGTACTCGAAGGGGCTGCGTA,CTCAGCCAGGTAGTACCGTGGGTACTCGAAGGGGCTGCGTACCACA...,NNNNNNNNNNNNNNNNNNNNNNNNNNCTCAGCCAGGTAGTACCGTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
97,97,84_BE4_ontargetpos_editpos6_mES_BE4,AGTACCGTGGGTACTCGAAG,GGG,TCAGCCAGGTAGTACCGTGGGTACTCGAAGGGGCTGCGTA,CTCAGCCAGGTAGTACCGTGGGTACTCGAAGGGGCTGCGTACCACA...,NNNNNNNNNNNNNNNNNNNNNNNNNNCTCAGCCAGGTAGTACCGTG...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."
98,98,85_BE4_ontargetpos_editpos6_mES_AID,TTCCTCGTATCCCAATGCTA,AGG,GTTCCTGAAATTCCTCGTATCCCAATGCTAAGGCACCTTT,TGTTCCTGAAATTCCTCGTATCCCAATGCTAAGGCACCTTTATATG...,NNNNNNNNNNNNNNNNNNNNNNNNNNTGTTCCTGAAATTCCTCGTA...,11,31,1,...,"?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<i4', 'fortran_order': Tr...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa...","?NUMPY�v�{'descr': '<f8', 'fortran_order': Fa..."


In [11]:
# Returns available base editors in dataset | Please note for demonstration the dataset is truncated
api.get_available_base_editors(efficiency_data)


array(['AID', 'BE4-CP1028', 'BE4'], dtype=object)

In [12]:
# Returns available Cell lines in dataset | Please note for demonstration the dataset is truncated
api.get_available_cells(efficiency_data)


array(['mES'], dtype=object)

###  Example to fit a Gradient Boosting Regressor to BE-DataHive data for efficiency rates

In [13]:
# Import required packages
from sklearn.model_selection import train_test_split # pip install -U scikit-learn
from sklearn.ensemble import GradientBoostingRegressor # pip install -U scikit-learn
from sklearn.metrics import mean_squared_error # pip install -U scikit-learn

# Initialize the API
api = be_datahive()

# Get efficiency data
efficiency_data = api.get_efficiency(max_rows=100)

Downloaded 100.00% of the request


#### Convert efficiency data to machine learning arrays

#### Parameters:

- **df**: Required  
  *Type*: `pd.DataFrame`  
  The underlying data set that got retrieved from the server.

- **encoding**: Required  
  *Type*: `str`  
  Encoding standard to use. Options are:
  - `'raw'`: Returns the decoded features allowing the user to encode the features themselves.
  - `'one-hot'`: Applies one-hot encoding to the features.
  - `'hilbert-curve'`: Applies Hilbert curve encoding to the features.

- **target_col**: Required  
  *Type*: `str`  
  The target variable to be returned. The default is `"efficiency_full_grna_calculated"`, but it can be any of the following:
  - `"efficiency_full_grna_reported"`
  - `"editing_windows_3_10_efficiency_reported"`
  - `"efficiency_full_grna_calculated"`
  - `"editing_windows_3_10_efficiency_calculated"`

- **clean**: Optional  
  *Type*: `bool`  
  Whether to clean up the dataframe by replacing `None` and `NaN` values. Default is `True`.

- **flatten**: Optional  
  *Type*: `bool`  
  Whether to flatten nested arrays, allowing the data to be used directly in machine learning models. Default is `True`.

- **base_editor**: Optional  
  *Type*: `str` or `list of str`  
  Subset the data by base editor. Provides built-in filtering for base editors.

- **cell**: Optional  
  *Type*: `str` or `list of str`  
  Subset the data by cell. Provides built-in filtering for cells.

In [14]:
features, target, variable_info = api.get_efficiency_ml_arrays(efficiency_data, target_col = "efficiency_full_grna_reported", encoding='hilbert-curve', clean=True, flatten=True)


In [15]:
features

array([[  1.        ,   0.        ,   0.        , ...,   0.        ,
        240.        ,  63.        ],
       [  1.        ,  14.3844157 ,  18.67997623, ...,   0.        ,
        240.        ,  63.        ],
       [  1.        ,   0.        ,   0.        , ...,   0.        ,
        240.        ,  63.        ],
       ...,
       [  1.        ,  10.33778361,  15.09301376, ...,   0.        ,
        240.        ,  63.        ],
       [  1.        ,   0.        ,   0.        , ...,   0.        ,
        240.        ,  63.        ],
       [  1.        ,   0.        ,   0.        , ...,   0.        ,
        240.        ,  63.        ]])

In [16]:
target

array([0.85240964, 0.62392344, 0.47342398, 0.0670579 , 0.50147662,
       0.69325153, 0.79674797, 0.74626866, 0.7370721 , 0.41889669,
       0.79133226, 0.69193742, 0.39846743, 0.51622419, 0.45610278,
       0.84621993, 0.68304915, 0.04534524, 0.04575557, 0.58467023,
       0.69005035, 0.44230769, 0.64660026, 0.05325621, 0.32463768,
       0.51470588, 0.56821378, 0.35130332, 0.65604305, 0.16196228,
       0.24611709, 0.54615385, 0.58872651, 0.8073903 , 0.41459716,
       0.51418101, 0.7075718 , 0.75038285, 0.19708788, 0.82776025,
       0.40044004, 0.79240122, 0.32093933, 0.26737968, 0.60934795,
       0.53421211, 0.69040248, 0.41284404, 0.16012227, 0.61363636,
       0.67427056, 0.53924915, 0.68888889, 0.45342272, 0.51309524,
       0.51959967, 0.84926471, 0.25508059, 0.21957447, 0.75038285,
       0.65011287, 0.85990338, 0.74084507, 0.69333097, 0.86117137,
       0.37327497, 0.28672537, 0.61056208, 0.69100692, 0.64268391,
       0.6222597 , 0.61674009, 0.34014502, 0.12827261, 0.64792

In [44]:
print(f'Features: {variable_info["features"][:20]}, \n Target:  {variable_info["target"]}')

Features: ['grna_sequence_match', 'energy_1', 'energy_2', 'energy_3', 'energy_4', 'energy_5', 'energy_6', 'energy_7', 'energy_8', 'energy_9', 'energy_10', 'energy_11', 'energy_12', 'energy_13', 'energy_14', 'energy_15', 'energy_16', 'energy_17', 'energy_18', 'energy_19'], 
 Target:  ['Position_-11', 'Position_-10', 'Position_-9', 'Position_-8', 'Position_-7', 'Position_-6', 'Position_-5', 'Position_-4', 'Position_-3', 'Position_-2', 'Position_-1', 'Position_0', 'Position_1', 'Position_2', 'Position_3', 'Position_4', 'Position_5', 'Position_6', 'Position_7', 'Position_8', 'Position_9', 'Position_10', 'Position_11', 'Position_12', 'Position_13', 'Position_14', 'Position_15', 'Position_16', 'Position_17', 'Position_18', 'Position_19', 'Position_20', 'Position_21', 'Position_22', 'Position_23', 'Position_24', 'Position_25', 'Position_26', 'Position_27', 'Position_28', 'Position_29', 'Position_30']


In [19]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


In [20]:
# Initialize and fit the gradient boost regressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)


GradientBoostingRegressor()

In [21]:
# Make predictions 
y_pred = gbr.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")


Mean Squared Error: 0.03329681968947069


###  Example to fit a neural network model to BE-DataHive data for bystander editing rates

In [22]:
# Import required packages
from sklearn.model_selection import train_test_split # pip install -U scikit-learn
from sklearn.neural_network import MLPRegressor # pip install -U scikit-learn
from sklearn.metrics import mean_squared_error # pip install -U scikit-learn

# Initialize the API
api = be_datahive()

# Get bystander data
bystander_data = api.get_bystander(max_rows=100)


Downloaded 100.00% of the request


#### Convert bystander data to machine learning arrays

#### Parameters:

- **df**: Required  
  *Type*: `pd.DataFrame`  
  The underlying data set that got retrieved from the server.

- **encoding**: Required  
  *Type*: `str`  
  Encoding standard to use. Options are:
  - `'raw'`: Returns the decoded features allowing the user to encode the features themselves.
  - `'one-hot'`: Applies one-hot encoding to the features.
  - `'hilbert-curve'`: Applies Hilbert curve encoding to the features.

- **bystander_typ**: Required  
  *Type*: `str`  
  The bystander task to be performed. Options are:
  - `'edited'`: Specifies the edited task.
  - `'outcome'`: Specifies the outcome task.

- **clean**: Optional  
  *Type*: `bool`  
  Whether to clean up the dataframe by replacing `None` and `NaN` values. Default is `True`.

- **flatten**: Optional  
  *Type*: `bool`  
  Whether to flatten nested arrays, allowing the data to be used directly in machine learning models. Default is `True`.

- **base_editor**: Optional  
  *Type*: `str` or `list of str`  
  Subset the data by base editor. Provides built-in filtering for base editors.

- **cell**: Optional  
  *Type*: `str` or `list of str`  
  Subset the data by cell. Provides built-in filtering for cells.


In [23]:
features, target, variable_info = api.get_bystander_ml_arrays(bystander_data, encoding='one-hot', bystander_type = 'edited', clean=True, flatten=True)

In [24]:
features

array([[ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        , 14.3844157 , 18.67997623, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 1.        , 10.33778361, 15.09301376, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

In [25]:
target

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [45]:
print(f'Features: {variable_info["features"][:20]}, \n Target:  {variable_info["target"]}')

Features: ['grna_sequence_match', 'energy_1', 'energy_2', 'energy_3', 'energy_4', 'energy_5', 'energy_6', 'energy_7', 'energy_8', 'energy_9', 'energy_10', 'energy_11', 'energy_12', 'energy_13', 'energy_14', 'energy_15', 'energy_16', 'energy_17', 'energy_18', 'energy_19'], 
 Target:  ['Position_-11', 'Position_-10', 'Position_-9', 'Position_-8', 'Position_-7', 'Position_-6', 'Position_-5', 'Position_-4', 'Position_-3', 'Position_-2', 'Position_-1', 'Position_0', 'Position_1', 'Position_2', 'Position_3', 'Position_4', 'Position_5', 'Position_6', 'Position_7', 'Position_8', 'Position_9', 'Position_10', 'Position_11', 'Position_12', 'Position_13', 'Position_14', 'Position_15', 'Position_16', 'Position_17', 'Position_18', 'Position_19', 'Position_20', 'Position_21', 'Position_22', 'Position_23', 'Position_24', 'Position_25', 'Position_26', 'Position_27', 'Position_28', 'Position_29', 'Position_30']


In [27]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


In [28]:
# Define the neural network model
mlp = MLPRegressor(hidden_layer_sizes=(128, 64), activation='relu', solver='adam', max_iter=500, random_state=42)

# Train the model
mlp.fit(X_train, y_train)


MLPRegressor(hidden_layer_sizes=(128, 64), max_iter=500, random_state=42)

In [29]:
# Make predictions
y_pred = mlp.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")


Mean Squared Error: 0.08567059524752998
