# OLS Library Preprocessing Notebook

**Authorship:**
Adam Klie, *09/28/2021*
***
**Description:**
Notebook to preprocess the datasets for training machine learning (ML) models. So far, there are a total of **5** feature sets to generate:
1. block encoding
2. fasta sequences
3. mixed-1.0 encoding
4. mixed-2.0 encoding
5. one-hot encoding

Details on the encodings can be found in the appropriate sections
***
**TODOs:**
 - <font color='green'> Done TODO </font>
 - <font color='orange'> WIP TODO </font>
 - <font color='red'> Queued TODO </font>
***

# Set-up

In [2]:
# Classics
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Helpful libraries
import tqdm
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

In [3]:
import sys
sys.path.append('/cellar/users/aklie/projects/EUGENE/bin/')
import project_utils

In [4]:
ACTIVE_LOW = 0.18  # The activity below which to define inactive enhancers
ACTIVE_HIGH = 0.4  # The activity above which to define active enhancers
PREPROCESS = "{}-{}".format(ACTIVE_LOW, ACTIVE_HIGH)  # String defining the preprocessing for saving
SPLIT = 0.9  # Split into training and test sets
SUBSET = None  # # Ratio of dataset to keep if testing. None if not testing

# Load data to preprocess
**Note**: There are two versions of the dataset I have to work with. See shared Google Drive for version history. The data has the following descriptors:
1. **NAME** - Site list 
2. **SEQUENCE** - Sequence
3. **MPRA_FXN** - Functional by MPRA (see slide 2)
    - Activity >.4 = functional (1)
    - Activity <.18 = non-functional (0)
    - If activity is in between => ‘na’
4. **MICROSCPOPE_FXN** - Microscope functional group
    - Only be for experiments with low variance and good controls (lacking *)
5. **ACTIVITY_SUMRNA_NUMDNA**
    - Activity score => sum MRPM all RNA barcodes / number DNA barcodes

In [5]:
# Load from tsv provided by Joe
OLS_data = pd.read_csv(
    "20210728-3.EnhancerTable.ForAdam.FunctionalEnhancers.WT-detected.ABL-notDetected.10R-20U-0.1P.tsv",
    sep="\t",
    na_values="na",
)
OLS_data.head(1)

Unnamed: 0,NAME,SEQUENCE,MPRA_FXN,MICROSCOPE_FXN,ACTIVITY_SUMRNA_NUMDNA
0,S1-G1R-S2-E1F-S3-E2F-S4-G2R-S5-G3F-S6,CATCTGAAGCTCGTTATCTCTAACGGAAGTTTTCGAAAAGGAAATT...,1.0,Neural Enhancer,0.611767


In [6]:
# Add labels based on decided cutoffs cut-offs
OLS_data["label"] = np.nan
OLS_data.loc[OLS_data["ACTIVITY_SUMRNA_NUMDNA"] <= ACTIVE_LOW, "label"] = 0
OLS_data.loc[OLS_data["ACTIVITY_SUMRNA_NUMDNA"] >= ACTIVE_HIGH, "label"] = 1

In [7]:
# Define a set black features for current lack of a better term
block_features = [
    "linker_1",
    "TFBS_1",
    "linker_2",
    "TFBS_2",
    "linker_3",
    "TFBS_3",
    "linker_4",
    "TFBS_4",
    "linker_5",
    "TFBS_5",
    "linker_6",
]

In [8]:
# Add these as columns to the dataframe
OLS_data[block_features] = OLS_data["NAME"].str.split("-").to_list()
OLS_data.head(1)

Unnamed: 0,NAME,SEQUENCE,MPRA_FXN,MICROSCOPE_FXN,ACTIVITY_SUMRNA_NUMDNA,label,linker_1,TFBS_1,linker_2,TFBS_2,linker_3,TFBS_3,linker_4,TFBS_4,linker_5,TFBS_5,linker_6
0,S1-G1R-S2-E1F-S3-E2F-S4-G2R-S5-G3F-S6,CATCTGAAGCTCGTTATCTCTAACGGAAGTTTTCGAAAAGGAAATT...,1.0,Neural Enhancer,0.611767,1.0,S1,G1R,S2,E1F,S3,E2F,S4,G2R,S5,G3F,S6


In [9]:
# Sanity check to make sure things match up with what Joe did
OLS_data["MPRA_FXN"].value_counts(dropna=False), OLS_data["label"].value_counts(dropna=False)

(0.0    208729
 NaN    157864
 1.0     94207
 Name: MPRA_FXN, dtype: int64,
 0.0    208729
 NaN    157864
 1.0     94207
 Name: label, dtype: int64)

# Preprocess and save different feature set ideas

## Grab labels and identifiers

In [10]:
# Remove ambiguous enhancers (aka the ones with NaNs)
OLS_data = OLS_data[~OLS_data["label"].isna()].reset_index()
y = OLS_data["label"].values
ID = OLS_data["NAME"].values

In [13]:
# Grab the indexes and the targets for a train set and a test set
train_index, test_index, y_train, y_test = project_utils.split_train_test(OLS_data.index, 
                                                                          y, 
                                                                          split=SPLIT, 
                                                                          subset=SUBSET)
train_index.shape, test_index.shape, y_train.shape, y_test.shape

((272642,), (30294,), (272642,), (30294,))

**Labels**: 0 (non-functional) and 1 (functional). These are all the same

In [14]:
# Save the labels
np.savetxt("{}_y-binary.txt".format(PREPROCESS), X=y, fmt="%d")
np.savetxt("{}_y-train-{}_binary.txt".format(PREPROCESS, SPLIT), X=y_train, fmt="%d")
np.savetxt("{}_y-test-{}_binary.txt".format(PREPROCESS, round(1-SPLIT, 1)), X=y_test, fmt="%d")

In [15]:
!wc -l *y-*

 302936 0.18-0.4_y-binary.txt
  30294 0.18-0.4_y-test-0.1_binary.txt
 272642 0.18-0.4_y-train-0.9_binary.txt
 605872 total


**Identifiers**: L1-TFBS1-... Also all the same

In [19]:
np.savetxt("{}_id.txt".format(PREPROCESS), X=ID, fmt="%s")
np.savetxt("{}_id-train_{}.txt".format(PREPROCESS, SPLIT), X=ID[train_index], fmt="%s")
np.savetxt("{}_id-test_{}.txt".format(PREPROCESS, round(1-SPLIT, 1)), X=ID[test_index], fmt="%s")

In [20]:
!wc -l *id*

   30294 0.18-0.4_id-test_0.1.txt
   30294 0.18-0.4_id-test_0.9.txt
  272642 0.18-0.4_id-train_0.9.txt
  302936 0.18-0.4_id.txt
  302936 sequence_id_0.18-0.4.txt
  460800 sequence_id_orig.txt
 1399902 total


## **<u>Sequence feature idea 1 </u>**: One hot encoding block features
 - linker_{1-5} could be a one hot encoded vector of length 6 that can be any of L1-L5 --> e.g., [0, 1, 0, 0, 0] encodes S2
 - TFBS_{1-5} could be a one hot encoded vector of length 10 that can be G{1-3}R, G{1-3}F, E{1,2}F, E{1,2}R
 - **Note**: L6 is only S6

In [21]:
# Fit to overall dataframe
ohe_block = OneHotEncoder(sparse=False)
X = OLS_data[block_features]
ohe_block.fit(X)

OneHotEncoder(sparse=False)

In [22]:
# Transform
X_block = ohe_block.fit_transform(X)

In [23]:
# Sanity check one
ohe_block.categories_, OLS_data.loc[45]["NAME"], X_block[45]

([array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S6'], dtype=object)],
 'S1-G1R-S2-E1F-S4-E2F-S3-G2R-S5-G3R-S6',
 array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,

**Q** How many features should we be expecting?
 - Each non-L6 linker has 5 options and there are 5 of these --> 5\*5=25
 - Each TFBS has 10 options and there are 5 of these --> 10\*5=25
 - L6 has one option S6 --> 1
 - Should have:

In [24]:
print("Expecting %d features" % ((5 * 5) + (10 * 5) + 1))
print("Observed %d features" % X_block.shape[1])
if (5 * 5) + (10 * 5) + 1 != X_block.shape[1]:
    print("Something is amiss...")

Expecting 76 features
Observed 76 features


In [25]:
# Remove non-unique feature at end
if len(np.unique(X_block[:, -1])) == 1:
    X_block = X_block[:, :-1]

In [30]:
print("Expecting %d features" % ((5 * 5) + (10 * 5)))
print("Observed %d features" % X_block.shape[1])
if (5 * 5) + (10 * 5) != X_block.shape[1]:
    print("Something is amiss...")

Expecting 75 features
Observed 75 features


In [33]:
# Save the different files as numpy arrays
np.save("block/{}_X_block".format(PREPROCESS), arr=X_block)
np.save("block/{}_X-train-{}_block".format(PREPROCESS, SPLIT), arr=X_block[train_index, :])
np.save("block/{}_X-test-{}_block".format(PREPROCESS, round(1-SPLIT, 1)), arr=X_block[test_index, :])

In [36]:
# Save the header for use down the line. Save as one column per line
with open ("block/block_header.txt", "w") as f:
    f.write("\n".join(np.concatenate(ohe_block.categories_[:-1])))

## **<u>Sequence feature idea 2 </u>**: Mixed encodings 

In [40]:
siteName2bindingSiteSequence = {
    "S1": "CATCTGAAGCTC",
    "G1R": "GTTATCTC",
    "S2": "TA",
    "E1F": "ACGGAAGT",
    "S3": "TTTCGAA",
    "E2F": "AAGGAAAT",
    "S4": "TGCTC",
    "G2R": "AATATCT",
    "S5": "",
    "G3F": "AAGATAGG",
    "G1F": "GAGATAAC",
    "E1R": "ACTTCCGT",
    "E2R": "ATTTCCTT",
    "G2F": "AGATATT",
    "G3R": "CCTATCTT",
    "S6": "A",
}

In [41]:
bindingSiteName2affinities = {
    "G1": 0.9,
    "G2": 0.3,
    "G3": 0.5,
    "E1": 0.6,
    "E2": 0.4,
}

In [62]:
def correct_s5(x):
    enhancer = x["Enhancer"]
    if "S5-G2F" in enhancer:
        s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
        if s5_loc == 1:
            return x
        linker_to_change_loc = s5_loc - 1
        x["L{}_length".format(linker_to_change_loc)] -= 1
        return x
    elif "G2R-S5" in enhancer:
        s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
        if s5_loc == 6:
            return x
        linker_to_change_loc = s5_loc + 1
        x["L{}_length".format(linker_to_change_loc)] -= 1
        return x
    else:
        return x

### Mixed 1.0
 - Replace binding sites using dictionary
 - Separate based on these binding sites and add create "dummy variables"
 - Get lengths of linkers around binding sites

In [90]:
mixed1_encoding = []

# Loop through each enhancer
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(OLS_data[block_features].iterrows())):

    enh_enc = []  # Single enhancer encoding

    # Loop through each position
    for col_num in range(len(enh_data.index)):

        # If we have a spacer in the current position we need to check for surrounding GATA-2 sites
        if "S" in enh_data.iloc[col_num]:

            # If the spacer is the empty spacer, just add a 0 and go to the next
            if enh_data.iloc[col_num] == "S5":
                enh_enc.append(
                    len(siteName2bindingSiteSequence[enh_data.iloc[col_num]])
                )
                continue

            # If the spacer is downstream of a GATA-2 reverse, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num > 0:
                if enh_data.iloc[col_num - 1] == "G2R":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # If the spacer is upstream of a GATA-2 forward, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num < len(enh_data.index) - 1:
                if enh_data.iloc[col_num + 1] == "G2F":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # Finally if no G2F or S5 is involved, just add the normal len of the spacer
            enh_enc.append(len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]))
            continue

        # If we are at a TFBS, add the TFBS type, orientation and affinity
        else:
            tfbs = enh_data.iloc[col_num]
            tf = tfbs[0]
            aff = bindingSiteName2affinities[tfbs[:2]]
            orient = tfbs[2]
            enh_enc += [tf, orient, aff]
    
    # Poisin pill
    #if row_num == 10000:
    #    break
    mixed1_encoding.append(enh_enc + ["-".join(list(enh_data.values))])

302936it [01:36, 3139.89it/s]


In [91]:
header = [
    "L1_length",
    "TFBS1_type",
    "TFBS1_orient",
    "TFBS1_affinity",
    "L2_length",
    "TFBS2_type",
    "TFBS2_orient",
    "TFBS2_affinity",
    "L3_length",
    "TFBS3_type",
    "TFBS3_orient",
    "TFBS3_affinity",
    "L4_length",
    "TFBS4_type",
    "TFBS4_orient",
    "TFBS4_affinity",
    "L5_length",
    "TFBS5_type",
    "TFBS5_orient",
    "TFBS5_affinity",
    "L6_length",
    "Enhancer"
]

In [92]:
X_mixed1 = (
    pd.DataFrame(mixed1_encoding, columns=header)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)

In [93]:
X_mixed1 = X_mixed1.apply(correct_s5, axis=1).drop("Enhancer", axis=1).values

In [106]:
X_mixed1_train = X_mixed1[train_index, :]
X_mixed1_test = X_mixed1[test_index, :]

In [111]:
scale_indeces = np.array([0, 3, 4, 7, 8, 11, 12, 15, 16, 19, 20])  # Mixed 1.0

In [114]:
X_mixed1_train, X_mixed1_test = project_utils.standardize_features(train_X=X_mixed1_train, test_X=X_mixed1_test, indeces=scale_indeces)
X_mixed1_train[:, scale_indeces].mean(axis=0), X_mixed1_train[:, scale_indeces].std(axis=0)

(array([-1.26658317e-17,  9.32997482e-18, -1.62362411e-17, -1.87641952e-17,
         1.05809212e-17, -1.04766756e-17, -2.32467529e-17,  1.15712537e-17,
         1.74611261e-17,  2.32076608e-17, -5.85338644e-17]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

In [98]:
# Save the different files as plain text for now
np.save("mixed_1.0/{}_X_mixed-1.0".format(PREPROCESS), X_mixed1)
np.save("mixed_1.0/{}_X-train-{}_mixed-1.0".format(PREPROCESS, SPLIT), X_mixed1_train)
np.save("mixed_1.0/{}_X-test-{}_mixed-1.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed1_test)

### Mixed 2.0
 - Replace binding sites using dictionary
 - 4 bit vector for each binding site [ets_affinity ets_orientation gata_affinity gata_orientation] - ties together the identity to affinity
 - Get lengths of linkers around binding sites
 - Corrected for having S5 linker neighbor GATA2 (need to adjust linker down or upstream from S5)

In [99]:
mixed2_encoding = []

# Loop through each enhancer
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(OLS_data[block_features].iterrows())):

    enh_enc = []  # Single enhancer encoding

    # Loop through each position
    for col_num in range(len(enh_data.index)):

        # If we have a spacer in the current position we need to check for surrounding GATA-2 sites
        if "S" in enh_data.iloc[col_num]:

            # If the spacer is the empty spacer, just add a 0 and go to the next
            if enh_data.iloc[col_num] == "S5":
                enh_enc.append(
                    len(siteName2bindingSiteSequence[enh_data.iloc[col_num]])
                )
                continue

            # If the spacer is downstream of a GATA-2 reverse, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num > 0:
                if enh_data.iloc[col_num - 1] == "G2R":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # If the spacer is upstream of a GATA-2 forward, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num < len(enh_data.index) - 1:
                if enh_data.iloc[col_num + 1] == "G2F":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # Finally if no G2 or S5 is involved, just add the normal len of the spacer
            enh_enc.append(len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]))
            continue

        # If we are at a TFBS, add the TFBS type, orientation and affinity
        else:
            tfbs = enh_data.iloc[col_num]
            tf = tfbs[0]
            aff = bindingSiteName2affinities[tfbs[:2]]
            orient = tfbs[2]
            if tf == "E":
                enh_enc += [aff, orient, 0, 0]
            elif tf == "G":
                enh_enc += [0, 0, aff, orient]
    # Poisin pill
    # if row_num == 10000:
    #    break
    mixed2_encoding.append(enh_enc + ["-".join(list(enh_data.values))])

302936it [01:33, 3229.74it/s]


In [100]:
header = [
    "L1_length",
    "TFBS1_ETS_affinity",
    "TFBS1_ETS_orient",
    "TFBS1_GATA_affinity",
    "TFBS1_GATA_orient",
    "L2_length",
    "TFBS2_ETS_affinity",
    "TFBS2_ETS_orient",
    "TFBS2_GATA_affinity",
    "TFBS2_GATA_orient",
    "L3_length",
    "TFBS3_ETS_affinity",
    "TFBS3_ETS_orient",
    "TFBS3_GATA_affinity",
    "TFBS3_GATA_orient",
    "L4_length",
    "TFBS4_ETS_affinity",
    "TFBS4_ETS_orient",
    "TFBS4_GATA_affinity",
    "TFBS4_GATA_orient",
    "L5_length",
    "TFBS5_ETS_affinity",
    "TFBS5_ETS_orient",
    "TFBS5_GATA_affinity",
    "TFBS5_GATA_orient",
    "L6_length",
    "Enhancer",
]

In [101]:
X_mixed2 = pd.DataFrame(mixed2_encoding, columns=header).replace({"R": -1, "F": 1})

In [102]:
X_mixed2 = X_mixed2.apply(correct_s5, axis=1).drop("Enhancer", axis=1).values

In [117]:
X_mixed2_train = X_mixed2[train_index, :]
X_mixed2_test = X_mixed2[test_index, :]

In [118]:
scale_indeces = np.array([0, 5, 10, 15, 20, 25])  # Mixed 2.0

In [119]:
X_mixed2_train, X_mixed2_test = project_utils.standardize_features(train_X=X_mixed2_train, test_X=X_mixed2_test, indeces=scale_indeces)
X_mixed2_train[:, scale_indeces].mean(axis=0), X_mixed2_train[:, scale_indeces].std(axis=0)

(array([-3.72156537e-17, -8.61328681e-17,  6.50101178e-17,  5.79083912e-17,
         3.13779041e-17,  1.27648650e-16]),
 array([1., 1., 1., 1., 1., 1.]))

In [103]:
# Save the different files as plain text for now
np.save("mixed_2.0/{}_X_mixed-2.0".format(PREPROCESS), X_mixed2)
np.save("mixed_2.0/{}_X-train-{}_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed2_train)
np.save("mixed_2.0/{}_X-test-{}_mixed-2.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed2_train)

## **<u>Sequence feature idea 3 </u>**: Use the actual sequence (one-hot encoded)
 - One hot encoded sequence: each position is encoded as a 1-D vector of size 4 e.g., AT is [[1,0,0,0], [0,0,0,1]]
 - Generally, we will get inputs of size (len(seq) X 4). The above example would be of size 2x4
 - Can also save the string seqs in case those are also useful down the line

**Q** Are all sequences the same length

In [120]:
OLS_data["SEQUENCE"].apply(len).value_counts()

66    302936
Name: SEQUENCE, dtype: int64

**Answer**: Yes, 66bp long

In [121]:
X_seqs = OLS_data["SEQUENCE"].values

In [122]:
# Define encoders
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder(
    categories=[np.array([0, 1, 2, 3])], handle_unknown="ignore"
)

In [123]:
# Example steps for one hot encoding
test = X_seqs[0]
print("{}...".format(test[:5]))
integer_encoded = integer_encoder.fit_transform(list(test))
print("{}...".format(integer_encoded[:5]))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoder.fit(integer_encoded)  # convert to one hot
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
print("{}...".format(one_hot_encoded.toarray()[:5]))

CATCT...
[1 0 3 1 3]...
[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]...


**One hot encoding of sequences...<u>takes about 5 minutes</u>**

In [125]:
# Applying above for all seqs
X_features = []  # will hold one hot encoded sequence
for i, seq in enumerate(tqdm.tqdm(X_seqs)):
    integer_encoded = integer_encoder.fit_transform(list(seq))  # convert to integer
    integer_encoded = np.array(integer_encoded).reshape(-1, 1)
    one_hot_encoder.fit(integer_encoded)  # convert to one hot
    one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
    X_features.append(one_hot_encoded.toarray())

100%|██████████| 302936/302936 [03:27<00:00, 1458.31it/s]


In [None]:
# Check to make sure it is correct length
len(X_features)

In [17]:
X_ohe_seq = np.array(X_features)

In [18]:
# Sanity check encoding for randomly chosens sequences
indeces = np.random.choice(len(X_features), size=len(X_features) // 100)
for j, ind in enumerate(indeces):
    seq = X_seqs[ind]
    one_hot_seq = X_features[ind]
    for i, bp in enumerate(seq):
        if bp == "A":
            if (one_hot_seq[i] != [1.0, 0.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "C":
            if (one_hot_seq[i] != [0.0, 1.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "G":
            if (one_hot_seq[i] != [0.0, 0.0, 1.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "T":
            if (one_hot_seq[i] != [0.0, 0.0, 0.0, 1.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "N":
            if (one_hot_seq[i] != [0.0, 0.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        else:
            print(bp)
    print("Seq #{} encoded correctly".format(j + 1))

Seq #1 encoded correctly
Seq #2 encoded correctly
Seq #3 encoded correctly
Seq #4 encoded correctly
Seq #5 encoded correctly
Seq #6 encoded correctly
Seq #7 encoded correctly
Seq #8 encoded correctly
Seq #9 encoded correctly
Seq #10 encoded correctly
Seq #11 encoded correctly
Seq #12 encoded correctly
Seq #13 encoded correctly
Seq #14 encoded correctly
Seq #15 encoded correctly
Seq #16 encoded correctly
Seq #17 encoded correctly
Seq #18 encoded correctly
Seq #19 encoded correctly
Seq #20 encoded correctly
Seq #21 encoded correctly
Seq #22 encoded correctly
Seq #23 encoded correctly
Seq #24 encoded correctly
Seq #25 encoded correctly
Seq #26 encoded correctly
Seq #27 encoded correctly
Seq #28 encoded correctly
Seq #29 encoded correctly
Seq #30 encoded correctly
Seq #31 encoded correctly
Seq #32 encoded correctly
Seq #33 encoded correctly
Seq #34 encoded correctly
Seq #35 encoded correctly
Seq #36 encoded correctly
Seq #37 encoded correctly
Seq #38 encoded correctly
Seq #39 encoded corre

In [20]:
# Save in binary format
np.save("../data/2021_OLS_Library/X_seq_ohe_0.18-0.4", X_ohe_seq)

In [22]:
np.savetxt("../data/2021_OLS_Library/X_seqs_0.18-0.4.txt", X=X_seqs, fmt="%s")

## **<u>Sequence feature idea 4 </u>**: Use the actual sequence (as fasta)

### **Split into negative and positive for lsgkm**

#### <u> **Train** </u>

In [9]:
# Mask
train_neg_mask = (y_train == 0)

In [10]:
# Negative seqs
X_train_neg = X_train[train_neg_mask]
y_train_neg = y_train[train_neg_mask]
id_train_neg = id_train[train_neg_mask]

# Positive seqs
X_train_pos = X_train[~train_neg_mask]
y_train_pos = y_train[~train_neg_mask]
id_train_pos = id_train[~train_neg_mask]

In [11]:
# Check
print(X_train_neg.shape, y_train_neg.shape)
print(X_train_pos.shape, y_train_pos.shape)
if (X_train_neg.shape[0] + X_train_pos.shape[0] == X_train.shape[0]):
    print("We good: {}, {}, {}".format(X_train.shape, y_train.shape, id_train.shape))
else:
    print("The game is afoot")

(187791,) (187791,)
(84851,) (84851,)
We good: (272642,), (272642,), (272642,)


*Positive training sequences*

In [12]:
tr_file = open("../data/2021_OLS_Library/OLS.tr.fa", "w")
for i in range(len(X_train_pos)):
    tr_file.write(">" + id_train_pos[i] + "\n" + X_train_pos[i] + "\n")
tr_file.close()

In [13]:
!wc -l ../data/2021_OLS_Library/OLS.tr.fa

169702 ../data/2021_OLS_Library/OLS.tr.fa


In [14]:
len(X_train_pos)*2

169702

*Negative training sequences*

In [15]:
tr_neg_file = open("../data/2021_OLS_Library/OLS.neg.tr.fa", "w")
for i in range(len(X_train_neg)):
    tr_neg_file.write(">" + id_train_neg[i] + "\n" + X_train_neg[i] + "\n")
tr_neg_file.close()

In [16]:
!wc -l ../data/2021_OLS_Library/OLS.neg.tr.fa

375582 ../data/2021_OLS_Library/OLS.neg.tr.fa


In [17]:
len(X_train_neg)*2

375582

#### <u> **Test** </u>

In [18]:
# Mask
test_neg_mask = (y_test == 0)

In [19]:
# Negative seqs
X_test_neg = X_test[test_neg_mask]
y_test_neg = y_test[test_neg_mask]
id_test_neg = id_test[test_neg_mask]

# Positive seqs
X_test_pos = X_test[~test_neg_mask]
y_test_pos = y_test[~test_neg_mask]
id_test_pos = id_test[~test_neg_mask]

In [20]:
# Check
print(X_test_neg.shape, y_test_neg.shape)
print(X_test_pos.shape, y_test_pos.shape)
if (X_test_neg.shape[0] + X_test_pos.shape[0] == X_test.shape[0]):
    print("We good: {}, {}. {}".format(X_test.shape, y_test.shape, id_test.shape))
else:
    print("The game is afoot")

(20938,) (20938,)
(9356,) (9356,)
We good: (30294,), (30294,). (30294,)


*Positive test sequences*

In [21]:
test_file = open("../data/2021_OLS_Library/OLS.test.fa", "w")
for i in range(len(X_test_pos)):
    test_file.write(">" + id_test_pos[i] + "\n" + X_test_pos[i] + "\n")
test_file.close()

In [22]:
!wc -l ../data/2021_OLS_Library/OLS.test.fa

18712 ../data/2021_OLS_Library/OLS.test.fa


In [23]:
len(X_test_pos)*2

18712

*Negative test sequences*

In [24]:
test_neg_file = open("../data/2021_OLS_Library/OLS.neg.test.fa", "w")
for i in range(len(X_test_neg)):
    test_neg_file.write(">" + id_test_neg[i] + "\n" + X_test_neg[i] + "\n")
test_neg_file.close()

In [25]:
!wc -l ../data/2021_OLS_Library/OLS.neg.test.fa

41876 ../data/2021_OLS_Library/OLS.neg.test.fa


In [26]:
len(X_test_neg)*2

41876

# Scratch

## Sanity check ordering of current

In [23]:
from sklearn.model_selection import train_test_split

In [10]:
OLS_data

Unnamed: 0,index,NAME,SEQUENCE,MPRA_FXN,MICROSCOPE_FXN,ACTIVITY_SUMRNA_NUMDNA,label,linker_1,TFBS_1,linker_2,TFBS_2,linker_3,TFBS_3,linker_4,TFBS_4,linker_5,TFBS_5,linker_6
0,0,S1-G1R-S2-E1F-S3-E2F-S4-G2R-S5-G3F-S6,CATCTGAAGCTCGTTATCTCTAACGGAAGTTTTCGAAAAGGAAATT...,1.0,Neural Enhancer,0.611767,1.0,S1,G1R,S2,E1F,S3,E2F,S4,G2R,S5,G3F,S6
1,3,S1-G1R-S2-E1F-S3-E2F-S4-G2R-S5-G3R-S6,CATCTGAAGCTCGTTATCTCTAACGGAAGTTTTCGAAAAGGAAATT...,0.0,,0.000000,0.0,S1,G1R,S2,E1F,S3,E2F,S4,G2R,S5,G3R,S6
2,6,S1-G1F-S2-E1F-S3-E2F-S4-G2F-S5-G3F-S6,CATCTGAAGCTCGAGATAACTAACGGAAGTTTTCGAAAAGGAAATT...,0.0,,0.098698,0.0,S1,G1F,S2,E1F,S3,E2F,S4,G2F,S5,G3F,S6
3,7,S1-G1F-S2-E1F-S3-E2F-S4-G2R-S5-G3R-S6,CATCTGAAGCTCGAGATAACTAACGGAAGTTTTCGAAAAGGAAATT...,1.0,,0.468321,1.0,S1,G1F,S2,E1F,S3,E2F,S4,G2R,S5,G3R,S6
4,8,S1-G1F-S2-E1R-S3-E2F-S4-G2R-S5-G3F-S6,CATCTGAAGCTCGAGATAACTAACTTCCGTTTTCGAAAAGGAAATT...,1.0,,0.633036,1.0,S1,G1F,S2,E1R,S3,E2F,S4,G2R,S5,G3F,S6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
302931,460790,S5-G3R-S4-G2F-S3-E2F-S2-E1R-S1-G1R-S6,CCTATCTTTGCTCAGATATTTTTCGAAAAGGAAATTAACTTCCGTC...,0.0,,0.124246,0.0,S5,G3R,S4,G2F,S3,E2F,S2,E1R,S1,G1R,S6
302932,460792,S5-G3F-S4-G2F-S3-E2R-S2-E1R-S1-G1R-S6,AAGATAGGTGCTCAGATATTTTTCGAAATTTCCTTTAACTTCCGTC...,0.0,,0.025567,0.0,S5,G3F,S4,G2F,S3,E2R,S2,E1R,S1,G1R,S6
302933,460793,S5-G3R-S4-G2R-S3-E2R-S2-E1R-S1-G1R-S6,CCTATCTTTGCTCAATATCTTTTCGAAATTTCCTTTAACTTCCGTC...,1.0,,0.406034,1.0,S5,G3R,S4,G2R,S3,E2R,S2,E1R,S1,G1R,S6
302934,460794,S5-G3R-S4-G2F-S3-E2F-S2-E1R-S1-G1F-S6,CCTATCTTTGCTCAGATATTTTTCGAAAAGGAAATTAACTTCCGTC...,0.0,,0.117635,0.0,S5,G3R,S4,G2F,S3,E2F,S2,E1R,S1,G1F,S6


In [20]:
ID = np.loadtxt("sequence_id_0.18-0.4.txt", dtype=str)

In [26]:
ID_train, ID_test, y_train, y_test = train_test_split(ID, y, train_size=0.9, random_state=13)

In [27]:
ID_train

array(['S3-G2F-S4-G3F-S1-G1F-S2-E1F-S5-E2F-S6',
       'S1-E1R-S4-E2R-S2-G2R-S3-G3F-S5-G1R-S6',
       'S2-G1R-S5-G2R-S4-E1R-S3-G3R-S1-E2R-S6', ...,
       'S4-G3F-S1-G1R-S2-G2F-S5-E2R-S3-E1F-S6',
       'S2-G3R-S4-G1R-S1-G2F-S5-E1F-S3-E2F-S6',
       'S4-E2R-S3-G3R-S2-E1R-S1-G2R-S5-G1F-S6'], dtype='<U37')

In [14]:
y = np.loadtxt("y_binary_0.18-0.4.txt")

In [17]:
(y == OLS_data["label"]).all()

True

## Old mixed encoding 1.0 code

In [226]:
j = 0
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(X_mixed2_df.iterrows())):
    enhancer = enh_data["Enhancer"]
    if "S5-G2F" in enhancer:
        s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
        if s5_loc == 1:
            continue
        linker_to_change_loc = s5_loc - 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        enh_data["L{}_length".format(linker_to_change_loc)] -= 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        j += 1
    elif "G2R-S5" in enhancer:
        s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
        if s5_loc == 6:
            continue
        linker_to_change_loc = s5_loc + 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        enh_data["L{}_length".format(linker_to_change_loc)] -= 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        j += 1
    # if j==100:
    #    break

2400it [00:00, 7178.25it/s]


## Different thresholds and methods for dealing with class imbalance

**Dealing with class imbalance**

In [37]:
from sklearn.utils import resample

In [39]:
pos_mask = y_train == 1

In [42]:
neg_y = y_train[~pos_mask]
pos_y = y_train[pos_mask]

neg_X = X_train[~pos_mask, :]
pos_X = X_train[pos_mask, :]

In [43]:
print(np.unique(pos_y, return_counts=True))
print(np.unique(neg_y, return_counts=True))

(array([1]), array([65561]))
(array([0]), array([349159]))


**Downsample negative class**

In [86]:
downsampled_neg_X, downsampled_neg_y = resample(
    neg_X, neg_y, n_samples=len(pos_y), random_state=13
)

In [87]:
X_train_down = np.concatenate([downsampled_neg_X, pos_X])
y_train_down = np.concatenate([downsampled_neg_y, pos_y])

In [88]:
X_train_down.shape, y_train_down.shape

((131122, 75), (131122,))

In [89]:
np.unique(y_train_down, return_counts=True)[1] / len(y_train_down)

array([0.5, 0.5])

## Testing mixed encoding code

In [237]:
X_mixed2_df["Enhancer"].str.contains("S5-G2F").sum(), X_mixed2_df[
    "Enhancer"
].str.contains("G2R-S5").sum()

(29952, 23969)

In [241]:
X_mixed2_df_corrected[
    (X_mixed2_df_corrected["L1_length"] == 11)
    & (X_mixed2_df_corrected["TFBS1_GATA_affinity"] != 0.3)
]["L2_length"].value_counts()

0    1437
Name: L2_length, dtype: int64

In [242]:
X_mixed2_df_corrected[
    (X_mixed2_df_corrected["L6_length"] == 0)
    & (X_mixed2_df_corrected["TFBS5_GATA_affinity"] != 0.3)
]["L5_length"].value_counts()

0    6037
Name: L5_length, dtype: int64

In [245]:
X_mixed2_df_corrected = X_mixed2_df_corrected.drop("Enhancer", axis=1).values

# References