# OLS Library Preprocessing Notebook

**Authorship:**
Adam Klie, *09/28/2021*
***
**Description:**
This notebook can be used to preprocess the OLS library dataset for training machine learning (ML) models on MPRA data. So far, there are a total of **6** feature sets to generate:

1. block encoding
2. fasta sequences
3. mixed-1.0 encoding
4. mixed-2.0 encoding
5. mixed-3.0 encoding
5. one-hot encoding (forward and reverse complement)

The general worklow is as follows:
1. Read in dataset
2. Perform train-test splitting
3. Save relevant information (sequence IDs, sequence, lables etc.)
4. Perform feature encoding/engineering/selection
5. Save full encoding, as well as standardized train and test set to specified directory

Details on the encodings can be found in the appropriate sections. Make sure to use a standard naming convention. Here I am using:
`PREPROCESS_SET_ENCODING.extension`.

An example: `0.18-0.4_X-train-0.9_block.npy`. This file contain a 0.9 split of the total sample set as a training set with the *block* encoding. These samples where preprocessed with the 0.18 and 0.4 low and high activity cut-offs respectively.

Note that separation with `_` versus the `-`. `-` separates descriptors of the main `_` separated field.
***
**TODOs:**
***

# Set-up

## Packages

In [1]:
# Classics
import os
import tqdm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Helpful libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

## Notebook parameters

In [2]:
# Param definitions
ACTIVE_LOW = np.inf  # The activity below which to define inactive enhancers
ACTIVE_HIGH = -np.inf  # The activity above which to define active enhancers
PREPROCESS = "{}-{}".format(ACTIVE_LOW, ACTIVE_HIGH)  # String defining the preprocessing for saving
SPLIT = 0.9  # Split into training and test sets
SUBSET = None

# Load data to preprocess
**Note**: There are two versions of the dataset I have to work with. See shared Google Drive for version history. The data has the following descriptors:
1. **NAME** - Site list 
2. **SEQUENCE** - Sequence
3. **MPRA_FXN** - Functional by MPRA (see slide 2)
    - Activity >.4 = functional (1)
    - Activity <.18 = non-functional (0)
    - If activity is in between => ‘na’
4. **MICROSCPOPE_FXN** - Microscope functional group
    - Only be for experiments with low variance and good controls (lacking *)
5. **ACTIVITY_SUMRNA_NUMDNA**
    - Activity score => sum MRPM all RNA barcodes / number DNA barcodes

In [3]:
# Load from tsv provided by Joe
OLS_data = pd.read_csv(
    "20210728-3.EnhancerTable.ForAdam.FunctionalEnhancers.WT-detected.ABL-notDetected.10R-20U-0.1P.tsv",
    sep="\t",
    na_values="na",
)

# Add sequence length
OLS_data = OLS_data.rename({"SEQUENCE":"SEQ"}, axis=1)
OLS_data["SEQ_LEN"] = OLS_data["SEQ"].apply(len)

# Define a set black features for current lack of a better term
block_features = ["linker_1", "TFBS_1", "linker_2", "TFBS_2", "linker_3", "TFBS_3", "linker_4", "TFBS_4", "linker_5", "TFBS_5", "linker_6"]

# Add these as columns to the dataframe
OLS_data[block_features] = OLS_data["NAME"].str.split("-").to_list()

# Save non-tiled dataframe in standard format
OLS_data.to_csv("2021_OLS_Library.tsv", index=False, sep="\t")

# Check it
OLS_data.head(1)

Unnamed: 0,NAME,SEQ,MPRA_FXN,MICROSCOPE_FXN,ACTIVITY_SUMRNA_NUMDNA,SEQ_LEN,linker_1,TFBS_1,linker_2,TFBS_2,linker_3,TFBS_3,linker_4,TFBS_4,linker_5,TFBS_5,linker_6
0,S1-G1R-S2-E1F-S3-E2F-S4-G2R-S5-G3F-S6,CATCTGAAGCTCGTTATCTCTAACGGAAGTTTTCGAAAAGGAAATT...,1.0,Neural Enhancer,0.611767,66,S1,G1R,S2,E1F,S3,E2F,S4,G2R,S5,G3F,S6


In [4]:
# Add labels based on decided cutoffs of activity
OLS_data["FXN_LABEL"] = np.nan
OLS_data.loc[OLS_data["ACTIVITY_SUMRNA_NUMDNA"] <= ACTIVE_LOW, "FXN_LABEL"] = 0
OLS_data.loc[OLS_data["ACTIVITY_SUMRNA_NUMDNA"] >= ACTIVE_HIGH, "FXN_LABEL"] = 1

# Impute activity of 23 NaN and 25 inf enhancers to mean avoid regression errors
OLS_data["ACTIVITY_SUMRNA_NUMDNA"] = OLS_data["ACTIVITY_SUMRNA_NUMDNA"].replace([np.inf, -np.inf], np.nan)
OLS_data["ACTIVITY_SUMRNA_NUMDNA"] = OLS_data["ACTIVITY_SUMRNA_NUMDNA"].fillna(OLS_data["ACTIVITY_SUMRNA_NUMDNA"].mean())

# Sanity check to make sure things match up with what Joe did if using the 0.18-0.4 thresholds
OLS_data["MPRA_FXN"].value_counts(dropna=False), OLS_data["FXN_LABEL"].value_counts(dropna=False)

(0.0    208729
 NaN    157864
 1.0     94207
 Name: MPRA_FXN, dtype: int64,
 1.0    460777
 NaN        23
 Name: FXN_LABEL, dtype: int64)

In [5]:
# Hold out enhancer with good labels already and middle "ambiguous" enhancers
holdout_mask = ((OLS_data["FXN_LABEL"].isna()) | (~OLS_data["MICROSCOPE_FXN"].isna()))

# How many seqs with annotated microscope function
sum(OLS_data["FXN_LABEL"].isna()), sum(~OLS_data["MICROSCOPE_FXN"].isna()), sum(holdout_mask)

(23, 78, 101)

# Get indexes for each set

## **All sequences**

In [6]:
# All sequences
binary = OLS_data["FXN_LABEL"].values
activity = OLS_data["ACTIVITY_SUMRNA_NUMDNA"].values
ID = OLS_data["NAME"].values
X_seqs = OLS_data["SEQ"].values
binary.shape, activity.shape, ID.shape, X_seqs.shape

((460800,), (460800,), (460800,), (460800,))

## **Deployment sequences**

In [7]:
# Remove ambiguous enhancers (aka the ones with NaNs). This will be the final training data
training_index = OLS_data[~holdout_mask].index
binary_training = binary[training_index]
activity_training = activity[training_index]
binary_training.shape, activity_training.shape, training_index.shape

((460699,), (460699,), (460699,))

## **Train/test split sequences**

In [9]:
from eugene.preprocessing import split_train_test

In [10]:
# Grab the indexes and the targets for a train set and a test set
train_index, test_index, binary_train, binary_test = split_train_test(training_index, binary_training, split=SPLIT, subset=SUBSET)
activity_train = activity[train_index]
activity_test = activity[test_index]
train_index.shape, test_index.shape, binary_train.shape, binary_test.shape, activity_train.shape, activity_test.shape

((414629,), (46070,), (414629,), (46070,), (414629,), (46070,))

## **Holdout Sequences**

In [11]:
# Grab the heldout data as well
holdout_index = OLS_data[holdout_mask].index
binary_holdout = binary[holdout_index]
activity_holdout = activity[holdout_index]
binary_holdout.shape, activity_holdout.shape, holdout_index.shape

((101,), (101,), (101,))

## **!!!Sanity Check!!!**

In [12]:
len(train_index) + len(test_index) == len(training_index), \
len(training_index) + len(holdout_index) == len(OLS_data), \
len(binary_train) + len(binary_test) == len(binary_training), \
len(binary_training) + len(binary_holdout) == len(binary), \
len(activity_train) + len(activity_test) == len(activity_training), \
len(activity_training) + len(activity_holdout) == len(activity)

(True, True, True, True, True, True)

# Save relavant information (labels, IDs, sequences)

## **Binary Labels**
0 (non-functional) and 1 (functional). These are all the same

In [13]:
# Save the labels
if not os.path.isdir("binary"):
    os.makedirs("binary")  

np.savetxt("binary/{}_y-all_binary.txt".format(PREPROCESS), X=binary, fmt="%f")
np.savetxt("binary/{}_y-training_binary.txt".format(PREPROCESS), X=binary_training, fmt="%f")
np.savetxt("binary/{}_y-train-{}_binary.txt".format(PREPROCESS, SPLIT), X=binary_train, fmt="%f")
np.savetxt("binary/{}_y-test-{}_binary.txt".format(PREPROCESS, round(1-SPLIT, 1)), X=binary_test, fmt="%f")
np.savetxt("binary/{}_y-holdout_binary.txt".format(PREPROCESS), X=binary_holdout, fmt="%f")
!wc -l binary/*

  460800 binary/0.09-0.4_y-all_binary.txt
  263252 binary/0.09-0.4_y-holdout_binary.txt
   19755 binary/0.09-0.4_y-test-0.1_binary.txt
  177793 binary/0.09-0.4_y-train-0.9_binary.txt
  197548 binary/0.09-0.4_y-training_binary.txt
  460800 binary/0.18-0.4_y-all_binary.txt
   30288 binary/0.18-0.4_y-test-0.1_binary.txt
  272591 binary/0.18-0.4_y-train-0.9_binary.txt
  302879 binary/0.18-0.4_y-training_binary.txt
  460800 binary/inf--inf_y-all_binary.txt
     101 binary/inf--inf_y-holdout_binary.txt
   46070 binary/inf--inf_y-test-0.1_binary.txt
  414629 binary/inf--inf_y-train-0.9_binary.txt
  460699 binary/inf--inf_y-training_binary.txt
 3568005 total


## **Continuous Labels**
Normalized MPRA counts of RNA reads

In [14]:
# Save the labels
if not os.path.isdir("activity"):
    os.makedirs("activity")  

np.savetxt("activity/{}_y-all_activity.txt".format(PREPROCESS), X=activity, fmt="%f")
np.savetxt("activity/{}_y-training_activity.txt".format(PREPROCESS), X=activity_training, fmt="%f")
np.savetxt("activity/{}_y-train-{}_activity.txt".format(PREPROCESS, SPLIT), X=activity_train, fmt="%f")
np.savetxt("activity/{}_y-test-{}_activity.txt".format(PREPROCESS, round(1-SPLIT, 1)), X=activity_test, fmt="%f")
np.savetxt("activity/{}_y-holdout_activity.txt".format(PREPROCESS), X=activity_holdout, fmt="%f")
!wc -l activity/*

  460800 activity/0.09-0.4_y-all_activity.txt
  263252 activity/0.09-0.4_y-holdout_activity.txt
   19755 activity/0.09-0.4_y-test-0.1_activity.txt
  177793 activity/0.09-0.4_y-train-0.9_activity.txt
  197548 activity/0.09-0.4_y-training_activity.txt
  460800 activity/inf--inf_y-all_activity.txt
     101 activity/inf--inf_y-holdout_activity.txt
   46070 activity/inf--inf_y-test-0.1_activity.txt
  414629 activity/inf--inf_y-train-0.9_activity.txt
  460699 activity/inf--inf_y-training_activity.txt
 2501447 total


## **Identifiers**
L1-TFBS1-... Also all the same

In [15]:
# Save the names of the seqs
if not os.path.isdir("id"):
    os.makedirs("id")
    
np.savetxt("id/{}_id-all.txt".format(PREPROCESS), X=ID, fmt="%s")
np.savetxt("id/{}_id-training.txt".format(PREPROCESS), X=ID[training_index], fmt="%s")
np.savetxt("id/{}_id-train-{}.txt".format(PREPROCESS, SPLIT), X=ID[train_index], fmt="%s")
np.savetxt("id/{}_id-test-{}.txt".format(PREPROCESS, round(1-SPLIT, 1)), X=ID[test_index], fmt="%s")
np.savetxt("id/{}_id-holdout.txt".format(PREPROCESS), X=ID[holdout_index], fmt="%s")

!wc -l id/*

   460800 id/0.09-0.4_id-all.txt
   263252 id/0.09-0.4_id-holdout.txt
    19755 id/0.09-0.4_id-test-0.1.txt
   177793 id/0.09-0.4_id-train-0.9.txt
   197548 id/0.09-0.4_id-training.txt
   460800 id/0.18-0.4_id-all.txt
   157921 id/0.18-0.4_id-holdout.txt
    30288 id/0.18-0.4_id-test-0.1.txt
   272591 id/0.18-0.4_id-train-0.9.txt
   302879 id/0.18-0.4_id-training.txt
   460800 id/inf--inf_id-all.txt
      101 id/inf--inf_id-holdout.txt
    46070 id/inf--inf_id-test-0.1.txt
   414629 id/inf--inf_id-train-0.9.txt
   460699 id/inf--inf_id-training.txt
  3725926 total


## **Sequences**
ACGT...

In [16]:
# Save the different files as numpy objects
if not os.path.isdir("seqs"):
    os.makedirs("seqs")  
    
np.savetxt("seqs/{}_seqs-all.txt".format(PREPROCESS), X=X_seqs, fmt="%s")
np.savetxt("seqs/{}_seqs-training.txt".format(PREPROCESS), X=X_seqs[training_index], fmt="%s")
np.savetxt("seqs/{}_seqs-train-{}.txt".format(PREPROCESS, SPLIT), X=X_seqs[train_index], fmt="%s")
np.savetxt("seqs/{}_seqs-test-{}.txt".format(PREPROCESS, round(1-SPLIT, 1)), X=X_seqs[test_index], fmt="%s")
np.savetxt("seqs/{}_seqs-holdout.txt".format(PREPROCESS), X=X_seqs[holdout_index], fmt="%s")
!wc -l seqs/*

   460800 seqs/0.09-0.4_seqs-all.txt
   263252 seqs/0.09-0.4_seqs-holdout.txt
    19755 seqs/0.09-0.4_seqs-test-0.1.txt
   177793 seqs/0.09-0.4_seqs-train-0.9.txt
   197548 seqs/0.09-0.4_seqs-training.txt
   460800 seqs/0.18-0.4_seqs-all.txt
   157921 seqs/0.18-0.4_seqs-holdout.txt
    30288 seqs/0.18-0.4_seqs-test-0.1.txt
   272591 seqs/0.18-0.4_seqs-train-0.9.txt
   302879 seqs/0.18-0.4_seqs-training.txt
   460800 seqs/inf--inf_seqs-all.txt
      101 seqs/inf--inf_seqs-holdout.txt
    46070 seqs/inf--inf_seqs-test-0.1.txt
   414629 seqs/inf--inf_seqs-train-0.9.txt
   460699 seqs/inf--inf_seqs-training.txt
  3725926 total


# Preprocess and save different feature sets

## **<u>Sequence feature idea 1 </u>**: One hot encoding block features
 - linker_{1-5} could be a one hot encoded vector of length 6 that can be any of L1-L5 --> e.g., [0, 1, 0, 0, 0] encodes S2
 - TFBS_{1-5} could be a one hot encoded vector of length 10 that can be G{1-3}R, G{1-3}F, E{1,2}F, E{1,2}R
 - **Note**: L6 is only S6

In [17]:
# Fit to overall dataframe
ohe_block = OneHotEncoder(sparse=False)
X = OLS_data[block_features]
ohe_block.fit(X)

# Transform
X_block = ohe_block.fit_transform(X)

# Sanity check one
ohe_block.categories_, OLS_data.loc[45]["NAME"], X_block[45]

([array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S1', 'S2', 'S3', 'S4', 'S5'], dtype=object),
  array(['E1F', 'E1R', 'E2F', 'E2R', 'G1F', 'G1R', 'G2F', 'G2R', 'G3F',
         'G3R'], dtype=object),
  array(['S6'], dtype=object)],
 'S1-G1R-S2-E1R-S3-E2F-S5-G2R-S4-G3R-S6',
 array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,

**Sanity Check!!!** How many features should we be expecting?
 - Each non-L6 linker has 5 options and there are 5 of these --> 5\*5=25
 - Each TFBS has 10 options and there are 5 of these --> 10\*5=25
 - L6 has one option S6 --> 1
 - Should have:

In [18]:
print("Expecting %d features" % ((5 * 5) + (10 * 5) + 1))
print("Observed %d features" % X_block.shape[1])
if (5 * 5) + (10 * 5) + 1 != X_block.shape[1]:
    print("Something is amiss...")

# Remove non-unique feature at end
if len(np.unique(X_block[:, -1])) == 1:
    X_block = X_block[:, :-1]

print("Expecting %d features" % ((5 * 5) + (10 * 5)))
print("Observed %d features" % X_block.shape[1])
if (5 * 5) + (10 * 5) != X_block.shape[1]:
    print("Something is amiss...")

Expecting 76 features
Observed 76 features
Expecting 75 features
Observed 75 features


In [19]:
# Save the different files as numpy objects
if not os.path.isdir("block"):
    os.makedirs("block")  
    
# Save the different files as numpy arrays
np.save("block/{}_X-all_block".format(PREPROCESS), arr=X_block)
np.save("block/{}_X-training_block".format(PREPROCESS), arr=X_block[training_index, :])
np.save("block/{}_X-train-{}_block".format(PREPROCESS, SPLIT), arr=X_block[train_index, :])
np.save("block/{}_X-test-{}_block".format(PREPROCESS, round(1-SPLIT, 1)), arr=X_block[test_index, :])
np.save("block/{}_X-holdout_block".format(PREPROCESS), arr=X_block[holdout_index, :])

# Save the header for use down the line. Save as one column per line
with open ("block/block_header.txt", "w") as f:
    f.write("\n".join(np.concatenate(ohe_block.categories_[:-1])))

# Sanity check
!ls -l block

total 2183165
-rw-r--r-- 1 aklie carter-users 276480128 May  5 18:58 0.09-0.4_X-all_block.npy
-rw-r--r-- 1 aklie carter-users 157951328 May  5 18:58 0.09-0.4_X-holdout_block.npy
-rw-r--r-- 1 aklie carter-users  11853128 May  5 18:58 0.09-0.4_X-test-0.1_block.npy
-rw-r--r-- 1 aklie carter-users 106675928 May  5 18:58 0.09-0.4_X-train-0.9_block.npy
-rw-r--r-- 1 aklie carter-users 118528928 May  5 18:58 0.09-0.4_X-training_block.npy
-rw-r--r-- 1 aklie carter-users 276480128 Nov 26  2021 0.18-0.4_X-all_block.npy
-rw-r--r-- 1 aklie carter-users  94752728 Nov 26  2021 0.18-0.4_X-holdout_block.npy
-rw-r--r-- 1 aklie carter-users  18172928 Nov 26  2021 0.18-0.4_X-test-0.1_block.npy
-rw-r--r-- 1 aklie carter-users 163554728 Nov 26  2021 0.18-0.4_X-train-0.9_block.npy
-rw-r--r-- 1 aklie carter-users 181727528 Nov 26  2021 0.18-0.4_X-training_block.npy
-rw-r--r-- 1 aklie carter-users       274 Jun  4 12:09 block_header.txt
-rw-r--r-- 1 aklie carter-users 276480128 Jun  4 12:09 inf--inf_X-all_bloc

## **<u>Sequence feature idea 2 </u>**: Mixed encodings 

In [20]:
from eugene.preprocessing import mixed_OLS_encode
from eugene.preprocessing import standardize_features

In [21]:
X_mixed1, X_mixed2, X_mixed3 = mixed_OLS_encode(OLS_data[block_features])
X_mixed1.shape, X_mixed2.shape, X_mixed3.shape

460800it [00:38, 11880.24it/s]


((460800, 21), (460800, 26), (460800, 21))

In [22]:
OLS_data.iloc[0]["NAME"], X_mixed1[0], X_mixed2[0], X_mixed3[0]

('S1-G1R-S2-E1F-S3-E2F-S4-G2R-S5-G3F-S6',
 array([12. ,  0. ,  0. ,  0.9,  2. ,  1. ,  1. ,  0.6,  7. ,  1. ,  1. ,
         0.4,  5. ,  0. ,  0. ,  0.3,  0. ,  0. ,  1. ,  0.5,  1. ]),
 array([12. ,  0. ,  0. ,  0.9, -1. ,  2. ,  0.6,  1. ,  0. ,  0. ,  7. ,
         0.4,  1. ,  0. ,  0. ,  5. ,  0. ,  0. ,  0.3, -1. ,  0. ,  0. ,
         0. ,  0.5,  1. ,  1. ]),
 array([12. ,  0. ,  0.9,  0. ,  2. ,  0.6,  0. ,  1. ,  7. ,  0.4,  0. ,
         1. ,  5. ,  0. ,  0.3,  0. ,  0. ,  0. ,  0.5,  1. ,  1. ]))

### *Mixed 1.0*
 - Replace binding sites using dictionary
 - Separate based on these binding sites and add create "dummy variables"
 - Get lengths of linkers around binding sites

In [25]:
# Subset train and test
X_mixed1_train = X_mixed1[train_index, :]
X_mixed1_test = X_mixed1[test_index, :]
X_mixed1_training = X_mixed1[training_index, :]
X_mixed1_holdout = X_mixed1[holdout_index, :]

# Standardize features and save both training set stats
scale_indeces = np.array([0, 3, 4, 7, 8, 11, 12, 15, 16, 19, 20])  # Mixed 1.0
_, X_mixed1_training_scaled = standardize_features(train_X=X_mixed1_training, test_X=X_mixed1_training, indeces=scale_indeces, stats_file="mixed_1.0/{}_X-training_stats.pickle".format(PREPROCESS))
X_mixed1_train_scaled, X_mixed1_test_scaled = standardize_features(train_X=X_mixed1_train, test_X=X_mixed1_test, indeces=scale_indeces, stats_file="mixed_1.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT))
X_mixed1_train_scaled, X_mixed1_holdout_scaled = standardize_features(train_X=X_mixed1_train, test_X=X_mixed1_holdout, indeces=scale_indeces)

# Save the different files as numpy objects
if not os.path.isdir("mixed_1.0"):
    os.makedirs("mixed_1.0")  
    
np.save("mixed_1.0/{}_X-all_mixed-1.0".format(PREPROCESS), X_mixed1)
np.save("mixed_1.0/{}_X-training_mixed-1.0".format(PREPROCESS), X_mixed1_training_scaled)
np.save("mixed_1.0/{}_X-train-{}_mixed-1.0".format(PREPROCESS, SPLIT), X_mixed1_train_scaled)
np.save("mixed_1.0/{}_X-test-{}_mixed-1.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed1_test_scaled)
np.save("mixed_1.0/{}_X-holdout_mixed-1.0".format(PREPROCESS), X_mixed1_holdout_scaled)
    
!ls -l mixed_1.0

# !!!Sanity Check!!!
X_mixed1_train[:, scale_indeces].mean(axis=0), \
X_mixed1_train[:, scale_indeces].std(axis=0), \
X_mixed1_train_scaled[:, scale_indeces].mean(axis=0), \
X_mixed1_train_scaled[:, scale_indeces].std(axis=0), \
X_mixed1_training[:, scale_indeces].mean(axis=0), \
X_mixed1_training[:, scale_indeces].std(axis=0), \
X_mixed1_training_scaled[:, scale_indeces].mean(axis=0), \
X_mixed1_training_scaled[:, scale_indeces].std(axis=0)

total 614739
-rw-r--r-- 1 aklie carter-users 77414528 Nov 27  2021 0.09-0.4_X-all_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users 44226464 Nov 27  2021 0.09-0.4_X-holdout_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users  3318968 Nov 27  2021 0.09-0.4_X-test-0.1_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users 29869352 Nov 27  2021 0.09-0.4_X-train-0.9_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users      534 Nov 27  2021 0.09-0.4_X-train-0.9_stats.pickle
-rw-r--r-- 1 aklie carter-users 33188192 Nov 27  2021 0.09-0.4_X-training_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users      534 Nov 27  2021 0.09-0.4_X-training_stats.pickle
-rw-r--r-- 1 aklie carter-users  3523378 Oct 28  2021 0.18-0.4_mixed-1.0_eda.html
-rw-r--r-- 1 aklie carter-users 77414528 Nov 26  2021 0.18-0.4_X-all_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users 26530856 Nov 26  2021 0.18-0.4_X-holdout_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users  5088512 Nov 26  2021 0.18-0.4_X-test-0.1_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users 45795

(array([5.12276999, 0.53998225, 5.03947625, 0.53976495, 5.04224017,
        0.54005726, 5.03799541, 0.53998563, 5.03761676, 0.54020992,
        0.9001927 ]),
 array([4.15142491, 0.20594161, 4.13440734, 0.20593279, 4.13205647,
        0.20595367, 4.13085135, 0.20578677, 4.13232906, 0.20594787,
        0.29974289]),
 array([ 2.12325344e-17, -7.14948615e-16,  5.66886390e-17,  3.78895347e-16,
        -8.03717403e-17, -9.64460884e-17,  3.83865028e-18, -7.59521514e-16,
         7.18718718e-17, -2.80975491e-16,  4.63037190e-17]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 array([5.12014786, 0.54000551, 5.03976349, 0.53999401, 5.04012598,
        0.53999531, 5.0399914 , 0.53999423, 5.03993931, 0.54001094,
        0.90000847]),
 array([4.15033187, 0.20591487, 4.13259336, 0.20590925, 4.1326689 ,
        0.20591554, 4.13262294, 0.205908  , 4.13255096, 0.20591535,
        0.29998871]),
 array([-6.88489181e-17,  7.12919442e-16, -1.89951451e-16,  2.84279404e-16,
         7.50798686e-17, 

### *Mixed 2.0*
 - Replace binding sites using dictionary
 - 4 bit vector for each binding site [ets_affinity ets_orientation gata_affinity gata_orientation] - ties together the identity to affinity
 - Get lengths of linkers around binding sites

In [26]:
# Subset train and test
X_mixed2_train = X_mixed2[train_index, :]
X_mixed2_test = X_mixed2[test_index, :]
X_mixed2_training = X_mixed2[training_index, :]
X_mixed2_holdout = X_mixed2[holdout_index, :]

# Standardize features and save both training set stats
scale_indeces = np.array([0, 5, 10, 15, 20, 25])  # Mixed 2.0
_, X_mixed2_training_scaled = standardize_features(train_X=X_mixed2_training, test_X=X_mixed2_training, indeces=scale_indeces, stats_file="mixed_2.0/{}_X-training_stats.pickle".format(PREPROCESS))
X_mixed2_train_scaled, X_mixed2_test_scaled = standardize_features(train_X=X_mixed2_train, test_X=X_mixed2_test, indeces=scale_indeces, stats_file="mixed_2.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT))
X_mixed2_train_scaled, X_mixed2_holdout_scaled = standardize_features(train_X=X_mixed2_train, test_X=X_mixed2_holdout, indeces=scale_indeces)

# Save the different files as numpy objects
if not os.path.isdir("mixed_2.0"):
    os.makedirs("mixed_2.0")  
    
np.save("mixed_2.0/{}_X-all_mixed-2.0".format(PREPROCESS), X_mixed2)
np.save("mixed_2.0/{}_X-training_mixed-2.0".format(PREPROCESS), X_mixed2_training_scaled)
np.save("mixed_2.0/{}_X-train-{}_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed2_train_scaled)
np.save("mixed_2.0/{}_X-test-{}_mixed-2.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed2_test_scaled)
np.save("mixed_2.0/{}_X-holdout_mixed-2.0".format(PREPROCESS), X_mixed2_holdout_scaled)
    
!ls -l mixed_2.0

# !!!Sanity Check!!!
X_mixed2_train[:, scale_indeces].mean(axis=0), \
X_mixed2_train[:, scale_indeces].std(axis=0), \
X_mixed2_train_scaled[:, scale_indeces].mean(axis=0), \
X_mixed2_train_scaled[:, scale_indeces].std(axis=0), \
X_mixed2_training[:, scale_indeces].mean(axis=0), \
X_mixed2_training[:, scale_indeces].std(axis=0), \
X_mixed2_training_scaled[:, scale_indeces].mean(axis=0), \
X_mixed2_training_scaled[:, scale_indeces].std(axis=0)

total 750671
-rw-r--r-- 1 aklie carter-users 95846528 Nov 27  2021 0.09-0.4_X-all_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users 54756544 Nov 27  2021 0.09-0.4_X-holdout_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users  4109168 Nov 27  2021 0.09-0.4_X-test-0.1_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users 36981072 Nov 27  2021 0.09-0.4_X-train-0.9_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users      414 Nov 27  2021 0.09-0.4_X-train-0.9_stats.pickle
-rw-r--r-- 1 aklie carter-users 41090112 Nov 27  2021 0.09-0.4_X-training_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users      414 Nov 27  2021 0.09-0.4_X-training_stats.pickle
-rw-r--r-- 1 aklie carter-users 95846528 Nov 26  2021 0.18-0.4_X-all_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users 26530856 Nov 26  2021 0.18-0.4_X-holdout_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users  6300032 Nov 26  2021 0.18-0.4_X-test-0.1_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-users 56699056 Nov 26  2021 0.18-0.4_X-train-0.9_mixed-2.0.npy
-rw-r--r-- 1 aklie carter-user

(array([5.12276999, 5.03947625, 5.04224017, 5.03799541, 5.03761676,
        0.9001927 ]),
 array([4.15142491, 4.13440734, 4.13205647, 4.13085135, 4.13232906,
        0.29974289]),
 array([ 2.12325344e-17,  5.66886390e-17, -8.03717403e-17,  3.83865028e-18,
         7.18718718e-17,  4.63037190e-17]),
 array([1., 1., 1., 1., 1., 1.]),
 array([5.12014786, 5.03976349, 5.04012598, 5.0399914 , 5.03993931,
        0.90000847]),
 array([4.15033187, 4.13259336, 4.1326689 , 4.13262294, 4.13255096,
        0.29998871]),
 array([-6.88489181e-17, -1.89951451e-16,  7.50798686e-17, -3.57816958e-17,
         2.56641128e-17, -5.70039430e-17]),
 array([1., 1., 1., 1., 1., 1.]))

### *Mixed 3.0*
 - Replace binding sites using dictionary
 - 3 bit vector for each binding site [ets_affinity gata_affinity orientation] - ties together the identity to affinity while removing redundant info from mixed-2.0
 - Get lengths of linkers around binding sites

In [27]:
# Subset train and test
X_mixed3_train = X_mixed3[train_index, :]
X_mixed3_test = X_mixed3[test_index, :]
X_mixed3_training = X_mixed3[training_index, :]
X_mixed3_holdout = X_mixed3[holdout_index, :]

# Standardize features and save both training set stats
scale_indeces = np.array([0, 4, 8, 12, 16, 20])  # Mixed 3.0
_, X_mixed3_training_scaled = standardize_features(train_X=X_mixed3_training, test_X=X_mixed3_training, indeces=scale_indeces, stats_file="mixed_3.0/{}_X-training_stats.pickle".format(PREPROCESS))
X_mixed3_train_scaled, X_mixed3_test_scaled = standardize_features(train_X=X_mixed3_train, test_X=X_mixed3_test, indeces=scale_indeces, stats_file="mixed_3.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT))
X_mixed3_train_scaled, X_mixed3_holdout_scaled = standardize_features(train_X=X_mixed3_train, test_X=X_mixed3_holdout, indeces=scale_indeces)

# Save the different files as numpy objects
if not os.path.isdir("mixed_3.0"):
    os.makedirs("mixed_3.0")  
    
np.save("mixed_3.0/{}_X-all_mixed-3.0".format(PREPROCESS), X_mixed3)
np.save("mixed_3.0/{}_X-training_mixed-3.0".format(PREPROCESS), X_mixed3_training_scaled)
np.save("mixed_3.0/{}_X-train-{}_mixed-3.0".format(PREPROCESS, SPLIT), X_mixed3_train_scaled)
np.save("mixed_3.0/{}_X-test-{}_mixed-3.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed3_test_scaled)
np.save("mixed_3.0/{}_X-holdout_mixed-3.0".format(PREPROCESS), X_mixed3_holdout_scaled)
    
!ls -l mixed_3.0

# !!!Sanity Check!!!
X_mixed3_train[:, scale_indeces].mean(axis=0), \
X_mixed3_train[:, scale_indeces].std(axis=0), \
X_mixed3_train_scaled[:, scale_indeces].mean(axis=0), \
X_mixed3_train_scaled[:, scale_indeces].std(axis=0), \
X_mixed3_training[:, scale_indeces].mean(axis=0), \
X_mixed3_training[:, scale_indeces].std(axis=0), \
X_mixed3_training_scaled[:, scale_indeces].mean(axis=0), \
X_mixed3_training_scaled[:, scale_indeces].std(axis=0)

total 611296
-rw-r--r-- 1 aklie carter-users 77414528 Nov 27  2021 0.09-0.4_X-all_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users 44226464 Nov 27  2021 0.09-0.4_X-holdout_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users  3318968 Nov 27  2021 0.09-0.4_X-test-0.1_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users 29869352 Nov 27  2021 0.09-0.4_X-train-0.9_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users      414 Nov 27  2021 0.09-0.4_X-train-0.9_stats.pickle
-rw-r--r-- 1 aklie carter-users 33188192 Nov 27  2021 0.09-0.4_X-training_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users      414 Nov 27  2021 0.09-0.4_X-training_stats.pickle
-rw-r--r-- 1 aklie carter-users 77414528 Nov 26  2021 0.18-0.4_X-all_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users 26530856 Nov 26  2021 0.18-0.4_X-holdout_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users  5088512 Nov 26  2021 0.18-0.4_X-test-0.1_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-users 45795416 Nov 26  2021 0.18-0.4_X-train-0.9_mixed-3.0.npy
-rw-r--r-- 1 aklie carter-user

(array([5.12276999, 5.03947625, 5.04224017, 5.03799541, 5.03761676,
        0.9001927 ]),
 array([4.15142491, 4.13440734, 4.13205647, 4.13085135, 4.13232906,
        0.29974289]),
 array([ 2.12325344e-17,  5.66886390e-17, -8.03717403e-17,  3.83865028e-18,
         7.18718718e-17,  4.63037190e-17]),
 array([1., 1., 1., 1., 1., 1.]),
 array([5.12014786, 5.03976349, 5.04012598, 5.0399914 , 5.03993931,
        0.90000847]),
 array([4.15033187, 4.13259336, 4.1326689 , 4.13262294, 4.13255096,
        0.29998871]),
 array([-6.88489181e-17, -1.89951451e-16,  7.50798686e-17, -3.57816958e-17,
         2.56641128e-17, -5.70039430e-17]),
 array([1., 1., 1., 1., 1., 1.]))

## **<u>Sequence feature idea 3 </u>**: Use the actual sequence (one-hot encoded)
 - One hot encoded sequence: each position is encoded as a 1-D vector of size 4 e.g., AT is [[1,0,0,0], [0,0,0,1]]
 - Generally, we will get inputs of size (len(seq) X 4). The above example would be of size 2x4
 - Can also save the string seqs in case those are also useful down the line

**Q** Are all sequences the same length

In [28]:
# Check the lengths of sequences to make sure they are all the same
OLS_data["SEQ"].apply(len).value_counts()

66    460800
Name: SEQ, dtype: int64

**Answer**: Yes, 66bp long

### *Forward Seqs*

In [29]:
from eugene.preprocessing import encodeDNA, decodeDNA

In [30]:
# Quick test
decodeDNA(encodeDNA(X_seqs[0:5])) == X_seqs[0:5]

array([ True,  True,  True,  True,  True])

In [31]:
X_ohe_seq = encodeDNA(X_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
np.save("ohe_seq/{}_X-all_ohe-seq".format(PREPROCESS), X_ohe_seq)
np.save("ohe_seq/{}_X-training_ohe-seq".format(PREPROCESS), X_ohe_seq[training_index, :, :])
np.save("ohe_seq/{}_X-train-{}_ohe-seq".format(PREPROCESS, SPLIT), arr=X_ohe_seq[train_index, :, :])
np.save("ohe_seq/{}_X-test-{}_ohe-seq".format(PREPROCESS, round(1-SPLIT, 1)), arr=X_ohe_seq[test_index, :, :])
np.save("ohe_seq/{}_X-holdout_ohe-seq".format(PREPROCESS), X_ohe_seq[holdout_index, :, :])

# Quick check
print(X_seqs[0][:5], X_ohe_seq[0][:5])

!ls -l ohe_seq

CATCT [[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]
total 15369456
-rw-r--r-- 1 aklie carter-users 973209728 Nov 26  2021 0.09-0.4_X-all_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 973209728 Nov 26  2021 0.09-0.4_X-all_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 555988352 Nov 26  2021 0.09-0.4_X-holdout_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 555988352 Nov 26  2021 0.09-0.4_X-holdout_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users  41722688 Nov 26  2021 0.09-0.4_X-test-0.1_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users  41722688 Nov 26  2021 0.09-0.4_X-test-0.1_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 375498944 Nov 26  2021 0.09-0.4_X-train-0.9_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 375498944 Nov 26  2021 0.09-0.4_X-train-0.9_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 417221504 Nov 26  2021 0.09-0.4_X-training_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 417221504 Nov 26  2021 0.09-0.4_X-training_ohe-seq-rev.npy
-rw-r--r-- 1 aklie cart

### *Reverse Seqs*

In [32]:
from eugene.utils.seq_utils import reverse_complement

In [33]:
# Get the reverse encodings
X_rev_seqs = np.array([reverse_complement(seq) for seq in X_seqs], dtype="object")
X_ohe_rev_seq = encodeDNA(X_rev_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
np.save("ohe_seq/{}_X-all_ohe-seq-rev".format(PREPROCESS), X_ohe_rev_seq)
np.save("ohe_seq/{}_X-training_ohe-seq-rev".format(PREPROCESS), arr=X_ohe_rev_seq[training_index, :, :])
np.save("ohe_seq/{}_X-train-{}_ohe-seq-rev".format(PREPROCESS, SPLIT), arr=X_ohe_rev_seq[train_index, :, :])
np.save("ohe_seq/{}_X-test-{}_ohe-seq-rev".format(PREPROCESS, round(1-SPLIT, 1)), arr=X_ohe_rev_seq[test_index, :, :])
np.save("ohe_seq/{}_X-holdout_ohe-seq-rev".format(PREPROCESS), X_ohe_rev_seq[holdout_index, :, :])

# Quick check
print(X_rev_seqs[0][:5], X_ohe_rev_seq[0][:5])

!ls -l ohe_seq

TCCTA [[0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]
total 15369456
-rw-r--r-- 1 aklie carter-users 973209728 Nov 26  2021 0.09-0.4_X-all_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 973209728 Nov 26  2021 0.09-0.4_X-all_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 555988352 Nov 26  2021 0.09-0.4_X-holdout_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 555988352 Nov 26  2021 0.09-0.4_X-holdout_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users  41722688 Nov 26  2021 0.09-0.4_X-test-0.1_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users  41722688 Nov 26  2021 0.09-0.4_X-test-0.1_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 375498944 Nov 26  2021 0.09-0.4_X-train-0.9_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 375498944 Nov 26  2021 0.09-0.4_X-train-0.9_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 417221504 Nov 26  2021 0.09-0.4_X-training_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 417221504 Nov 26  2021 0.09-0.4_X-training_ohe-seq-rev.npy
-rw-r--r-- 1 aklie cart

In [34]:
print(X_seqs[0][:5], decodeDNA(X_ohe_rev_seq[0:100])[-1][-5:][::-1])
print(X_seqs[10000][:5], decodeDNA(X_ohe_rev_seq[0:10000])[-1][-5:][::-1])

CATCT GTAGA
TGCTC ACGAG


## **<u>Sequence feature idea 4 </u>**: Use the actual sequence (as fasta)

In [35]:
from eugene.utils.seq_utils import seq2Fasta, gkmSeq2Fasta

In [36]:
# All seqs
file_name = "fasta/{}_X-all_fasta".format(PREPROCESS)
seq2Fasta(X_seqs, ID, name=file_name)

# Training seqs
file_name = "fasta/{}_X-training_fasta".format(PREPROCESS)
gkmSeq2Fasta(X_seqs[training_index], ID[training_index], binary_training, name=file_name)

# Train seqs
file_name = "fasta/{}_X-train-{}_fasta".format(PREPROCESS, SPLIT)
gkmSeq2Fasta(X_seqs[train_index], ID[train_index], binary_train, name=file_name)

# Test seqs
file_name = "fasta/{}_X-test-{}_fasta".format(PREPROCESS, round(1-SPLIT, 1))
seq2Fasta(X_seqs[test_index], ID[test_index], name=file_name)

# Holdout seqs
file_name = "fasta/{}_X-holdout_fasta".format(PREPROCESS)
seq2Fasta(X_seqs[holdout_index], ID[holdout_index], name=file_name)

# Sanity checks
print(len(ID)*2)
print(len(training_index)*2)
print(len(train_index)*2)
print(len(test_index)*2)
print(len(holdout_index)*2)
print(len(ID)*2 + len(training_index)*2 + len(train_index)*2 + len(test_index)*2 + len(holdout_index)*2)

!wc -l fasta/0.09-0.4*
!wc -l fasta/0.18-0.4*
!wc -l fasta/inf--inf*

921600
921398
829258
92140
202
2764598
   921600 fasta/0.09-0.4_X-all_fasta.fa
   526504 fasta/0.09-0.4_X-holdout_fasta.fa
    39510 fasta/0.09-0.4_X-test-0.1_fasta.fa
   185928 fasta/0.09-0.4_X-train-0.9_fasta-neg.fa
   169658 fasta/0.09-0.4_X-train-0.9_fasta-pos.fa
   206718 fasta/0.09-0.4_X-training_fasta-neg.fa
   188378 fasta/0.09-0.4_X-training_fasta-pos.fa
  2238296 total
   921600 fasta/0.18-0.4_X-all_fasta.fa
   315842 fasta/0.18-0.4_X-holdout_fasta.fa
    60576 fasta/0.18-0.4_X-test-0.1_fasta.fa
   375660 fasta/0.18-0.4_X-train-0.9_fasta-neg.fa
   169522 fasta/0.18-0.4_X-train-0.9_fasta-pos.fa
   417380 fasta/0.18-0.4_X-training_fasta-neg.fa
   188378 fasta/0.18-0.4_X-training_fasta-pos.fa
  2448958 total
   921600 fasta/inf--inf_X-all_fasta.fa
      202 fasta/inf--inf_X-holdout_fasta.fa
    92140 fasta/inf--inf_X-test-0.1_fasta.fa
        0 fasta/inf--inf_X-train-0.9_fasta-neg.fa
   829258 fasta/inf--inf_X-train-0.9_fasta-pos.fa
        0 fasta/inf--inf_X-training_fasta-neg.

# Final Checks

In [37]:
!tree -L 2

[38;5;33m.[0m
├── 20210728-3.EnhancerTable.ForAdam.FunctionalEnhancers.WT-detected.ABL-notDetected.10R-20U-0.1P.tsv
├── 2021_OLS_Library_EDA.ipynb
├── 2021_OLS_Library_preprocess.ipynb
├── 2021_OLS_Library.tsv
├── 3.ForAdam.EnhancerTable.FunctionalEnhancers.WT-detected.ABL-notDetected.10R-20U-0.1P.tsv
├── [38;5;33mactivity[0m
│   ├── 0.09-0.4_y-all_activity.txt
│   ├── 0.09-0.4_y-holdout_activity.txt
│   ├── 0.09-0.4_y-test-0.1_activity.txt
│   ├── 0.09-0.4_y-train-0.9_activity.txt
│   ├── 0.09-0.4_y-training_activity.txt
│   ├── inf--inf_y-all_activity.txt
│   ├── inf--inf_y-holdout_activity.txt
│   ├── inf--inf_y-test-0.1_activity.txt
│   ├── inf--inf_y-train-0.9_activity.txt
│   └── inf--inf_y-training_activity.txt
├── [38;5;33mbinary[0m
│   ├── 0.09-0.4_y-all_binary.txt
│   ├── 0.09-0.4_y-holdout_binary.txt
│   ├── 0.09-0.4_y-test-0.1_binary.txt
│   ├── 0.09-0.4_y-train-0.9_binary.txt
│   ├── 0.09-0.4_y-training_binary.txt
│   ├── 0.18-0.4_y-all_binary.txt
│   ├── 0.18-0.4_y-

# Scratch

In [121]:
# Get the ohe
X_ohe_seq = ohe_seqs(X_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
np.save("ohe_seq/{}_X-all_ohe-seq".format(PREPROCESS), X_ohe_seq)
np.save("ohe_seq/{}_X-training_ohe-seq".format(PREPROCESS), X_ohe_seq[training_index, :, :])
np.save("ohe_seq/{}_X-train-{}_ohe-seq".format(PREPROCESS, SPLIT), arr=X_ohe_seq[train_index, :, :])
np.save("ohe_seq/{}_X-test-{}_ohe-seq".format(PREPROCESS, round(1-SPLIT, 1)), arr=X_ohe_seq[test_index, :, :])
np.save("ohe_seq/{}_X-holdout_ohe-seq".format(PREPROCESS), X_ohe_seq[holdout_index, :, :])

# Quick check
print(X_seqs[0][:5], X_ohe_seq[0][:5])

!ls -l ohe_seq

100%|██████████| 460800/460800 [05:46<00:00, 1327.99it/s]


Encoded 460800 seqs
Checking 1000 random seqs for proper encoding
Sequence encoding was great success
CATCT [[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]
total 8309627
-rw-r--r-- 1 aklie carter-users 973209728 Nov 26 22:49 0.09-0.4_X-all_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 555988352 Nov 26 22:49 0.09-0.4_X-holdout_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 555988352 Nov 24 17:45 0.09-0.4_X-holdout_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users  41722688 Nov 26 22:49 0.09-0.4_X-test-0.1_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users  41722688 Nov 24 17:45 0.09-0.4_X-test-0.1_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 375498944 Nov 26 22:49 0.09-0.4_X-train-0.9_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 375498944 Nov 24 17:45 0.09-0.4_X-train-0.9_ohe-seq-rev.npy
-rw-r--r-- 1 aklie carter-users 417221504 Nov 26 22:49 0.09-0.4_X-training_ohe-seq.npy
-rw-r--r-- 1 aklie carter-users 973209728 Nov 26 14:24 0.18-0.4_X-all_ohe-seq.npy
-rw-r--r-- 1 akli

In [None]:
# Create the encoding
mixed3_encoding = []

# Loop through each enhancer
for i, (row_num, enh_data) in enumerate(tqdm.tqdm(OLS_data[block_features].iterrows())):

    enh_enc = []  # Single enhancer encoding

    # Loop through each position
    for col_num in range(len(enh_data.index)):

        # If we have a spacer in the current position we need to check for surrounding GATA-2 sites
        if "S" in enh_data.iloc[col_num]:

            # If the spacer is the empty spacer, just add a 0 and go to the next
            if enh_data.iloc[col_num] == "S5":
                enh_enc.append(
                    len(siteName2bindingSiteSequence[enh_data.iloc[col_num]])
                )
                continue

            # If the spacer is downstream of a GATA-2 reverse, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num > 0:
                if enh_data.iloc[col_num - 1] == "G2R":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # If the spacer is upstream of a GATA-2 forward, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num < len(enh_data.index) - 1:
                if enh_data.iloc[col_num + 1] == "G2F":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # Finally if no G2 or S5 is involved, just add the normal len of the spacer
            enh_enc.append(len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]))
            continue

        # If we are at a TFBS, add the TFBS type, orientation and affinity
        else:
            tfbs = enh_data.iloc[col_num]
            tf = tfbs[0]
            aff = bindingSiteName2affinities[tfbs[:2]]
            orient = tfbs[2]
            if tf == "E":
                enh_enc += [aff, 0, orient]
            elif tf == "G":
                enh_enc += [0, aff, orient]
    # Poisin pill
    # if row_num == 10000:
    #    break
    mixed3_encoding.append(enh_enc + ["-".join(list(enh_data.values))])

In [None]:
# Mixed-3.0 header
header = [
    "L1_length",
    "TFBS1_ETS_affinity",
    "TFBS1_GATA_affinity",
    "TFBS1_orient",
    "L2_length",
    "TFBS2_ETS_affinity",
    "TFBS2_GATA_affinity",
    "TFBS2_orient",
    "L3_length",
    "TFBS3_ETS_affinity",
    "TFBS3_GATA_affinity",
    "TFBS3_orient",
    "L4_length",
    "TFBS4_ETS_affinity",
    "TFBS4_GATA_affinity",
    "TFBS4_orient",
    "L5_length",
    "TFBS5_ETS_affinity",
    "TFBS5_GATA_affinity",
    "TFBS5_orient",
    "L6_length",
    "Enhancer"
]
# Save the header for use down the line. Save as one column per line
with open ("mixed_3.0/mixed-3.0_header.txt", "w") as f:
    f.write("\n".join(header[:-1]))

In [None]:
# Create a dataframe and replace occurrences of TFs and orientation with numbers
X_mixed3 = pd.DataFrame(mixed3_encoding, columns=header).replace({"R": -1, "F": 1})

# Drop the enhancer column
X_mixed3 = X_mixed3.drop("Enhancer", axis=1).values

# Subset train and test
X_mixed3_train = X_mixed3[train_index, :]
X_mixed3_test = X_mixed3[test_index, :]
X_mixed3_trian_and_test = X_mixed3[training_index, :]
X_mixed3_holdout = X_mixed3[holdout_index, :]

# Standardize features and double check means are 0 and stds are 1
scale_indeces = np.array([0, 4, 8, 12, 16, 20])  # Mixed 3.0
_, X_mixed3_scaled = project_utils.standardize_features(train_X=X_mixed3_trian_and_test, test_X=X_mixed3_trian_and_test, indeces=scale_indeces, stats_file="mixed_3.0/{}_X-training_stats.pickle".format(PREPROCESS))
X_mixed3_train_scaled, X_mixed3_test_scaled = project_utils.standardize_features(train_X=X_mixed3_train, test_X=X_mixed3_test, indeces=scale_indeces, stats_file="mixed_3.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT))
X_mixed3_train_scaled, X_mixed3_holdout_scaled = project_utils.standardize_features(train_X=X_mixed3_train, test_X=X_mixed3_holdout, indeces=scale_indeces)

# Check
X_mixed3_train[:, scale_indeces].mean(axis=0), \
X_mixed3_train[:, scale_indeces].std(axis=0), \
X_mixed3_train_scaled[:, scale_indeces].mean(axis=0), \
X_mixed3_train_scaled[:, scale_indeces].std(axis=0), \
X_mixed3_scaled[:, scale_indeces].mean(axis=0), \
X_mixed3_scaled[:, scale_indeces].std(axis=0)

In [None]:
# Save the different files as plain text for now
if not os.path.isdir("mixed_3.0"):
    os.makedirs("mixed_3.0")  
    
np.save("mixed_3.0/{}_X-all_mixed-3.0".format(PREPROCESS), X_mixed3)
np.save("mixed_3.0/{}_X-training_mixed-3.0".format(PREPROCESS), X_mixed3_scaled)
np.save("mixed_3.0/{}_X-train-{}_mixed-3.0".format(PREPROCESS, SPLIT), X_mixed3_train_scaled)
np.save("mixed_3.0/{}_X-test-{}_mixed-3.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed3_test_scaled)
np.save("mixed_3.0/{}_X-holdout_mixed-3.0".format(PREPROCESS), X_mixed3_holdout_scaled)
    
!ls -l mixed_3.0

In [None]:
# Create the encoding
mixed2_encoding = []

# Loop through each enhancer
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(OLS_data[block_features].iterrows())):

    enh_enc = []  # Single enhancer encoding

    # Loop through each position
    for col_num in range(len(enh_data.index)):

        # If we have a spacer in the current position we need to check for surrounding GATA-2 sites
        if "S" in enh_data.iloc[col_num]:

            # If the spacer is the empty spacer, just add a 0 and go to the next
            if enh_data.iloc[col_num] == "S5":
                enh_enc.append(
                    len(siteName2bindingSiteSequence[enh_data.iloc[col_num]])
                )
                continue

            # If the spacer is downstream of a GATA-2 reverse, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num > 0:
                if enh_data.iloc[col_num - 1] == "G2R":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # If the spacer is upstream of a GATA-2 forward, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num < len(enh_data.index) - 1:
                if enh_data.iloc[col_num + 1] == "G2F":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # Finally if no G2 or S5 is involved, just add the normal len of the spacer
            enh_enc.append(len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]))
            continue

        # If we are at a TFBS, add the TFBS type, orientation and affinity
        else:
            tfbs = enh_data.iloc[col_num]
            tf = tfbs[0]
            aff = bindingSiteName2affinities[tfbs[:2]]
            orient = tfbs[2]
            if tf == "E":
                enh_enc += [aff, orient, 0, 0]
            elif tf == "G":
                enh_enc += [0, 0, aff, orient]
    # Poisin pill
    # if row_num == 10000:
    #    break
    mixed2_encoding.append(enh_enc + ["-".join(list(enh_data.values))])

In [None]:
# Mixed-2.0 header
header = [
    "L1_length",
    "TFBS1_ETS_affinity",
    "TFBS1_ETS_orient",
    "TFBS1_GATA_affinity",
    "TFBS1_GATA_orient",
    "L2_length",
    "TFBS2_ETS_affinity",
    "TFBS2_ETS_orient",
    "TFBS2_GATA_affinity",
    "TFBS2_GATA_orient",
    "L3_length",
    "TFBS3_ETS_affinity",
    "TFBS3_ETS_orient",
    "TFBS3_GATA_affinity",
    "TFBS3_GATA_orient",
    "L4_length",
    "TFBS4_ETS_affinity",
    "TFBS4_ETS_orient",
    "TFBS4_GATA_affinity",
    "TFBS4_GATA_orient",
    "L5_length",
    "TFBS5_ETS_affinity",
    "TFBS5_ETS_orient",
    "TFBS5_GATA_affinity",
    "TFBS5_GATA_orient",
    "L6_length",
    "Enhancer",
]
# Save the header for use down the line. Save as one column per line
with open ("mixed_2.0/mixed-2.0_header.txt", "w") as f:
    f.write("\n".join(header[:-1]))

In [None]:
# Create a dataframe and replace occurrences of TFs and orientation with numbers
X_mixed2 = (
    pd.DataFrame(mixed2_encoding, columns=header)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)

# Drop the enhancer column
X_mixed2 = X_mixed2.drop("Enhancer", axis=1).values

# Subset train and test
X_mixed2_train = X_mixed2[train_index, :]
X_mixed2_test = X_mixed2[test_index, :]
X_mixed2_trian_and_test = X_mixed2[training_index, :]
X_mixed2_holdout = X_mixed2[holdout_index, :]

# Standardize features and double check means are 0 and stds are 1
scale_indeces = np.array([0, 5, 10, 15, 20, 25])  # Mixed 2.0
_, X_mixed2_scaled = project_utils.standardize_features(train_X=X_mixed2_trian_and_test, test_X=X_mixed2_trian_and_test, indeces=scale_indeces, stats_file="mixed_2.0/{}_X-training_stats.pickle".format(PREPROCESS))
X_mixed2_train_scaled, X_mixed2_test_scaled = project_utils.standardize_features(train_X=X_mixed2_train, test_X=X_mixed2_test, indeces=scale_indeces, stats_file="mixed_2.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT))
X_mixed2_train_scaled, X_mixed2_holdout_scaled = project_utils.standardize_features(train_X=X_mixed2_train, test_X=X_mixed2_holdout, indeces=scale_indeces)

# Check
X_mixed2_train[:, scale_indeces].mean(axis=0), \
X_mixed2_train[:, scale_indeces].std(axis=0), \
X_mixed2_train_scaled[:, scale_indeces].mean(axis=0), \
X_mixed2_train_scaled[:, scale_indeces].std(axis=0), \
X_mixed2_scaled[:, scale_indeces].mean(axis=0), \
X_mixed2_scaled[:, scale_indeces].std(axis=0)

In [None]:
# Save the different files as plain text for now
if not os.path.isdir("mixed_2.0"):
    os.makedirs("mixed_2.0")  
    
np.save("mixed_2.0/{}_X-all_mixed-2.0".format(PREPROCESS), X_mixed2)
np.save("mixed_2.0/{}_X-training_mixed-2.0".format(PREPROCESS), X_mixed2_scaled)
np.save("mixed_2.0/{}_X-train-{}_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed2_train_scaled)
np.save("mixed_2.0/{}_X-test-{}_mixed-2.0".format(PREPROCESS, round(1-SPLIT, 1)), X_mixed2_test_scaled)
np.save("mixed_2.0/{}_X-holdout_mixed-2.0".format(PREPROCESS), X_mixed1_holdout_scaled)
    
!ls -l mixed_2.0

In [None]:
# Create the encoding
mixed1_encoding = []

# Loop through each enhancer
for i, (row_num, enh_data) in enumerate(tqdm.tqdm(OLS_data[block_features].iterrows())):

    enh_enc = []  # Single enhancer encoding

    # Loop through each position
    for col_num in range(len(enh_data.index)):

        # If we have a spacer in the current position we need to check for surrounding GATA-2 sites
        if "S" in enh_data.iloc[col_num]:

            # If the spacer is the empty spacer, just add a 0 and go to the next
            if enh_data.iloc[col_num] == "S5":
                enh_enc.append(
                    len(siteName2bindingSiteSequence[enh_data.iloc[col_num]])
                )
                continue

            # If the spacer is downstream of a GATA-2 reverse, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num > 0:
                if enh_data.iloc[col_num - 1] == "G2R":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # If the spacer is upstream of a GATA-2 forward, we need to add a nucleotide to the GATA-2 (subtract one from spacer)
            if col_num < len(enh_data.index) - 1:
                if enh_data.iloc[col_num + 1] == "G2F":
                    enh_enc.append(
                        len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]) - 1
                    )
                    continue

            # Finally if no G2F or S5 is involved, just add the normal len of the spacer
            enh_enc.append(len(siteName2bindingSiteSequence[enh_data.iloc[col_num]]))
            continue

        # If we are at a TFBS, add the TFBS type, orientation and affinity
        else:
            tfbs = enh_data.iloc[col_num]
            tf = tfbs[0]
            aff = bindingSiteName2affinities[tfbs[:2]]
            orient = tfbs[2]
            enh_enc += [tf, orient, aff]
    
    # Poisin pill
    #if row_num == 10000:
    #    break
    mixed1_encoding.append(enh_enc + ["-".join(list(enh_data.values))])

In [None]:
# Mixed-1.0 header
header = [
    "L1_length",
    "TFBS1_type",
    "TFBS1_orient",
    "TFBS1_affinity",
    "L2_length",
    "TFBS2_type",
    "TFBS2_orient",
    "TFBS2_affinity",
    "L3_length",
    "TFBS3_type",
    "TFBS3_orient",
    "TFBS3_affinity",
    "L4_length",
    "TFBS4_type",
    "TFBS4_orient",
    "TFBS4_affinity",
    "L5_length",
    "TFBS5_type",
    "TFBS5_orient",
    "TFBS5_affinity",
    "L6_length",
    "Enhancer"
]
# Save the header for use down the line. Save as one column per line
with open ("mixed_1.0/mixed-1.0_header.txt", "w") as f:
    f.write("\n".join(header[:-1]))

In [None]:
# Create a dataframe and replace occurrences of TFs and orientation with numbers
X_mixed1 = (
    pd.DataFrame(mixed1_encoding, columns=header)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)

# Drop the enhancer column
X_mixed1 = X_mixed1.drop("Enhancer", axis=1).values

# Subset train and test
X_mixed1_train = X_mixed1[train_index, :]
X_mixed1_test = X_mixed1[test_index, :]
X_mixed1_trian_and_test = X_mixed1[training_index, :]
X_mixed1_holdout = X_mixed1[holdout_index, :]

# Standardize features and double check means are 0 and stds are 1
scale_indeces = np.array([0, 3, 4, 7, 8, 11, 12, 15, 16, 19, 20])  # Mixed 1.0
_, X_mixed1_scaled = project_utils.standardize_features(train_X=X_mixed1_trian_and_test, test_X=X_mixed1_trian_and_test, indeces=scale_indeces, stats_file="mixed_1.0/{}_X-training_stats.pickle".format(PREPROCESS))
X_mixed1_train_scaled, X_mixed1_test_scaled = project_utils.standardize_features(train_X=X_mixed1_train, test_X=X_mixed1_test, indeces=scale_indeces, stats_file="mixed_1.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT))
X_mixed1_train_scaled, X_mixed1_holdout_scaled = project_utils.standardize_features(train_X=X_mixed1_train, test_X=X_mixed1_holdout, indeces=scale_indeces)

## Sanity check ordering of current

In [177]:
X1 = np.load("mixed_1.0/0.18-0.4_X_mixed-1.0.npy")
X1_train = np.load("mixed_1.0/0.18-0.4_X_mixed-1.0.npy")
X1_test = np.load("mixed_1.0/0.18-0.4_X_mixed-1.0.npy")

In [178]:
y1 = np.loadtxt("y_binary/0.18-0.4_y-binary.txt", dtype=float, delimiter=" ")
y1_train = np.loadtxt("binary_binary/0.18-0.4_y-train-0.9_binary.txt", dtype=float, delimiter=" ")
y1_test = np.loadtxt("y_binary/0.18-0.4_y-test-0.1_binary.txt", dtype=int, delimiter=" ")

In [179]:
X2 = np.loadtxt("mixed_1.0/X_mixed-1.0_0.18-0.4.txt", dtype=float, delimiter=" ")

In [180]:
y2 = y

In [181]:
(y1 == y2).all()

True

In [182]:
(X1 == X2).all()

True

In [183]:
from sklearn.model_selection import train_test_split

In [184]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, train_size=0.9, random_state=13)

## Old mixed encoding 1.0 code

In [226]:
j = 0
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(X_mixed2_df.iterrows())):
    enhancer = enh_data["Enhancer"]
    if "S5-G2F" in enhancer:
        s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
        if s5_loc == 1:
            continue
        linker_to_change_loc = s5_loc - 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        enh_data["L{}_length".format(linker_to_change_loc)] -= 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        j += 1
    elif "G2R-S5" in enhancer:
        s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
        if s5_loc == 6:
            continue
        linker_to_change_loc = s5_loc + 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        enh_data["L{}_length".format(linker_to_change_loc)] -= 1
        # print(i, row_num, enhancer, s5_loc, linker_to_change_loc, enh_data["L{}_length".format(linker_to_change_loc)])
        j += 1
    # if j==100:
    #    break

2400it [00:00, 7178.25it/s]


## Different thresholds and methods for dealing with class imbalance

**Dealing with class imbalance**

In [37]:
from sklearn.utils import resample

In [39]:
pos_mask = y_train == 1

In [42]:
neg_y = y_train[~pos_mask]
pos_y = y_train[pos_mask]

neg_X = X_train[~pos_mask, :]
pos_X = X_train[pos_mask, :]

In [43]:
print(np.unique(pos_y, return_counts=True))
print(np.unique(neg_y, return_counts=True))

(array([1]), array([65561]))
(array([0]), array([349159]))


**Downsample negative class**

In [86]:
downsampled_neg_X, downsampled_neg_y = resample(
    neg_X, neg_y, n_samples=len(pos_y), random_state=13
)

In [87]:
X_train_down = np.concatenate([downsampled_neg_X, pos_X])
y_train_down = np.concatenate([downsampled_neg_y, pos_y])

In [88]:
X_train_down.shape, y_train_down.shape

((131122, 75), (131122,))

In [89]:
np.unique(y_train_down, return_counts=True)[1] / len(y_train_down)

array([0.5, 0.5])

## Testing mixed encoding code

In [32]:
# # This is obsolete. DO NOT USE!!!
# def correct_s5(x):
#     enhancer = x["Enhancer"]
#     if "S5-G2F" in enhancer:
#         s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
#         if s5_loc == 1:
#             return x
#         linker_to_change_loc = s5_loc - 1
#         x["L{}_length".format(linker_to_change_loc)] -= 1
#         return x
#     elif "G2R-S5" in enhancer:
#         s5_loc = int((np.where(np.array(enhancer.split("-")) == "S5")[0][0] / 2) + 1)
#         if s5_loc == 6:
#             return x
#         linker_to_change_loc = s5_loc + 1
#         x["L{}_length".format(linker_to_change_loc)] -= 1
#         return x
#     else:
#         return x

In [237]:
X_mixed2_df["Enhancer"].str.contains("S5-G2F").sum(), X_mixed2_df[
    "Enhancer"
].str.contains("G2R-S5").sum()

(29952, 23969)

In [241]:
X_mixed2_df_corrected[
    (X_mixed2_df_corrected["L1_length"] == 11)
    & (X_mixed2_df_corrected["TFBS1_GATA_affinity"] != 0.3)
]["L2_length"].value_counts()

0    1437
Name: L2_length, dtype: int64

In [242]:
X_mixed2_df_corrected[
    (X_mixed2_df_corrected["L6_length"] == 0)
    & (X_mixed2_df_corrected["TFBS5_GATA_affinity"] != 0.3)
]["L5_length"].value_counts()

0    6037
Name: L5_length, dtype: int64

In [245]:
X_mixed2_df_corrected = X_mixed2_df_corrected.drop("Enhancer", axis=1).values

## Old one hot encoding code

In [48]:
# Get the sequences only
X_rev_seqs = np.array(["".join({'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}.get(base, base) for base in reversed(seq)) for seq in X_seqs])

In [51]:
# Define encoders
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder(
    categories=[np.array([0, 1, 2, 3])], handle_unknown="ignore"
)

In [52]:
# Example steps for one hot encoding
test = X_seqs[0]
print("{}...".format(test[:5]))
integer_encoded = integer_encoder.fit_transform(list(test))
print("{}...".format(integer_encoded[:5]))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoder.fit(integer_encoded)  # convert to one hot
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
print("{}...".format(one_hot_encoded.toarray()[:5]))

TCCTA...
[3 1 1 3 0]...
[[0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]...


**One hot encoding of sequences...<u>takes about 2 minutes</u>**

In [53]:
# Applying above for all seqs
X_features = []  # will hold one hot encoded sequence
for i, seq in enumerate(tqdm.tqdm(X_seqs)):
    integer_encoded = integer_encoder.fit_transform(list(seq))  # convert to integer
    integer_encoded = np.array(integer_encoded).reshape(-1, 1)
    one_hot_encoder.fit(integer_encoded)  # convert to one hot
    one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
    X_features.append(one_hot_encoded.toarray())

100%|██████████| 302936/302936 [01:39<00:00, 3030.54it/s]


In [54]:
# Check to make sure it is correct length
len(X_features)

302936

In [55]:
# convert to numpy array
X_ohe_seq = np.array(X_features)

In [56]:
# Sanity check encoding for randomly chosens sequences
indeces = np.random.choice(len(X_features), size=len(X_features) // 100)
for j, ind in enumerate(indeces):
    seq = X_seqs[ind]
    one_hot_seq = X_features[ind]
    for i, bp in enumerate(seq):
        if bp == "A":
            if (one_hot_seq[i] != [1.0, 0.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "C":
            if (one_hot_seq[i] != [0.0, 1.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "G":
            if (one_hot_seq[i] != [0.0, 0.0, 1.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "T":
            if (one_hot_seq[i] != [0.0, 0.0, 0.0, 1.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "N":
            if (one_hot_seq[i] != [0.0, 0.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        else:
            print(bp)
    print("Seq #{} encoded correctly".format(j + 1))

Seq #1 encoded correctly
Seq #2 encoded correctly
Seq #3 encoded correctly
Seq #4 encoded correctly
Seq #5 encoded correctly
Seq #6 encoded correctly
Seq #7 encoded correctly
Seq #8 encoded correctly
Seq #9 encoded correctly
Seq #10 encoded correctly
Seq #11 encoded correctly
Seq #12 encoded correctly
Seq #13 encoded correctly
Seq #14 encoded correctly
Seq #15 encoded correctly
Seq #16 encoded correctly
Seq #17 encoded correctly
Seq #18 encoded correctly
Seq #19 encoded correctly
Seq #20 encoded correctly
Seq #21 encoded correctly
Seq #22 encoded correctly
Seq #23 encoded correctly
Seq #24 encoded correctly
Seq #25 encoded correctly
Seq #26 encoded correctly
Seq #27 encoded correctly
Seq #28 encoded correctly
Seq #29 encoded correctly
Seq #30 encoded correctly
Seq #31 encoded correctly
Seq #32 encoded correctly
Seq #33 encoded correctly
Seq #34 encoded correctly
Seq #35 encoded correctly
Seq #36 encoded correctly
Seq #37 encoded correctly
Seq #38 encoded correctly
Seq #39 encoded corre

In [57]:
# Format in way compatible with PyTorch CNN and RNN
X_ohe_seq = np.transpose(X_ohe_seq, axes=(0, 2, 1))
X_ohe_seq.shape

(302936, 4, 66)

In [58]:
# Save in binary format
np.save("ohe_seq/{}_X_rev-ohe-seq".format(PREPROCESS), X_ohe_seq)
np.save("ohe_seq/{}_X-train-{}_rev-ohe-seq".format(PREPROCESS, SPLIT), arr=X_ohe_seq[train_index, :, :])
np.save("ohe_seq/{}_X-test-{}_rev-ohe-seq".format(PREPROCESS, round(1-SPLIT, 1)), arr=X_ohe_seq[test_index, :, :])

## Old test fasta code

In [75]:
# Negative seqs
X_test_neg = OLS_data.loc[test_index][test_neg_mask]["SEQUENCE"].values
y_test_neg = y_test[test_neg_mask]
id_test_neg = ID[test_index][test_neg_mask]

# Positive seqs
X_test_pos = OLS_data.loc[test_index][~test_neg_mask]["SEQUENCE"].values
y_test_pos = y_test[~test_neg_mask]
id_test_pos = ID[test_index][~test_neg_mask]

In [76]:
# Check
print(X_test_neg.shape, y_test_neg.shape)
print(X_test_pos.shape, y_test_pos.shape)
if (X_test_neg.shape[0] + X_test_pos.shape[0] == X_test.shape[0]):
    print("We good: {}, {}. {}".format(X_test.shape, y_test.shape, ID[test_index].shape))
else:
    print("The game is afoot")

(10395,) (10395,)
(9360,) (9360,)
We good: (19755, 18), (19755,). (19755,)


*Positive test sequences*

In [77]:
test_file = open("fasta/{}_X-test-{}_fasta-pos.fa".format(PREPROCESS, round(1-SPLIT, 1)), "w")
for i in range(len(X_test_pos)):
    test_file.write(">" + id_test_pos[i] + "\n" + X_test_pos[i] + "\n")
test_file.close()

*Negative test sequences*

In [80]:
test_neg_file = open("fasta/{}_X-test-{}_fasta-neg.fa".format(PREPROCESS, round(1-SPLIT, 1)), "w")
for i in range(len(X_test_neg)):
    test_neg_file.write(">" + id_test_neg[i] + "\n" + X_test_neg[i] + "\n")
test_neg_file.close()

In [88]:
!wc -l fasta/0.09-0.4_X-test-0.1_fasta-neg.fa

20790 fasta/0.09-0.4_X-test-0.1_fasta-neg.fa


In [89]:
len(X_test_neg)*2

20790

In [148]:
# Mask
training_neg_mask = (y_training == 0)

# Negative seqs
X_training_neg = X_seqs[training_index][training_neg_mask]
y_training_neg = y_training[training_neg_mask]
id_training_neg = ID[training_index][training_neg_mask]

# Positive seqs
X_training_pos = X_seqs[training_index][~training_neg_mask]
y_training_pos = y_training[~training_neg_mask]
id_training_pos = ID[training_index][~training_neg_mask]

# Check
print(X_training_neg.shape, y_training_neg.shape)
print(X_training_pos.shape, y_training_pos.shape)
if (X_training_neg.shape[0] + X_training_pos.shape[0] == X_seqs[training_index].shape[0]):
    print("We good: {}, {}, {}".format(X_seqs[training_index].shape, y_training.shape, ID[training_index].shape))
else:
    print("The game is afoot")

(208690,) (208690,)
(94189,) (94189,)
We good: (302879,), (302879,), (302879,)


In [103]:
# Save positive sequences in fasta format
tr_file = open("fasta/{}_X-train-{}_fasta-pos.fa".format(PREPROCESS, SPLIT), "w")
for i in range(len(X_train_pos)):
    tr_file.write(">" + id_train_pos[i] + "\n" + X_train_pos[i] + "\n")
tr_file.close()

# Should equal below
print(len(X_train_pos)*2)

# Double check
!wc -l fasta/0.09-0.4_X-train-0.9_fasta-pos.fa

169522
169658 fasta/0.09-0.4_X-train-0.9_fasta-pos.fa


*Negative training sequences*

In [104]:
# Save negative sequences in fasta format
tr_neg_file = open("fasta/{}_X-train-{}_fasta-neg.fa".format(PREPROCESS, SPLIT), "w")
for i in range(len(X_train_neg)):
    tr_neg_file.write(">" + id_train_neg[i] + "\n" + X_train_neg[i] + "\n")
tr_neg_file.close()

# Should equal below
print(len(X_train_neg)*2)

# Double check
!wc -l fasta/0.09-0.4_X-train-0.9_fasta-neg.fa

375660
185928 fasta/0.09-0.4_X-train-0.9_fasta-neg.fa


#### <u> **Train** </u>

In [102]:
# Mask
train_neg_mask = (y_train == 0)

# Negative seqs
X_train_neg = X_seqs[train_index][train_neg_mask]
y_train_neg = y_train[train_neg_mask]
id_train_neg = ID[train_index][train_neg_mask]

# Positive seqs
X_train_pos = X_seqs[train_index][~train_neg_mask]
y_train_pos = y_train[~train_neg_mask]
id_train_pos = ID[train_index][~train_neg_mask]

# Check
print(X_train_neg.shape, y_train_neg.shape)
print(X_train_pos.shape, y_train_pos.shape)
if (X_train_neg.shape[0] + X_train_pos.shape[0] == X_train.shape[0]):
    print("We good: {}, {}, {}".format(X_train.shape, y_train.shape, ID[train_index].shape))
else:
    print("The game is afoot")

(187830,) (187830,)
(84761,) (84761,)
The game is afoot


*Positive training sequences*

In [103]:
# Save positive sequences in fasta format
tr_file = open("fasta/{}_X-train-{}_fasta-pos.fa".format(PREPROCESS, SPLIT), "w")
for i in range(len(X_train_pos)):
    tr_file.write(">" + id_train_pos[i] + "\n" + X_train_pos[i] + "\n")
tr_file.close()

# Should equal below
print(len(X_train_pos)*2)

# Double check
!wc -l fasta/0.09-0.4_X-train-0.9_fasta-pos.fa

169522
169658 fasta/0.09-0.4_X-train-0.9_fasta-pos.fa


*Negative training sequences*

In [104]:
# Save negative sequences in fasta format
tr_neg_file = open("fasta/{}_X-train-{}_fasta-neg.fa".format(PREPROCESS, SPLIT), "w")
for i in range(len(X_train_neg)):
    tr_neg_file.write(">" + id_train_neg[i] + "\n" + X_train_neg[i] + "\n")
tr_neg_file.close()

# Should equal below
print(len(X_train_neg)*2)

# Double check
!wc -l fasta/0.09-0.4_X-train-0.9_fasta-neg.fa

375660
185928 fasta/0.09-0.4_X-train-0.9_fasta-neg.fa


*Positive training sequences*

#### <u> **Test** </u>

In [105]:
# Only need to split for training
X_test = X_seqs[test_index]
id_test = ID[test_index]

test_file = open("fasta/{}_X-test-{}_fasta.fa".format(PREPROCESS, round(1-SPLIT, 1)), "w")
for i in range(len(X_test)):
    test_file.write(">" + id_test[i] + "\n" + X_test[i] + "\n")
test_file.close()

!wc -l fasta/0.09-0.4_X-test-0.1_fasta.fa

len(X_test)*2

39510 fasta/0.09-0.4_X-test-0.1_fasta.fa


60576

#### <u> **Hold-out** </u>

In [106]:
# Only need to split for training
X_holdout = X_seqs[holdout_index]
id_holdout = ID[holdout_index]

test_file = open("fasta/{}_X-holdout_fasta.fa".format(PREPROCESS), "w")
for i in range(len(X_holdout)):
    test_file.write(">" + id_holdout[i] + "\n" + X_holdout[i] + "\n")
test_file.close()

!wc -l fasta/0.09-0.4_X-holdout-0.1_fasta.fa

len(X_holdout)*2

wc: fasta/0.09-0.4_X-holdout-0.1_fasta.fa: No such file or directory


315842