# 2010 Khoueiry et al CellPress Sequences Preprocessing Notebook

**Authorship:**
Adam Klie
*11/09/2021*
***
**Description:**
This data comes from genomic sequences of *Ciona intestinalis* and *Ciona savignyi* that were tested in a 2010 CellPress paper by Khoueiry et al. Briefly, these sequences were identified by scanning the genomes for clusters of two ETS and two GATA sites using a phylogenetic footprinting-based algorithm named Search for Evolutionary COnserved MODules (SECOMOD). The algorithm searches for clusters of binding sites in *Ciona intestinalis* according to a number of sites and a cluster size specified by the user. It then identifies the orthologous region in *Ciona savignyi* and checks whether this region also contains the required number of sites and has the correct size. The authors scanned for GATA (GATA) and MGGAAR (ETS) in 80bp to start to find 9 clusters, but then expanded the search window to 130bp and relaxed the ETS constraint to HGGAWR, which led to 46 additional clusters. They also had an intersite distance threshold of 5bp and an overall sequence conservation threshold of 40%.

The data was downloaded from Table S2 as an excel spreadsheet and processed from there.
***

**TODOs:**
 - <font color='green'> Load in and preprocess excel sheet (Table S2) from downloaded supplement </font>
 - <font color='green'> Preprocess into FASTA, labels and seq ids</font>
 - <font color='green'> Preprocess into ohe </font>
 - <font color='green'> Clean up notebook </font>
***

# Set-up

In [137]:
# Classics
import os
import numpy as np
import pandas as pd
import tqdm
import pickle

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

import sys
sys.path.append("/cellar/users/aklie/projects/EUGENE/bin/")
import project_utils
import otx_enhancer_utils

In [138]:
# Training stats for mixed encodings
PREPROCESS = "0.09-0.4"  # String defining the preprocessing for saving
SPLIT = 0.9  # Split into training and test sets

# Load data to preprocess

## Create full sequence dataframe

In [139]:
# Load the excel spreadsheet with the sequences and functional annotation
dataset = pd.read_excel("1-s2.0-S096098221000432X-mmc2.xls", sheet_name="Table S2", skiprows=2)
dataset = dataset[~dataset["Enhancer Activity"].isna()]
dataset["SEQ"] = dataset["Cloned and Tested C. i. Sequence"]
#dataset["SEQ"] = dataset["Cloned and Tested C. i. Sequence"].apply(lambda x: x.replace("ACCCAACTTT", ""))
#dataset["SEQ"] = dataset["SEQ"].apply(lambda x: x.replace("AAAGTAGGCT", ""))
dataset["FXN_LABEL"] = (dataset["Enhancer Activity"] != "none").astype(int)
dataset["SEQ_LEN"] = dataset["SEQ"].apply(len)
dataset["NAME"] = dataset["Position"] + ":" + dataset["Start"].astype(str) + ":" + dataset["End"].astype(str)
dataset["TILE"] = "full"
dataset = dataset[["NAME", "SEQ", "FXN_LABEL", "TILE", "SEQ_LEN"]]
dataset.head(1)

Unnamed: 0,NAME,SEQ,FXN_LABEL,TILE,SEQ_LEN
0,scaffold_1:462149:462232,AAAGTAGGCTATATGCTACAGCCCAGAGCTCATGGATTTTAATGGG...,1,full,195


In [140]:
# Save non-tiled dataframe in standard format
dataset.to_csv("2010_Khoueiry_CellPress.tsv", index=False, sep="\t")

# Double check valid labeling
dataset["FXN_LABEL"].value_counts()

0    12
1     8
Name: FXN_LABEL, dtype: int64

## Create tiled dataframe

In [141]:
# Find and concatenate tiles
tile_seqs = []
for i, enh in dataset.iterrows():
    seq = enh["SEQ"]
    print(seq, enh["SEQ_LEN"])
    if len(seq) > 66:
        for j in range(len(seq)-66+1):
            tile = seq[j:j+66]
            tile_seqs.append([enh["NAME"], tile, enh["FXN_LABEL"], j, 66])
    else:
        print("too short:", len(seq))
dataset_tiled = pd.concat([dataset, pd.DataFrame(data=tile_seqs, columns=dataset.columns)]).sort_values(["NAME", "TILE"]).reset_index()

# Save tiled dataframe in standard format
dataset_tiled = dataset_tiled.drop('index', axis=1)
dataset_tiled.to_csv("2010_Khoueiry_CellPress-tiled.tsv", index=False, sep="\t")

AAAGTAGGCTATATGCTACAGCCCAGAGCTCATGGATTTTAATGGGATCGGCTATCTAGGCCGACCCTCGCTCTCCCAAGGAAATGTCCACCTTCCAGCCGGGAAAAGATAACCGCTCGCCAGAGCGACGCTTTCCGGCTGACAAATTGTGTCGGACCTTGATAGCATTCCTGTTCCCTATCGGACCCAACTTT  195
AAAGTAGGCTATCGCGACCACTGATCGTCGCGTAATTATTTTGAGGCAACATATCGATCAGCGATCTCGGCAACGATAGCgAaATtctccCTCAGCTTTCTCGGAAGCGCTCGCTGCTGGTTATCCGGAAAAGTGcggAACTTCCCTCGCTCTCCAGCAGCAacGACTCGGGAACCCAACTTT 183
AAAGTAGGCTTTGTATAATTCCCAATTAGAACACACAGGTAGGGTAATACATTAGAATGTCTTTGATTAGACTAGCAAGATAGGCACCTAATTTCCTATGTTGGTACAACTGGTTTTAGAAAATTATAAAAATTTCCTGCTATAATCGGTTAGAGCAAGGAACGGTGATTACCCAACTTT 180
AAAGTAGGCTTTAACATGGAATCTATTCCCCGTGACGATGAGATAAGGACAAGCGGAAACCGACCAACGCGTATTTCGAGAACTATTTATCTCGACTGATTCCTGTTCCTTCCTTTCCTTCCGCCCGCAGCTTACCGGTAAACAACCCAACTTT 154
AAAGTAGGCTTCTTGCTTTGTTGTTTTTGTTTTGAAACTGGTGAGCAGCACCGGAAATTGGTCTCAGTTCTCTAGCCAGTCTACTTCCTGCTGTTACTGCTAAACGATAACATTTCTCTTGTACATGATATCATTCTATGGTTCATATAAGTTGAAACACAATATGCATTACATTTACCCAACTTT 186
AAAGTAGGCTCTTGGTATTTGTACCTGACGCGTGCTACGCCTTCCTTTCGTAACACAATGACTTAATTATCTTCTTT

# Save labels, IDs, and Seqs

### **Binary Labels**
0 (non-functional) and 1 (functional). These are all the same

In [142]:
# Save the labels
if not os.path.isdir("binary"):
    os.makedirs("binary")

y = dataset["FXN_LABEL"].values
y_tiled = dataset_tiled["FXN_LABEL"].values
np.savetxt("binary/y_binary.txt", X=y, fmt="%d")
np.savetxt("binary/y-tiled_binary.txt", X=y_tiled, fmt="%d")

!wc -l binary/*

  20 binary/y_binary.txt
1977 binary/y-tiled_binary.txt
1997 total


### **Identifiers**
Name of the sequence to identify it

In [158]:
# Save the ids
if not os.path.isdir("id"):
    os.makedirs("id")
    
ID = dataset["NAME"].values
ID_tiled = (dataset_tiled["NAME"] + ":" + dataset_tiled["TILE"].astype(str)).values
np.savetxt("id/id.txt", X=ID, fmt="%s")
np.savetxt("id/id-tiled.txt", X=ID_tiled, fmt="%s")

!wc -l id/*

 1977 id/id-tiled.txt
   20 id/id.txt
 1997 total


### **Sequences**
ACGT...

In [144]:
# Save the seqs
if not os.path.isdir("seqs"):
    os.makedirs("seqs")
    
X_seqs = dataset["SEQ"].values
X_seqs_tiled = dataset_tiled["SEQ"].values
np.savetxt("seqs/seqs.txt", X=X_seqs, fmt="%s")
np.savetxt("seqs/seqs-tiled.txt", X=X_seqs_tiled, fmt="%s")

!wc -l seqs/*

  1977 seqs/seqs-tiled.txt
    20 seqs/seqs.txt
  1997 total


# Preprocess and save different feature sets

## **<u>Sequence feature idea 2 </u>**: Mixed encodings 

### **Generate all valid encodings**

In [160]:
X_mixed1s, X_mixed2s, X_mixed3s, valid_idxs = otx_enhancer_utils.mixed_encode(dataset)
X_mixed1s.shape, X_mixed2s.shape, X_mixed3s.shape, len(valid_idxs)

20it [00:00, 185.69it/s]


((6, 21), (6, 26), (6, 21), 6)

In [None]:
X_mixed1s[0], X_mixed2s[0], X_mixed3s[0]

In [162]:
X_mixed1s_tiled, X_mixed2s_tiled, X_mixed3s_tiled, valid_idxs_tiled = otx_enhancer_utils.mixed_encode(dataset_tiled)
X_mixed1s_tiled.shape, X_mixed2s_tiled.shape, X_mixed3s_tiled.shape, len(valid_idxs_tiled)

1977it [00:10, 182.42it/s]


((211, 21), (211, 26), (211, 21), 211)

### *Mixed 1.0*
 - Replace binding sites using dictionary
 - Separate based on these binding sites and add create "dummy variables"
 - Get lengths of linkers around binding sites

In [178]:
# Load in training stats
with open("../2021_OLS_Library/mixed_1.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)
scale_indeces = train_stats["indeces"]

# Full sequences
X_mixed1s[:, scale_indeces] -= train_stats["means"]
X_mixed1s[:, scale_indeces] /= train_stats["stds"]

# Tiled sequences
X_mixed1s_tiled[:, scale_indeces] -= train_stats["means"]
X_mixed1s_tiled[:, scale_indeces] /= train_stats["stds"]
valid_tiled_names = dataset_tiled.loc[valid_idxs_tiled]["NAME"] + ":" + dataset_tiled.loc[valid_idxs_tiled]["TILE"].astype(str)

# Save the vals
if not os.path.isdir("mixed_1.0"):
    os.makedirs("mixed_1.0")
    
np.save("mixed_1.0/{}_{}-split_X-test_mixed-1.0".format(PREPROCESS, SPLIT), X_mixed1s)
np.savetxt("mixed_1.0/id-valid.txt", X=ID[valid_idxs], fmt="%s")
np.save("mixed_1.0/{}_{}-split_X-test_mixed-1.0-tiled".format(PREPROCESS, SPLIT), X_mixed1s_tiled)
np.savetxt("mixed_1.0/id-valid-tiled.txt", X=valid_tiled_names, fmt="%s")

!wc -l mixed_1.0/*

    5 mixed_1.0/0.09-0.4_0.9-split_X-test_mixed-1.0.npy
   52 mixed_1.0/0.09-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
    2 mixed_1.0/0.18-0.4_0.9-split_X-test_mixed-1.0.npy
  150 mixed_1.0/0.18-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
  211 mixed_1.0/id-valid-tiled.txt
    6 mixed_1.0/id-valid.txt
  426 total


### *Mixed 2.0*
 - Replace binding sites using dictionary
 - 4 bit vector for each binding site [ets_affinity ets_orientation gata_affinity gata_orientation] - ties together the identity to affinity
 - Get lengths of linkers around binding sites

In [179]:
# Load in training stats
with open("../2021_OLS_Library/mixed_2.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)
scale_indeces = train_stats["indeces"]

# Full sequences
X_mixed2s[:, scale_indeces] -= train_stats["means"]
X_mixed2s[:, scale_indeces] /= train_stats["stds"]

# Tiled sequences
X_mixed2s_tiled[:, scale_indeces] -= train_stats["means"]
X_mixed2s_tiled[:, scale_indeces] /= train_stats["stds"]
valid_tiled_names = dataset_tiled.loc[valid_idxs_tiled]["NAME"] + ":" + dataset_tiled.loc[valid_idxs_tiled]["TILE"].astype(str)

# Save the vals
if not os.path.isdir("mixed_2.0"):
    os.makedirs("mixed_2.0")
    
np.save("mixed_2.0/{}_{}-split_X-test_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed2s)
np.savetxt("mixed_2.0/id-valid.txt", X=ID[valid_idxs], fmt="%s")
np.save("mixed_2.0/{}_{}-split_X-test_mixed-2.0-tiled".format(PREPROCESS, SPLIT), X_mixed2s_tiled)
np.savetxt("mixed_2.0/id-valid-tiled.txt", X=valid_tiled_names, fmt="%s")

!wc -l mixed_2.0/*

    3 mixed_2.0/0.09-0.4_0.9-split_X-test_mixed-2.0.npy
   54 mixed_2.0/0.09-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
    1 mixed_2.0/0.18-0.4_0.9-split_X-test_mixed-2.0.npy
  122 mixed_2.0/0.18-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
  211 mixed_2.0/id-valid-tiled.txt
    6 mixed_2.0/id-valid.txt
  397 total


### *Mixed 3.0*
 - Replace binding sites using dictionary
 - 3 bit vector for each binding site [ets_affinity gata_affinity orientation] - ties together the identity to affinity while removing redundant info from mixed-2.0
 - Get lengths of linkers around binding sites

In [180]:
# Load in training stats
with open("../2021_OLS_Library/mixed_3.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)
scale_indeces = train_stats["indeces"]

# Full sequences
X_mixed3s[:, scale_indeces] -= train_stats["means"]
X_mixed3s[:, scale_indeces] /= train_stats["stds"]

# Tiled sequences
X_mixed3s_tiled[:, scale_indeces] -= train_stats["means"]
X_mixed3s_tiled[:, scale_indeces] /= train_stats["stds"]
valid_tiled_names = dataset_tiled.loc[valid_idxs_tiled]["NAME"] + ":" + dataset_tiled.loc[valid_idxs_tiled]["TILE"].astype(str)

# Save the vals
if not os.path.isdir("mixed_3.0"):
    os.makedirs("mixed_3.0")
    
np.save("mixed_3.0/{}_{}-split_X-test_mixed-3.0".format(PREPROCESS, SPLIT), X_mixed3s)
np.savetxt("mixed_3.0/id-valid.txt", X=ID[valid_idxs], fmt="%s")
np.save("mixed_3.0/{}_{}-split_X-test_mixed-3.0-tiled".format(PREPROCESS, SPLIT), X_mixed3s_tiled)
np.savetxt("mixed_3.0/id-valid-tiled.txt", X=valid_tiled_names, fmt="%s")

!wc -l mixed_3.0/*

    3 mixed_3.0/0.09-0.4_0.9-split_X-test_mixed-3.0.npy
   54 mixed_3.0/0.09-0.4_0.9-split_X-test_mixed-3.0-tiled.npy
    1 mixed_3.0/0.18-0.4_0.9-split_X-test_mixed-3.0.npy
  122 mixed_3.0/0.18-0.4_0.9-split_X-test_mixed-3.0-tiled.npy
  211 mixed_3.0/id-valid-tiled.txt
    6 mixed_3.0/id-valid.txt
  397 total


## **<u>Sequence feature idea 3 </u>**: Use the actual sequence (one-hot encoded)
 - One hot encoded sequence: each position is encoded as a 1-D vector of size 4 e.g., AT is [[1,0,0,0], [0,0,0,1]]
 - Generally, we will get inputs of size (len(seq) X 4). The above example would be of size 2x4
 - Can also save the string seqs in case those are also useful down the line

**Q** Are all sequences the same length

In [151]:
# Check the lengths of sequences to make sure they are all the same
dataset["SEQ"].apply(len).value_counts()

141    3
160    2
55     1
195    1
228    1
167    1
139    1
140    1
186    1
226    1
150    1
180    1
149    1
129    1
183    1
154    1
223    1
Name: SEQ, dtype: int64

**Answer**: Nope

### *Forward Encoding*

**Full sequences**

In [152]:
# Get the sequences only
X_seqs = [seq.upper().strip() for seq in dataset["SEQ"].values]
X_ohe_seq = project_utils.ohe_seqs(X_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
# Save in binary format
np.save("ohe_seq/X_ohe-seq", X_ohe_seq)

# Quick check
X_seqs[0][:5], X_ohe_seq[0][:5]

100%|██████████| 20/20 [00:00<00:00, 903.86it/s]

Encoded 20 seqs
Checking all 20 seqs for proper encoding
Sequence encoding was great success



  


('AAAGT',
 array([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.]]))

### *Reverse encoding*

**Full sequences**

In [153]:
# Get the reverse encodings
X_rev_seqs = np.array(["".join({'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}.get(base, base) for base in reversed(seq)) for seq in X_seqs])
X_ohe_rev_seq = project_utils.ohe_seqs(X_rev_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
# Save in binary format
np.save("ohe_seq/X_ohe-seq-rev", X_ohe_rev_seq)

# Quick check
X_rev_seqs[0][:5], X_ohe_rev_seq[0][:5]

100%|██████████| 20/20 [00:00<00:00, 996.02it/s]

Encoded 20 seqs
Checking all 20 seqs for proper encoding
Sequence encoding was great success



  


('AAAGT',
 array([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.]]))

## **<u>Sequence feature idea 4 </u>**: Use the actual sequence (as fasta)

In [154]:
# Create dir
if not os.path.isdir("fasta"):
    os.makedirs("fasta")

# Save full sequences in fasta format
dataset = dataset.reset_index()
file = open("fasta/X_fasta.fa", "w")
for i, enh in dataset.iterrows():
    file.write(">" + ID[i] + "\n" + X_seqs[i].upper().strip() + "\n")
file.close()

# Save tiled sequences in fasta format
pos_tiled = dataset_tiled["TILE"].values
file = open("fasta/X_fasta-tiled.fa", "w")
for i, enh in dataset_tiled.iterrows():
    file.write(">" + ID_tiled[i] + ":" + str(pos_tiled[i]) + "\n" + X_seqs_tiled[i].upper().strip() + "\n")
file.close()

!wc -l fasta/*

    40 fasta/X_fasta.fa
  3954 fasta/X_fasta-tiled.fa
  3994 total


# Final Checks

In [155]:
!tree -L 2

[38;5;33m.[0m
├── 1-s2.0-S096098221000432X-mmc2.xls
├── 2010_Khoueiry_CellPress.ipynb
├── 2021_Khoueiry_CellPress-tiled.tsv
├── 2021_Khoueiry_CellPress.tsv
├── [38;5;33mbinary[0m
│   ├── y_binary.txt
│   └── y-tiled_binary.txt
├── [38;5;33mfasta[0m
│   ├── X_fasta.fa
│   └── X_fasta-tiled.fa
├── [38;5;33mid[0m
│   ├── id-tiled.txt
│   └── id.txt
├── [38;5;33mmixed_1.0[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixed-1.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-1.0.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
│   ├── valid_id-tiled.txt
│   └── valid_id.txt
├── [38;5;33mmixed_2.0[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixed-2.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-2.0.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
│   ├── valid_id-tiled.txt
│   └── valid_id.txt
├── [38;5;33mmixed_3.0[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixed-3.0.npy
│   ├── 

In [159]:
%%bash
head fasta/X_fasta-tiled.fa
head -n 5 id/id-tiled.txt

>Scaffold_31:173924:174038:0
AAAGTAGGCTATACAACGCGTGTCGAATCTGAATGTTCAATTCAATAACTGACACAATTTAACAAG
>Scaffold_31:173924:174038:1
AAGTAGGCTATACAACGCGTGTCGAATCTGAATGTTCAATTCAATAACTGACACAATTTAACAAGC
>Scaffold_31:173924:174038:2
AGTAGGCTATACAACGCGTGTCGAATCTGAATGTTCAATTCAATAACTGACACAATTTAACAAGCA
>Scaffold_31:173924:174038:3
GTAGGCTATACAACGCGTGTCGAATCTGAATGTTCAATTCAATAACTGACACAATTTAACAAGCAA
>Scaffold_31:173924:174038:4
TAGGCTATACAACGCGTGTCGAATCTGAATGTTCAATTCAATAACTGACACAATTTAACAAGCAAT
Scaffold_31:173924:174038:0
Scaffold_31:173924:174038:1
Scaffold_31:173924:174038:2
Scaffold_31:173924:174038:3
Scaffold_31:173924:174038:4


In [157]:
%%bash
head fasta/X_fasta.fa
head -n 5 id/id.txt

>scaffold_1:462149:462232
AAAGTAGGCTATATGCTACAGCCCAGAGCTCATGGATTTTAATGGGATCGGCTATCTAGGCCGACCCTCGCTCTCCCAAGGAAATGTCCACCTTCCAGCCGGGAAAAGATAACCGCTCGCCAGAGCGACGCTTTCCGGCTGACAAATTGTGTCGGACCTTGATAGCATTCCTGTTCCCTATCGGACCCAACTTT
>scaffold_102:102675:102754
AAAGTAGGCTATCGCGACCACTGATCGTCGCGTAATTATTTTGAGGCAACATATCGATCAGCGATCTCGGCAACGATAGCGAAATTCTCCCTCAGCTTTCTCGGAAGCGCTCGCTGCTGGTTATCCGGAAAAGTGCGGAACTTCCCTCGCTCTCCAGCAGCAACGACTCGGGAACCCAACTTT
>scaffold_3:475040:475120
AAAGTAGGCTTTGTATAATTCCCAATTAGAACACACAGGTAGGGTAATACATTAGAATGTCTTTGATTAGACTAGCAAGATAGGCACCTAATTTCCTATGTTGGTACAACTGGTTTTAGAAAATTATAAAAATTTCCTGCTATAATCGGTTAGAGCAAGGAACGGTGATTACCCAACTTT
>scaffold_3:781073:781145
AAAGTAGGCTTTAACATGGAATCTATTCCCCGTGACGATGAGATAAGGACAAGCGGAAACCGACCAACGCGTATTTCGAGAACTATTTATCTCGACTGATTCCTGTTCCTTCCTTTCCTTCCGCCCGCAGCTTACCGGTAAACAACCCAACTTT
>scaffold_357:46416:46496
AAAGTAGGCTTCTTGCTTTGTTGTTTTTGTTTTGAAACTGGTGAGCAGCACCGGAAATTGGTCTCAGTTCTCTAGCCAGTCTACTTCCTGCTGTTACTGCTAAACGATAACATTTCTCTTGTACATGATATCATTCTATGGTTCATATAAGTT

# Scratch

## Old mixed encoding code

In [None]:
def encode_mixed1(seq):
    enh_tfbs = defineTFBS(seq)
    if len(enh_tfbs) != 5:
        return -1
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], "E", tfbs[1], tfbs[3]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], "G", tfbs[1], tfbs[3]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    return enh_encoding

def mixed1_encode(data):
    mixed1_encoding, valid_idx = [], []
    for i, (row_num, enh_data) in tqdm.tqdm(enumerate(data.iterrows())):
        sequence = enh_data["SEQ"].upper().strip()
        encoded_seq = encode_mixed1(sequence)
        if encoded_seq != -1:
            mixed1_encoding.append(encoded_seq)
            valid_idx.append(i)
    X_mixed1 = (pd.DataFrame(mixed1_encoding).replace({"G": 0, "E": 1, "R": 0, "F": 1}))
    X_mixed1 = X_mixed1.values
    return X_mixed1, valid_idx

In [65]:
mixed1_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed1_encoding.append([0]*21)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], "E", tfbs[1], tfbs[3]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], "G", tfbs[1], tfbs[3]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed1_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed1 = (
    pd.DataFrame(mixed1_encoding)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)
X_mixed1 = X_mixed1.values

# Load in training stats
with open("../2021_OLS_Library/mixed_1.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)

# Z-score test set
scale_indeces = train_stats["indeces"]
X_mixed1[:, scale_indeces] -= train_stats["means"]
X_mixed1[:, scale_indeces] /= train_stats["stds"]

# Save the vals
if not os.path.isdir("mixed_1.0"):
    os.makedirs("mixed_1.0")
    
np.save("mixed_1.0/{}_{}-split_X-test_mixed-1.0".format(PREPROCESS, SPLIT), X_mixed1)

!ls -l mixed_1.0

X_mixed1.shape, mixed1_encoding[0], X_mixed1[0]

20it [00:03,  5.35it/s]


total 774
-rw-r--r-- 1 aklie carter-users   7968 Nov 26 12:21 0.18-0.4_0.9-split_X-test_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users 784128 Nov 26 12:09 0.18-0.4_0.9-split_X-test_mixed-1.0-tiled.npy


((20, 49),
 [31,
  'E',
  'F',
  0.11868156845560487,
  3,
  'E',
  'F',
  0.10973005182726431,
  0,
  'G',
  'R',
  0.5389119998601336,
  19,
  'E',
  'F',
  0.39163576347437207,
  5,
  'E',
  'R',
  0.10072171154881818,
  1,
  'E',
  'F',
  0.14028684169033467,
  -2,
  'G',
  'F',
  0.8051007829329527,
  17,
  'E',
  'R',
  0.3934295478885151,
  20,
  'G',
  'F',
  0.4449949666094348,
  -1,
  'E',
  'R',
  0.09932825230929267,
  -2,
  'E',
  'R',
  0.10277555374902991,
  -3,
  'G',
  'R',
  0.31163566926760605,
  10],
 array([ 6.33683722e+00,  1.00000000e+00,  1.00000000e+00, -2.05414929e+00,
        -4.97137607e-01,  1.00000000e+00,  1.00000000e+00, -2.09314810e+00,
        -1.22237710e+00,  0.00000000e+00,  0.00000000e+00,  3.19738794e-04,
         3.35270336e+00,  1.00000000e+00,  1.00000000e+00, -7.11168311e-01,
        -1.14164866e-02,  1.00000000e+00,  0.00000000e+00, -2.12157986e+00,
         3.36793426e-01,  1.00000000e+00,  1.00000000e+00,  1.40286842e-01,
        -2.0000000

In [35]:
mixed2_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset_tiled.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed2_encoding.append([0]*26)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], tfbs[3], tfbs[1], 0, 0]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], 0, 0, tfbs[3], tfbs[1]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed2_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed2 = (
    pd.DataFrame(mixed2_encoding)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)
X_mixed2 = X_mixed2.values

# Load in training stats
with open("../2021_OLS_Library/mixed_2.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)

# Z-score test set
scale_indeces = train_stats["indeces"]
X_mixed2[:, scale_indeces] -= train_stats["means"]
X_mixed2[:, scale_indeces] /= train_stats["stds"]

# Save the vals
if not os.path.isdir("mixed_2.0"):
    os.makedirs("mixed_2.0")
    
np.save("mixed_2.0/{}_{}-split_X-test_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed1)

!ls -l mixed_2.0

X_mixed2.shape, mixed2_encoding[0], X_mixed2[0]

12it [00:02,  4.92it/s]


KeyError: 'TGGATAAt'

In [117]:
mixed3_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed3_encoding.append([0]*21)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], tfbs[3], 0, tfbs[1]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], 0, tfbs[3], tfbs[1]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed3_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed3 = pd.DataFrame(mixed3_encoding).replace({"R": -1, "F": 1})
X_mixed3 = X_mixed3.values

In [None]:
mixed3_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset_tiled.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed3_encoding.append([0]*21)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], tfbs[3], 0, tfbs[1]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], 0, tfbs[3], tfbs[1]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed3_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed3 = pd.DataFrame(mixed3_encoding).replace({"R": -1, "F": 1})
X_mixed3 = X_mixed3.values

# Load in training stats
with open("../2021_OLS_Library/mixed_3.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)

# Z-score test set
scale_indeces = train_stats["indeces"]
X_mixed3[:, scale_indeces] -= train_stats["means"]
X_mixed3[:, scale_indeces] /= train_stats["stds"]

# Save the vals
if not os.path.isdir("mixed_3.0"):
    os.makedirs("mixed_3.0")
    
np.save("mixed_3.0/{}_{}-split_X-test_mixed-3.0-tiled".format(PREPROCESS, SPLIT), X_mixed1)

!ls -l mixed_3.0

X_mixed3.shape, mixed3_encoding[0], X_mixed3[0]

## Old positive and negative sequence break-up code

In [71]:
# Mask
neg_mask = (y == 0)

In [76]:
# Negative seqs
X_neg = X_seqs[neg_mask]
y_neg = y[neg_mask]
id_neg = ID[neg_mask]

# Positive seqs
X_pos = X_seqs[~neg_mask]
y_pos = y[~neg_mask]
id_pos = ID[~neg_mask]

In [79]:
# Check
print(X_neg.shape, y_neg.shape)
print(X_pos.shape, y_pos.shape)
if (X_neg.shape[0] + X_pos.shape[0] == X_seqs.shape[0]):
    print("We good: {}, {}, {}".format(X_seqs.shape, y.shape, ID.shape))
else:
    print("The game is afoot")

(12,) (12,)
(8,) (8,)
We good: (20,), (20,), (20,)


*Positive training sequences*

In [92]:
# Save positive sequences in fasta format
pos_file = open("fasta/X_fasta-pos.fa", "w")
for i in range(len(X_pos)):
    pos_file.write(">" + id_pos[i] + "\n" + X_pos[i].upper().strip() + "\n")
pos_file.close()

In [93]:
# Double check
!wc -l fasta/X_fasta-pos.fa

16 fasta/X_fasta-pos.fa


In [94]:
# Should equal above
len(X_pos)*2

16

*Negative training sequences*

In [95]:
# Save negatvie sequences in fasta format
neg_file = open("fasta/X_fasta-neg.fa", "w")
for i in range(len(X_neg)):
    neg_file.write(">" + id_neg[i] + "\n" + X_neg[i].upper().strip() + "\n")
neg_file.close()

In [96]:
# Double check
!wc -l fasta/X_fasta-neg.fa

24 fasta/X_fasta-neg.fa


In [97]:
# Should equal above
len(X_neg)*2

24

## Old dataset loading code

In [13]:
# Laod in
dataset = pd.read_excel("1-s2.0-S096098221000432X-mmc2.xls", sheet_name="Table S2", skiprows=2)
dataset = dataset[~dataset["Enhancer Activity"].isna()]

# Clean up sequence by removing constant regions and clean up labels by equating 'none' to inaxctive
dataset["Sequence"] = dataset["Cloned and Tested C. i. Sequence"].apply(lambda x: x.replace("ACCCAACTTT", ""))
dataset["Sequence"] = dataset["Sequence"].apply(lambda x: x.replace("AAAGTAGGCT", ""))
dataset["label"] = (dataset["Enhancer Activity"] != "none").astype(int)
dataset["Sequence_len"] = dataset["Sequence"].apply(len)
dataset["Name"] = dataset["Position"] + ":" + dataset["Start"].astype(str) + ":" + dataset["End"].astype(str)
dataset["Type"] = "full"
dataset.head(1)

Unnamed: 0,Cluster,Position,Start,End,Size,% Identity between C.i. and C.s.,GATA-MGGAAR-80,GATA-MGGAAR-130,GATA-HGGAWR-80,Enhancer Activity,...,Segal Probability on C.i. Clusters (yeast model – 2008),Unnamed: 15,Cloned and Tested C. i. Sequence,Sequence from Published C.i. Assembly,Sequence from Published C.s. Assembly,Sequence,label,Sequence_len,Name,Type
0,C1,scaffold_1,462149,462232,83,82,X,X,X,a6.5 and/or b6.5,...,0.8,,AAAGTAGGCTATATGCTACAGCCCAGAGCTCATGGATTTTAATGGG...,ATATGCTACAGCCCAGAGCTCATGGATTTTAATGGGATCCGCTATC...,ATACTTCGACGCTGGAAGCCGGGGGAATTTAATGGGACGGCCTATC...,ATATGCTACAGCCCAGAGCTCATGGATTTTAATGGGATCGGCTATC...,1,175,scaffold_1:462149:462232,full


# References