# 2021 Genomic GATA/ETS Cluster Sequences Preprocessing Notebook

**Authorship:**
Adam Klie *11/15/2021*
***
**Description:**
This data comes from Joe in the form of an excel spreadsheet. From my current understanding, the 9 sequences were identified using a GATA/ETS cluster search algorithm in the *Ciona intestinalis* genome. I will follow up with Joe for more specific details
***
**TODOs:**
 - <font color='green'> Load in and preprocess excel sheet emailed by Joe </font>
 - <font color='green'> Preprocess into FASTA, labels and seq ids</font>
 - <font color='green'> Preprocess into ohe </font>
 - <font color='green'> Clean up notebook </font>
***

# Set-up

In [2]:
# Classics
import os
import numpy as np
import pandas as pd
import tqdm
import pickle

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

import sys
sys.path.append("/cellar/users/aklie/projects/EUGENE/bin/")
import project_utils
import otx_enhancer_utils

In [14]:
# Training stats for mixed encodings
PREPROCESS = "0.09-0.4"  # String defining the preprocessing for saving
SPLIT = 0.9  # Split into training and test sets

# Load data to preprocess

## Create full sequence dataframe

In [15]:
# Load the excel spreadsheet with the sequences and functional annotation
dataset = pd.read_excel("GATA_ETS_clusters.xlsx", sheet_name=0).dropna()
dataset["FXN_LABEL"] = (dataset["Functional"] == "Yes").astype(int)
dataset = dataset.rename({"Sequence": "SEQ"}, axis=1)
dataset["SEQ_LEN"] = dataset["SEQ"].apply(len)
dataset["Name"].iloc[4] = "Negative_control"
dataset["NAME"] = dataset["Name"].str.replace(" ", "-")
dataset["TILE"] = "full"
dataset = dataset[["NAME", "SEQ", "FXN_LABEL", "TILE", "SEQ_LEN"]]
dataset.head(1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,NAME,SEQ,FXN_LABEL,TILE,SEQ_LEN
0,Otxa-WTg1,CGTGGTACTAGAAATTTTATCACTGTCGTGGTACCGGAAAATTTAT...,1,full,93


In [16]:
# Save non-tiled dataframe in standard format
dataset.to_csv("2021_GATA_ETS_Clusters.tsv", index=False, sep="\t")

# Double check valid labeling
dataset["FXN_LABEL"].value_counts()

0    6
1    3
Name: FXN_LABEL, dtype: int64

## Create tiled dataframe

In [17]:
# Find and concatenate tiles
tile_seqs = []
for i, enh in dataset.iterrows():
    seq = enh["SEQ"]
    print(seq, enh["SEQ_LEN"])
    if len(seq) > 66:
        for j in range(len(seq)-66+1):
            tile = seq[j:j+66]
            tile_seqs.append([enh["NAME"], tile, enh["FXN_LABEL"], j, 66])
    else:
        print("too short:", len(seq))
dataset_tiled = pd.concat([dataset, pd.DataFrame(data=tile_seqs, columns=dataset.columns)]).sort_values(["NAME", "TILE"]).reset_index()

# Save tiled dataframe in standard format
dataset_tiled = dataset_tiled.drop('index', axis=1)
dataset_tiled.to_csv("2021_GATA_ETS_Clusters-tiled.tsv", sep="\t", index=False)

CGTGGTACTAGAAATTTTATCACTGTCGTGGTACCGGAAAATTTATCCCAGCTGTGATACCGGAAATTTTATCTCAGTCGTGGTACCGGAACA 93
ATTATCATTACCTTATTATCAAATGTGCAAGAGGGACGCGGAAGTAGTATTTTCAATTGGGACGCGGAAGTAGAACTTTTCGTATCTCGCAAACCATT 98
CCTATATCTATAATAACAGTCGGAAATTGCCGGAAAATAATAAAATTATCGTCT 54
too short: 54
CATAGATAGCAACAATGCCATCTTATCTACGTAACTAGGAAACATAATCATTTATAAAACTTTCTATCTACTTAAACGGAAACCTTTTGTTACGTCACGGTGACGTATTATCTGAC 116
CAGGAAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCGTGTAGCCAACGTGCTAATTCGGTAAGTAAAACCAGTTATCATAACATAACCGGATACGCAAACCATT 133
TTATTATCGGACTCAAAGTAGCAAACGGAAATCATGCGTGATGAACTTCCGACAGCAAACTTATCTTCCTGTG 73
GTTTGGATAGAGCGCAAGGTTTATCACAGCAGGATATATAAAAAACAACAGTAGTTTCCTGACTTTGCGTTCATTAAATATCTAATACTACCCTTGCTTTTCCCTTT 107
AAATTATCAACGTACAGAAATTATCTTGGATTGAAATGCAAGCGTCGTTTCCGTTGAGAATATCTTGT 68
TTATTATCGTTTTTTAATATATTTATCAGGTTTCTCTCGTTTCCTGCTATAGGAGGACAAGTTTTATCAGCAGGATAGAAC 81


# Save labels, IDs, and Seqs

### **Binary Labels**
0 (non-functional) and 1 (functional). These are all the same

In [18]:
# Save the labels
if not os.path.isdir("binary"):
    os.makedirs("binary")

y = dataset["FXN_LABEL"].values
y_tiled = dataset_tiled["FXN_LABEL"].values
np.savetxt("binary/y_binary.txt", X=y, fmt="%d")
np.savetxt("binary/y-tiled_binary.txt", X=y_tiled, fmt="%d")

!wc -l binary/*

  9 binary/y_binary.txt
258 binary/y-tiled_binary.txt
267 total


### **Identifiers**
Name of the sequence to identify it

In [47]:
# Save the ids
if not os.path.isdir("id"):
    os.makedirs("id")
    
ID = dataset["NAME"].values
ID_tiled = (dataset_tiled["NAME"] + ":" + dataset_tiled["TILE"].astype(str)).values
np.savetxt("id/id.txt", X=ID, fmt="%s")
np.savetxt("id/id-tiled.txt", X=ID_tiled, fmt="%s")

!wc -l id/*

 258 id/id-tiled.txt
   9 id/id.txt
 267 total


### **Sequences**
ACGT...

In [20]:
# Save the seqs
if not os.path.isdir("seqs"):
    os.makedirs("seqs")
    
X_seqs = dataset["SEQ"].values
X_seqs_tiled = dataset_tiled["SEQ"].values
np.savetxt("seqs/seqs.txt", X=X_seqs, fmt="%s")
np.savetxt("seqs/seqs-tiled.txt", X=X_seqs_tiled, fmt="%s")

!wc -l seqs/*

  258 seqs/seqs-tiled.txt
    9 seqs/seqs.txt
  267 total


# Preprocess and save different feature sets

## **<u>Sequence feature idea 2 </u>**: Mixed encodings 

### **Generate all valid encodings**

In [49]:
X_mixed1s, X_mixed2s, X_mixed3s, valid_idxs = otx_enhancer_utils.mixed_encode(dataset)
X_mixed1s.shape, X_mixed2s.shape, X_mixed3s.shape, len(valid_idxs)

9it [00:00, 126.13it/s]


((3, 21), (3, 26), (3, 21), 3)

In [51]:
X_mixed1s[0], X_mixed2s[0], X_mixed3s[0]

(array([ 0.        ,  0.        ,  0.        ,  0.58244164,  6.        ,
         0.        ,  0.        ,  0.58768221, 15.        ,  1.        ,
         1.        ,  0.64665928, 18.        ,  1.        ,  1.        ,
         0.64665928,  9.        ,  0.        ,  0.        ,  0.43036466,
        10.        ]),
 array([ 0.        ,  0.        ,  0.        ,  0.58244164, -1.        ,
         6.        ,  0.        ,  0.        ,  0.58768221, -1.        ,
        15.        ,  0.64665928,  1.        ,  0.        ,  0.        ,
        18.        ,  0.64665928,  1.        ,  0.        ,  0.        ,
         9.        ,  0.        ,  0.        ,  0.43036466, -1.        ,
        10.        ]),
 array([ 0.        ,  0.        ,  0.58244164,  0.        ,  6.        ,
         0.        ,  0.58768221,  0.        , 15.        ,  0.64665928,
         0.        ,  1.        , 18.        ,  0.64665928,  0.        ,
         1.        ,  9.        ,  0.        ,  0.43036466,  0.        ,
     

In [52]:
X_mixed1s_tiled, X_mixed2s_tiled, X_mixed3s_tiled, valid_idxs_tiled = otx_enhancer_utils.mixed_encode(dataset_tiled)
X_mixed1s_tiled.shape, X_mixed2s_tiled.shape, X_mixed3s_tiled.shape, len(valid_idxs_tiled)

258it [00:01, 146.51it/s]


((17, 21), (17, 26), (17, 21), 17)

In [53]:
X_mixed1s_tiled[0], X_mixed2s_tiled[0], X_mixed3s_tiled[0]

(array([ 3.        ,  0.        ,  0.        ,  0.06471063, -7.        ,
         1.        ,  0.        ,  0.08895922, 16.        ,  0.        ,
         0.        ,  0.84258591, -4.        ,  0.        ,  0.        ,
         0.06471063, -7.        ,  1.        ,  0.        ,  0.08895922,
        25.        ]),
 array([ 3.        ,  0.        ,  0.        ,  0.06471063, -1.        ,
        -7.        ,  0.08895922, -1.        ,  0.        ,  0.        ,
        16.        ,  0.        ,  0.        ,  0.84258591, -1.        ,
        -4.        ,  0.        ,  0.        ,  0.06471063, -1.        ,
        -7.        ,  0.08895922, -1.        ,  0.        ,  0.        ,
        25.        ]),
 array([ 3.        ,  0.        ,  0.06471063,  0.        , -7.        ,
         0.08895922,  0.        ,  0.        , 16.        ,  0.        ,
         0.84258591,  0.        , -4.        ,  0.        ,  0.06471063,
         0.        , -7.        ,  0.08895922,  0.        ,  0.        ,
     

### *Mixed 1.0*
 - Replace binding sites using dictionary
 - Separate based on these binding sites and add create "dummy variables"
 - Get lengths of linkers around binding sites

In [54]:
# Load in training stats
with open("../2021_OLS_Library/mixed_1.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)
scale_indeces = train_stats["indeces"]

# Full sequences
X_mixed1s[:, scale_indeces] -= train_stats["means"]
X_mixed1s[:, scale_indeces] /= train_stats["stds"]

# Tiled sequences
X_mixed1s_tiled[:, scale_indeces] -= train_stats["means"]
X_mixed1s_tiled[:, scale_indeces] /= train_stats["stds"]
valid_tiled_names = dataset_tiled.loc[valid_idxs_tiled]["NAME"] + ":" + dataset_tiled.loc[valid_idxs_tiled]["TILE"].astype(str)

# Save the vals
if not os.path.isdir("mixed_1.0"):
    os.makedirs("mixed_1.0")
    
np.save("mixed_1.0/{}_{}-split_X-test_mixed-1.0".format(PREPROCESS, SPLIT), X_mixed1s)
np.savetxt("mixed_1.0/valid_id.txt", X=ID[valid_idxs], fmt="%s")
np.save("mixed_1.0/{}_{}-split_X-test_mixed-1.0-tiled".format(PREPROCESS, SPLIT), X_mixed1s_tiled)
np.savetxt("mixed_1.0/valid_id-tiled.txt", X=valid_tiled_names, fmt="%s")

!wc -l mixed_1.0/*

   1 mixed_1.0/0.09-0.4_0.9-split_X-test_mixed-1.0.npy
   6 mixed_1.0/0.09-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
   2 mixed_1.0/0.18-0.4_0.9-split_X-test_mixed-1.0.npy
   6 mixed_1.0/0.18-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
  17 mixed_1.0/id-valid-tiled.txt
   3 mixed_1.0/id-valid.txt
  17 mixed_1.0/valid_id-tiled.txt
   3 mixed_1.0/valid_id.txt
  55 total


### *Mixed 2.0*
 - Replace binding sites using dictionary
 - 4 bit vector for each binding site [ets_affinity ets_orientation gata_affinity gata_orientation] - ties together the identity to affinity
 - Get lengths of linkers around binding sites

In [55]:
# Load in training stats
with open("../2021_OLS_Library/mixed_2.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)
scale_indeces = train_stats["indeces"]

# Full sequences
X_mixed2s[:, scale_indeces] -= train_stats["means"]
X_mixed2s[:, scale_indeces] /= train_stats["stds"]

# Tiled sequences
X_mixed2s_tiled[:, scale_indeces] -= train_stats["means"]
X_mixed2s_tiled[:, scale_indeces] /= train_stats["stds"]
valid_tiled_names = dataset_tiled.loc[valid_idxs_tiled]["NAME"] + ":" + dataset_tiled.loc[valid_idxs_tiled]["TILE"].astype(str)

# Save the vals
if not os.path.isdir("mixed_2.0"):
    os.makedirs("mixed_2.0")
    
np.save("mixed_2.0/{}_{}-split_X-test_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed2s)
np.savetxt("mixed_2.0/valid_id.txt", X=ID[valid_idxs], fmt="%s")
np.save("mixed_2.0/{}_{}-split_X-test_mixed-2.0-tiled".format(PREPROCESS, SPLIT), X_mixed2s_tiled)
np.savetxt("mixed_2.0/valid_id-tiled.txt", X=valid_tiled_names, fmt="%s")

!wc -l mixed_2.0/*

   1 mixed_2.0/0.09-0.4_0.9-split_X-test_mixed-2.0.npy
   4 mixed_2.0/0.09-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
   1 mixed_2.0/0.18-0.4_0.9-split_X-test_mixed-2.0.npy
   5 mixed_2.0/0.18-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
  17 mixed_2.0/id-valid-tiled.txt
   3 mixed_2.0/id-valid.txt
  17 mixed_2.0/valid_id-tiled.txt
   3 mixed_2.0/valid_id.txt
  51 total


### *Mixed 3.0*
 - Replace binding sites using dictionary
 - 3 bit vector for each binding site [ets_affinity gata_affinity orientation] - ties together the identity to affinity while removing redundant info from mixed-2.0
 - Get lengths of linkers around binding sites

In [56]:
# Load in training stats
with open("../2021_OLS_Library/mixed_3.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)
scale_indeces = train_stats["indeces"]

# Full sequences
X_mixed3s[:, scale_indeces] -= train_stats["means"]
X_mixed3s[:, scale_indeces] /= train_stats["stds"]

# Tiled sequences
X_mixed3s_tiled[:, scale_indeces] -= train_stats["means"]
X_mixed3s_tiled[:, scale_indeces] /= train_stats["stds"]
valid_tiled_names = dataset_tiled.loc[valid_idxs_tiled]["NAME"] + ":" + dataset_tiled.loc[valid_idxs_tiled]["TILE"].astype(str)

# Save the vals
if not os.path.isdir("mixed_3.0"):
    os.makedirs("mixed_3.0")
    
np.save("mixed_3.0/{}_{}-split_X-test_mixed-3.0".format(PREPROCESS, SPLIT), X_mixed3s)
np.savetxt("mixed_3.0/valid_id.txt", X=ID[valid_idxs], fmt="%s")
np.save("mixed_3.0/{}_{}-split_X-test_mixed-3.0-tiled".format(PREPROCESS, SPLIT), X_mixed3s_tiled)
np.savetxt("mixed_3.0/valid_id-tiled.txt", X=valid_tiled_names, fmt="%s")

!wc -l mixed_3.0/*

   1 mixed_3.0/0.09-0.4_0.9-split_X-test_mixed-3.0.npy
   4 mixed_3.0/0.09-0.4_0.9-split_X-test_mixed-3.0-tiled.npy
   1 mixed_3.0/0.18-0.4_0.9-split_X-test_mixed-3.0.npy
   5 mixed_3.0/0.18-0.4_0.9-split_X-test_mixed-3.0-tiled.npy
  17 mixed_3.0/id-valid-tiled.txt
   3 mixed_3.0/id-valid.txt
  17 mixed_3.0/valid_id-tiled.txt
   3 mixed_3.0/valid_id.txt
  51 total


## **<u>Sequence feature idea 3 </u>**: Use the actual sequence (one-hot encoded)
 - One hot encoded sequence: each position is encoded as a 1-D vector of size 4 e.g., AT is [[1,0,0,0], [0,0,0,1]]
 - Generally, we will get inputs of size (len(seq) X 4). The above example would be of size 2x4
 - Can also save the string seqs in case those are also useful down the line

**Q** Are all sequences the same length

In [39]:
# Check the lengths of sequences to make sure they are all the same
dataset["SEQ"].apply(len).value_counts()

81     1
98     1
68     1
116    1
133    1
54     1
73     1
107    1
93     1
Name: SEQ, dtype: int64

**Answer**: Nope

### *Forward Encoding*

**Full sequences**

In [40]:
# Get the sequences only
X_seqs = [seq.upper().strip() for seq in dataset["SEQ"].values]
X_ohe_seq = project_utils.ohe_seqs(X_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
# Save in binary format
np.save("ohe_seq/X_ohe-seq", X_ohe_seq)

# Quick check
X_seqs[0][:5], X_ohe_seq[0][:5]

100%|██████████| 9/9 [00:00<00:00, 918.55it/s]

Encoded 9 seqs
Checking all 9 seqs for proper encoding
Sequence encoding was great success



  


('CGTGG',
 array([[0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.]]))

### *Reverse encoding*

**Full sequences**

In [41]:
# Get the reverse encodings
X_rev_seqs = np.array(["".join({'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}.get(base, base) for base in reversed(seq)) for seq in X_seqs])
X_ohe_rev_seq = project_utils.ohe_seqs(X_rev_seqs)

# Save the seqs
if not os.path.isdir("ohe_seq"):
    os.makedirs("ohe_seq")
    
# Save in binary format
np.save("ohe_seq/X_ohe-seq-rev", X_ohe_rev_seq)

# Quick check
X_rev_seqs[0][:5], X_ohe_rev_seq[0][:5]

100%|██████████| 9/9 [00:00<00:00, 954.31it/s]

Encoded 9 seqs
Checking all 9 seqs for proper encoding
Sequence encoding was great success



  


('TGTTC',
 array([[0., 0., 0., 1.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [0., 0., 0., 1.],
        [0., 1., 0., 0.]]))

## **<u>Sequence feature idea 4 </u>**: Use the actual sequence (as fasta)

**Full sequences**

In [42]:
# Create dir
if not os.path.isdir("fasta"):
    os.makedirs("fasta")

# Save full sequences in fasta format
dataset = dataset.reset_index()
file = open("fasta/X_fasta.fa", "w")
for i, enh in dataset.iterrows():
    file.write(">" + ID[i] + "\n" + X_seqs[i].upper().strip() + "\n")
file.close()

# Save tiled sequences in fasta format
pos_tiled = dataset_tiled["TILE"].values
file = open("fasta/X_fasta-tiled.fa", "w")
for i, enh in dataset_tiled.iterrows():
    file.write(">" + ID_tiled[i] + ":" + str(pos_tiled[i]) + "\n" + X_seqs_tiled[i].upper().strip() + "\n")
file.close()

!wc -l fasta/*

   18 fasta/X_fasta.fa
  516 fasta/X_fasta-tiled.fa
  534 total


# Final Checks

In [43]:
!tree -L 2

[38;5;33m.[0m
├── 2021_GATA_ETS_Clusters.ipynb
├── 2021_GATA_ETS_Clusters-tiled.tsv
├── 2021_GATA_ETS_Clusters.tsv
├── [38;5;33mbinary[0m
│   ├── y_binary.txt
│   └── y-tiled_binary.txt
├── [38;5;33mfasta[0m
│   ├── X_fasta.fa
│   └── X_fasta-tiled.fa
├── GATA_ETS_clusters_tiled.tsv
├── GATA_ETS_clusters.xlsx
├── [38;5;33mid[0m
│   ├── id-tiled.txt
│   └── id.txt
├── [38;5;33mmixed_1.0[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixed-1.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-1.0.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
│   ├── valid_id-tiled.txt
│   └── valid_id.txt
├── [38;5;33mmixed_2.0[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixed-2.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-2.0.npy
│   ├── 0.18-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
│   ├── valid_id-tiled.txt
│   └── valid_id.txt
├── [38;5;33mmixed_3.0[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixe

In [48]:
%%bash
head fasta/X_fasta-tiled.fa
head -n 5 id/id-tiled.txt

>Negative_control:0
CAGGAAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCG
>Negative_control:1
AGGAAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCGT
>Negative_control:2
GGAAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCGTG
>Negative_control:3
GAAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCGTGT
>Negative_control:4
AAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCGTGTA
Negative_control:0
Negative_control:1
Negative_control:2
Negative_control:3
Negative_control:4


In [45]:
%%bash
head fasta/X_fasta.fa
head -n 5 id/id.txt

>Otxa-WTg1
CGTGGTACTAGAAATTTTATCACTGTCGTGGTACCGGAAAATTTATCCCAGCTGTGATACCGGAAATTTTATCTCAGTCGTGGTACCGGAACA
>Otxa-WTg2
ATTATCATTACCTTATTATCAAATGTGCAAGAGGGACGCGGAAGTAGTATTTTCAATTGGGACGCGGAAGTAGAACTTTTCGTATCTCGCAAACCATT
>Otxa-WTg3
CCTATATCTATAATAACAGTCGGAAATTGCCGGAAAATAATAAAATTATCGTCT
>Otxa-WTg5
CATAGATAGCAACAATGCCATCTTATCTACGTAACTAGGAAACATAATCATTTATAAAACTTTCTATCTACTTAAACGGAAACCTTTTGTTACGTCACGGTGACGTATTATCTGAC
>Negative_control
CAGGAAATTCCCGTTATCTATCCATCGCTATACTTCGTGTAGTTATCTATCCATCGCTATACTTCGTGTAGCCAACGTGCTAATTCGGTAAGTAAAACCAGTTATCATAACATAACCGGATACGCAAACCATT
Otxa-WTg1
Otxa-WTg2
Otxa-WTg3
Otxa-WTg5
Negative_control


# Scratch

In [75]:
# Define encoders
integer_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder(
    categories=[np.array([0, 1, 2, 3])], handle_unknown="ignore"
)

In [76]:
# Example steps for one hot encoding
test = X_seqs[0]
print("{}...".format(test[:5]))
integer_encoded = integer_encoder.fit_transform(list(test))
print("{}...".format(integer_encoded[:5]))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoder.fit(integer_encoded)  # convert to one hot
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
print("{}...".format(one_hot_encoded.toarray()[:5]))

CGTGG...
[1 2 3 2 2]...
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]...


**One hot encoding of sequences...<u>takes about 2 minutes</u>**

In [77]:
# Applying above for all seqs
X_features = []  # will hold one hot encoded sequence
for i, seq in enumerate(tqdm.tqdm(X_seqs)):
    integer_encoded = integer_encoder.fit_transform(list(seq))  # convert to integer
    integer_encoded = np.array(integer_encoded).reshape(-1, 1)
    one_hot_encoder.fit(integer_encoded)  # convert to one hot
    one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
    X_features.append(one_hot_encoded.toarray())

100%|██████████| 9/9 [00:00<00:00, 1181.90it/s]


In [78]:
# Check to make sure it is correct length
len(X_features)

9

In [79]:
# convert to numpy array
X_ohe_seq = np.array(X_features, dtype="object")

In [80]:
# Sanity check encoding for randomly chosens sequences
indeces = np.random.choice(len(X_features), size=len(X_features))
for j, ind in enumerate(indeces):
    seq = X_seqs[ind]
    one_hot_seq = X_features[ind]
    for i, bp in enumerate(seq):
        if bp == "A":
            if (one_hot_seq[i] != [1.0, 0.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "C":
            if (one_hot_seq[i] != [0.0, 1.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "G":
            if (one_hot_seq[i] != [0.0, 0.0, 1.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "T":
            if (one_hot_seq[i] != [0.0, 0.0, 0.0, 1.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        elif bp == "N":
            if (one_hot_seq[i] != [0.0, 0.0, 0.0, 0.0]).all():
                print("You one hot encoded wrong dummy!")
                print(seq, one_hot_seq)
        else:
            print(bp)
    print("Seq #{} encoded correctly".format(j + 1))

Seq #1 encoded correctly
Seq #2 encoded correctly
Seq #3 encoded correctly
Seq #4 encoded correctly
Seq #5 encoded correctly
Seq #6 encoded correctly
Seq #7 encoded correctly
Seq #8 encoded correctly
Seq #9 encoded correctly


### **Binary Labels**
0 (non-functional) and 1 (functional). These are all the same

In [113]:
# Save the labels
y = dataset["label"].values
y_tiled = dataset_tiled["label"].values
np.savetxt("binary/full_y_binary.txt", X=y, fmt="%d")
np.savetxt("binary/tiled_y_binary.txt", X=y_tiled, fmt="%d")

In [114]:
!wc -l binary/*

  9 binary/full_y_binary.txt
258 binary/tiled_y_binary.txt
267 total


### **Identifiers**

In [115]:
ID = dataset["Name"].values
ID_tiled = dataset_tiled["Name"].values
np.savetxt("id/id.txt", X=ID, fmt="%s")
np.savetxt("id/tiled_id.txt", X=ID_tiled, fmt="%s")

In [116]:
!wc -l id/*

   9 id/id.txt
 258 id/tiled_id.txt
 267 total


### **Sequences**
ACGT...

In [117]:
X_seqs = dataset["Sequence"].values
X_seqs_tiled = dataset_tiled["Sequence"].values
np.savetxt("seqs/seqs.txt", X=X_seqs, fmt="%s")
np.savetxt("seqs/tiled_seqs.txt", X=X_seqs_tiled, fmt="%s")

In [118]:
!wc -l seqs/*

    9 seqs/seqs.txt
  258 seqs/tiled_seqs.txt
  267 total


# References