# Combining All Genomic Sequences Into Single Test Set Notebook

**Authorship:**
Adam Klie
*11/27/2021*
***
**Description:**
This notebook can be used to generate the necessary files that contain all the genomic sequences used for testing EUGENE models. The following datasets have been incorporated up to this point:

1. 2010 Khoueiry et al
2. 2021 GATA ETS Clusters
3. 2021 OLS Exact Syntax Match
***

**TODOs:**
 - <font color='green'> Load in data from each encoding and concat in some way</font>
***

# Set-up

In [1]:
# Classics
import os
import numpy as np
import pandas as pd

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

import sys
sys.path.append("/cellar/users/aklie/projects/EUGENE/bin/")
import project_utils
import otx_enhancer_utils

In [2]:
# Training stats for mixed encodings
PREPROCESS = "0.18-0.4"  # String defining the preprocessing for saving
SPLIT = 0.9  # Split into training and test sets

In [3]:
DATASETS = ["2010_Khoueiry_CellPress", "2021_GATA_ETS_Clusters", "2021_OLS_Exact_Syntax_Match"]

# Load data to preprocess

## 1. Dataframes

In [4]:
merged, merged_tiled = pd.DataFrame(), pd.DataFrame()
for dataset in DATASETS:
    path, path_tiled = "../{0}/{0}.tsv".format(dataset), "../{0}/{0}-tiled.tsv".format(dataset)
    if os.path.exists(path):
        print("Reading in full seqs {}".format(dataset))
        df = pd.read_csv(path, sep="\t")
        df["DATASET"] = dataset
        merged = pd.concat([merged, df])
    if os.path.exists(path_tiled):
        print("Reading in tiled seqs {}".format(dataset))
        df = pd.read_csv(path_tiled, sep="\t")
        df["DATASET"] = dataset
        merged_tiled = pd.concat([merged_tiled, df])
merged.to_csv("All_Genomic_Sequences.tsv", sep="\t", index=False)
merged_tiled.to_csv("All_Genomic_Sequences-tiled.tsv", sep="\t", index=False)
len(merged), len(merged_tiled)

Reading in full seqs 2010_Khoueiry_CellPress
Reading in tiled seqs 2010_Khoueiry_CellPress
Reading in full seqs 2021_GATA_ETS_Clusters
Reading in tiled seqs 2021_GATA_ETS_Clusters
Reading in full seqs 2021_OLS_Exact_Syntax_Match


(42, 2235)

## 2. Labels

In [84]:
if not os.path.exists("binary"):
    os.makedirs("binary")

!cat \
    ../2010_Khoueiry_CellPress/binary/y_binary.txt \
    ../2021_GATA_ETS_Clusters/binary/y_binary.txt \
    ../2021_OLS_Exact_Syntax_Match/binary/y_binary.txt \
    > binary/y_binary.txt

!cat \
    ../2010_Khoueiry_CellPress/binary/y-tiled_binary.txt \
    ../2021_GATA_ETS_Clusters/binary/y-tiled_binary.txt \
    > binary/y-tiled_binary.txt

!wc -l binary/*

  42 binary/y_binary.txt
2235 binary/y-tiled_binary.txt
2277 total


## IDs

In [85]:
if not os.path.exists("id"):
    os.makedirs("id")

!cat \
    ../2010_Khoueiry_CellPress/id/id.txt \
    ../2021_GATA_ETS_Clusters/id/id.txt \
    ../2021_OLS_Exact_Syntax_Match/id/id.txt \
    > id/id.txt

!cat \
    ../2010_Khoueiry_CellPress/id/id-tiled.txt \
    ../2021_GATA_ETS_Clusters/id/id-tiled.txt \
    > id/id-tiled.txt

!wc -l id/*

 2235 id/id-tiled.txt
   42 id/id.txt
 2277 total


## 3. Seqs

In [86]:
if not os.path.exists("seqs"):
    os.makedirs("seqs")

!cat \
    ../2010_Khoueiry_CellPress/seqs/seqs.txt \
    ../2021_GATA_ETS_Clusters/seqs/seqs.txt \
    ../2021_OLS_Exact_Syntax_Match/seqs/seqs.txt \
    > seqs/seqs.txt

!cat \
    ../2010_Khoueiry_CellPress/seqs/seqs-tiled.txt \
    ../2021_GATA_ETS_Clusters/seqs/seqs-tiled.txt \
    > seqs/seqs-tiled.txt

!wc -l seqs/*

  2235 seqs/seqs-tiled.txt
    42 seqs/seqs.txt
  2277 total


## 4. Mixed encodings

In [87]:
if not os.path.exists("mixed"):
    os.makedirs("mixed")

!cat \
    ../2010_Khoueiry_CellPress/mixed_1.0/id-valid.txt \
    ../2021_GATA_ETS_Clusters/mixed_1.0/id-valid.txt \
    ../2021_OLS_Exact_Syntax_Match/mixed_1.0/id-valid.txt \
    > mixed/id-valid.txt

!cat \
    ../2010_Khoueiry_CellPress/mixed_1.0/id-valid-tiled.txt \
    ../2021_GATA_ETS_Clusters/mixed_1.0/id-valid-tiled.txt \
    > mixed/id-valid-tiled.txt

!wc -l mixed/*

     8 mixed/0.09-0.4_0.9-split_X-test_mixed-1.0.npy
    57 mixed/0.09-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
     5 mixed/0.09-0.4_0.9-split_X-test_mixed-2.0.npy
    57 mixed/0.09-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
     5 mixed/0.09-0.4_0.9-split_X-test_mixed-3.0.npy
    57 mixed/0.09-0.4_0.9-split_X-test_mixed-3.0-tiled.npy
   228 mixed/id-valid-tiled.txt
    22 mixed/id-valid.txt
   439 total


In [88]:
for encoding in ["1.0", "2.0", "3.0"]:
    print("Merging encoding: {}".format(encoding))
    merged, merged_tiled = [], []
    for dataset in DATASETS:
        path = "../{0}/mixed_{1}/{2}_{3}-split_X-test_mixed-{1}.npy".format(dataset, encoding, PREPROCESS, SPLIT)
        if os.path.exists(path):
            print("Reading full seqs in {}".format(dataset))
            arr = np.load(path)
            merged.append(arr)
        path_tiled = "../{0}/mixed_{1}/{2}_{3}-split_X-test_mixed-{1}-tiled.npy".format(dataset, encoding, PREPROCESS, SPLIT)
        if os.path.exists(path_tiled):
            print("Reading tiled seqs in {}".format(dataset))
            arr_tiled = np.load(path_tiled)
            merged_tiled.append(arr_tiled)
    merged, merged_tiled = np.vstack(merged), np.vstack(merged_tiled)
    print(merged.shape, merged_tiled.shape)
    np.save("mixed/{1}_{2}-split_X-test_mixed-{3}".format(dataset, PREPROCESS, SPLIT, encoding), arr=merged)
    np.save("mixed/{1}_{2}-split_X-test_mixed-{3}-tiled".format(dataset, PREPROCESS, SPLIT, encoding), arr=merged_tiled)

Merging encoding: 1.0
Reading full seqs in 2010_Khoueiry_CellPress
Reading tiled seqs in 2010_Khoueiry_CellPress
Reading full seqs in 2021_GATA_ETS_Clusters
Reading tiled seqs in 2021_GATA_ETS_Clusters
Reading full seqs in 2021_OLS_Exact_Syntax_Match
(22, 21) (228, 21)
Merging encoding: 2.0
Reading full seqs in 2010_Khoueiry_CellPress
Reading tiled seqs in 2010_Khoueiry_CellPress
Reading full seqs in 2021_GATA_ETS_Clusters
Reading tiled seqs in 2021_GATA_ETS_Clusters
Reading full seqs in 2021_OLS_Exact_Syntax_Match
(22, 26) (228, 26)
Merging encoding: 3.0
Reading full seqs in 2010_Khoueiry_CellPress
Reading tiled seqs in 2010_Khoueiry_CellPress
Reading full seqs in 2021_GATA_ETS_Clusters
Reading tiled seqs in 2021_GATA_ETS_Clusters
Reading full seqs in 2021_OLS_Exact_Syntax_Match
(22, 21) (228, 21)


## 4. One-hot seqs

In [89]:
if not os.path.exists("ohe_seq"):
    os.makedirs("ohe_seq")

merged, merged_rev = [], []
for dataset in DATASETS:
    path = "../{0}/ohe_seq/X_ohe-seq.npy".format(dataset)
    path_rev = "../{0}/ohe_seq/X_ohe-seq-rev.npy".format(dataset)
    if os.path.exists(path) and os.path.exists(path_rev):
        print("Reading forward and reverse seqs in {}".format(dataset))
        arr, arr_rev = np.load(path, allow_pickle=True), np.load(path_rev, allow_pickle=True)
        merged.append(arr), merged_rev.append(arr_rev)
merged, merged_rev = np.array(merged), np.array(merged_rev)
np.save("ohe_seq/X_ohe-seq.npy", arr=merged), np.save("ohe_seq/X_ohe-seq-rev.npy", arr=merged_rev)

Reading forward and reverse seqs in 2010_Khoueiry_CellPress
Reading forward and reverse seqs in 2021_GATA_ETS_Clusters
Reading forward and reverse seqs in 2021_OLS_Exact_Syntax_Match


  merged, merged_rev = np.array(merged), np.array(merged_rev)


(None, None)

## 5. Fasta

In [90]:
if not os.path.exists("fasta"):
    os.makedirs("fasta")

!cat \
    ../2010_Khoueiry_CellPress/fasta/X_fasta.fa \
    ../2021_GATA_ETS_Clusters/fasta/X_fasta.fa \
    ../2021_OLS_Exact_Syntax_Match/fasta/X_fasta.fa \
    > fasta/X_fasta.fa

!cat \
    ../2010_Khoueiry_CellPress/fasta/X_fasta-tiled.fa \
    ../2021_GATA_ETS_Clusters/fasta/X_fasta-tiled.fa \
    > fasta/X_fasta-tiled.fa

!wc -l fasta/*

cat: ../2010_Khoueiry_CellPress/fasta/X_fasta-tiled.fa: No such file or directory
cat: ../2021_GATA_ETS_Clusters/fasta/X_fasta-tiled.fa: No such file or directory
    84 fasta/X_fasta.fa
     0 fasta/X_fasta-tiled.fa
  4470 fasta/X-tiled_fasta.fa
  4554 total


# Final Checks

In [91]:
!tree -L 2

[38;5;33m.[0m
├── All_Genomic_Sequences.ipynb
├── All_Genomic_Sequences-tiled.tsv
├── All_Genomic_Sequences.tsv
├── [38;5;33mbinary[0m
│   ├── y_binary.txt
│   └── y-tiled_binary.txt
├── [38;5;33mfasta[0m
│   ├── X_fasta.fa
│   ├── X_fasta-tiled.fa
│   └── X-tiled_fasta.fa
├── [38;5;33mid[0m
│   ├── id-tiled.txt
│   └── id.txt
├── [38;5;33mmixed[0m
│   ├── 0.09-0.4_0.9-split_X-test_mixed-1.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-1.0-tiled.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-2.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-2.0-tiled.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-3.0.npy
│   ├── 0.09-0.4_0.9-split_X-test_mixed-3.0-tiled.npy
│   ├── id-valid-tiled.txt
│   └── id-valid.txt
├── [38;5;33mohe_seq[0m
│   ├── X_ohe-seq.npy
│   └── X_ohe-seq-rev.npy
└── [38;5;33mseqs[0m
    ├── seqs-tiled.txt
    └── seqs.txt

6 directories, 22 files


# Scratch

## Old mixed encoding code

In [None]:
def encode_mixed1(seq):
    enh_tfbs = defineTFBS(seq)
    if len(enh_tfbs) != 5:
        return -1
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], "E", tfbs[1], tfbs[3]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], "G", tfbs[1], tfbs[3]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    return enh_encoding

def mixed1_encode(data):
    mixed1_encoding, valid_idx = [], []
    for i, (row_num, enh_data) in tqdm.tqdm(enumerate(data.iterrows())):
        sequence = enh_data["SEQ"].upper().strip()
        encoded_seq = encode_mixed1(sequence)
        if encoded_seq != -1:
            mixed1_encoding.append(encoded_seq)
            valid_idx.append(i)
    X_mixed1 = (pd.DataFrame(mixed1_encoding).replace({"G": 0, "E": 1, "R": 0, "F": 1}))
    X_mixed1 = X_mixed1.values
    return X_mixed1, valid_idx

In [65]:
mixed1_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed1_encoding.append([0]*21)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], "E", tfbs[1], tfbs[3]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], "G", tfbs[1], tfbs[3]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed1_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed1 = (
    pd.DataFrame(mixed1_encoding)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)
X_mixed1 = X_mixed1.values

# Load in training stats
with open("../2021_OLS_Library/mixed_1.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)

# Z-score test set
scale_indeces = train_stats["indeces"]
X_mixed1[:, scale_indeces] -= train_stats["means"]
X_mixed1[:, scale_indeces] /= train_stats["stds"]

# Save the vals
if not os.path.isdir("mixed_1.0"):
    os.makedirs("mixed_1.0")
    
np.save("mixed_1.0/{}_{}-split_X-test_mixed-1.0".format(PREPROCESS, SPLIT), X_mixed1)

!ls -l mixed_1.0

X_mixed1.shape, mixed1_encoding[0], X_mixed1[0]

20it [00:03,  5.35it/s]


total 774
-rw-r--r-- 1 aklie carter-users   7968 Nov 26 12:21 0.18-0.4_0.9-split_X-test_mixed-1.0.npy
-rw-r--r-- 1 aklie carter-users 784128 Nov 26 12:09 0.18-0.4_0.9-split_X-test_mixed-1.0-tiled.npy


((20, 49),
 [31,
  'E',
  'F',
  0.11868156845560487,
  3,
  'E',
  'F',
  0.10973005182726431,
  0,
  'G',
  'R',
  0.5389119998601336,
  19,
  'E',
  'F',
  0.39163576347437207,
  5,
  'E',
  'R',
  0.10072171154881818,
  1,
  'E',
  'F',
  0.14028684169033467,
  -2,
  'G',
  'F',
  0.8051007829329527,
  17,
  'E',
  'R',
  0.3934295478885151,
  20,
  'G',
  'F',
  0.4449949666094348,
  -1,
  'E',
  'R',
  0.09932825230929267,
  -2,
  'E',
  'R',
  0.10277555374902991,
  -3,
  'G',
  'R',
  0.31163566926760605,
  10],
 array([ 6.33683722e+00,  1.00000000e+00,  1.00000000e+00, -2.05414929e+00,
        -4.97137607e-01,  1.00000000e+00,  1.00000000e+00, -2.09314810e+00,
        -1.22237710e+00,  0.00000000e+00,  0.00000000e+00,  3.19738794e-04,
         3.35270336e+00,  1.00000000e+00,  1.00000000e+00, -7.11168311e-01,
        -1.14164866e-02,  1.00000000e+00,  0.00000000e+00, -2.12157986e+00,
         3.36793426e-01,  1.00000000e+00,  1.00000000e+00,  1.40286842e-01,
        -2.0000000

In [35]:
mixed2_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset_tiled.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed2_encoding.append([0]*26)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], tfbs[3], tfbs[1], 0, 0]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], 0, 0, tfbs[3], tfbs[1]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed2_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed2 = (
    pd.DataFrame(mixed2_encoding)
    .replace({"G": 0, "E": 1, "R": 0, "F": 1})
)
X_mixed2 = X_mixed2.values

# Load in training stats
with open("../2021_OLS_Library/mixed_2.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)

# Z-score test set
scale_indeces = train_stats["indeces"]
X_mixed2[:, scale_indeces] -= train_stats["means"]
X_mixed2[:, scale_indeces] /= train_stats["stds"]

# Save the vals
if not os.path.isdir("mixed_2.0"):
    os.makedirs("mixed_2.0")
    
np.save("mixed_2.0/{}_{}-split_X-test_mixed-2.0".format(PREPROCESS, SPLIT), X_mixed1)

!ls -l mixed_2.0

X_mixed2.shape, mixed2_encoding[0], X_mixed2[0]

12it [00:02,  4.92it/s]


KeyError: 'TGGATAAt'

In [117]:
mixed3_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed3_encoding.append([0]*21)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], tfbs[3], 0, tfbs[1]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], 0, tfbs[3], tfbs[1]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed3_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed3 = pd.DataFrame(mixed3_encoding).replace({"R": -1, "F": 1})
X_mixed3 = X_mixed3.values

In [None]:
mixed3_encoding = []
for i, (row_num, enh_data) in tqdm.tqdm(enumerate(dataset_tiled.iterrows())):
    seq = enh_data["SEQ"].upper().strip()
    enh_tfbs = otx_enhancer_utils.defineTFBS(seq)
    if len(enh_tfbs) == 0:
        mixed3_encoding.append([0]*21)
    enh_encoding = []
    for pos, tfbs in enh_tfbs.items():
        if tfbs[0] == "ETS":
            enh_encoding += [tfbs[4], tfbs[3], 0, tfbs[1]]
        elif tfbs[0] == "GATA":
            enh_encoding += [tfbs[4], 0, tfbs[3], tfbs[1]]
    enh_encoding.append(len(seq)-(pos+5)-1)
    mixed3_encoding.append(enh_encoding)

# Replace strings with one hot encoding
X_mixed3 = pd.DataFrame(mixed3_encoding).replace({"R": -1, "F": 1})
X_mixed3 = X_mixed3.values

# Load in training stats
with open("../2021_OLS_Library/mixed_3.0/{}_X-train-{}_stats.pickle".format(PREPROCESS, SPLIT), 'rb') as handle:
    train_stats = pickle.load(handle)

# Z-score test set
scale_indeces = train_stats["indeces"]
X_mixed3[:, scale_indeces] -= train_stats["means"]
X_mixed3[:, scale_indeces] /= train_stats["stds"]

# Save the vals
if not os.path.isdir("mixed_3.0"):
    os.makedirs("mixed_3.0")
    
np.save("mixed_3.0/{}_{}-split_X-test_mixed-3.0-tiled".format(PREPROCESS, SPLIT), X_mixed1)

!ls -l mixed_3.0

X_mixed3.shape, mixed3_encoding[0], X_mixed3[0]

## Old positive and negative sequence break-up code

In [71]:
# Mask
neg_mask = (y == 0)

In [76]:
# Negative seqs
X_neg = X_seqs[neg_mask]
y_neg = y[neg_mask]
id_neg = ID[neg_mask]

# Positive seqs
X_pos = X_seqs[~neg_mask]
y_pos = y[~neg_mask]
id_pos = ID[~neg_mask]

In [79]:
# Check
print(X_neg.shape, y_neg.shape)
print(X_pos.shape, y_pos.shape)
if (X_neg.shape[0] + X_pos.shape[0] == X_seqs.shape[0]):
    print("We good: {}, {}, {}".format(X_seqs.shape, y.shape, ID.shape))
else:
    print("The game is afoot")

(12,) (12,)
(8,) (8,)
We good: (20,), (20,), (20,)


*Positive training sequences*

In [92]:
# Save positive sequences in fasta format
pos_file = open("fasta/X_fasta-pos.fa", "w")
for i in range(len(X_pos)):
    pos_file.write(">" + id_pos[i] + "\n" + X_pos[i].upper().strip() + "\n")
pos_file.close()

In [93]:
# Double check
!wc -l fasta/X_fasta-pos.fa

16 fasta/X_fasta-pos.fa


In [94]:
# Should equal above
len(X_pos)*2

16

*Negative training sequences*

In [95]:
# Save negatvie sequences in fasta format
neg_file = open("fasta/X_fasta-neg.fa", "w")
for i in range(len(X_neg)):
    neg_file.write(">" + id_neg[i] + "\n" + X_neg[i].upper().strip() + "\n")
neg_file.close()

In [96]:
# Double check
!wc -l fasta/X_fasta-neg.fa

24 fasta/X_fasta-neg.fa


In [97]:
# Should equal above
len(X_neg)*2

24

## Old dataset loading code

In [13]:
# Laod in
dataset = pd.read_excel("1-s2.0-S096098221000432X-mmc2.xls", sheet_name="Table S2", skiprows=2)
dataset = dataset[~dataset["Enhancer Activity"].isna()]

# Clean up sequence by removing constant regions and clean up labels by equating 'none' to inaxctive
dataset["Sequence"] = dataset["Cloned and Tested C. i. Sequence"].apply(lambda x: x.replace("ACCCAACTTT", ""))
dataset["Sequence"] = dataset["Sequence"].apply(lambda x: x.replace("AAAGTAGGCT", ""))
dataset["label"] = (dataset["Enhancer Activity"] != "none").astype(int)
dataset["Sequence_len"] = dataset["Sequence"].apply(len)
dataset["Name"] = dataset["Position"] + ":" + dataset["Start"].astype(str) + ":" + dataset["End"].astype(str)
dataset["Type"] = "full"
dataset.head(1)

Unnamed: 0,Cluster,Position,Start,End,Size,% Identity between C.i. and C.s.,GATA-MGGAAR-80,GATA-MGGAAR-130,GATA-HGGAWR-80,Enhancer Activity,...,Segal Probability on C.i. Clusters (yeast model – 2008),Unnamed: 15,Cloned and Tested C. i. Sequence,Sequence from Published C.i. Assembly,Sequence from Published C.s. Assembly,Sequence,label,Sequence_len,Name,Type
0,C1,scaffold_1,462149,462232,83,82,X,X,X,a6.5 and/or b6.5,...,0.8,,AAAGTAGGCTATATGCTACAGCCCAGAGCTCATGGATTTTAATGGG...,ATATGCTACAGCCCAGAGCTCATGGATTTTAATGGGATCCGCTATC...,ATACTTCGACGCTGGAAGCCGGGGGAATTTAATGGGACGGCCTATC...,ATATGCTACAGCCCAGAGCTCATGGATTTTAATGGGATCGGCTATC...,1,175,scaffold_1:462149:462232,full


# References