**Author**: Yichen Fang

**Purpose**:

The purpose of this notebook is to produce output files by one-hot encoding the dna sequence, and adding the tfbs scores to the one-hot encoding.

The output files are first saved as plain txt files in the specified output folder. They are also combined together into one huge list which is stored as a `pickle` buffer (so that the loading time of the output is faster).

**Important Note Before Using the File**:

1. Always check the **output location** and the **buffer file name** are correct and intended. Otherwise, you may accidentally **overwrite** previous results.
2. ALways check the motif file names corresponds to the ones you are using. Otherwise, the script would not work correctly.
3. This file should only be used to generate outputs that contain motifs. To generate outputs without motifs attached, please use the script `producing_output_files_without_motif.ipynb`.

In [None]:
import os
import glob
import ast
import pickle
import shelve
import pandas as pd
import numpy as np
from Bio import SeqIO

Set address variables in the following cell:

In [None]:
input_folder_path = "/home/ubuntu/data/team_neural_network/data/input/3.24_species_only" + "/"
output_folder_path = "/home/ubuntu/formatted/output" + "/"
motif_folder_path = "/home/ubuntu/raw/5_TFBS_scores_18July2018" + "/"
path_to_buffer_file = "/home/ubuntu/formatted/buffers" + "/" + "all_data_buffer.txt" 
# NOTE: the buffer file need not be created beforehands. Just write the path
# and the file name here. The file would be created automatically by the system.

Set the motif variables:

In [None]:
motif_1 = "zelda"
motif_2 = "hb"
motif_3 = "bcd"

The following cell imports all the TFBS scores and transform them into a dictionary called `all_scores`.

`all_scores` has the data structure: `{motif: {species: {raw_position: score}}}`.

In [None]:
# A helper function to extract the motif name from the csv name.
def get_motif(name):
    if name == 'zelda':
        return motif_1
    if name == 'hb':
        return motif_2
    if name == 'bcd':
        return motif_3

def motif_processing(motif_name):
    all_csvs = glob.glob(motif_folder_path + motif_name + '/*.csv')
    all_scores = shelve.open("/home/ubuntu/formatted/motif_dic")
    curr_motif = {}
    u = 0
    for csv_ in all_csvs:
        with open(csv_, encoding='utf-8') as csv_file:
            for a_line in csv_file:
                curr_line = a_line.split('\t')
                strand = curr_line[6]
                if strand == 'positive\n':
                    score = float(curr_line[2])
                    species = curr_line[4]
                    raw_position = int(curr_line[5])
                    if species not in curr_motif:
                        curr_motif[species] = {}
                    curr_motif[species][raw_position] = score
        u += 1
        print(motif_name + ': ' + str(u))
    all_scores[motif_name] = curr_motif
    all_scores.close()

In [None]:
motif_processing('zelda')
motif_processing('hb')
motif_processing('bcd')

In [None]:
all_scores = shelve.open("/home/ubuntu/formatted/motif_dic")
new_scores = shelve.open("/home/ubuntu/formatted/species_motif_dic")

def redesign_shelve(motif):
    v = 0
    current_motif = all_scores[motif]
    for species in current_motif:
        if species not in new_scores:
            species_dic = {}
            species_dic[motif] = current_motif[species]
            new_scores[species] = species_dic
        else:
            species_dic = new_scores[species]
            species_dic[motif] = current_motif[species]
            new_scores[species] = species_dic
        v += 1
        print(motif + ": " + str(v))

redesign_shelve('zelda')
redesign_shelve('hb')
redesign_shelve('bcd')

new_scores.close()
all_scores.close()

The aim of the following cell is to produce a one-hot encoding scheme with TFBS scores embedded for each DNA sequence segment.

It consists of three parts:

1. Read in all DNA sequence segments.
2. Transform each position of the DNA sequence into a 4-letter one-hot encoding based on the `base_pairs` dictionary.
3. For each position, attach the TFBS scores to the end of the one-hot encoding.
4. Output the final encoding into `txt` files for bookkeeping.

In [None]:
# Use the following dictionary to perform the transformation
base_pairs = {'A': [1, 0, 0, 0], 
              'C': [0, 1, 0, 0],
              'G': [0, 0, 1, 0],
              'T': [0, 0, 0, 1],
              'a': [1, 0, 0, 0],
              'c': [0, 1, 0, 0],
              'g': [0, 0, 1, 0],
              't': [0, 0, 0, 1],
              'n': [0, 0, 0, 0],
              'N': [0, 0, 0, 0]}

file_num_limit = 10000    # The maximum number of files to be decoded
file_count = 0

def lacking_motif(sequence):
    if 'zelda' not in sequence:
        return True
    elif 'hb' not in sequence:
        return True
    elif 'bcd' not in sequence:
        return True
    return False

new_scores = shelve.open("/home/ubuntu/formatted/species_motif_dic")

# Iterate through every file
for file in os.listdir(input_folder_path):
    one_hot = []
    # When the number of file decoded has reached the limit, stop
    if file_count < file_num_limit:
        data = list(SeqIO.parse(input_folder_path + file,"fasta"))
        for n in range(0, len(data)):
            # Extract the header information
            header = data[n].description.split('|')
            descr = data[n].description
            regionID = header[0]
            expressed = header[1]
            speciesID = header[2]
            strand = header[3]
            # Complement all sequences in the negative DNA strand
            if strand == '-':
                # Using the syntax [e for e in base_pairs[n]] to create a new pointer for each position
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq.complement()]])
            else:
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq]])
        # Attach the TFBS scores to the end of each position
        to_write = True
        for item in one_hot:
            # Only outputs sequences that currently have TFBS scores
            # Ignore all sequences that do not have TFBS scores yet
            sequence_name = item[0]
            if sequence_name not in new_scores:
                to_write = False
                break
            current_sequence = new_scores[sequence_name]
            if lacking_motif(current_sequence):
                to_write = False
                break
            i = 0
            for encoding in item[3]:
                # Take care of positions that do not have TFBS scores, attaching 0 as placeholder (i.e. NA)
                if i not in current_sequence['zelda']:
                    encoding.extend([0, 0, 0])
                else:
                    encoding.append(current_sequence['zelda'][i])
                    encoding.append(current_sequence['hb'][i])
                    encoding.append(current_sequence['bcd'][i])
                i += 1
        # Write the final encoding into txt files
        if to_write:
            with open(output_folder_path + regionID + ".txt", mode="w", encoding='utf-8') as output:
                output.write(str(one_hot))
            file_count += 1
            print("output: " + str(file_count))

new_scores.close()

The rest of the notebook uses the one-hot encoding files produced above to build a neural network prototype to make sure everything works as intended.

The following cell reads in one-hot encoding files as a list `seq_record_list`.

In [None]:
all_txts = glob.glob(output_folder_path + '*.txt')
seq_record_list = []
i = 0
# Iterate through all one-hot encoding files
for txt_ in all_txts:
    i += 1
    print(i)
    with open(txt_, encoding='utf-8') as f:
        # attach the one-hot encoding information of this file to the end of seq_record_list
        seq_record_list += ast.literal_eval(f.read())
len(seq_record_list)

The following cell saves `seq_record_list` as a `pickle` buffer so that it can be retreated much faster next time.

In [None]:
with open(path_to_buffer_file, "wb") as buff:
    pickle.dump(seq_record_list, buff)