**Author**: Yichen Fang

**Purpose**:

The purpose of this notebook is to produce output files by one-hot encoding the dna sequence.

The output files are first saved as plain txt files in the `data/output_without_motif` folder. They are also combined together into one huge list which is stored as a `pickle` buffer (so that the loading time of the output is faster).

**Important Note Before Using the File**:

1. Always check the **output location** and the **buffer file name** are correct and intended. Otherwise, you may accidentally **overwrite** previous results.
2. ALways check the motif file names corresponds to the ones you are using. Otherwise, the script would not work correctly.
3. This file should only be used to generate outputs without motifs. To generate outputs with motifs attached, please use the script `producing_output_files_with_motif.ipynb`.

In [None]:
import os
import glob
import ast
import pickle
import pandas as pd
import numpy as np
from Bio import SeqIO

Set address variables in the following cell:

In [None]:
input_folder_path = "change_the_path_here" + "/"
output_folder_path = "change_the_path_here" + "/"
path_to_buffer_file = "change_the_path_here" + "/" + "change_the_name_here.txt"
# NOTE: the buffer file need not be created beforehands. Just write the path
#       and the file name here. The file would be created by the system.

The aim of the following cell is to produce a one-hot encoding scheme for each DNA sequence segment.

It consists of three parts:

1. Read in all DNA sequence segments.
2. Transform each position of the DNA sequence into a 4-letter one-hot encoding based on the `base_pairs` dictionary.
3. Output the final encoding into `txt` files for bookkeeping.

In [None]:
# Use the following dictionary to perform the transformation
base_pairs = {'A': [1, 0, 0, 0], 
              'C': [0, 1, 0, 0],
              'G': [0, 0, 1, 0],
              'T': [0, 0, 0, 1],
              'a': [1, 0, 0, 0],
              'c': [0, 1, 0, 0],
              'g': [0, 0, 1, 0],
              't': [0, 0, 0, 1],
              'n': [0, 0, 0, 0],
              'N': [0, 0, 0, 0]}

file_num_limit = 10000    # The maximum number of files to be decoded
file_count = 0

# Iterate through every file
for file in os.listdir(input_folder_path):
    one_hot = []
    # When the number of file decoded has reached the limit, stop
    if file_count < file_num_limit:
        data = list(SeqIO.parse(input_folder_path + file,"fasta"))
        for n in range(0, len(data)):
            # Extract the header information
            header = data[n].description.split('|')
            descr = data[n].description
            regionID = header[0]
            expressed = header[1]
            speciesID = header[2]
            strand = header[3]
            # Complement all sequences in the negative DNA strand
            if strand == '-':
                # Using the syntax [e for e in base_pairs[n]] to create a new pointer for each position
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq.complement()]])
            else:
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq]])
        with open(output_folder_path + regionID + ".txt", mode="w", encoding='utf-8') as output:
            output.write(str(one_hot))
        file_count += 1

The rest of the notebook uses the one-hot encoding files produced above to build a neural network prototype to make sure everything works as intended.

The following cell reads in one-hot encoding files as a list `seq_record_list`.

In [None]:
all_txts = glob.glob(output_folder_path + '*.txt')
seq_record_list = []
i = 0
# Iterate through all one-hot encoding files
for txt_ in all_txts:
    i += 1
    print(i)
    with open(txt_, encoding='utf-8') as f:
        # attach the one-hot encoding information of this file to the end of seq_record_list
        seq_record_list += ast.literal_eval(f.read())
len(seq_record_list)

The following cell saves `seq_record_list` as a `pickle` buffer so that it can be retreated much faster next time.

In [None]:
with open(path_to_buffer_file, "wb") as buff:
    pickle.dump(seq_record_list, buff)