**Author**: Zhanyuan

**About this notebook**:
- This notebook takes 20 sampled raw data and check the data processing procedure in the notebook.  
- The purpose of `1a_producing_output_files_with_motif.ipynb` is to produce output files by one-hot encoding the dna sequence, and adding the tfbs scores to the one-hot encoding.
- The result of this note book are saved in `../test`.

**Important Note Before Using the File**:
1. Always check all **path variables** are correct and intended. Otherwise, you may accidentally **overwrite** previous results.
2. Always check the motif file names corresponds to the ones you are using. Otherwise, the script would not work correctly.
3. This file should only be used to generate outputs that contain motifs. To generate outputs without motifs attached, please use the script `producing_output_files_without_motif.ipynb`.

In [20]:
import os
import glob
import ast
import shelve
import numpy as np
from Bio import SeqIO
import shutil

Set address variables in the following cell:

In [21]:
input_folder_path = "../../test/sample" + "/"
output_folder_path = "../../test/new_output" + "/"

# This should be the parent folder of motif files
motif_folder_path = "../../../Map_Motif_no_threshold_14Nov2018/"

# The shelve with motif as the outermost key
motif_species_shelve_path = "../../test/formatted/motif_dic"

# The shelve with species name as the outermost key
species_motif_shelve_path = "../../test/formatted/species_motif_dic"

# sample data from:
data_dir = "../../../3.species_24_only_07March2018/3.24_species_only/"

Sanity check on the total number of regions in the `data_dir`.

In [22]:
len(os.listdir(data_dir)) # If the number is less than 3545, then data_dir does not contain all the regions.

870

For simplicity, we sampled the following 20 regions from `data_dir` and put them into the `input_folder_path`

In [23]:
all_data_lst = np.array(os.listdir(data_dir))
n = len(all_data_lst)

np.random.seed(189)
sample_indices = np.random.choice(np.arange(n), 20, replace = False)
sample_files = all_data_lst[sample_indices]

for file in sample_files:
    print(file)
    shutil.copy(os.path.join(data_dir, file),
                        input_folder_path)

outlier_rm_with_length_VT26654.fa
outlier_rm_with_length_VT15830.fa
outlier_rm_with_length_VT17004.fa
outlier_rm_with_length_VT26330.fa
outlier_rm_with_length_VT10559.fa
outlier_rm_with_length_VT17649.fa
outlier_rm_with_length_VT21450.fa
outlier_rm_with_length_VT15159.fa
outlier_rm_with_length_VT18447.fa
outlier_rm_with_length_VT21149.fa
outlier_rm_with_length_VT11142.fa
outlier_rm_with_length_VT26763.fa
outlier_rm_with_length_VT17668.fa
outlier_rm_with_length_VT25271.fa
outlier_rm_with_length_VT23783.fa
outlier_rm_with_length_VT23949.fa
outlier_rm_with_length_VT21906.fa
outlier_rm_with_length_VT18437.fa
outlier_rm_with_length_VT17789.fa
outlier_rm_with_length_VT15722.fa


Append all motif names to `motif_list`:

The motif name should be the same as the folder where that particular motif is stored.

For example, if all `zelda_.fm` csv files are stored in the folder `.../motif/zelda`, then the name for this motif should be `zelda`.

If all different motifs exist in a single folder, then you should separate them into distinct folders, one folder for each motif, before running the script.

In [24]:
motif_list = []
motif_list.append("zelda")
motif_list.append("cad_FlyReg")
motif_list.append("bcd_FlyReg")
motif_list.append("eve_new6")


The following cells import all the TFBS scores and transform them into a `shelve` (i.e. a dictionary stored in disk, rather than memory) called `all_scores`.

`all_scores` has the data structure: `{motif: {species: {raw_position: score}}}`.

In [28]:
def motif_processing(motif_name):
    ''' Transform the motif files into the data structure specified above, one motif at a time.
    '''
    all_csvs = glob.glob(motif_folder_path + motif_name + '/*.fa.csv') # modified
    all_scores = shelve.open(motif_species_shelve_path)
    curr_motif = {}
    u = 0
    for csv_ in all_csvs:
        with open(csv_, encoding='utf-8') as csv_file:
            for a_line in csv_file:
                curr_line = a_line.split(',') # modified
                strand = curr_line[4] # modified
                if strand == 'positive': # modified
                    score = float(curr_line[1]) # modified
                    species = curr_line[2] # modified
                    raw_position = int(curr_line[3]) # modified
                    if species not in curr_motif:
                        curr_motif[species] = {}
                    curr_motif[species][raw_position] = score
        u += 1
        print(motif_name + ': ' + str(u))
    all_scores[motif_name] = curr_motif
    all_scores.close()

## Inspect the input region sequence and motif score

#### Inspect the fasta files in the `input_folder_path`

In [31]:
# data contains all 24 species's DNA sequence of a given region.
region = os.listdir(input_folder_path)[0]
data = list(SeqIO.parse(input_folder_path + region,"fasta"))

one_species = data[0]
print(one_species)

descr = one_species.description
header = one_species.description.split('|')
print('*******************************************')
print('regionID: ', header[0])
print('expressed: ', header[1])
print('speciesID: ', header[2])
print('strand: ', header[3])

ID: VT23949|1|dkik|-|2532
Name: VT23949|1|dkik|-|2532
Description: VT23949|1|dkik|-|2532
Number of features: 0
Seq('TCGCGTGCGGAATAGCGCGTGTTTGAATTTTATTCTCGGGCGAGAACTTTTTTG...CGA', SingleLetterAlphabet())
*******************************************
regionID:  VT23949
expressed:  1
speciesID:  dkik
strand:  -


Note that `ID`, `Name` and `Description` are the same.

#### Inspect the file containing motif scores.

In [40]:

all_lines[1][]

'15504,-16.70252227783203,VT9998|0|MEMB002A|+|963,0,positive,1,zelda_.fm\n'

In [41]:
all_csvs = glob.glob(motif_folder_path + 'zelda' + '/*.fa')
with open(os.path.join(motif_folder_path, motif_list[0], 'VT9998.fa.csv'), encoding='utf-8') as csv_file:
    all_lines = [line for line in csv_file]
    print("Total number of lines in the given motif file: {}".format(len(all_lines)))
    title = all_lines[0][1:].split(',')
    print("The column names of the motif file:\n {}".format(title))
    print("Check the first raw (split features by ','): \n")
    curr_line = all_lines[1].split(',')[1:]
    print(curr_line)

Total number of lines in the given motif file: 46715
The column names of the motif file:
 ['score', 'species', 'raw_position', 'strand', 'align_position', 'motif\n']
Check the first raw (split features by ','): 

['-16.70252227783203', 'VT9998|0|MEMB002A|+|963', '0', 'positive', '1', 'zelda_.fm\n']


In [None]:
for each_motif in motif_list:
    motif_processing(each_motif)

The following cell trnasforms the `all_scores` shelve created above into a new shelve `new_scores`.

The `new_scores` has the data structure: `{species: {motif: {raw_position: score}}}`.

The purpose for the transformation is that the structure of `new_scores` is quicker and more memory efficient for subsequent motif attaching.

In [None]:
all_scores = shelve.open(motif_species_shelve_path)
new_scores = shelve.open(species_motif_shelve_path)

def redesign_shelve(motif):
    ''' Redesign the data structure as specified above, one motif at a time.
    '''
    v = 0
    current_motif = all_scores[motif]
    for species in current_motif:
        if species not in new_scores:
            species_dic = {}
            species_dic[motif] = current_motif[species]
            new_scores[species] = species_dic
        else:
            species_dic = new_scores[species]
            species_dic[motif] = current_motif[species]
            new_scores[species] = species_dic
        v += 1
        print(motif + ": " + str(v))

for each_motif in motif_list:
    redesign_shelve(each_motif)

new_scores.close()
all_scores.close()

The aim of the following cell is to produce a one-hot encoding scheme with TFBS scores embedded for each DNA sequence segment.

It consists of four parts:

1. Read in all DNA sequence segments.
2. Transform each position of the DNA sequence into a 4-letter one-hot encoding based on the `base_pairs` dictionary.
3. For each position, attach the TFBS scores to the end of the one-hot encoding.
4. Output the final encoding into `txt` files for bookkeeping.

In [46]:
# Use the following dictionary to perform the transformation
base_pairs = {'A': [1, 0, 0, 0], 
              'C': [0, 1, 0, 0],
              'G': [0, 0, 1, 0],
              'T': [0, 0, 0, 1],
              'a': [1, 0, 0, 0],
              'c': [0, 1, 0, 0],
              'g': [0, 0, 1, 0],
              't': [0, 0, 0, 1],
              'n': [0, 0, 0, 0],
              'N': [0, 0, 0, 0]}

# The maximum number of files to be decoded
file_num_limit = 10000

# A counter for file processing
file_count = 0

def lacking_motif(sequence):
    ''' Return True if one or more motifs are missing for a sequence.
        Otherwise, return False.
    '''
    for each_motif in motif_list:
        if each_motif not in sequence:
            return True
    return False

new_scores = shelve.open(species_motif_shelve_path)

# Iterate through every file
for file in os.listdir(input_folder_path):
    one_hot = []
    # When the number of file decoded has reached the limit, stop
    if file_count < file_num_limit:
        data = list(SeqIO.parse(input_folder_path + file,"fasta"))
        for n in range(0, len(data)):
            # Extract the header information
            header = data[n].description.split('|')
            descr = data[n].description
            regionID = header[0]
            expressed = header[1]
            speciesID = header[2]
            strand = header[3]
#             # Complement all sequences in the negative DNA strand
#             if strand == '-':
#                 # Using the syntax [e for e in base_pairs[n]] to create a new pointer for each position
#                 one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq.complement()]])
#             else:
            one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq]])
        # Attach the TFBS scores to the end of each position
        to_write = True
        for item in one_hot:
            # Only outputs sequences that currently have TFBS scores
            # Ignore all sequences that do not have TFBS scores yet
            sequence_name = item[0]
            if sequence_name not in new_scores:
                to_write = False
                break
            current_sequence = new_scores[sequence_name]
            if lacking_motif(current_sequence):
                to_write = False
                break
            i = 0
            for encoding in item[3]:
                # Take care of positions that do not have TFBS scores, attaching 0 as placeholder (i.e. NA)
                if i not in current_sequence[motif_list[0]]:
                    encoding.extend([0 for _ in range(len(motif_list))])
                else:
                    for each_motif in motif_list:
                        encoding.append(current_sequence[each_motif][i])
                i += 1
        # Write the final encoding into txt files
        if to_write:
            with open(output_folder_path + regionID + ".txt", mode="w", encoding='utf-8') as output:
                output.write(str(one_hot))
            file_count += 1
            print("output: " + str(file_count))

new_scores.close()

output: 1
output: 2
output: 3
output: 4
output: 5
output: 6
output: 7
output: 8
output: 9
output: 10


### Inspect the output file
#### Each position should have length of 4 + # of motifs: the first 4 is for the one-hot encoding and each of the rest correspond to the score of one motif.

In [60]:
one_hot_sequence = os.listdir(output_folder_path)[0]
text_wrapper = open(os.path.join(output_folder_path, one_hot_sequence), "r")
seq = text_wrapper.read()

seq

"[['VT17649|1|dkik|-|2572', '1', 'dkik', [[0, 1, 0, 0, -10.244582176208496, 0.7007919549942017, -7.555264949798584, -1.8725075721740723], [0, 0, 0, 1, -10.244582176208496, -2.5216004848480225, 1.0692260265350342, -7.981031894683838], [0, 1, 0, 0, -10.244582176208496, 0.06336206197738647, -2.345811367034912, -3.457470178604126], [0, 0, 1, 0, -16.70252227783203, -0.9910857081413269, -4.201627731323242, 0.8644579648971558], [0, 1, 0, 0, -3.786642074584961, -1.4061232805252075, -6.555264949798584, 1.8644579648971558], [0, 1, 0, 0, -3.786642074584961, -1.4061232805252075, -8.555264472961426, 3.44942045211792], [0, 0, 0, 1, -10.244582176208496, -1.9910857677459717, -6.140227317810059, -1.074141502380371], [0, 0, 0, 1, -10.244582176208496, -1.9910857677459717, -6.140227317810059, -1.074141502380371], [0, 0, 0, 1, -10.244582176208496, -1.8841705322265625, -5.140227317810059, -5.396069526672363], [0, 0, 0, 1, -10.244582176208496, -1.9910857677459717, -6.140227317810059, -2.074141502380371], [0,