The purpose of this notebook is to produce output files by one-hot encoding the dna sequence, and adding the tfbs scores to the one-hot encoding.

The output files are first saved as plain txt files in the `data/output` folder. They are also combined together into one huge list which is stored as a `pickle` buffer (so that the loading time of the output is faster).

In [1]:
import os
import glob
import ast
import pickle
import pandas as pd
import numpy as np
#from Bio import SeqIO

  (fname, cnt))
  (fname, cnt))


The following cell imports all the TFBS scores and transform them into a dictionary called `all_scores`.

`all_scores` has the data structure: `{species: {motif: {raw_position: score}}}`.

In [None]:
# A helper function to extract the motif name from the csv name.
def get_motif(name):
    if 'cad_FlyReg.fm' in name:
        return 'cad_FlyReg.fm'
    if 'hb_nar2008.fm' in name:
        return 'hb_nar2008.fm'
    if 'bcd_FlyReg.fm' in name:
        return 'bcd_FlyReg.fm'

path = '../data/input/5_TFBS_score_subset_30May2018'
all_csvs = glob.glob(path + '/*.csv')
all_scores = {}
for csv_ in all_csvs:
    with open(csv_, encoding='utf-8') as csv_file:
        motif = get_motif(csv_file.name)
        for a_line in csv_file:
            curr_line = a_line.split('\t')
            strand = curr_line[6]
            if strand == 'positive\n':
                score = float(curr_line[2])
                species = curr_line[4]
                raw_position = int(curr_line[5])
                if species not in all_scores:
                    all_scores[species] = {}
                if motif not in all_scores[species]:
                    all_scores[species][motif] = {}
                all_scores[species][motif][raw_position] = score

The aim of the following cell is to produce a one-hot encoding scheme with TFBS scores embedded for each DNA sequence segment.

It consists of three parts:

1. Read in all DNA sequence segments.
2. Transform each position of the DNA sequence into a 4-letter one-hot encoding based on the `base_pairs` dictionary.
3. For each position, attach the TFBS scores to the end of the one-hot encoding.
4. Output the final encoding into `txt` files for bookkeeping.

In [None]:
# Use the following dictionary to perform the transformation
base_pairs = {'A': [1, 0, 0, 0], 
              'C': [0, 1, 0, 0],
              'G': [0, 0, 1, 0],
              'T': [0, 0, 0, 1],
              'a': [1, 0, 0, 0],
              'c': [0, 1, 0, 0],
              'g': [0, 0, 1, 0],
              't': [0, 0, 0, 1],
              'n': [0, 0, 0, 0],
              'N': [0, 0, 0, 0]}

file_num_limit = 110    # The maximum number of files to be decoded
file_count = 0

# Iterate through every file
for file in os.listdir("../data/input/3.24_species_only"):
    one_hot = []
    to_write = False
    # When the number of file decoded has reached the limit, stop
    if file_count < file_num_limit:
        data = list(SeqIO.parse("../data/input/3.24_species_only/" + file,"fasta"))
        for n in range(0, len(data)):
            # Extract the header information
            header = data[n].description.split('|')
            descr = data[n].description
            regionID = header[0]
            expressed = header[1]
            speciesID = header[2]
            strand = header[3]
            # Complement all sequences in the negative DNA strand
            if strand == '-':
                # Using the syntax [e for e in base_pairs[n]] to create a new pointer for each position
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq.complement()]])
            else:
                one_hot.append([descr, expressed, speciesID, [[e for e in base_pairs[n]] for n in data[n].seq]])
        # Attach the TFBS scores to the end of each position
        for item in one_hot:
            # Only outputs sequences that currently have TFBS scores
            # Ignore all sequences that do not have TFBS scores yet
            if descr in all_scores:
                to_write = True
                i = 0
                for encoding in item[3]:
                    # Take care of positions that do not have TFBS scores, attaching 0 as placeholder (i.e. NA)
                    if i not in all_scores[descr]['cad_FlyReg.fm']:
                        encoding.extend([0, 0, 0])
                    else:
                        encoding.append(all_scores[descr]['cad_FlyReg.fm'][i])
                        encoding.append(all_scores[descr]['hb_nar2008.fm'][i])
                        encoding.append(all_scores[descr]['bcd_FlyReg.fm'][i])
                    i += 1
                # Write the final encoding into txt files
        if to_write:
            with open("../data/output/" + regionID + ".txt", mode="w", encoding='utf-8') as output:
                output.write(str(one_hot))
            file_count += 1

The rest of the notebook uses the one-hot encoding files produced above to build a neural network prototype to make sure everything works as intended.

The following cell reads in one-hot encoding files as a list `seq_record_list`.

In [2]:
path = '../data/output'
all_txts = glob.glob(path + '/*.txt')
seq_record_list = []
i = 0
# Iterate through all one-hot encoding files
for txt_ in all_txts:
    i += 1
    print(i)
    with open(txt_, encoding='utf-8') as f:
        # attach the one-hot encoding information of this file to the end of seq_record_list
        seq_record_list += ast.literal_eval(f.read())
len(seq_record_list)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


10752

The following cell saves `seq_record_list` as a `pickle` buffer so that it can be retreated much faster next time.

In [3]:
with open("list_buffer.txt", "wb") as buff:
    pickle.dump(seq_record_list, buff)