## Data Annotation Pipline
The non-coding regions of the designated regions of genes can be annotated through this process

start with test_small. Data analysis was performed. The analytical results are stored in the test_small_list

In [3]:
import pandas as pd
from tqdm import tqdm
import pickle

test_small = pd.read_pickle('datasets/small/test_small.pkl')
test_small_list = []

for _, row in test_small.iterrows():
    test_small_list.append({'chr': row['variant_id'].split('_')[0][3:], 'pos': row['variant_id'].split('_')[1],
                            'ref': row['variant_id'].split('_')[2],
                            'alt': row['variant_id'].split('_')[3],
                            'tss_distance': row['tss_distance'],
                            'annotation': []})

{'chr': '19', 'pos': '3762700', 'ref': 'C', 'alt': 'T', 'tss_distance': 36}

Here is a list, in the same order as the original. Note that it may occur here that the regulation of the same gene differs in different cell lines

Annotate each gene below, the result of the annotation is stored in test_small_list. This is a single-threaded version

In [None]:
regulation_annotation = pd.read_table('preprocess_data/regulation_annotation.tsv', sep='\t')
for test_item in tqdm(test_small_list):
    for idx, row in regulation_annotation.iterrows():
        if test_item['chr'] != row['Chromosome/scaffold name']:
            continue  # skip if not same chromosome
        slice_start = min(int(test_item['pos']), int(test_item['pos']) + test_item['tss_distance'])
        slice_end = max(int(test_item['pos']), int(test_item['pos']) + test_item['tss_distance'])
        overlap_start = max(slice_start, int(row['Start (bp)']))
        overlap_end = min(slice_end, int(row['End (bp)']))
        if overlap_start < overlap_end:
            test_item['annotation'].append({'no': idx, 'start': overlap_start, 'end': overlap_end})

The final pickle needs to be aggregated and stored in test_small_annotation.pkl

In [None]:
import copy
total = 20
slice_size = len(test_small_list) // total
for i in tqdm(range(total)):
    start_pos = i * slice_size
    if i == total - 1:
        end_pos = len(test_small_list)
    else:
        end_pos = (i + 1) * slice_size
    input_file = open('preprocess_data/small/test_small_annotation_{}.pkl'.format(i), 'rb')
    test_small_list_tmp = pickle.load(input_file)
    input_file.close()
    for j in range(start_pos, end_pos):
        test_small_list[j]['annotation'] = copy.deepcopy(test_small_list_tmp[j - start_pos]['annotation'])
out_file = open('preprocess_data/small/test_small_annotation.pkl', 'wb')
pickle.dump(test_small_list, out_file)
out_file.close()

Here is the test small annotation. PKL format

In [6]:
import pickle
input_file = open('preprocess_data/small/test_small_annotation.pkl', 'rb')
test_small_list = pickle.load(input_file)
input_file.close()
test_small_list[0]

{'chr': '19',
 'pos': '3762700',
 'ref': 'C',
 'alt': 'T',
 'tss_distance': 36,
 'annotation': [{'no': 455219, 'start': 3762700, 'end': 3762736}]}

The sequence length and the position of the annotation relative to the sequence need to be added

In [13]:
for item in test_small_list:
    item['seq_len'] = abs(item['tss_distance']) + 1
    for i in item['annotation']:
        i['start_rel'] = i['start'] - min(int(item['pos']), int(item['pos']) + item['tss_distance'])
        i['end_rel'] = i['end'] - min(int(item['pos']), int(item['pos']) + item['tss_distance'])
out_file = open('preprocess_data/small/test_small_annotation.pkl', 'wb')
pickle.dump(test_small_list, out_file)
out_file.close()
test_small_list[54]

{'chr': '19',
 'pos': '45179662',
 'ref': 'C',
 'alt': 'T',
 'tss_distance': 917,
 'annotation': [{'no': 483931,
   'start': 45179662,
   'end': 45180579,
   'start_rel': 0,
   'end_rel': 917}],
 'seq_len': 918}