# Boundary detection of composite transposable elements

## Notebook: Outputs predictions from NER model to detect transposable elements - This script outputs the final probabilites and values for the evaluation of the NER prediction 

### Description:

Transposable elements are sequences in genomes that can change their position in the genome. Thus, they are also called “jumping genes”. They are able to affect the composition and size of genetic replicons. Our research interest in this project are composite transposable elements, which are flanked by two inverted repeats and transposable elements. Composite transposable elements are moving as one unit within a genome and are copying and inserting genes enclosed by itself. The following traits of composite transposable elements are making their detection challenging:

Sometimes terminal information such as repeats or transposable elements are missing, which would theoretically determine the boundaries of a composite transposable element.
Composite transposable elements are diverse in their genetic composition and size.
Composite transposable elements are usually associated with essential and indispensable genes, which are having a high gene frequency across genomes, but also with genes of lower essentiality, which leads to significant drop in the gene frequency landscape. We hypothesize that the genetic frequency landscape of a replicon will follow a particular pattern, which can be used as a marker for putative regions of composite transposable elements. Thus, we are representing here an approach to detect boundaries of composite transposable elements using the langugage of genome sequences consisting out protein families labels.

### Project Repo:

https://github.com/DMH-dutte/Boundary_detection_comTEs

Participants:

Nicolas Steen

Friedrich Kerchnawe

Wang Yiqing

Dustin Martin Hanke

### Course and Semester:

Transformers for Natural Language Processing and Beyond - Sommersemester 2022

# TEdetection: NER_predictions_from_files_V2

In [1]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import shelve

2022-06-09 18:41:43.526427: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


## set parameters

In [2]:
input_dir = os.path.join("..", "data", "test_input_files", "contigs")
output_file = os.path.join("..", "data", "test_input_files", "contig.filter")

chunk_len = 150
chunk_offset = 50

## get tokenize_table

In [3]:
results = shelve.open(os.path.join("..", "data", "prep_trainvalidationtest_1"))
tokenize_table = results["tokenize_table"]
results.close

<bound method Shelf.close of <shelve.DbfilenameShelf object at 0x14d4664d7970>>

## read data

In [4]:
def prep_chunks_from_files(
               input_dir,
               tokenize_table,
               chunk_len = 150,
               chunk_offset = 50):
               
    """-------------------------------------------------------------
    prepare data
    Read original data from input_dir, tokenize the protein families
    with the tokenize_table.
    
    Create chunks from this data of size "chunk_len" and with an 
    offset of "chunk_offset". Define these parameters below.
    
    Output:
    
    "df_chunked" (pandas DataFrame)
        contains all chunks with token_ids.  
        Other columns in this dataframe serve for finding the data 
        quickly in the original files  
    -------------------------------------------------------------"""
    
    col_origin = []
    col_tokens = []
    col_length = []

    for sequence in os.listdir(input_dir):
        df_c = pd.read_csv(os.path.join("..", "data", "original_data", "contigs_func", sequence), 
                           sep = "\t", 
                           names = ("tokens", "strand"), 
                           skiprows = (1) )
        contig_id = sequence.split(".")[0]        
        tokens = df_c["tokens"].to_list()
        length = len(tokens)
        
        col_origin.append(sequence)
        col_tokens.append(tokens)
        col_length.append(length)
        
    df_a = pd.DataFrame({ "origin": col_origin,
                          "tokens": col_tokens,
                          "length": col_length 
                        })
                        
    # assign token_ids to tokens
    df_a["token_ids"] = df_a["tokens"].apply(lambda x: [ tokenize_table[t] for  t in x])
    
    # compute the chunks
    origin = []
    chunk = []
    dset = []
    tokens = []
    token_ids = []
    attention_masks = []

    for index, row in df_a.iterrows():
        l = len(row["token_ids"])
        raw_t = np.zeros(max(l + chunk_offset, chunk_len)).astype(int)
        raw_t[0: l] = row["token_ids"]
        raw_masks = np.zeros(max(l + chunk_offset, chunk_len)).astype(int)
        raw_masks[0:l] = np.ones(l).astype(int)
        raw_tokens = ['pad' for i in range(0, max(l + chunk_offset, chunk_len))]
        raw_tokens[0: l] = row["tokens"]
        chunk_n = 1

        for i in range(0,max(1, l + chunk_offset - chunk_len), chunk_offset):
            origin.append(row["origin"])
            chunk.append(chunk_n)
            token_ids.append(raw_t[i:i+chunk_len])
            attention_masks.append(raw_masks[i:i+chunk_len])
            tokens.append(raw_tokens[i:i+chunk_len])
            chunk_n = chunk_n + 1


    df_chunked = pd.DataFrame({"origin": origin,
                               "chunk": chunk,
                               "tokens": tokens,
                               "token_ids": token_ids,
                               "attention_masks" : attention_masks
                             })
    
    return df_chunked

In [5]:
df = prep_chunks_from_files(
               input_dir,
               tokenize_table,
               chunk_len = chunk_len,
               chunk_offset = chunk_offset)

In [6]:
df

Unnamed: 0,origin,chunk,tokens,token_ids,attention_masks
0,8033.fasta,1,"[mcl06955, mcl06642, mcl06735, mcl03626, mcl03...","[13439, 15808, 19615, 8359, 8360, 8361, 6251, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,8033.fasta,2,"[mcl05643, mcl05569, mcl00020, mcl05599, mcl05...","[7493, 7492, 878, 7491, 7490, 7489, 7488, 1171...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,9198.fasta,1,"[mcl0000X, mcl01971, mcl02107, mcl01276, mcl01...","[31, 2986, 2985, 2984, 2982, 2981, 2980, 2979,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,9198.fasta,2,"[mcl06606, mcl02218, mcl03974, mcl03972, mcl02...","[4982, 2961, 4983, 4984, 4985, 4986, 4987, 498...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,9198.fasta,3,"[mcl01240, mcl02266, mcl02094, mcl00911, mcl00...","[2928, 2927, 2926, 2925, 2924, 2923, 2922, 292...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...
234558,9730.fasta,83,"[mcl03236, mcl03263, mcl03352, mcl03231, mcl01...","[3343, 3342, 3341, 3339, 3338, 3337, 3336, 599...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
234559,9730.fasta,84,"[mcl02365, mcl00092, mcl00105, mcl01029, mcl03...","[3232, 3231, 3230, 3229, 3228, 3227, 3226, 322...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
234560,9730.fasta,85,"[mcl01590, mcl02602, mcl02510, mcl02447, mcl02...","[3183, 3182, 6110, 3181, 3180, 3179, 3178, 317...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
234561,9730.fasta,86,"[mcl04176, mcl07059, mcl04103, mcl04082, mcl00...","[7231, 7230, 7229, 7228, 3143, 3142, 6117, 611...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


## extract training and validation data for NER task

In [7]:
df_tkn = df["token_ids"]
tokens = np.array([x for x in df_tkn.tolist()])
df_att = df["attention_masks"]
attention_masks = np.array([x for x in df_att.tolist()])

## load the NER model

In [8]:
from transformers import TFAutoModelForTokenClassification
  
#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = TFAutoModelForTokenClassification.from_pretrained("models/NER_1")

2022-06-09 18:44:26.149353: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-09 18:44:28.004689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30986 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2022-06-09 18:44:28.006578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30986 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2022-06-09 18:44:28.008262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/rep

In [9]:
model.summary()

Model: "tf_distil_bert_for_token_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 67715328  
 nLayer)                                                         
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 67,716,866
Trainable params: 67,716,866
Non-trainable params: 0
_________________________________________________________________


## Functions

In [10]:
def merge_chunks(sequence_chunks):
    n = np.shape(sequence_chunks)[0]
    l = np.shape(sequence_chunks)[1]
    output_sequence = np.zeros(n * l - (n-1) * (l-chunk_offset))
    p = 0
    for chunk in sequence_chunks:
        output_sequence[p:p+l] = output_sequence[p:p+l] + chunk
        p = p + chunk_offset
    
    return output_sequence.astype(bool).astype(int)

In [11]:
def merge_chunks2(sequence_chunks):
    n = np.shape(sequence_chunks)[0]
    l = np.shape(sequence_chunks)[1]
    output_sequence = np.zeros(n * l - (n-1) * (l-chunk_offset))
    p = 0
    for chunk in sequence_chunks:
        output_sequence[p:p+l] = chunk
        p = p + chunk_offset
    
    return output_sequence.astype(float)

### Don't use len to assign variables -> 

In [12]:
'''
def merge_chunks2(sequence_chunks):
    
    merged_chunk = []
    for i, chunk in enumerate(sequence_chunks):
        chunk = list(chunk)
        if i == len(sequence_chunks)-1:
            merged_chunk.append(chunk[25:])
            continue
        if i == 0:
            merged_chunk.append(chunk[0:125])
        else:
            merged_chunk.append(chunk[25:125])
            #merged_chunk.append(sequence_chunks[i+1][100:151])
    
    flat_list = [x for xs in merged_chunk for x in xs]
    output_sequence = np.array(flat_list)
    return output_sequence.astype(float)
'''

'\ndef merge_chunks2(sequence_chunks):\n    \n    merged_chunk = []\n    for i, chunk in enumerate(sequence_chunks):\n        chunk = list(chunk)\n        if i == len(sequence_chunks)-1:\n            merged_chunk.append(chunk[25:])\n            continue\n        if i == 0:\n            merged_chunk.append(chunk[0:125])\n        else:\n            merged_chunk.append(chunk[25:125])\n            #merged_chunk.append(sequence_chunks[i+1][100:151])\n    \n    flat_list = [x for xs in merged_chunk for x in xs]\n    output_sequence = np.array(flat_list)\n    return output_sequence.astype(float)\n'

In [13]:
def get_prediction_element(i, predictions):
    probs = tf.nn.softmax(predictions[0, i])
    result = tf.math.top_k(probs, k=2)
    pred = [result.indices.numpy(), result.values.numpy()]
    TE_prob = str(round(pred[1][0], 2))
    return pred[0][0], TE_prob

In [14]:
def get_prediction_chunk(input_chunk):
    outputs = model(input_chunk)           # call the model to predict TE positions
    i = 0
    pred_arr = np.zeros(150)
    pred_TE_prob = np.zeros(150)
    #,  pred_TE_prob[i]
    for x in input_chunk:
        pred_arr[i],  pred_TE_prob[i] = get_prediction_element(i, outputs[0])
        i = i+1
    return pred_arr.astype(int), pred_TE_prob.astype(float)

In [15]:
#c_arr = df[df["origin"] == test_origin]["token_ids"].to_numpy()
#c_arr = np.array(c_arr.tolist())
df[df["origin"]=='7235.fasta']

Unnamed: 0,origin,chunk,tokens,token_ids,attention_masks
2298,7235.fasta,1,"[mcl01584, mcl01943, mcl02101, mcl02243, mcl01...","[3551, 3549, 3548, 3547, 3546, 3545, 24390, 35...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2299,7235.fasta,2,"[mcl00758, mcl01548, mcl00046, mcl02368, mcl02...","[3762, 3763, 3764, 3766, 3767, 3768, 3769, 377...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2300,7235.fasta,3,"[mcl02674, mcl00848, mcl01677, mcl01595, mcl00...","[3813, 3814, 3816, 3817, 3818, 3819, 3855, 385...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2301,7235.fasta,4,"[mcl00404, mcl00192, mcl00172, mcl04802, mcl01...","[3484, 3483, 3482, 23348, 3480, 7635, 7636, 76...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2302,7235.fasta,5,"[mcl11472, mcl04535, mcl11710, mcl12269, mcl12...","[23355, 5162, 23356, 25417, 25416, 3461, 7134,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...
2391,7235.fasta,94,"[mcl00878, mcl02236, mcl00804, mcl00765, mcl00...","[4028, 4027, 4026, 4025, 4024, 4022, 4021, 402...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2392,7235.fasta,95,"[mcl01446, mcl01223, mcl01154, mcl01713, mcl00...","[3975, 3974, 3973, 3972, 3971, 3970, 18247, 14...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2393,7235.fasta,96,"[mcl05111, mcl02328, mcl04982, mcl04868, mcl02...","[24357, 3911, 24358, 24359, 3909, 3908, 3907, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2394,7235.fasta,97,"[mcl00839, mcl00250, mcl04731, mcl02681, mcl01...","[3695, 3694, 15532, 3693, 3692, 3691, 3690, 36...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [189]:
test = model(c_arr[0])
#get_prediction_element(1, test[0])

In [87]:
t = np.zeros(150)[0]
t = 0.000000000000000004
t


4e-18

In [16]:
def get_prediction_sequence(input_sequence_chunks):
    output_sequence_chunks = np.zeros(np.shape(input_sequence_chunks))
    output_sequence_TE_probs = np.zeros(np.shape(input_sequence_chunks))
    i = 0
    #output_sequence_TE_probs[i] 
    for chunks in input_sequence_chunks:
        output_sequence_chunks[i], output_sequence_TE_probs[i]  = get_prediction_chunk(input_sequence_chunks[i])
        '''
        if output_sequence_chunks[i].sum() > 0:
            #print(chunks)
            print(output_sequence_TE_probs[i])
        '''
        i = i + 1
        #print(output_sequence_TE_probs)
    return merge_chunks(output_sequence_chunks), merge_chunks2(output_sequence_TE_probs)
    

In [106]:

#get_prediction_sequence(c_arr[0:1])

In [26]:
df

Unnamed: 0,origin,chunk,tokens,token_ids,attention_masks
0,8033.fasta,1,"[mcl06955, mcl06642, mcl06735, mcl03626, mcl03...","[13439, 15808, 19615, 8359, 8360, 8361, 6251, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,8033.fasta,2,"[mcl05643, mcl05569, mcl00020, mcl05599, mcl05...","[7493, 7492, 878, 7491, 7490, 7489, 7488, 1171...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,9198.fasta,1,"[mcl0000X, mcl01971, mcl02107, mcl01276, mcl01...","[31, 2986, 2985, 2984, 2982, 2981, 2980, 2979,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,9198.fasta,2,"[mcl06606, mcl02218, mcl03974, mcl03972, mcl02...","[4982, 2961, 4983, 4984, 4985, 4986, 4987, 498...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,9198.fasta,3,"[mcl01240, mcl02266, mcl02094, mcl00911, mcl00...","[2928, 2927, 2926, 2925, 2924, 2923, 2922, 292...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...
234558,9730.fasta,83,"[mcl03236, mcl03263, mcl03352, mcl03231, mcl01...","[3343, 3342, 3341, 3339, 3338, 3337, 3336, 599...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
234559,9730.fasta,84,"[mcl02365, mcl00092, mcl00105, mcl01029, mcl03...","[3232, 3231, 3230, 3229, 3228, 3227, 3226, 322...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
234560,9730.fasta,85,"[mcl01590, mcl02602, mcl02510, mcl02447, mcl02...","[3183, 3182, 6110, 3181, 3180, 3179, 3178, 317...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
234561,9730.fasta,86,"[mcl04176, mcl07059, mcl04103, mcl04082, mcl00...","[7231, 7230, 7229, 7228, 3143, 3142, 6117, 611...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


## Predict all data

In [215]:
saver = df_predictions

In [19]:
already_processed = []
with open('output_check.txt', 'r') as f:
    for line in f:
        already_processed.append(line.strip())
print(len(already_processed))

0


In [None]:
preds = []
orgs = []
te_preds = []

#check what have been done already
already_processed = []
with open('output_check.txt', 'r') as f:
    for line in f:
        already_processed.append(line.strip())
        
        
        

for i, test_origin in enumerate(df["origin"].unique()):
    if test_origin in already_processed:
        continue
    print(i, test_origin)
    with open('output_check.txt', 'a') as f:
        f.write('{}\n'.format(test_origin))
        
        
        
    #if test_origin == '7751.fasta':
    #    break
#for test_origin in ['7235.fasta', '3885.fasta']:
#for test_origin in ['8584.fasta']: #[275 8584.fasta]
    count = df[df["origin"] == test_origin]["attention_masks"].count()
    len_sequence = chunk_offset*(count-1)+df[(df["origin"] == test_origin) & (df["chunk"] == count)]["attention_masks"].iloc[0].sum()

    c_arr = df[df["origin"] == test_origin]["token_ids"].to_numpy()
    c_arr = np.array(c_arr.tolist())
    try:
        predictions, TE_probs = get_prediction_sequence(c_arr)[0:len_sequence]
    except ValueError:
        continue
    preds.append(predictions)
    orgs.append(test_origin)
    te_preds.append(TE_probs)
    

df_predictions = pd.DataFrame({ "origin" : orgs,
                                "predictions": preds,
                               "probability": te_preds
                             })

0 8033.fasta
1 9198.fasta
2 8779.fasta
3 325.fasta
4 6408.fasta
5 3028.fasta
6 10182.fasta
7 2925.fasta
8 4243.fasta
9 7120.fasta
10 10515.fasta
11 1693.fasta
12 2014.fasta
13 7277.fasta
14 7751.fasta
15 2719.fasta
16 9000.fasta
17 1872.fasta
18 4176.fasta
19 7270.fasta
20 5180.fasta
21 10528.fasta
22 3731.fasta
23 2585.fasta
24 5726.fasta
25 1274.fasta
26 2183.fasta
27 2663.fasta
28 6871.fasta
29 6583.fasta
30 4871.fasta
31 5428.fasta
32 3633.fasta
33 4275.fasta
34 9575.fasta
35 5814.fasta
36 1567.fasta
37 3398.fasta
38 4557.fasta
39 9535.fasta
40 798.fasta
41 3982.fasta
42 7668.fasta
43 9620.fasta
44 6085.fasta
45 6156.fasta
46 6907.fasta
47 2512.fasta


In [None]:
df_predictions.to_csv('predictions.csv', sep='\t')

## create the filter output

# HERE! DON'T USE LEN TO ASSIGN VARIABLES



In [None]:
contig_ID = []
start = []
end = []
lengths = []
probs = []
probs_all = []

s = 0
i = 0
in_TE = False
for row in df_predictions.iterrows():

    for i in range(0, np.shape(row[1]["predictions"])[0]):
        if row[1]["predictions"][i]:
            prob = str(row[1]["probability"][i])
            if in_TE:
                e = i + 1
                probs.append(prob)
            else:
                c_ID = row[1]["origin"].split(".")[0]
                s = i + 1
                e = i + 1
                in_TE = True
                
                #
                #prob = row[1]["probability"][i]
                probs.append(prob)
        else:
            if in_TE:
                contig_ID.append(c_ID)
                start.append(s)
                end.append(e)
                lengths.append(e - s + 1)
                in_TE = False
                #
                probs_all.append(','.join(probs))
                probs = []
                
    if in_TE: # nach wechsel auf nächstes origin, wenn im vorherigen origin noch TE am Ende war
        contig_ID.append(c_ID)
        start.append(s)
        end.append(e)
        lengths.append(e  -s + 1)
        #probs.append(prob)
        in_TE = False
        #
        probs_all.append(','.join(probs))
        probs = []

df_filter = pd.DataFrame({ "contig_ID": contig_ID,
                           "start": start,
                           "end": end,
                           "len": lengths,
                          "probabilities": probs_all
                        })

df_filter.to_csv('df_filter.csv')

In [None]:
original_labels = pd.read_csv('../data/original_data/tn.contig.filter.tsv', sep='\t')

labels_all = []
for contig in df_filter["contig_ID"].unique():
    #print(contig)
    tmp = original_labels[original_labels["contig_ID"]==int(contig)]
    positives = set()
    for x, y in zip(tmp["start"].to_list(), tmp["end"].to_list()):
        for i in range(x, y+1):
            positives.add(i)

    tmp2 = df_filter[df_filter["contig_ID"]==contig]
    for x2, y2 in zip(tmp2["start"].to_list(), tmp2["end"].to_list()):
        labels = []
        for i in range(x2, y2+1):
            if i in positives:
                labels.append('1')
            else:
                labels.append('0')
        labels_all.append(','.join(labels))
df_filter["labels"] = labels_all


In [None]:
df_filter.to_csv('df_filter2.csv')

In [197]:

df_filter.to_csv('final_results', sep='\t')

In [31]:
df_filter[(df_filter["contig_ID"]=='6408') & (df_filter["start"]>=144) ]


Unnamed: 0,contig_ID,start,end,len,probabilities,labels
21,6408,144,151,8,"0.61,0.81,0.94,0.91,0.73,0.67,0.75,0.52",11111100
22,6408,153,169,17,"0.87,0.59,0.74,0.69,0.74,0.5,0.79,0.89,0.57,0....",11000011100001110
23,6408,171,178,8,"0.97,0.96,0.83,0.96,0.55,0.94,0.95,0.91",11110110
24,6408,227,229,3,"0.7,0.98,0.76",11


## save filter to file

In [None]:
df_filter.to_csv(output_file, sep='\t', header=True, index=True)