# Preparation of training dat for the model

100 000 lines-long file is required, having equal representation of all ten genomes and intergenic, exon and intron sequences

In [1]:
import os
print(os.getcwd())

/home/lieberze/DP


## Parameters of the notebook code
Change these in order to adjust the output file generation parameters. No other parameter changes in other chunks are required. After the notebook run is finished, it is important to check the **IMPORTANT CHUNK** output for info about the success of the operation.

Reccommended: 
- N_max = 256, then SuffixOfTheInputFile = 4000
- N_max = 512, then SuffixOfTheInputFile = 5000

In [2]:
import numpy as np

SuffixOfTheInputFile = 5000 # the input files are created 
N_max = 512 # length of the output file lines (maximum base pairs)
N_min_len = 30 # minimum length of sequence
N_lines_total = 100000 # desired number of lines in the output file All_equal_shuffled.txt
Equal_part = int(np.floor(N_lines_total/3))

## Paths

In [3]:
import os
RootFolder = "/home/lieberze/DP/"
ThesisFolder = os.path.abspath(os.path.join(RootFolder, 'Thesis/model_training/data/'))
All_genomes = os.path.abspath(os.path.join(ThesisFolder, 'All_genomes_100K_lines'))
All_genomes_file = os.path.abspath(os.path.join(All_genomes, f'Sample_of_all_genomes_large_{SuffixOfTheInputFile}.txt'))

## Statistics of lengths 

In [4]:
exons, introns, intergenic = [], [], [] 
with open(All_genomes_file, "r") as file:
    for line in file:
        LineSplit = line.strip().split()
        Seqtype, Seq = LineSplit[1], LineSplit[-1]
        if Seqtype == "exon":
            exons.append(len(Seq))
        if Seqtype == "intron":
            introns.append(len(Seq))
        if Seqtype == "intergenic":
            intergenic.append(len(Seq))

In [5]:
import pandas as pd
e = pd.DataFrame(exons).describe()
i = pd.DataFrame(introns).describe()
inter = pd.DataFrame(intergenic).describe()
df = pd.concat([e,i,inter], axis=1)
df.columns = ["exons","introns","intergenic"]
df

Unnamed: 0,exons,introns,intergenic
count,25556.0,20627.0,3817.0
mean,386.886524,7312.223154,58005.05
std,810.095436,22418.163155,114215.8
min,1.0,1.0,2.0
25%,96.0,578.0,2752.0
50%,143.0,1726.0,14399.0
75%,265.0,5319.5,54482.0
max,14943.0,601497.0,1238100.0


## Create a training file

Separate exon, intron, intergenic

In [6]:
import textwrap
with open(All_genomes_file, "r") as file,\
    open(f"{All_genomes}/{N_max}_bp/exons.txt","w") as exon_file,\
    open(f"{All_genomes}/{N_max}_bp/introns.txt","w") as intron_file,\
    open(f"{All_genomes}/{N_max}_bp/intergenic.txt","w") as intergenic_file:
    for line in file:
        LineSplit = line.strip().split()
        Seqtype, Seq = LineSplit[1], LineSplit[-1]
        lines = textwrap.wrap(Seq, N_max)
        if Seqtype == "exon":
            for j in lines:
                if len(j) > N_min_len:
                    exon_file.write(Seqtype + "\t" + j + "\n")
        if Seqtype == "intron":
            for j in lines:
                if len(j) > N_min_len:
                    intron_file.write(Seqtype + "\t" + j + "\n")      
        if Seqtype == "intergenic":
            for j in lines:
                if len(j) > N_min_len:
                    intergenic_file.write(Seqtype + "\t" + j + "\n")  

In [7]:
!wc -l {All_genomes}/{N_max}_bp/*

    99999 /home/lieberze/DP/Thesis/model_training/data/All_genomes_100K_lines/512_bp/All_equal_shuffled.txt
    99999 /home/lieberze/DP/Thesis/model_training/data/All_genomes_100K_lines/512_bp/All_equal.txt
    35824 /home/lieberze/DP/Thesis/model_training/data/All_genomes_100K_lines/512_bp/exons.txt
   434122 /home/lieberze/DP/Thesis/model_training/data/All_genomes_100K_lines/512_bp/intergenic.txt
   304119 /home/lieberze/DP/Thesis/model_training/data/All_genomes_100K_lines/512_bp/introns.txt
   974063 total


Create a file with equal number of ex, in, iner

In [8]:
!shuf -n {Equal_part} {All_genomes}/{N_max}_bp/exons.txt > {All_genomes}/{N_max}_bp/All_equal.txt
!shuf -n {Equal_part} {All_genomes}/{N_max}_bp/introns.txt >> {All_genomes}/{N_max}_bp/All_equal.txt
!shuf -n {Equal_part} {All_genomes}/{N_max}_bp/intergenic.txt >> {All_genomes}/{N_max}_bp/All_equal.txt
!wc -l {All_genomes}/{N_max}_bp/All_equal.txt

99999 /home/lieberze/DP/Thesis/model_training/data/All_genomes_100K_lines/512_bp/All_equal.txt


Shuffle file with random seed

In [9]:
%%bash -s "$All_genomes" "$N_max"
get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf ${1}/${2}_bp/All_equal.txt --random-source=<(get_seeded_random 42) > ${1}/${2}_bp/All_equal_shuffled.txt

**IMPORTANT CHUNK, CHECK OUTPUT** Numbers should be equal, otherwise the size of input file should be enlarged

In [10]:
len1 = Equal_part*3
len2 = !wc -l {All_genomes}/{N_max}_bp/All_equal_shuffled.txt
len2 = int(str(len2).split()[0].split('\'')[1])

if len1 == len2:
    print("The number of lines is equal to the desired number. We obtained the required result.")
else:
    print(f"The number of lines is smaller than the desired. Please, use larger input file instead of {All_genomes_file}.")

The number of lines is equal to the desired number. We obtained the required result.


In [12]:
# ! cat {All_genomes}/{N_max}_bp/All_equal_shuffled.txt | head