# Detection of composite transposable elements 
## Notebook 2: Data Preparation
## Description:
Transposable elements are sequences in genomes that can change their position in the genome. Thus, they are also called “jumping genes”. They are able to affect the composition and size of genetic replicons. Our research interest in this project are composite transposable elements, which are flanked by two inverted repeats and transposable elements. Composite transposable elements are moving as one unit within a genome and are copying and inserting genes enclosed by itself. The following traits of composite transposable elements are making their detection challenging: 

1. Sometimes terminal information such as repeats or transposable elements are missing, which would theoretically determine the boundaries of a composite transposable element.
2. Composite transposable elements are diverse in their genetic composition and size. 

Composite transposable elements are usually associated with essential and indispensable genes, which are having a high gene frequency across genomes, but also with genes of lower essentiality, which leads to significant drop in the gene frequency landscape. We hypothesize that the genetic frequency landscape of a replicon will follow a particular pattern, which can be used as a marker for putative regions of composite transposable elements. Thus, we are representing here an approach to detect regions of putative composite transposable elements using gene frequencies, protein family clusters and positions of composite transposable elements as input for a supervised LSTM-based neural network model. 

### Project Repo 
https://github.com/DMH-dutte/Detection_of_composite_transposable_elements

## Participants:
Dustin Martin Hanke: dhanke@ifam.uni-kiel.de

Wang Yiging: ywang@ifam.uni-kiel.de 

### Course and Semester
Machine Learning with TensorFlow - Wintersemester 2021/2022

### License
If you are releasing the software under some certain license, you can mention it and also include the `LICENSE.md` file in the folder

---

### Imports:

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import random
import os

### Load 1D input arrays:

In [4]:
#Arrays have been stored as 1D-arrays
two_d_data = np.loadtxt('../arrays/2D_freq_mcl.csv', delimiter=',') 
frequencies = np.loadtxt('../arrays/frequencies.csv', delimiter=',')
labels = np.loadtxt('../arrays/labels.csv', delimiter=',')

#Bring 1D-arrays into the correct shape:
two_d_data = two_d_data.reshape(807447, 25, 2)
frequencies = frequencies.reshape(807447, 25)
labels = labels.reshape(807447, 25)
print(two_d_data.shape, frequencies.shape, labels.shape)

(807447, 25, 2) (807447, 25) (807447, 25)


### Convert the labels into binary labels:
1. [1] -> is containing a comTE
2. [0] -> doesn't contain a comTE

We assumed that it would be worth it to start a binary classification approach to validate, if our approach of detecting composite transposable elements is valid at all.

In [None]:
def make_binary(array):
    new_list = []
    for element in array:
        if np.sum(element) != 0:
            new_list.append(1)
        else:
            new_list.append(0)
    new_array = np.array(new_list).reshape(array.shape[0], 1)
    return new_array

binary_array = make_binary(labels)

### Extract randomly samples and labels from the data to maintain a 50/50 distribution of negative/positive samples:

It has been observed that the labels are dramatically biased and it was decided to randomly choose negative samples from the dataset and to keep all positive samples that were available.

In [None]:
reduced_x = list()
reduced_y = list()
all_zeros = list()
for i in range(len(two_d_data)):
    if binary_array[i] == 1:
        reduced_x.append(two_d_data[i])
        reduced_y.append(binary_array[i])
    else:
        all_zeros.append(two_d_data[i])
zero_x = list()

for i in range(len(reduced_x)):
    reduced_x.append(random.choice(all_zeros))
    reduced_y.append(np.array([0]))

    
reduced_x = np.array(reduced_x)
reduced_y = np.array(reduced_y)

### Extracting arrays containing at least 4 fragmentary comTE positions in a chunk.
We decided to increase the signal strength of a chunk of the input data by creating an input dataset that contains at least 4 positions relating to a comTE.

In [None]:
counter = 0

two_d_4 = []
labels4 = []
for i, el in enumerate(labels):
    
    if np.sum(el) >= 4:
        if counter == 0:
            labels4 = np.concatenate([np.array([1])])
            two_d_4 = np.concatenate([two_d_data[i]])
        else:
            labels4 = np.concatenate([labels4, np.array([1])])
            two_d_4 = np.concatenate([two_d_4, two_d_data[i]])

        counter += 1
        
#labels4 = list(labels4)
#two_d_4 = list(two_d_4)

for i in range(len(labels4)):
    two_d_4 = np.append(two_d_4, random.choice(all_zeros))
    labels4 = np.append(labels4, np.array([0])) 
    
labels4 = labels4.reshape(29360, 1)
two_d_4 = two_d_4.reshape(29360, 25, 2)

# Save the data:

In [None]:
def save_flat_arrays(where, array, name):
    '''
    Saves flattened np arrays
    '''
    np.savetxt("{}/{}.csv".format(where, name), array, delimiter=',')
    return 

In [None]:
save_flat_arrays("arrays", frequencies.flatten(), "frequencies")
save_flat_arrays("arrays", two_d_training.flatten(), "2D_freq_mcl")
save_flat_arrays("arrays", labels.flatten(), "labels") 