# Preprocessing for Autonomous Transposable Element Prediction

Both Genetic and Epigenetic factors greatly influence the expression of transposable elements. As a result, performing differential expression analysis on transposable elements in cases with highly varient genetic landscapes would provide incredibly noisy (highly false positive) results. This is because TEs found located near expressed genes (exonic or intronic) are likely to be equivalently expressed. As such, if there are differences in the expression of these genes between samples, TEs expression is likely to be differentially expressed in a similar manner. To combat this issue, we can simply eliminate transposable elements that have expression that is highly linearly correlated with local gene expression. In doing so, we eliminate the major confounding variable (local genetic expression) when evaluating resultant functional differences between transposable element expression profiles. the process isolates what we call 'autonomous' transposable element expression.

Assumption
1. Transposable elements that are linearly correlated with the expression of local genes (defined in 'Setting Locality Metric' below)  are simply 'passenger' transcripts that have limited functionality is comparision to the associated genetic expression
2. Collinear expression of local TE(s) and Gene(s) is not coincidental

## Input

TE: Tranposable element per loci expression matrix (in TPM) with samples as columns and individual TE expression as rows.
    ** As shown below **
    Column Names: Sample ID
    Row Names: TE name and Genomic Location of form 'TEname_RelativeGenomicLocation_Chromosome__Start_Stop'
        RelativeGenomicLocation: (intron/intergenic/exonic)

Gene: Gene expression matrix (in TPM) with samples as columns and individual gene expression as rows.
    Column Names: SampleID
    Row Names: ENSEMBL ID

## Import 'Input' Dataframes

In this example I am only evaluating a specific subclass of Tranposable Element's the LINE Elements
- note I am index the pandas dataframe with the names of the genes and elements.
    - elements must have the naming convention: 'name_relativegenomiclocation_chromosome__start_end'
        - example: 'L3_intron_1__19972_20405'

In [10]:
import pandas as pd

LINE = pd.read_csv('~/PycharmProjects/TransposableElements/Data/tpm_LINE.csv', index_col='Unnamed: 0')
Gene = pd.read_csv('~/PycharmProjects/TransposableElements/Data/TCGAgene.csv', index_col='Gene')
# in my case I needed to isolate ensembl IDs
Gene.index = [x.split('.')[0] for x in Gene.index.values.tolist()]
print(Gene.head())
print(LINE.head())

KeyboardInterrupt: 

## Setting a locality metric

According to Ribeiro et, al. (https://www.nature.com/articles/s42003-022-03831-w) Genes are considered 'immediately local' if they are <100Kb from their neighbour and 'nearby' if they are <1Mb from their neighbour. However, a specific value for such a concept is very difficult to define. For this analyses I will use the 'immediately local' value.

In [None]:
locality = 1000

1. prepare the TE dataframe for BEDTOOLS alignment
* to identify 'immediately local' genes we also need to add 'locality margins' to our TE locations by subtracting out locality metric from the 'start' and adding the locality metric to 'end' as shown below

format:
    Chr[Z] start[- locality] end[+ locality]
    Chr[Z] start[- locality] end[+ locality]
    Chr[Z] start[- locality] end[+ locality]


In [None]:
elementLocations = []

for i, element in enumerate(LINE.index.values.tolist()):
    if '_' in element:
        components = element.split(',')[0].split('_')
        if len(components) < 6:
            continue
        else:
            chromosome = 'chr' + str(components[2])
            start = components[4]
            end = components[5]
            if start == '' or end == '':
                continue
            else:
                elementLocation = [chromosome, int(start) - locality, int(end) + locality,element]
                elementLocations.append(elementLocation)
    else:
        print('please review elements string format')

I will also need to prepare a bed file for my reference genes

to do so I first need to extract genomic locations of these genes based on their ENSEMBL ids
1. Download the appropriate reference genome used for the alignment of your samples in my case this is GRCh38. to do so you simply run: 'pyensembl install --release [genome number] --species homo_sapiens' in your terminal window (GRCh38 is 77)
2. Extract genomic location of gene (chr, start, stop, gene name)

In [None]:
import pyensembl

data = pyensembl.EnsemblRelease(77)

geneLocations = []
try:
    for i, ensembl in enumerate(Gene.index.values.tolist()):
        chromosome = 'chr' + str(data.gene_by_id(ensembl).contig)
        start = str(data.gene_by_id(ensembl).start)
        end = str(data.gene_by_id(ensembl).end)
        gene = str(data.gene_by_id(ensembl).gene_name)
        geneLocation = [chromosome, int(start), int(end), gene]
        geneLocations.append(geneLocation)
except ValueError:
    pass

# perform intersection

In [None]:
import pybedtools

TE = pybedtools.BedTool(elementLocations)
GENE = pybedtools.BedTool(geneLocations)

intersect = TE.intersect(GENE, wo=True)
intersectdf = pd.read_table(intersect.fn, names= ['TEchr', 'TEstart', 'TEend', 'TEname', 'GENEchr', 'GENEstart', 'GENEend', 'GENEname', 'OVERLAP'])

# correlate expression of TE and local gene

1. Although not necessary for small datasets, I am going to create a dictionary 'local genes' as keys and 'local TEs' as values
2.

In [None]:
LocalGeneTEPairs = {}

for pair in intersectdf.iterrows():
    TEname = pair[1][3]
    GENEid = data.gene_ids_of_gene_name(pair[1][7])[0]
    if GENEid in LocalGeneTEPairs.keys():
        LocalGeneTEPairs[GENEid].append(TEname)
    else:
        LocalGeneTEPairs[GENEid] = [TEname]

In [None]:
import scipy.stats as stats

tepatients = LINE.columns.values.tolist()
Genefilter = Gene.loc[:, tepatients]

CorrTEid = []

for gene in LocalGeneTEPairs.keys():
    for te in LocalGeneTEPairs[gene]:
        try:
            stat, pval = stats.pearsonr(Genefilter.loc[gene,:], LINE.loc[te, :])
            if stat > 0.70 and pval < 0.05:
                CorrTEid.append(te)
        except:
            KeyError

CorrTEid = list(set(CorrTEid))
CleanLINE = LINE.drop(CorrTEid)

In [None]:
print(LINE.shape)
print(CleanLINE.shape)