#### Aim

This notebook extracts FPKM (fragments per kilobase million) values from eXpress reults of mRNA sequencing data. The expression of genes of interesnt will be compared in the traget and the control knockdown. Control knockdown is ND7. Target knockdowns are AS17 and PS17. Also, different developmental stages are available (starved-veg: 1-10% fragmentation; 20: 10-30% fragmentation; 50: 40-50% fragmentation; 80: 80-90% fragmentation; and 100A: 80 + 6h(100% fragmentation + new MACs visible))

#### Samples

Total RNA was extracted from 300ml/100ml Paramecium culture that was subjected to RNAi by feeding. Total RNA was send to Genewiz for mRNA sequencing (poly-A enrichment, NovaSeq 2x150, between 9 Mio and 22 Mio reads per sample). Reads were trimmed for adapters with trim_galore and mapped to the Paramecium tetraurelia strain 51 transcriptome (https://paramecium.i2bc.paris-saclay.fr/files/Paramecium/tetraurelia/51/annotations/ptetraurelia_mac_51/ptetraurelia_mac_51_annotation_v2.0.transcript.fa). Mapping was done with hisat2 allowing 20 multimappings. Using samtools, the properly paired and mapped reads were filtered (-f2 flag) and sorted by the read name (-n flag). Read counts per transcript were acquired with eXpress (https://www.nature.com/articles/nmeth.2251, https://pachterlab.github.io/eXpress/overview.html) with 5 additional online EM rounds to perform on the data after the initial online round (-O 5 flag) to improve accuracy.

#### What's done here

The results.xprs output file is reformated and only the genes of interest for KD-efficiency evaluation are arranged in tables. These will be saved and can be imported in Excel to generate bar plots.

In [None]:
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from functools import partial, reduce

In [None]:
def fpkm_genes (path_data, samples, transcripts, t_names, path_save, name_save, save = False):
    
    '''
    # Prepare the fpkm matrix (rows are samples and colums are the transcripts)
    # contains the rounded fpkm outputted by eXpress
    '''
    
    dfs = dict.fromkeys(samples, None)
    dfs_unround = dict.fromkeys(samples, None)
    
    # store the results from eXpress as dataframes in dictionary
    for i in samples:
        folder = path_data + i + '/eXpress/results.xprs'
        result = pd.read_csv(folder, sep="\t")
        dfs[i] = result
        dfs_unround[i] = result
        
    # drop all the columns you do not need as input for Deseq2
    for f in samples:     
        dfs[f] = dfs[f].drop(columns=['bundle_id',"tpm",'length','eff_length','tot_counts','uniq_counts',
                                      'est_counts','ambig_distr_alpha','ambig_distr_beta','fpkm_conf_low',
                                      'fpkm_conf_high','solvable','eff_counts'])
        dfs_unround[f] = dfs_unround[f].drop(columns=['bundle_id',"tpm",'length','eff_length','tot_counts',
                                                  'uniq_counts','est_counts','ambig_distr_alpha',
                                                  'ambig_distr_beta','fpkm_conf_low','fpkm_conf_high',
                                                  'solvable','eff_counts'])
    
    # round the values in column "eff_counts" to integers
    for d in samples:
        dfs[d]['fpkm'] = dfs[d]['fpkm'].round().astype(int)

    # change the column name eff_counts to sample name
    for s in samples:
        dfs[s] = dfs[s].rename({'fpkm': s}, axis='columns')
        
    # merge the dataframes into one big dataframe and transpose data frame 
    my_reduce = partial(pd.merge, on='target_id', how='outer')                                                              
    fpkm_df = reduce(my_reduce, dfs.values()).transpose()
    fpkm_df.columns = fpkm_df.iloc[0]
    fpkm_df = fpkm_df.drop(fpkm_df.index[0])
    
    
    # extract your genes of interest
    fpkm_genes_df = fpkm_df[transcripts]
    

    # add transcript names as first row
    fpkm_genes_df.loc["transcript_names"]=t_names
    
    display(fpkm_genes_df)
    
    if save:
        #save as .csv
        save_name = path_save + f"{name_save}.csv"
        fpkm_genes_df.to_csv(save_name, sep = ',')
        
    

In [None]:
# set the paths
path_data = "/ebio/ag-swart/home/lhaeussermann/ag-swart-paramecium/analysis/mRNA_lilia_KD489/hisat2_express/transcriptome_mapping/"
path_save = "/ebio/ag-swart/home/lhaeussermann/ag-swart-paramecium/analysis/mRNA_lilia_KD489/hisat2_express/transcriptome_mapping/eXpress_results/"


In [None]:
# fpkm for AS17 and PS17 transcripts in kd1
samples = ['veg_ND7-1','veg_AS17-1','veg_PS17-1','20_ND7-1','20_AS17-1','20_PS17-1',
          '80_ND7-1','80_AS17-1','80_PS17-1','100A_ND7-1','100A_AS17-1','100A_PS17-1']
transcripts = ['PTET.51.1.T0620188','PTET.51.1.T0240213']
t_names = ['AS17','PS17']
name_save = "KD489-kd1"

fpkm_genes (path_data, samples, transcripts, t_names, path_save, name_save, save = True)


In [None]:
# fpkm for AS17 and PS17 transcripts in kd2
samples = ['veg_ND7-2','veg_AS17-2','veg_PS17-2','20_ND7-2','20_AS17-2','20_PS17-2',
          '80_ND7-2','80_AS17-2','80_PS17-2','100A_ND7-2','100A_AS17-2','100A_PS17-2']
transcripts = ['PTET.51.1.T0620188','PTET.51.1.T0240213']
t_names = ['AS17','PS17']
name_save = "KD489-kd2"

fpkm_genes (path_data, samples, transcripts, t_names, path_save, name_save, save = True)

In [None]:
# fpkm for AS17 and PS17 transcripts in kd3
samples = ['veg_ND7-3','veg_AS17-3','veg_PS17-3','20_ND7-3','20_AS17-3','20_PS17-3',
          '80_ND7-3','80_AS17-3','80_PS17-3','100A_ND7-3','100A_AS17-3','100A_PS17-3']
transcripts = ['PTET.51.1.T0620188','PTET.51.1.T0240213']
t_names = ['AS17','PS17']
name_save = "KD489-kd3"

fpkm_genes (path_data, samples, transcripts, t_names, path_save, name_save, save = True)

In [None]:
# fpkm for AS17 and PS17 transcripts in kd4
samples = ['veg_ND7-4','veg_AS17-4','veg_PS17-4','20_ND7-4','20_AS17-4','20_PS17-4',
          '80_ND7-4','80_AS17-4','80_PS17-4','100A_ND7-4','100A_AS17-4','100A_PS17-4']
transcripts = ['PTET.51.1.T0620188','PTET.51.1.T0240213']
t_names = ['AS17','PS17']
name_save = "KD489-kd4"

fpkm_genes (path_data, samples, transcripts, t_names, path_save, name_save, save = True)