# Are there sequences near TSS that are associated with up or down regulation by ...  
    Call transcripts from merged RNAseq datasets, find TSS region,
    Merge RNAseq data to obtain more sequencing depth 
    Merge .bam files from BAO and BAN experimental set to one bam file
    split merged bam file into + strand and - strand files (+ flag 83,163) (-flag 99,147)
    index each file
    Run stringtie on this indexed + and - bam file to estimate transcripts


    Filtering out stringtie transcripts that are more reliable and editing them
    filter
    Select stringtie transcripts that start between two genes on one of the
    DNA strands
    Filter out transcripts that
    Have low read Density
    have no sudden increase in reads over a window of nts
    Edit 
    If stringtie transcript ends in middle of gene, extend transcript to end of that gene
    Create GFF from Filtered/edited stringtie transcripts and visualize with genome browser

    match known transcripts  with their known TSSs to determine if our predictions 
    are predicting those correctly

    generate file that has:
    all predicted TSS sequences from -100 to +50 (wrt TSS)
    genes that are included in TU for each TSS
    information about whether those genes are upregulated, downregulated, not regulated 
    TSS1	….NNNNNN….	rpoD	not regulated
    TSS1	…NNNNNN….	lecA	not regulated
    TSS2	…NNNNNN…		dnaA	upregulated
    
    determine if there are motifs present in upregulated, downregulated or not regulated 
    seq using MEME


In [3]:
import pandas as pd
from plotly import graph_objects as go
import numpy as np
import os
from jw_utils import genome_utils as gu
from jw_utils import parse_gff as pgf
from jw_utils import parse_fasta as pf
from jw_utils import file_utils as fu
import bisect
import pysam
from transcript_calling import tc_functions as tf

In [4]:
path_to_gff = '../references/FERM_BP3421.gff'
path_to_strGTF_neg = './merge_all_negStrand.gtf'
path_to_annot_file = '../data/references/Reference_FERM_BP3421.gbk'
path_to_fa_genomes = '../data/references/concat_references.fa'

### Split merged bam file into plus and minus strand templates 
    stringtie seems to perform better for me when I give it only one strand
    - rf sequencing  
    - plus strand flags: 83, 163
    - minus strand flags: 99, 147
    samtools view -f 83 merge_all.bam -o flag83.bam

In [6]:
#positive strand genes
!samtools view -f 83 ./merged_bam_files/merge_all.bam -o ./merged_bam_files/flag83.bam
#!samtools view -f 163 merge_all.bam -o flag163.bam
!samtools merge ./merged_bam_files/flag83.bam ./merged_bam_files/flag163.bam -o ./merged_bam_files/merge_all_posStrand.bam
!samtools index ./merged_bam_files/merge_all_posStrand.bam
!rm ./merged_bam_files/flag83.bam
!rm ./merged_bam_files/flag163.bam

#neg strand genes
!samtools view -f 99 ./merged_bam_files/merge_all.bam -o ./merged_bam_files/flag99.bam
!samtools view -f 147 ./merged_bam_files/merge_all.bam -o ./merged_bam_files/flag147.bam
!samtools merge ./merged_bam_files/flag99.bam ./merged_bam_files/flag147.bam -o ./merged_bam_files/merge_all_negStrand.bam
!samtools index ./merged_bam_files/merge_all_negStrand.bam
!rm ./merged_bam_files/flag99.bam
!rm ./merged_bam_files/flag147.bam

/bin/bash: samtools: command not found
/bin/bash: samtools: command not found
/bin/bash: samtools: command not found
rm: ./merged_bam_files/flag83.bam: No such file or directory
rm: ./merged_bam_files/flag163.bam: No such file or directory
/bin/bash: samtools: command not found
/bin/bash: samtools: command not found
/bin/bash: samtools: command not found
/bin/bash: samtools: command not found
rm: ./merged_bam_files/flag99.bam: No such file or directory
rm: ./merged_bam_files/flag147.bam: No such file or directory


### Call initial transcripts using stringtie

In [26]:
!stringtie -g 0 -j 10 ./merged_bam_files/merge_all_posStrand.bam --rf -o merge_all_posStrand.gtf
!stringtie -g 0 -j 10 ./merged_bam_files/merge_all_negStrand.bam --rf -o merge_all_negStrand.gtf
!stringtie -g 0 -j 10 ./merged_bam_files/merge_all.bam --rf -o merge_all.gtf

##### Merge pos and neg strand annotation files produced by stringtie back into one gff file

In [5]:
merged_nested_obj_dict = tf.merge_pos_neg_sttie_objs('merge_all_posStrand.gtf','merge_all_negStrand.gtf',path_to_gff)
tf.write_gff_from_seqobj(merged_nested_obj_dict,'merged_all.gff')

### Make wig files with pilups for each strand for each chromosome

In [4]:
contig_names = pgf.get_contig_names(path_to_gff)
tf.make_wig_pilups('./merged_bam_files/merge_all_posStrand.bam', contig_names)
tf.make_wig_pilups('./merged_bam_files/merge_all_negStrand.bam', contig_names)

[E::hts_open_format] Failed to open file "./merged_bam_files/merge_all_posStrand.bam" : No such file or directory


FileNotFoundError: [Errno 2] could not open alignment file `./merged_bam_files/merge_all_posStrand.bam`: No such file or directory

### Filtering out stringtie transcripts that are more reliable, and editing them    

    filter for  
    -Select stringtie transcripts that start between two genes on one of the    
     DNA strands  
    -Filter out transcripts that  
        -Have low read Density  
        -have no sudden increase in reads over a window of nts  
    Edit   
        -If stringtie transcript ends in middle of gene, extend transcript to  
         end of that gene  

### Edit  and filter transcripts  
tf.filter_transcripts() 
1) extends each transcript to the stop codon of the nearest gene on the same strand
2) filters out transcripts that start in the middle of a gene (on the same strand)
3) filters out genes that have below the input threshold reads/nt



In [6]:
filtered_transcript_objs = tf.filter_transcripts('./merged_all.gff',path_to_gff,20,annot_type='gff' )

In [7]:
tf.write_gff_from_seqobj(filtered_transcript_objs, 'merged_all_filtered_20.gff')

###  Generate file that has:  
- all predicted TSS sequences from -100 to +50 (wrt TSS)
- genes that are included in TU for each TSS
- information about whether those genes are upregulated, downregulated, not regulated 
        TSS1    ….NNNNNN….    rpoD    not regulated
        TSS1    …NNNNNN….    lecA    not regulated
        TSS2    …NNNNNN…        dnaA    upregulated

#### get sequence around a given coordinate

<div>
<img src="./notebook_images/get_minusStrand_seq_region2.jpg" alt="get minus strand seq. region" width="400">
<div>

In [8]:
def make_seqs_around_start_df(upstream, downstream, fasta_genome, feature_obj_dict):
    """Return df from a nested seq object dictionary with local sequence around start of feature. 
    
    parameters:
    upstream (int): # nt to return upstream of feature_beginning
    downstream (int): # nt to return downstream of feature_beginning
    fasta_genome (str): path_to_fasta_genome
    feature_obj_dict (dict): {contig:feature_ID:feature_annot_object}
    """
    info_for_file = {}
#     upstream = 100
#     downstream = 50
    fasta_genome_dict = pf.get_seq_dict(fasta_genome)
    seqs = {}
    contigs = {}
    for contig in filtered_transcript_objs:
        for seq_obj_id in filtered_transcript_objs[contig]:
            seq_obj = filtered_transcript_objs[contig][seq_obj_id]
            strand = seq_obj.strand
            if strand == '+':
                tss = seq_obj.start
            elif strand == '-':
                tss = seq_obj.end
            contigs[seq_obj_id] = contig
            seqs[seq_obj_id] = tf.get_seq_around_coordinate(fasta_genome_dict,
                                            contig,strand, tss, upstream, downstream)
            
    df1 = pd.DataFrame(seqs.values(), seqs.keys())
    df2 = pd.DataFrame(contigs.values(), contigs.keys()) 
    df_seqs= pd.merge(df1,df2,how='inner', left_index=True, right_index=True)
    df_seqs.columns =["up_down_seq 3'-5'", 'chromosome']
    df_seqs.index.name = 'transcript_ID'
    return df_seqs
            
fasta_concat_genomepath = '../references/concat_references.fa'            
df_seqs = make_seqs_around_start_df(100, 50, fasta_concat_genomepath, filtered_transcript_objs)

Could not return all of upstream sequence requested because reached beginning of contig.


## Get genes within the transcript  
1) if the start codon of a gene (on the same strand as the transcript) is within the 
    start:end coordinates of the transcript, then count it as in the gene
    - need the gene annotation  and the transcript annotations
    

In [82]:
def get_genes_in_transcripts_df(path_trans_annots, path_to_gff):
    """Return df with row with genes, if any, (1 per row) in each transcript"""
    
    genes_in_transcript=tf.find_genes_in_transcript(path_trans_annots, path_to_gff)
    transcpt_list = []
    gene_list = []
    num_genes = []
    df_git =pd.DataFrame(genes_in_transcript.values(), genes_in_transcript.keys())
    for tnscpt, genes in genes_in_transcript.items():
        if len(genes)>0:
            for gene in genes:
                transcpt_list.append(tnscpt)
                gene_list.append(gene)
                num_genes.append(len(genes))
        else:
            transcpt_list.append(tnscpt)
            gene_list.append(None)
            num_genes.append(0)

    df_tran_genes = pd.DataFrame()
    df_tran_genes['transcript_ID'] = transcpt_list
    df_tran_genes['locus_tag'] = gene_list
    df_tran_genes['#_genes/trnscrpt'] = num_genes
    return df_tran_genes

path_trans_annots = './merged_all_filtered_20.gff'
df_tran_genes = get_genes_in_transcripts_df(path_trans_annots, path_to_gff)
df_seqs_with_genes = pd.merge(df_tran_genes,df_seqs, how='outer', on='transcript_ID')

In [83]:
df_seqs_with_genes

Unnamed: 0,transcript_ID,locus_tag,#_genes/trnscrpt,up_down_seq 3'-5',chromosome
0,STRG.3314,gene-tmp_000007,1,aatccgcgcgcgccatccacaagcccgcccacgaccgtcgacgcct...,BF000000.1
1,STRG.954,gene-tmp_000008,1,ttcgtagtgtagtcgccaaaccgaaaatgccacggtcgggagtgac...,BF000000.1
2,STRG.3315,,0,cccgccccggccgccggcgcggtgcacgatccgtcggccccaataa...,BF000000.1
3,STRG.955,gene-tmp_000009,2,gaacctgctcatcggatccgacccttgcaactgttacactttccgc...,BF000000.1
4,STRG.955,gene-tmp_000010,2,gaacctgctcatcggatccgacccttgcaactgttacactttccgc...,BF000000.1
...,...,...,...,...,...
1777,STRG.141,,0,cgcaggctcagcggaaaacagtagtacaaccaaacggcgtggcaga...,BF000000.3
1778,STRG.142,gene-tmp_006514,1,cccgcaattgacttgacggtaaattatcagagccagtaaatcgagc...,BF000000.3
1779,STRG.146,gene-tmp_006521,2,agtcgaattgtttcatttaatttgaatatgcatatttgatacagat...,BF000000.3
1780,STRG.146,gene-tmp_006522,2,agtcgaattgtttcatttaatttgaatatgcatatttgatacagat...,BF000000.3


### Add gene expression data (up and down-regulated genes) to the dataframe

In [66]:
def filter_gene_expression_data_add_up_down(data_path, up_thresh, down_thresh, pval_thresh):
    df_expression = pd.read_csv(data_path, sep='\t')
    pval_filt = df_expression['PValue']<pval_thresh
    df_expression = df_expression.loc[pval_filt,:]
    filt_up = df_expression['logFC']>up_thresh
    filt_down = df_expression['logFC']<down_thresh
    up_down = []
    for up,down in zip(filt_up,filt_down):
        if up:
            up_down.append('up')
        elif down:
            up_down.append('down')
        else:
            up_down.append('not significant')
    df_expression['regulation'] = up_down
    return df_expression

data_path = './differentially_regulated_genes_BANvBAO.tsv'
df_expression = filter_gene_expression_data_add_up_down(data_path,1.0, -1.0,5)

In [101]:

df_seqs_with_genes.columns = ['transcript_ID', 'locus_tag', 
                              '#_genes/trnscrpt', "up_down_seq 3'-5",'chromosome']
df_seqs_with_genes['locus_tag'] = [gene.replace('gene-', '') if gene else None for gene in df_seqs_with_genes['locus_tag']]
df_expression.columns = ['locus_tag', 'gene', 'description', 'featureType', 'logFC', 'pValue',
       'FDR', 'BAO-1', 'BAO-2', 'BAO-3', 'BAN-1', 'BAN-2', 'BAN-3','regulation']
df_full= pd.merge(df_seqs_with_genes,df_expression, how='inner', on='locus_tag').set_index('transcript_ID')
#df3 = df3[['chromosome', 'genes', '#_genes_in_trnscrpt',
       #'Gene', 'Description', 'regulation', 'PValue', 'logFC', "up_down_seq 3'-5'"]]
df_full = df_full[['#_genes/trnscrpt','locus_tag', 'chromosome',
       'gene', 'description','regulation', 'logFC', 'pValue', 'FDR', 'BAO-1',
       'BAO-2', 'BAO-3', 'BAN-1', 'BAN-2', 'BAN-3', "up_down_seq 3'-5" ]]
df_full.to_csv('transcipts_genes_regulation_BAOvBAN.csv')

Index(['Locustag', 'Gene', 'Description', 'FeatureType', 'logFC', 'PValue',
       'FDR', 'BAO-1', 'BAO-2', 'BAO-3', 'BAN-1', 'BAN-2', 'BAN-3',
       'regulation'],
      dtype='object')

### Volcano plot

In [47]:

df3 = df3[[]]
df_merge.columns = ['Transcript_ID', "Up_down_seq 3'-5'", 'Chromosome', 'Locustag',
       '#_genes/trnscrpt']

Index(['transcript_ID', 'up_down_seq 3'-5'', 'chromosome', 'genes',
       '#_genes_in_trnscrpt'],
      dtype='object')

###  Generate file that has:  
- all predicted TSS sequences from -100 to +50 (wrt TSS)
- genes that are included in TU for each TSS
- information about whether those genes are upregulated, downregulated, not regulated 
        TSS1    ….NNNNNN….    rpoD    not regulated
        TSS1    …NNNNNN….    lecA    not regulated
        TSS2    …NNNNNN…        dnaA    upregulated

In [106]:
fc_select

2        2.033241
5        2.664175
9       12.077312
10      10.234666
11       0.200274
          ...    
1522    18.503156
1523    24.151807
1524    21.638660
1525    14.008044
1526    22.592076
Length: 617, dtype: float64

#### Graph read density on filtered transcripts

In [24]:
filt_val = 20
coverage_fil_df = coverage_df.loc[coverage_df['coverage']>filt_val,:].sort_values('coverage')

bins  = 50
rnge = (0,75)
import plotly.graph_objects as go
import plotly.offline as pyo
hist = np.histogram(coverage_fil_df['coverage'],bins=bins, range=rnge)
trace1 = go.Bar(x=hist[1], y=hist[0], name='coverage')
layout = go.Layout({'title':'histogram of ave. reads/nt'})
fig = go.Figure(data=[trace1], layout=layout )
fig.update_xaxes({'title':
                      {'text':'ave. reads/nt',
                      'font':{'size':20}}
                 })
fig.update_yaxes({'title':
                      {'text':'number of transripts',
                      'font':{'size':20}}
                 })
fig


NameError: name 'coverage_df' is not defined