# Final Bioinformatics Project

## Katelyn, Hannah, Kathleen, and Grant

### BLAST

- What to do:

    + Using Unix commands, create a single table that includes the top hit for each transcript. 
    
    + Save one fasta file of protein sequences per identified transcript (6 total).

- The final code for this part of the project is __'tophitsscript.sh'__

- The final table for this part of the project is __'tophits.txt'__

### Translation of RNA Seq Data

- What to do:

    + Use Python to translate nucleotides to amino acids
    
    + __READ__ _codonmap.txt_ and nucleotide file you are translating
    
    + __WRITE__ to a fasta file the translated amino acid sequences
    
    + Use this code in a for loop to translate all four files of RNAseq data

In [1]:
#import packages
from __future__ import print_function
import csv
import os
import pandas
import numpy

#open codonmap.txt and store it as a dictionary under the variable name 'd'
d = {}
with open('codonmap.txt', 'r') as csv_file:
    for line in csv_file:
        aa, codon = line.split()
        d[codon] = aa
        
#for loop through files
def translate(codex, fasta):
    sequences = [] # sequential list of protein sequences
    sequence_names = []
    for i, item in enumerate(fasta): # loops over list, list items
        protein = '' # translated protein sequence
        if i%2 == 0: # if index is even
            sequence_names.append(item)
        else:
            for j in range(0, len(item), 3):
                res = codex[item[j:j+3]]
                if res == 'Stop':
                    break
                else:
                    protein += res
            sequences.append(protein)
    return ['{0}\n{1}'.format(sequence_names[p], sequences[p]) for p in range(len(sequences))]

#create protein.fasta files
if __name__ == '__main__':
    #read transcript fasta files
    control1 = open('control1.fasta', 'r')
    control2 = open('control2.fasta', 'r')
    obese1 = open('obese1.fasta', 'r')
    obese2 = open('obese2.fasta', 'r')
    #creates and opens files to write
    control1protein = open('control1protein.fasta', 'w')
    control2protein = open('control2protein.fasta', 'w')
    obese1protein = open('obese1protein.fasta', 'w')
    obese2protein = open('obese2protein.fasta', 'w')
    #writes the translated nucleotides to the outfiles
    control1protein.write('\n'.join(translate(d, control1.read().split())))
    control2protein.write('\n'.join(translate(d, control2.read().split())))
    obese1protein.write('\n'.join(translate(d, obese1.read().split())))
    obese2protein.write('\n'.join(translate(d, obese2.read().split())))
    #closes files
    control1.close
    control2.close
    obese1.close
    obese2.close
    control1protein.close
    control2protein.close
    obese1protein.close
    obese2protein.close

### Hidden Markov Models

- What to do:

    + Use __muscle__ to make an alignment for downloaded protein sequences and translated RNAseq data
    
    + Use __hmmbuild__ to construct six protein models
    
    + Use __hmmsearch__ to search the translated RNAseq files for each of the protein models made
    
    + Use a bash script to loop over the transcript files and RNAseq files

- Final code for this part is in __'muscle_hmm_script.sh'__

- Final files for this part are __''__

### Graphing of "expression levels"

- What to do:

    + Graph the counts of the hmm hits for each transcript [our measure of RNA expression] in each RNAseq file
    
    + Compare the expression levels across the 2 normal and 2 obese mice
    
    + Qualitatively compare our results to those reported in Kuhns & Pluznick (2017)

### Further Exploration

1. What to do:

    + For 2-3 of the 6 transcripts, return to the original BLAST search and change the _Optimize for_ option. It might be eaiser to explore if you also restrict the _Database_ option to either human or mouse
    
    + How do _discontinuous megablast_ and _blastn_ change your table of BLAST hits?
    
        + _uniquetranscripts.fasta_ with _Mouse genomic + transcript_ database and _discontiguous megablast_
        
        + _uniquetranscripts.fasta_ with _Mouse genomic + transcript_ database and _blastn_
    
2. What to do:

    + For 2-3 of the 6 trnascripts, return to NCBI protein search and explore the effects of phylogenetic relatedness of your amino acid sequences on the performance of your HMM model
    
    + What would happen if you built your HMM protein model using more distantly related mammals (ex primates)? Would you still get the same quality of hits if your HMM protein model was based on non-mammalian sequences? 
    
    + Pick one of the RNAseq files to search in order to test your hypotheses. Compate e-values among HMMs built from differing taxa.