# BCB 546  Python Assignment

## Dependency

BioPython



## Task 1: Get sequence



### Function: get_sequences_from_file(fasta_fn)

**Description:** Get sequence from a fasta file

**Arguments:**

* fasta_fn: fasta file

**Return:** dictionaries, containing sequences with their species names


In [1]:
from Bio import SeqIO

def get_sequences_from_file(fasta_fn):           # define a function
    sequence_data_dict = {}                       
    for record in SeqIO.parse(fasta_fn, "fasta"):# a for-loop over every record of the fasta file, discription and sequence are seperated
        description = record.description.split() # split the discription of the record
        species_name = description[1] + " " + description[2] # define species names with 1st and 2nd elements in splitted discription
        sequence_data_dict[species_name] = record.seq        # output the sequence with the corrisponding species name
    return(sequence_data_dict)                   # return each sequence with species name

In [2]:
# read the fatsta file with the defined function

penguin_seq = get_sequences_from_file("penguins_cytb.fasta")

In [3]:
# inspection of the read file
penguin_seq

{'Aptenodytes forsteri': Seq('ATGGCCCCAAATCTCCGAAAATCCCATCCCCTCCTAAAAATAATTAATAACTCC...TAA'),
 'Aptenodytes patagonicus': Seq('ATGGCCCCAAACCTCCGAAAATCCCATCCTCTCCTAAAAATAATTAATAACTCC...TAA'),
 'Eudyptes chrysocome': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCCCTCCTAAAAACAATCAATAACTCC...TAA'),
 'Eudyptes chrysolophus': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCCCTCCTAAAAACAATCAATAACTCC...TAA'),
 'Eudyptes sclateri': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCCCTCCTAAAAACAATCAATAACTCC...TAA'),
 'Eudyptula minor': Seq('ATGGCCCCCAACCTCCGAAAATCTCACCCCCTCCTAAAAATAATCAACAACTCT...TAA'),
 'Pygoscelis adeliae': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCTCTCCTAAAAATAATTAACAACTCC...TAA'),
 'Pygoscelis antarctica': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCTCTCCTAAAAATAATCAACAACTCC...TAG'),
 'Pygoscelis papua': Seq('ATGGCCCCCAACCTTCGAAAATCCCACCCTCTCCTAAAAATAATCAACAAATCC...TAG'),
 'Spheniscus demersus': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCTCTCCTAAAAACAATCAACAACTCC...TAA'),
 'Spheniscus humboldti': Seq('ATGGCCCCCAACCTCCGAAAATCCCACCCTCTCCTAAAAAC

In [4]:
len(penguin_seq)

type(penguin_seq)

dict

## Task 2: Translate DNA to AA

### Function: dna_to_aa(dna_dict)

**Description:** translate sequences into amino acids in dictionaries

**Arguments:**

* dna_dict: dictionaries that contain DNA sequences and their names

**Return:** dictionaries, containing amino acids with their names



NameError: name 'mito_table' is not defined

In [9]:
from Bio.Data import CodonTable


def dna_to_aa(dna_dict):
    
    # using this specific codon table 
    mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
    
    # convert stop codons into asterisks to make sure the following tranlation loop will not choke
    mito_table.forward_table["TAA"] = "*"
    mito_table.forward_table["TAG"] = "*"
    mito_table.forward_table["AGG"] = "*"
    mito_table.forward_table["AGA"] = "*"
    
    aa_dict = {}
    
    
    for name, dna_seq in dna_dict.items():
        aa_seq = ""
        
        for i in range(0,len(dna_seq),3):
            codon = dna_seq[i:i+3]
            aa = mito_table.forward_table[codon]
            if aa !='*':
                aa_seq += aa
            else:
                break
        aa_dict[name] = aa_seq
        
    return aa_dict

In [31]:
aa_2 = dna_to_aa(penguin_seq)

In [33]:
class(aa_2)

SyntaxError: invalid syntax (3061692964.py, line 1)

In [34]:
aa_2

{'Aptenodytes forsteri': 'MAPNLRKSHPLLKMINNSLIDLPTPSNISAWWNFGSLLGICLTTQILTGLLLAMHYTADTTLAFSSVAHTCRNVQYGWLIRNLHANGASFFFICIYLHIGRGFYYGSYLYKETWNTGIILLLTLMATAFVGYVLPWGQMSFWGATVITNLFSAIPYIGQTLVEWTWGGFSVDNPTLTRFFALHFLLPFMIAGLTLIHLTFLHESGSNNPLGIVANSDKIPFHPYYSTKDILGFALMLLPLTTLALFSPNLLGDPENFTPANPLVTPPHIKPEWYFLFAYAILRSIPNKLGGVLALAASVLILFLIPLLHKSKQRTMAFRPLSQLLFWALVANLIILTWVGSQPVEHPFIIIGQLASLTYFTTLLILFPIAGALENKMLNH',
 'Aptenodytes patagonicus': 'MAPNLRKSHPLLKMINNSLIDLPTPSNISAWWNFGSLLGICLTTQILTGLLLAMHYTADTTLAFSSVAHTCRNVQYGWLIRNLHANGASFFFICIYLHIGRGFYYGSYLYKETWNTGIILLLTLMATAFVGYVLPWGQMSFWGATVITNLFSAIPYIGQTLVEWAWGGFSVDNPTLTRFFALHFLLPFMIAGLTLIHLTFLHESGSNNPLGIVANSDKIPFHPYYSTKDTLGFALMLLPLTTLALFSPNLLGDPENFTPANPLVTPPHIKPEWYFLFAYAILRSIPNKLGGVLALAASVLILFLIPLLHKSKQRTMTFRPLSQLLFWTLVANLTILTWIGSQPVEHPFIIIGQLASLTYFTILLILFPLIGTLENKMLNH',
 'Eudyptes chrysocome': 'MAPNLRKSHPLLKTINNSLIDLPTPSNISAWWNFGSLLGICLATQILTGLLLAAHYTADTTLAFSSVAHTCRNVQYGWLIRNLHANGASFFFICIYLHIGRGLYYGSYLYKETWNTGIILLLTLMATAFVGYVLPWGQMSFWGATVITNLFSAI

## Task 3

Alternative way to finish task2


### Functio:  translate_sequences(dna_dict)

**Description:** Translate DNA sequence into amino acids in dictionaries

**Arguments:**

* dna_dict: dictionaries that contain sequences and their names

**Return:** dictionaries, containing amino acids sequences with their species names


In [49]:
from Bio.Seq import Seq

def translate_sequences(dna_dict):
    aa_dict = {}
    for name, dna_seq in dna_dict.items():
        aa_seq = dna_seq.translate(to_stop = False)
        aa_dict[name] = aa_seq
    return aa_dict

In [50]:
aa_3 = translate_sequences(penguin_seq)

In [51]:
aa_3

{'Aptenodytes forsteri': Seq('MAPNLRKSHPLLKIINNSLIDLPTPSNISA**NFGSLLGICLTTQILTGLLLAI...NH*'),
 'Aptenodytes patagonicus': Seq('MAPNLRKSHPLLKIINNSLIDLPTPSNISA**NFGSLLGICLTTQILTGLLLAI...NH*'),
 'Eudyptes chrysocome': Seq('MAPNLRKSHPLLKTINNSLIDLPTPSNISA**NFGSLLGICLATQILTGLLLAA...NH*'),
 'Eudyptes chrysolophus': Seq('MAPNLRKSHPLLKTINNSLIDLPTPSNISA**NFGSLLGICLATQILTGLLLAA...NH*'),
 'Eudyptes sclateri': Seq('MAPNLRKSHPLLKTINNSLIDLPTPSNISA**NFGSLLGICLATQILTGLLLAA...NH*'),
 'Eudyptula minor': Seq('MAPNLRKSHPLLKIINNSLIDLPTPSNIST**NFGSLLGICLITQILTGLLLAA...SH*'),
 'Pygoscelis adeliae': Seq('MAPNLRKSHPLLKIINNSLIDLPTPSNISA**NFGSLLGICLTTQILTGLLLAM...NH*'),
 'Pygoscelis antarctica': Seq('MAPNLRKSHPLLKIINNSLIDLPTPSNISA**NFGSLLGICLTTQILTGLLLAI...NF*'),
 'Pygoscelis papua': Seq('MAPNLRKSHPLLKIINKSLIDLPTPPNISA**NFGSLLGICLITQILTGLLLAI...NF*'),
 'Spheniscus demersus': Seq('MAPNLRKSHPLLKTINNSLIDLPTPSNISA**NFGSLLGICLATQILTGLLLAA...NH*'),
 'Spheniscus humboldti': Seq('MAPNLRKSHPLLKTINNSLIDLPTPSNISA**NFGSLLSIC

In [52]:
penguin_aa = translate_sequences(penguin_seq)

In [53]:
len(penguin_seq["Aptenodytes forsteri"])

1143

In [54]:
len(aa_3["Aptenodytes forsteri"])

381

In [56]:
aa_3["Aptenodytes forsteri"]

Seq('MAPNLRKSHPLLKIINNSLIDLPTPSNISA**NFGSLLGICLTTQILTGLLLAI...NH*')

In [58]:
aa_3["Aptenodytes forsteri"].count("*")

11

In [59]:
aa_2["Aptenodytes forsteri"]

'MAPNLRKSHPLLKMINNSLIDLPTPSNISAWWNFGSLLGICLTTQILTGLLLAMHYTADTTLAFSSVAHTCRNVQYGWLIRNLHANGASFFFICIYLHIGRGFYYGSYLYKETWNTGIILLLTLMATAFVGYVLPWGQMSFWGATVITNLFSAIPYIGQTLVEWTWGGFSVDNPTLTRFFALHFLLPFMIAGLTLIHLTFLHESGSNNPLGIVANSDKIPFHPYYSTKDILGFALMLLPLTTLALFSPNLLGDPENFTPANPLVTPPHIKPEWYFLFAYAILRSIPNKLGGVLALAASVLILFLIPLLHKSKQRTMAFRPLSQLLFWALVANLIILTWVGSQPVEHPFIIIGQLASLTYFTTLLILFPIAGALENKMLNH'

In [60]:
aa_2["Aptenodytes forsteri"].count("*")

0

In [61]:
penguin_seq["Aptenodytes forsteri"].count("TAA")

17

In [42]:
1143-3
1140/3

380.0

In [35]:
len(aa_2["Aptenodytes forsteri"])

380

In [9]:


######################## Python Translate Script ########################

## Here's the start of our Python script. Thanks for completing it for me! - Dr. X
## IMPORTANT: install BioPython so that this will work


#%%%%%%%%%%%%%%%#
### FUNCTIONS ###
#%%%%%%%%%%%%%%%#




## 3 ##
####### YOUR ALTERNATIVE FUNCTION ########
## Is there a better way to write the translation function? (Hint: yes there is.) 
## Perhaps using available BioPython library utilities?
## Please also write this function.


## 4 ##
####### YOUR COUNT AA ANALYSIS FUNCTION ########
## Write a function that calculates the molecular weight of each amino acid sequence.
## For this, you can use some BioPython functions. I think you can use the ProtParam module.
## For more info, check this out: http://biopython.org/wiki/ProtParam
## So you should import the following before defining your function:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# def compute_molecular_weight(aa_seq):
#     # I think the ProtParam functions may require aa_seq to be a string.
#     # It may not work if the amino acid sequence has stop codons.
#     run the ProteinAnalysis() function on aa_seq
#	  return the molecular weight

## 5 ##
####### YOUR GC CONTENT ANALYSIS FUNCTION ########
## Write a function that calculates the GC-content (proportion of "G" and "C") of each DNA sequence and returns this value.


#%%%%%%%%%%%%%%#
###   MAIN   ###
#%%%%%%%%%%%%%%#

cytb_seqs = get_sequences_from_file("penguins_cytb.fasta") 

penguins_df = pd.read_csv("penguins_mass.csv") # Includes only data for body mass 
species_list = list(penguins_df.species)

## 6 ## 
## Add two new columns to the penguin DataFrame: (1) molecular weight and (2) GC content.
## Set the value to 'NaN' to indicate that these cells are currently empty.

## 7 ##
## Write a for-loop that translates each sequence and also gets molecular weight and computes the GC content
## of each translated sequence and adds those data to DataFrame
# for key, value in cytb_seqs.items():
#     aa_seq = nuc2aa_translate_function(value) # whichever function you prefer of #2 or #3
#     get the molecular weight of aa_seq
#     get the GC content of the DNA sequence
#     fill in empty cells in DF that you created above

## 8 ##
## Plot a bar-chart of the mass with the x-axes labeled with species names.
## *Q1* What is the smallest penguin species? 
## *Q2* What is the geographical range of this species?

## 9 ##
## Plot a visualization of the molecular weight (y-axis) as a function of GC-content (x-axis).

## 10 ##
## Save the new DataFrame to a file called "penguins_mass_cytb.csv"

## 11 - BONUS ##
## What else can we do with this dataset in Python? 
## Add functions or anything that might be interesting and fun. (optional)



In [302]:
def gpt_dna(dna_dict):
    """
    This function takes in a dictionary of DNA sequences and returns a dictionary of translated amino acid sequences
    """
    # dictionary to store the translated amino acid sequences
    aa_dict = {}
    codon_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
    
    # loop through each DNA sequence in the dictionary
    for name, sequence in dna_dict.items():
        # initialize the translated sequence as an empty string
        aa_sequence = ""
        # loop through the sequence, reading three bases at a time
        for i in range(0, len(sequence), 3):
            # extract the current codon
            codon = sequence[i:i+3]
            # look up the corresponding amino acid in the codon table
            aa = codon_table.forward_table[codon]
            
            # add the amino acid to the translated sequence, unless it is a stop codon
            if aa != '*':
                aa_sequence += aa
            else:
                break
        # store the translated sequence in the dictionary
        aa_dict[name] = aa_sequence
        
    return aa_dict


In [303]:
conda install biopython


Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.9.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Retrieving notices: ...working... done

Note: you may need to restart the kernel to use updated packages.
