# BF527: Applications in Bioinformatics

>**Note:** Please submit the Jupyter notebook through Blackboard. Your code should follow the guidelines laid out in class, including commenting. Partial credit will be given for nonfunctional code that is logical and well commented. This assignment must be completed on your own.

## Homework 8

### See [Blackboard](https://learn.bu.edu) for assignment and due dates

---

## Problem 8.1 (40%):

#### Go to the PDB website and open the page for the structure with PDB ID 3BMP.

* Use __Pfam__, Uniprot, Google or Wikipedia to find some information about this protein. How long is the protein? Which superfamily does the protein belong to? What is the protein’s function, and the evolutionary history of the superfamily? What domains and enzymatic properties does the protein have?

#### Explore the 3D structure of “3BMP” using the "3D View" tab on the PDB website.

* Generate two informative pictures of this structure by manipulating the various style options (you can fine tune these options through the right-click menu). Include screen shots with your homework submission and explain the biological meaning of the different styles.

#### Use the other information tabs to answer the remaining questions.

* There are some “dots” buried in the structure—what do these represent? __Hint: try hovering over them with your pointer.__

* Describe the secondary structure composition of this protein. Is there a prevalence of one type of secondary structure?

* Does the protein belong to a family recognized by SCOP, CATH, and/or PFam?

* Is the protein similar to any other human proteins? To what degree?

__Hints__: You can download a fasta record from the PDB website. You can restrict blast to only look in the human database.

* How was the 3D structure and view of this protein generated?

__Hint__: The "Experiment" tab on the PDB website has some information that may help here

---

## Problem 8.2 (60%):

__Your task is to write a python script to parse a PDB file__. A typical PDB format file will contain atomic coordinates for proteins, as well as small molecules, ions and water. Each atom is entered as a line of information that starts with a keyword ATOM or HETATM. By tradition, the ATOM keyword is used to identify proteins or nucleic acid atoms, and keyword HETATM is used to identify atoms in small molecules. Following this keyword, there is a list of information about the atom, including its name, its number in the file, the name and number of the residue it belongs to, one letter to specify the chain (in oligomeric proteins), its x, y, and z coordinates. Download the raw data for 3BMP. (__Hint: under “Download files” select "PDB Format"__.) Your Python script should do the following things:

* Open the 3BMP.pdb file in order to parse it line by line. __Hint__: PDB files can be a little hard to read because the lines will have varied numbers of spaces so that the columns line up exactly in a flat file. If you tried opening the file (in a text editor), you’ll also realize that it has a LOT of different information in it. You are only interested in rows that begin with “ATOM”. The best way to separate individual components of a line is by slicing, e.g. to get just “ATOM” you could use line[0:4]. __Splitting on a variable (e.g. '```\t```') will not work__.
* Amino acids are made of Carbon (C), Nitrogen (N), Sulfur (S), and Oxygen (O). Count the number of C, N, S and O atoms that occur in each amino acid of the protein, including the total number of C, N, S and O atoms in the protein. Compute the frequencies (%) for each atom in each unique amino acid. __Remember__: the keyword for atoms in proteins (instead of small molecules) is ATOM; the HETATM keywords can be ignored. The atomic element is given a one-letter code at the __end of the line__. The PDB file will display the x,y,z coordinates starting at amino acid #9, and continuing to amino acid #114. There will be one line per atom of the amino acid. The question you are trying to answer is, of all the C, N and O atoms in the protein structure, how many are in Alanine, Arginine, etc.

Your output should look like:

```
amino acid  C     N     S     O
ARG         0.03  0.08  0.00  0.03
ASN         0.05  0.10  0.00  0.09
ASP         0.05  0.04  0.00  0.12
…etc
total:      531   142   9     156
```


In [126]:
#Write your script here

# Taking "3BMP.pdb" as input file in reading mode
with open("3BMP.pdb", "r") as input:
      
    # Creating "gfg output file.txt" as output file in write mode
    with open("ATOM.txt", "w") as output:
          
        # Writing each line from input file to
        # output file using loop
        for line in input:
            if line[:4]=="ATOM":
                output.write(line)
                
#open the 'ATOM.txt' file that just contain ATOM information               
file=open("ATOM.txt", "r")
s=file.readlines()

#initialize a variable to store all the amino acids in this protein
aa_ls=[]
#loop through lines to get amino acids in lines
for line in s:
    if line[17:20] not in aa_ls:
        aa_ls.append(line[17:20])
#aa_ls

#Obtain the complete molecular elemental composition of each amino acid
#store them a string
aa_mol={} #iniliaze a dictionary, keys: aa ; values:molecular formula

#loop through all amino acids
for aa in aa_ls:
    aa_mol[aa]=''
    #loop through lines in the ATOM file to get information of elements of each animo acid
    for line in s:
        if line[17:20]==aa:
            aa_mol[aa]+=line.strip()[-1]#remove newline character and get the last element
#aa_mol

#count all the "C","N","S","O" in the entire protein
total=''#initialize a variable to store all the elements in the protein as a string
#loop through all amino acids' molecular elements
for value in aa_mol.values():
    total+=value #add them to one string
#count 
t_C=total.count("C")
t_N=total.count("N")
t_S=total.count("S")
t_O=total.count("O")


print("amino acid","C",'\t',"N",'\t',"S",'\t',"O")
#count "C","N","S","O" in each amino acid
#calculate element ratios and print with two decimal places
for key in aa_mol.keys():
    print(key,'\t',"%.2f" % (aa_mol[key].count("C")/t_C),
          '\t',"%.2f" % (aa_mol[key].count("N")/t_N),
         '\t', "%.2f" % (aa_mol[key].count("S")/t_S),
          '\t',"%.2f" % (aa_mol[key].count("O")/t_O))
print("total",'\t',t_C,'\t',t_N,'\t',t_S,'\t',t_O)

amino acid C 	 N 	 S 	 O
ARG 	 0.03 	 0.08 	 0.00 	 0.03
LEU 	 0.10 	 0.06 	 0.00 	 0.06
LYS 	 0.07 	 0.08 	 0.00 	 0.04
SER 	 0.05 	 0.06 	 0.00 	 0.10
CYS 	 0.04 	 0.05 	 0.78 	 0.04
HIS 	 0.06 	 0.11 	 0.00 	 0.03
PRO 	 0.07 	 0.05 	 0.00 	 0.04
TYR 	 0.08 	 0.04 	 0.00 	 0.06
VAL 	 0.10 	 0.08 	 0.00 	 0.07
ASP 	 0.05 	 0.04 	 0.00 	 0.12
PHE 	 0.05 	 0.02 	 0.00 	 0.02
GLY 	 0.02 	 0.04 	 0.00 	 0.03
TRP 	 0.04 	 0.03 	 0.00 	 0.01
ASN 	 0.05 	 0.10 	 0.00 	 0.09
ILE 	 0.05 	 0.03 	 0.00 	 0.03
ALA 	 0.03 	 0.04 	 0.00 	 0.04
GLU 	 0.05 	 0.04 	 0.00 	 0.11
THR 	 0.02 	 0.02 	 0.00 	 0.04
GLN 	 0.02 	 0.03 	 0.00 	 0.03
MET 	 0.02 	 0.01 	 0.22 	 0.01
total 	 531 	 142 	 9 	 156


__What does the distribution of frequencies look like? Are there any atoms that are more prevalent in one amino acid or another?__