# BF527: Applications in Bioinformatics

>**Note:** Please submit the Jupyter notebook through Blackboard. Your code should follow the guidelines laid out in class, including commenting. Partial credit will be given for nonfunctional code that is logical and well commented. This assignment must be completed on your own.

## Homework 7

### See [Blackboard](https://learn.bu.edu) for assignment and due dates

---

## Problem 7.1 (40%):

Explore the gene expression dataset in **Gene Expression Omnibus (GEO)** with the accession **GSE4115**. In this study, the authors compared the gene expression in histologically normal bronchial epithelium in 79 samples from smokers with lung cancer with 73 samples from smokers without lung cancer. The goal of the study was to identify a diagnostic gene expression profile that can distinguish between lung cancer and non-lung cancer samples.

Use GEO to identify the number of genes differentially expressed in smokers with lung cancer, versus smokers without lung cancer. Use a Two-tailed T-test with 0.05 as the Significance Level.

Use a web tool of your choice to identify any significant pathways or biological processes that may be affected in lung cancer patients. Do you find anything interesting; does it make sense? What would you do next as a follow up experiment (bioinformatics or biology)?

---

## Problem 7.2 (60%):

Mitogen-activated protein kinase 6 (MAPK6) is an enzyme that is a member of the Ser/Thr protein kinase family. MAPK6, along with other MAP kinases, are extracellular signal-regulated kinases, which are activated through protein phosphorylation. MAPK6 is known to contain one protein kinase domain (Pkinase), located at the N-terminus. This Pkinase domain is a linear motif binding (LMB) domain, and is known to bind the following short linear motifs (SLiMs):

<table>
  <tr>
    <th>LMB Domain</th>
    <th>SLiM</th> 
  </tr>
  <tr>
    <td rowspan="7">Pkinase</td>
    <td>N.E.K..N</td> 
  </tr>
  <tr>
    <td>N.Y....E</td>
  </tr>
  <tr>
    <td>S...D.PL</td>
  </tr>
  <tr>
    <td>S..SS</td>
  </tr>
  <tr>
    <td>S.S..S</td>
  </tr>
  <tr>
    <td>ST.S</td>
  </tr>
  <tr>
    <td>F.FP</td>
  </tr>
</table>

Write a Python script that integrates information about MAPK6’s interaction partners, their sequences, and the known motifs that the Pkinase binds, to determine how many of MAPK6’s interactions are mediated by a LMB domain-motif interaction. The file ```HW7.2_uniprot_proteins.fasta``` (available on blackboard) is a fasta file that contains proteins that are known to interact with MAPK6. Your code should print: **(1)** the Uniprot ID of the binding partner; **(2)** the motif found in the binding partner; and, **(3)** the location of the motif in the binding partner’s sequence. There may be more than one motif within one protein. There may also be no motifs in a protein.


In [4]:
#Write your script here
import re

#read the fasta file of interaction proteins
proteins={} #initialize a variable to store interaction proteins

file = open('HW7.2_uniprot_proteins.fasta','r')
s=file.readlines()

def search_motif(sequence):
    '''
    Param: seq: protein sequence;
    return:None
    print the motif found in the binding partner
    and the location of the motif in the binding partner’s sequence.
    '''
    SLiM = ['N.E.K..N','N.Y....E','S...D.PL','S..SS','S.S..S','ST.S','F.FP']
    #create a list to store all the SLiMs
    clarify=0
    for motif in SLiM:#loop through SLiMs
        result = re.search(motif, sequence)#check if at least one motif is found in this sequence

        if result != None:#if a motif is found
            clarify=1
            print(motif)
            print('Location','\t','Motif')
            all_motifs = re.finditer(motif,sequence) #find all the motif in the sequence

            for i in all_motifs:
                print(i.span(),'\t',i.group()) #print the found motif and its location in the binding partners' sequence 
    if clarify == 0:
        print("None")
        
# Loop through lines and establish the dictionary for binding proteins
for l in s:
    if l[0]=='>':#extracting Uniprot ID
        ID = re.search(r"(?<=\|).+?(?=\|)",l).group()
        proteins[ID]=''#initialize a value for this ID
    else:#extracting the corresponding protein sequence
        proteins[ID]+=l.strip()

for ID in proteins.keys():
    print("Uniprot ID:", ID)#print uniprot ID
    search_motif(proteins[ID]) #search motif in corresponding protein sequence
    print()

Uniprot ID: P30153
None

Uniprot ID: P01023
S.S..S
Location 	 Motif
(60, 66) 	 SASLES
ST.S
Location 	 Motif
(781, 785) 	 STAS
F.FP
Location 	 Motif
(189, 193) 	 FSFP

Uniprot ID: Q9Y478
S..SS
Location 	 Motif
(176, 181) 	 SELSS

Uniprot ID: P42025
None

Uniprot ID: P63261
None

Uniprot ID: P49418
None

Uniprot ID: Q12955
S..SS
Location 	 Motif
(1522, 1527) 	 SLSSS
(1555, 1560) 	 STTSS
(1617, 1622) 	 STFSS
(1697, 1702) 	 SSLSS
(1802, 1807) 	 SLGSS
(1813, 1818) 	 SVTSS
(1869, 1874) 	 SRTSS
(1889, 1894) 	 STPSS
(2421, 2426) 	 SRPSS
(2624, 2629) 	 SPTSS
(2667, 2672) 	 SLPSS
(2712, 2717) 	 SKLSS
(3481, 3486) 	 SKSSS
S.S..S
Location 	 Motif
(910, 916) 	 SASLRS
(1522, 1528) 	 SLSSSS
(1541, 1547) 	 SVSTPS
(1615, 1621) 	 SNSTFS
(2622, 2628) 	 SQSPTS
(2994, 3000) 	 SQSSMS
(3481, 3487) 	 SKSSSS
(4036, 4042) 	 SLSETS
ST.S
Location 	 Motif
(1543, 1547) 	 STPS
(1555, 1559) 	 STTS
(1617, 1621) 	 STFS
(1889, 1893) 	 STPS

Uniprot ID: Q9UJX4
ST.S
Location 	 Motif
(269, 273) 	 STHS

Uniprot ID: Q99767
S

---

## EXTRA CREDIT (5 points):

Watch the 3 webinars hosted by George Church on Youtube (~20 minutes):

1. http://www.youtube.com/watch?v=mVZI7NBgcWM
2. http://www.youtube.com/watch?v=2r9DpthvNKM
3. http://www.youtube.com/watch?v=mgXAO8pv-X4

Discuss the potential benefits/detriments of getting your genome sequenced and the potential benefits/detriments of making your genome sequence public for all to see.

Would you get your genome sequenced? Would you make it public? Why or why not?