## Analyzing Saccharomyces cerevisiae potential ACC1/YNR016C homologs for phosphorylation sites

This notebook uses [biopython]() and [pydna]() to analyse the 1000 most similar sequences for sequences that lack phosphorylation sites. Acetyl CoA carboxylases without these 


#### Choi 2014

"AMPK phosphorylation target motif, which is a hydrophobic residue (M,L,F,I or V) for P-5 and P + 4, and basic residues (R,K or H) for P-3 or P-4 (Dale 1995; Scott 2002; Weekes 1993)."

    (M|L|F|I|V).(R|H|K)..S...(M|L|F|I|V)              regex pattern #1
    (M|L|F|I|V)(R|H|K)...S...(M|L|F|I|V)              regex pattern #2
    
    (M|L|F|I|V)(.(R|H|K)..|(R|H|K)...)S...(M|L|F|I|V) regex combining #1&#2

#### Shi 2014

    Hyd-X-Arg-XX-Ser-XXX-Hyd (Dale 1995)
    (M|L|F|I|V).R..S...(M|L|F|I|V)                     regex pattern #3 (same as #1)
    
    
    
    
    
    
#### References:

Choi JW, Da Silva NA. Improving polyketide and fatty acid synthesis by engineering of the yeast acetyl-CoA carboxylase. J Biotechnol 2014;187:56–9.
    
Dale S, Wilson WA, Edelman AM et al. Similar substrate recognition motifs for mammalian AMP-activated protein kinase, higher plant HMG-CoA reductase kinase-A, yeast SNF1, and mammalian calmodulin-dependent protein kinase I. FEBS Lett 1995;361:191–5.

Shi S, Chen Y, Siewers V et al. Improving production of malonyl coenzyme A-derived metabolites by abolishing Snf1-dependent regulation of Acc1. MBio 2014;5:e01130–14.

Scott JW, Norman DG, Hawley SA et al. Protein kinase substrate recognition studied using the recombinant catalytic domain of AMP-activated protein kinase and a model substrate. J Mol Biol 2002;317:309–23.

Weekes J, Ball KL, Caudwell FB et al. Specificity determinants for the AMP-activated protein kinase and its plant homologue analysed using synthetic peptides. FEBS Lett 1993;334:335–9.

In [145]:
import re
import collections
import pickle
from io import StringIO

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio import Entrez
from Bio import SeqIO
from Bio.SeqUtils import GC

from sklearn.externals.joblib import Memory

from IPython.core.display import display
from IPython.core.display import HTML

import pandas as pd

from CAI import CAI

from IPython.core.display import display, Markdown

from pydna.genbank import genbank
from pygenome import sg

memory = Memory(cachedir='./cachedir', verbose=0)
#@memory.cache
#def computation(p1, p2):

In [1]:
Number_of_sequences = 1000

In [3]:
result_handle = NCBIWWW.qblast("blastp", 
                               "nr", 
                               "AAA20073", 
                               hitlist_size=Number_of_sequences,
                               alignments = 1, 
                               expect=10.0)

with open("my_blast.xml", "w") as f:
    f.write(result_handle.read())    
    
result_handle.close()

The [my_blast.xml](my_blast.xml) file contains information on all High-scoring Segment Pair [HSP](https://en.wikipedia.org/wiki/BLAST#Algorithm) for each one of the most similar sequences and the parameters that were used in the analysis.

Biopython has a [parser](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc91) that can be used to extract [ACCESSION](https://en.wikipedia.org/wiki/Accession_number_%28bioinformatics%29) numbers for all the results:

In [4]:
result_handle = open("my_blast.xml")

blast_record = NCBIXML.read(result_handle)

accessions = []

for h in blast_record.alignments:
    accessions.append(h.accession)

result_handle.close()

In [5]:
len(accessions)

1000

The first five ACCESSION numbers:

In [6]:
accessions[:5]

['AAA20073', 'NP_014413', 'AJT05884', 'EDN62822', 'AJT27522']

Now we would like to get the sequences for all of the ACCESSION numbers.
This is trickier than we might expect, due to the design of the database.

We have to make a string containing all of the numbers divided by a blank:

In [7]:
query  = " ".join(accessions)
query[:60]

'AAA20073 NP_014413 AJT05884 EDN62822 AJT27522 AJT07736 AJT31'

We have to set a variable to say how many results we want. 
We call this variable retmax. We should set is to as many sequence as we have in the ids list:

In [8]:
retmax = len(accessions)

We will use functionality of the Entrez module.
We have to tell Genbank who we are when we use their service.
We can do this by setting the Entrez.email variable.

First we need the gi number associated with each accession number.

We will use the biopython wrapper for the Entrez E-utilities server programs.
Here is a tutorial from [NCBI](http://www.ncbi.nlm.nih.gov/books/NBK25501/).

[Entrez.esearch](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc112) can be used to search the E-utilities programs.

The code below posts a search on Entrez that we can fetch later

In [13]:
len(query)

10393

In [15]:

Entrez.email = "bjornjobb@gmail.com"

query  = " ".join(accessions)
retmax=1000
handle = Entrez.esearch( db="protein",term=query,retmax=retmax , usehistory="y")
giList = Entrez.read(handle)['IdList']

In [17]:
len(giList)

1000

In [18]:
handle = Entrez.epost(db="protein", id=",".join(giList), rettype="fasta", retmode="text")

result = Entrez.read(handle)

search_results = result

webenv, query_key  = search_results["WebEnv"], search_results["QueryKey"]

Below, we download the results in batches of 100 sequences.

In [20]:
batchSize    = 100
db="protein"

for start in range( 0, len(giList), batchSize ):

    handle = Entrez.efetch(db=db, 
                           rettype="gb", 
                           retstart=start, 
                           retmax=batchSize, 
                           webenv=webenv, 
                           query_key=query_key)

    with open("genbank_sequences.gb", "a+") as f:
        f.write(handle.read())

File with all the results (This file can be big!):

[genbank_sequences](genbank_sequences.gb)

In [2]:
f=open("genbank_sequences.gb","r")
sequences = SeqIO.parse(f, "gb")

ScACC1p = next(sequences)

seqs_wo_phos = []
seqs_w_phos = []

kinase_site  = re.compile("((M|L|F|I|V)(.(R|H|K)..|(R|H|K)...)S...(M|L|F|I|V))")

for i,s in enumerate(sequences):
    ms  = list(kinase_site.finditer(str(s.seq).upper()))
    if ms:
        seqs_w_phos.append(s)
    else:      
        print(i, s.id)
        seqs_wo_phos.append(s)
f.close()

556 XP_002174469.1
559 KFH45103.1
625 CDS10010.1
627 RCI05537.1
628 CEP09288.1
629 ORZ01680.1
630 CDH52590.1
660 KDE05599.1
661 SGZ26149.1
681 POY75327.1
700 KWU41533.1
704 XP_018270360.1
714 SCV70467.1
800 KDQ18913.1
885 XP_025349007.1


In [31]:
seqs_to_analyze = [ScACC1p]+seqs_wo_phos

In [32]:
for seq in seqs_to_analyze:
    display(HTML(f'<a href="https://www.ncbi.nlm.nih.gov/protein/{seq.id}">{seq.description}</a>'))

In [33]:
for seq in seqs_to_analyze:
    print(seq.format("fasta"))

>AAA20073.1 acetyl-CoA carboxylase [Saccharomyces cerevisiae]
MSEESLFESSPQKMEYEITNYSERHTELPGHFIGLNTVDKLEESPLRDFVKSHGGHTVIS
KILIANNGIAAVKEIRSVRKWAYETFGDDRTVQFVAMATPEDLEANAEYIRMADQYIEVP
GGTNNNNYANVDLIVDIAERADVDAVWAGWGHASENPLLPEKLSQSKRKVIFIGPPGNAM
RSLGDKISSTIVAQSAKVPCIPWSGTGVDTVHVDEKTGLVSVDDDIYQKGCCTSPEDGLQ
KAKRIGFPVMIKASEGGGGKGIRQVEREEDFIALYHQAANEIPGSPIFIMKLAGRARHLE
VQLLADQYGTNISLFGRDCSVQRRHQKIIEEAPVTIAKAETFHEMEKAAVRLGKLVGYVS
AGTVEYLYSHDDGKFYFLELNPRLQVEHPTTEMVSGVNLPAAQLQIAMGIPMHRISDIRT
LYGMNPHSASEIDFEFKTQDATKKQRRPIPKGHCTACRITSEDPNDGFKPSGGTLHELNF
RSSSNVWGYFSVGNNGNIHSFSDSQFGHIFAFGENRQASRKHMVVALKELSIRGDFRTTV
EYLIKLLETEDFEDNTITTGWLDDLITHKMTAEKPDPTLAVICGAATKAFLASEEARHKY
IESLQKGQVLSKDLLQTMFPVDFIHEGKRYKFTVAKSGNDRYTLFINGSKCDIILRQLSD
GGLLIAIGGKSHTIYWKEEVAATRLSVDSMTTLLEVENDPTQLRTPSPGKLVKFLVENGE
HIIKGQPYAEIEVMKMQMPLVSQENGIVQLLKQPGSTIVAGDIMAIMTLDDPSKVKHALP
FEGMLPDFGSPVIEGTKPAYKFKSLVSTLENILKGYDNQVIMNASLQQLIEVLRNPKLPY
SEWKLHISALHSRLPAKLDEQMEELVARSLRRGAVFPARQLSKLIDMAVKNPEYNPDKLL
GAVVEPLADIAHKYSNGLEAHEH

The sequences above were aligned using tcoffee (http://tcoffee.crg.cat/)



In [34]:
Entrez.email = "bjornjobb@gmail.com"

epost_1 = Entrez.read(Entrez.epost("protein", id=",".join(s.id for s in seqs_to_analyze)))

webenv = epost_1["WebEnv"]

query_key = epost_1["QueryKey"]

iden_prots = Entrez.efetch(db="protein", 
                           rettype='ipg', 
                           retmode='text', 
                           webenv=epost_1["WebEnv"], 
                           query_key=epost_1["QueryKey"])

In [11]:
df = pd.read_csv(iden_prots, sep="\t")

NameError: name 'iden_prots' is not defined

In [13]:
df.insert(0, 'Candidate#', range(0,23))
df

Unnamed: 0,Candidate#,Id,Source,Nucleotide Accession,Start,Stop,Strand,Protein,Protein Name,Organism,Strain,Assembly
0,0,382261,INSDC,M92156.1,1700.0,8413.0,+,AAA20073.1,acetyl-CoA carboxylase,Saccharomyces cerevisiae,,
1,1,382261,PAT,,,,,AAB75480.1,Sequence 1 from patent US 5641666,Unknown,,
2,2,15864892,RefSeq,NW_011627861.1,847281.0,854150.0,-,XP_002174469.1,acetyl-CoA/biotin carboxylase,Schizosaccharomyces japonicus yFS275,yFS275,GCF_000149845.2
3,3,15864892,RefSeq,XM_002174433.1,1.0,6870.0,+,XP_002174469.1,acetyl-CoA/biotin carboxylase,Schizosaccharomyces japonicus yFS275,yFS275,
4,4,15864892,INSDC,KE651167.1,847281.0,854150.0,-,EEB08176.1,acetyl-CoA/biotin carboxylase,Schizosaccharomyces japonicus yFS275,yFS275,GCA_000149845.2
5,5,61353600,INSDC,JPKY01000037.1,132082.0,139128.0,+,KFH45103.1,Acetyl-CoA carboxylase-like protein,Acremonium chrysogenum ATCC 11550,ATCC 11550,GCA_000769265.1
6,6,61160175,INSDC,LK023335.1,908916.0,915755.0,+,CDS10010.1,hypothetical protein,Lichtheimia ramosa,JMRC FSU:6197,GCA_000945115.1
7,7,197118838,INSDC,PJQM01000363.1,69.0,6876.0,+,RCI05537.1,acetyl-coenzyme-A carboxylase,Rhizopus stolonifer,LSU 92-RS-03,GCA_003325415.1
8,8,75891099,INSDC,LN721335.1,5845.0,12691.0,-,CEP09288.1,hypothetical protein,Parasitella parasitica,CBS 412.66,GCA_000938895.1
9,9,147690390,INSDC,MCGN01000002.1,4251138.0,4257717.0,+,ORZ01680.1,acetyl-CoA carboxylase,Syncephalastrum racemosum,NRRL 2496,GCA_002105135.1


In [14]:
df.to_pickle("identical_proteins_table.pickle")

In [5]:
df = pd.read_pickle("identical_proteins_table.pickle")

In [6]:
Candidate = df["Candidate#"]
Protein = df["Protein"]
Nucleotide_Accession = df["Nucleotide Accession"]
Start = df["Start"]
Stop  = df["Stop"]
Strand = df["Strand"]

In [7]:

genetuple = collections.namedtuple('genetuple', 'canidate sequence')

genes=[]

for cnd,pro,nac,sta,sto,str_ in zip(Candidate,Protein,Nucleotide_Accession,Start,Stop,Strand):
    
    if nac!=nac: continue  # Hack to test is nac is NaN https://stackoverflow.com/questions/944700/how-can-i-check-for-nan-in-python
    
    gene = genbank(nac, seq_start=int(sta),seq_stop=int(sto),strand=str_)

    genes.append( genetuple(cnd, gene) )
    
    with open("genbank_nucelotide_sequences.gb", "a+") as f:
        f.write(gene.format()+"\n\n\n")
    
    print(f"{cnd:<10}{pro:<15}{nac:<15}{sta:<15}{sto:<15}{str_:<15}")

0         AAA20073.1     M92156.1       1700.0         8413.0         +              
2         XP_002174469.1 NW_011627861.1 847281.0       854150.0       -              
3         XP_002174469.1 XM_002174433.1 1.0            6870.0         +              
4         EEB08176.1     KE651167.1     847281.0       854150.0       -              
5         KFH45103.1     JPKY01000037.1 132082.0       139128.0       +              
6         CDS10010.1     LK023335.1     908916.0       915755.0       +              
7         RCI05537.1     PJQM01000363.1 69.0           6876.0         +              
8         CEP09288.1     LN721335.1     5845.0         12691.0        -              
9         ORZ01680.1     MCGN01000002.1 4251138.0      4257717.0      +              
10        CDH52590.1     CBTN010000014.177974.0        84845.0        -              
11        KDE05599.1     GL541683.1     1146.0         8865.0         -              
12        SGZ26149.1     FQNC01000086.1 184978.0      

In [125]:
len(genes)

22

In [128]:
pickle.dump(genes, open( "genes.pickle", "wb" ))

In [130]:
genes=pickle.load(open( "genes.pickle", "rb" ))

In [132]:
mrnas = []
for candidate, gene in genes:
    cds_features = [f for f in gene.features if f.type=="CDS"]
    cds_feature  = cds_features[0]
    mrna = cds_feature.extract(gene)
    mrna.id = gene.id
    mrnas.append(genetuple(candidate, mrna))

In [133]:
sharp_table_1 = """
AA Tri RSCUe we RSCUy wy
Phe TTT 0.456 0.296 0.203 0.113
Phe TTC 1.544 1.000 1.797 1.000
Leu TTA 0.106 0.020 0.601 0.117
Leu TTG 0.106 0.020 5.141 1.000
Leu CTT 0.225 0.042 0.029 0.006
Leu CTC 0.198 0.037 0.014 0.003
Leu CTA 0.040 0.007 0.200 0.039
Leu CTG 5.326 1.000 0.014 0.003
Ile ATT 0.466 0.185 1.352 0.823
Ile ATC 2.525 1.000 1.643 1.000
Ile ATA 0.008 0.003 0.005 0.003
Met ATG 1.000 1.000 1.000 1.000
Val GTT 2.244 1.000 2.161 1.000
Val GTC 0.148 0.066 1.796 0.831
Val GTA 1.111 0.495 0.004 0.002
Val GTG 0.496 0.221 0.039 0.018
Ser TCT 2.571 1.000 3.359 1.000
Ser TCC 1.912 0.744 2.327 0.693
Ser TCA 0.198 0.077 0.122 0.036
Ser TCG 0.044 0.017 0.017 0.005
Pro CCT 0.231 0.070 0.179 0.047
Pro CCC 0.038 0.012 0.036 0.009
Pro CCA 0.442 0.135 3.776 1.000
Pro CCG 3.288 1.000 0.009 0.002
Thr ACT 1.804 0.965 1.899 0.921
Thr ACC 1.870 1.000 2.063 1.000
Thr ACA 0.141 0.076 0.025 0.012
Thr ACG 0.185 0.099 0.013 0.006
Ala GCT 1.877 1.000 3.005 1.000
Ala GCC 0.228 0.122 0.948 0.316
Ala GCA 1.099 0.586 0.044 0.015
Ala GCG 0.796 0.424 0.004 0.001
Tyr TAT 0.386 0.239 0.132 0.071
Tyr TAC 1.614 1.000 1.868 1.000
His CAT 0.451 0.291 0.394 0.245
His CAC 1.549 1.000 1.606 1.000
Gln CAA 0.220 0.124 1.987 1.000
Gln CAG 1.780 1.000 0.013 0.007
Asn AAT 0.097 0.051 0.100 0.053
Asn AAC 1.903 1.000 1.900 1.000
Lys AAA 1.596 1.000 0.237 0.135
Lys AAG 0.404 0.253 1.763 1.000
Asp GAT 0.605 0.434 0.713 0.554
Asp GAC 1.395 1.000 1.287 1.000
Glu GAA 1.589 1.000 1.968 1.000
Glu GAG 0.411 0.259 0.032 0.016
Cys TGT 0.667 0.500 1.857 1.000
Cys TGC 1.333 1.000 0.143 0.077
Trp TGG 1.000 1.000 1.000 1.000
Arg CGT 4.380 1.000 0.718 0.137
Arg CGC 1.561 0.356 0.008 0.002
Arg CGA 0.017 0.004 0.008 0.002
Arg CGG 0.017 0.004 0.008 0.002
Ser AGT 0.220 0.085 0.070 0.021
Ser AGC 1.055 0.410 0.105 0.031
Arg AGA 0.017 0.004 5.241 1.000
Arg AGG 0.008 0.002 0.017 0.003
Gly GGT 2.283 1.000 3.898 1.000
Gly GGC 1.652 0.724 0.077 0.020
Gly GGA 0.022 0.010 0.009 0.002
Gly GGG 0.043 0.019 0.017 0.004"""

In [134]:
#[line.split() for line in sharp_table_1.strip().splitlines()]

df = pd.read_csv(StringIO(sharp_table_1.strip()), sep=" ")

In [135]:
df

Unnamed: 0,AA,Tri,RSCUe,we,RSCUy,wy
0,Phe,TTT,0.456,0.296,0.203,0.113
1,Phe,TTC,1.544,1.000,1.797,1.000
2,Leu,TTA,0.106,0.020,0.601,0.117
3,Leu,TTG,0.106,0.020,5.141,1.000
4,Leu,CTT,0.225,0.042,0.029,0.006
5,Leu,CTC,0.198,0.037,0.014,0.003
6,Leu,CTA,0.040,0.007,0.200,0.039
7,Leu,CTG,5.326,1.000,0.014,0.003
8,Ile,ATT,0.466,0.185,1.352,0.823
9,Ile,ATC,2.525,1.000,1.643,1.000


In [136]:
RSCU_sharp = dict(zip(df["Tri"],df["RSCUy"]))

In [141]:
GAL4str = str(sg.stdgene["GAL4"].cds.seq)
PPR1str = str(sg.stdgene["PPR1"].cds.seq)
GPD1str = str(sg.stdgene["TDH3"].cds.seq)
CAI(GAL4str,RSCUs=RSCU_sharp), CAI(PPR1str,RSCUs=RSCU_sharp), CAI(GPD1str,RSCUs=RSCU_sharp)

(0.11578464288185483, 0.11473527278307559, 0.9242336846040357)

In [142]:


table = []

for candidate, mrna in mrnas:
    table.append([ candidate,
                   mrna.id, 
                   round( CAI( str(mrna.seq), RSCUs=RSCU_sharp),3), 
                   round(GC(str(mrna.seq)),3),
                   f"https://www.ncbi.nlm.nih.gov/nuccore/{mrna.id}"
                 ])

In [143]:
df2 = pd.DataFrame(table, columns=["Candidate#", "Nucleotide Accession", "CAI", "GC%", "url"])

In [144]:
def make_clickable(val):
    return '<a href="{}">{}</a>'.format(val, val)

df2.style.format({'url': make_clickable})

Unnamed: 0,Candidate#,Nucleotide Accession,CAI,GC%,url
0,0,M92156.1,0.327,40.84,https://www.ncbi.nlm.nih.gov/nuccore/M92156.1
1,2,NW_011627861.1,0.182,46.885,https://www.ncbi.nlm.nih.gov/nuccore/NW_011627861.1
2,3,XM_002174433.1,0.182,46.885,https://www.ncbi.nlm.nih.gov/nuccore/XM_002174433.1
3,4,KE651167.1,0.182,46.885,https://www.ncbi.nlm.nih.gov/nuccore/KE651167.1
4,5,JPKY01000037.1,0.093,60.003,https://www.ncbi.nlm.nih.gov/nuccore/JPKY01000037.1
5,6,LK023335.1,0.177,42.846,https://www.ncbi.nlm.nih.gov/nuccore/LK023335.1
6,7,PJQM01000363.1,0.42,44.038,https://www.ncbi.nlm.nih.gov/nuccore/PJQM01000363.1
7,8,LN721335.1,0.225,47.406,https://www.ncbi.nlm.nih.gov/nuccore/LN721335.1
8,9,MCGN01000002.1,0.105,58.585,https://www.ncbi.nlm.nih.gov/nuccore/MCGN01000002.1
9,10,CBTN010000014.1,0.159,46.409,https://www.ncbi.nlm.nih.gov/nuccore/CBTN010000014.1


#### Candidates 2,3,4 Schizosaccharomyces japonicus yFS275

Most similar to S. cerevisiae

https://www.ncbi.nlm.nih.gov/pubmed/21511999

#### Candidate 6 Lichtheimia ramosa strain JMRC FSU:6197

Fourth best CAI

https://www.leibniz-hki.de/en/institut-staff-details.html?member=357
https://www.leibniz-hki.de/en/institut-staff-details.html?member=90

#### Candidate 7 Rhizopus stolonifer

Better CAI than S. cerevisiae

https://en.wikipedia.org/wiki/Rhizopus_stolonifer
http://fds.duke.edu/db/aas/Biology/faculty/fungi

#### Candidate 8 Parasitella parasitica

Second best CAI

https://github.com/SabrinaEllenberger


#### Candidates 14,15,16,17 Rhodotorula sp. JG-1b & Rhodotorula graminis WP1

Oleagenous yeast https://www.sciencedirect.com/science/article/pii/S0960852412002209

http://genome.jgi.doe.gov/Rhoba1_1




Interesting links:


http://article.gmane.org/gmane.comp.python.bio.general/7974/match=elink

http://article.gmane.org/gmane.comp.python.bio.general/4495/match=elink

http://www.biostars.org/p/66921/

http://www.biostars.org/p/63506/

https://paperpile.com/shared/qlNu3j

https://paperpile.com/shared/kdaV1s

https://www.biostars.org/p/51475/

http://pbpython.com/pandas-list-dict.html

https://stackoverflow.com/questions/50209206/clickable-link-in-pandas-dataframe