## Calculating Codon Adaptation Index according (CAI) according to Sharp et al. 1987

The codon adaptation index (CAI) measures the level of adaptation of a gene to a certain organism. It was introduced by Sharp 1987. CAI measures the deviation of the codons in a given protein coding gene sequence with respect to the codons used in a reference set of genes.

Synthetic genes are often made for a particular organism with an optimal CAI which often seem to improve expression. See the [wikipedia](https://en.wikipedia.org/wiki/Codon_Adaptation_Index) article on CAI.

There are numerous CAI tools online, but unfortunately these are 

[Biopython](http://biopython.org) has a module called [CodonUsage](http://biopython.org/DIST/docs/api/Bio.SeqUtils.CodonUsage.CodonAdaptationIndex-class.html). This module does not contain any of the necessary reference data in order to calculate CAI for yeast. Worse, the fidelity of the Biopython module had been questioned in this exchange on [Biostars](https://www.biostars.org/p/290485/).

The python module [CAI](https://pypi.org/project/CAI/) is an alternative implementation. 

The objective of this notebook is to test the fidelity of this python module using the original data of 
Sharp 1987 for yeast. 

Codon adaptation data is tabulated in table 1 of the [publication](https://www.ncbi.nlm.nih.gov/pubmed/3547335).

![Sharp Table 1](sharp_table_1.png)

Some examples of CAI numbers for identifiable yeast genes are given in table 2 of the same publication.

[Sharp Table 2](sharp_table_2.png)

<img src="sharp_table_2.png" alt="Drawing" style="width=200"/>

The [CAI](https://pypi.org/project/CAI/) module has a function [CAI](https://cai.readthedocs.io/en/latest/api.html#CAI.CAI) that can take a series of genes as reference or a dictionary with the triplets as keys and relative synonymous codon usage (RSCU) as values. This data is present in the Sharp 1987 table 1.

Table 2 lists CAI for yeast genes GAL4, PPR1 and GPD1.

The strategy will be to turn the Table 1 into a dict, use this to calculate CAI for [GAL4](https://www.yeastgenome.org/locus/S000006169), [PPR1](https://www.yeastgenome.org/locus/S000004004) and [GPD1](https://www.yeastgenome.org/locus/S000003424). The gene referred to as GPD1 in Sharp 1987 table 2 is most likely the TDH3 gene encoding Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) .

### References

Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 1987;15:1281–95.

[
Benjamin Lee. (2017). Python Implementation of Codon Adaptation Index.](http://joss.theoj.org/papers/8adf6bd9fd6391d5343d15ea0b6b6525)


Import all dependencies as per these [tips](https://hackernoon.com/10-tips-on-using-jupyter-notebook-abc0ba7028a4)

In [11]:
from io import StringIO

import pandas as pd

from CAI import CAI

from pygenome import sg

The Sharp table 1 was copied and edited from the pdf version of the publication. The order of the columns remains the same.

In [12]:
sharp_table_1 = """
AA Tri RSCUe we RSCUy wy
Phe TTT 0.456 0.296 0.203 0.113
Phe TTC 1.544 1.000 1.797 1.000
Leu TTA 0.106 0.020 0.601 0.117
Leu TTG 0.106 0.020 5.141 1.000
Leu CTT 0.225 0.042 0.029 0.006
Leu CTC 0.198 0.037 0.014 0.003
Leu CTA 0.040 0.007 0.200 0.039
Leu CTG 5.326 1.000 0.014 0.003
Ile ATT 0.466 0.185 1.352 0.823
Ile ATC 2.525 1.000 1.643 1.000
Ile ATA 0.008 0.003 0.005 0.003
Met ATG 1.000 1.000 1.000 1.000
Val GTT 2.244 1.000 2.161 1.000
Val GTC 0.148 0.066 1.796 0.831
Val GTA 1.111 0.495 0.004 0.002
Val GTG 0.496 0.221 0.039 0.018
Ser TCT 2.571 1.000 3.359 1.000
Ser TCC 1.912 0.744 2.327 0.693
Ser TCA 0.198 0.077 0.122 0.036
Ser TCG 0.044 0.017 0.017 0.005
Pro CCT 0.231 0.070 0.179 0.047
Pro CCC 0.038 0.012 0.036 0.009
Pro CCA 0.442 0.135 3.776 1.000
Pro CCG 3.288 1.000 0.009 0.002
Thr ACT 1.804 0.965 1.899 0.921
Thr ACC 1.870 1.000 2.063 1.000
Thr ACA 0.141 0.076 0.025 0.012
Thr ACG 0.185 0.099 0.013 0.006
Ala GCT 1.877 1.000 3.005 1.000
Ala GCC 0.228 0.122 0.948 0.316
Ala GCA 1.099 0.586 0.044 0.015
Ala GCG 0.796 0.424 0.004 0.001
Tyr TAT 0.386 0.239 0.132 0.071
Tyr TAC 1.614 1.000 1.868 1.000
His CAT 0.451 0.291 0.394 0.245
His CAC 1.549 1.000 1.606 1.000
Gln CAA 0.220 0.124 1.987 1.000
Gln CAG 1.780 1.000 0.013 0.007
Asn AAT 0.097 0.051 0.100 0.053
Asn AAC 1.903 1.000 1.900 1.000
Lys AAA 1.596 1.000 0.237 0.135
Lys AAG 0.404 0.253 1.763 1.000
Asp GAT 0.605 0.434 0.713 0.554
Asp GAC 1.395 1.000 1.287 1.000
Glu GAA 1.589 1.000 1.968 1.000
Glu GAG 0.411 0.259 0.032 0.016
Cys TGT 0.667 0.500 1.857 1.000
Cys TGC 1.333 1.000 0.143 0.077
Trp TGG 1.000 1.000 1.000 1.000
Arg CGT 4.380 1.000 0.718 0.137
Arg CGC 1.561 0.356 0.008 0.002
Arg CGA 0.017 0.004 0.008 0.002
Arg CGG 0.017 0.004 0.008 0.002
Ser AGT 0.220 0.085 0.070 0.021
Ser AGC 1.055 0.410 0.105 0.031
Arg AGA 0.017 0.004 5.241 1.000
Arg AGG 0.008 0.002 0.017 0.003
Gly GGT 2.283 1.000 3.898 1.000
Gly GGC 1.652 0.724 0.077 0.020
Gly GGA 0.022 0.010 0.009 0.002
Gly GGG 0.043 0.019 0.017 0.004"""

Pandas can turn a text table into a dataframe

In [13]:
sharp_df = pd.read_csv(StringIO(sharp_table_1.strip()), sep=" ")

The resulting dataframe seem correct.

In [14]:
sharp_df

Unnamed: 0,AA,Tri,RSCUe,we,RSCUy,wy
0,Phe,TTT,0.456,0.296,0.203,0.113
1,Phe,TTC,1.544,1.000,1.797,1.000
2,Leu,TTA,0.106,0.020,0.601,0.117
3,Leu,TTG,0.106,0.020,5.141,1.000
4,Leu,CTT,0.225,0.042,0.029,0.006
5,Leu,CTC,0.198,0.037,0.014,0.003
6,Leu,CTA,0.040,0.007,0.200,0.039
7,Leu,CTG,5.326,1.000,0.014,0.003
8,Ile,ATT,0.466,0.185,1.352,0.823
9,Ile,ATC,2.525,1.000,1.643,1.000


A dict is made from the first and fifth columns. The fifth column contain the yeast data.

In [15]:
RSCU_sharp = dict(zip(sharp_df["Tri"],sharp_df["RSCUy"]))

In [16]:
import pickle
pickle.dump(RSCU_sharp,open("RSCU_sharp.pickle","wb"))

In [17]:
GAL4str = str(sg.stdgene["GAL4"].cds.seq)
PPR1str = str(sg.stdgene["PPR1"].cds.seq)
GPD1str = str(sg.stdgene["TDH3"].cds.seq)


| Gene | CAI   |
|------|-------|
| GAL4 | 0.116 |
| PPR1 | 0.114 |
| GPD1 | 0.929 |

In [18]:
print(round(CAI(GAL4str,RSCUs=RSCU_sharp),3))
print(round(CAI(PPR1str,RSCUs=RSCU_sharp),3)) 
print(round(CAI(GPD1str,RSCUs=RSCU_sharp),3)) 

0.116
0.115
0.924


There are small differences for PPR1 and GPD1, these could perhaps be explained by rounding of the RSCU values in table 1. The genes used to create the RSCU data in Sharp 1987 were described in [Sharp 1986](https://www.ncbi.nlm.nih.gov/pubmed/3526280).

![](table1a.png)

![](table1b.png)

![](table1b2.png)

![](table1c.png)

    Yeast - 16 ribosomal protein genes, TEF 1, 2 enolase
    genes, 2 GA-3-PDH genes, ADH 1, PGK, pyruvate kinase (data
    sources given in Ref.5)


    Ribosomal protein L16 175 0.83 0.70 0.80 0.79 (1) *1   RPL10
    Ribosomal protein L17a 138 0.79 0.68 0.72 0.63 (2)     RPL17A
    Ribosomal protein L25 138 0.86 0.72 0.82 0.52 (3)      RPL25
    Ribosomal protein L29 150 0.79 0.73 0.83 0.66 (4)      RPL29
    Ribosomal protein L34 114 0.84 0.75 0.79 0.57 (5)      RPL31A
    Ribosomal protein 13 388 0.89 0.78 0.86 0.70 (7)       RPL16A
    Ribosomal protein 28 187 0.89 0.86 0.89 0.61 (8)       RPS23A
    Ribosomal protein 51a 137 0.87 0.85 0.86 0.68 (9)      RPS17A
    Ribosomal protein 59 138 0.88 0.85 0.79 0.63 (10) *1   RPS14A
    Ribosomal protein S10 238 0.94 0.86 0.86 0.67 (11)     RPS20
    Ribosomal protein S16a 145 0.88 0.80 0.78 0.78 (12)    RPS16A
    Ribosomal protein S24 131 0.86 0.77 0.67 0.81 (13) *4  RPS22A
    Ribosomal protein 29 156 0.83 0.72 0.83 0.55 (86) *4   RPL29
    Ribosomal protein 51B 137 0.83 0.78 0.83 0.68 (102) *4 RPS17B
    Ribosomal protein L46 52 0.93 0.99 1.00 0.13 (6) *3    RPL39
    Ribosomal protein S33 68 0.63 0.68 0.79 0.31 (14)      RPS28A
    TEF 1 Elong. factor la 459 0.93 0.78 0.83 0.73 (69)    TEF1
    enolase A 438 0.93 0.82 0.85 0.78 (29)                 ENO1
    enolase B 438 0.96 0.85 0.86 0.75 (30)                 ENO2
    GA-3-PDH 1 331 0.99 0.86 0.86 0.81 (34)                TDH3
    GA-3-PDH 3 331 0.94 0.75 0.81 0.78 (35)                TDH1
    ADH 1 349 0.91 0.76 0.79 0.74 (16)                     ADH1
    PGK 417 0.91 0.75 0.85 0.75 (67)                       PGK1
    Pyruvate kinase 500 0.95 0.79 0.87 0.77 (53) *4        CDC19


In [19]:
highly_expressed_genes= '''
RPL10
RPL17A
RPL25
RPL29 
RPL31A
RPL16A
RPL29 
RPS17B
RPL39 
RPS28A
RPS23A
RPS17A
RPS14A
RPS20 
RPS16A
RPS22A

TEF1  
ENO1 
ENO2 
TDH3 
TDH2 
ADH1 
PGK1 
CDC19
'''
reference = []
for g in highly_expressed_genes.split():
    reference.append(str(sg.stdgene[g].cds.seq))

In [20]:
print(round(CAI(GAL4str,reference=reference),3))
print(round(CAI(PPR1str,reference=reference),3)) 
print(round(CAI(GPD1str,reference=reference),3))

0.104
0.109
0.913


Using the raw data supposedly used in Sharp 1987 has a worse fit than using the table data.
This could perhaps be due to misidentification of some genes in Sharp 1986. Gene names have changed quite a bit 
since then. Going through the references in the Sharp 1986 tables could perhaps help solving this. 