# Coding Domains Of Life

## Background information

[Tree Of Life](https://www.evogeneao.com/en/learn/tree-of-life)
![](../images/tree-of-life_2000.png)

[Genetic Code](https://www.geeksforgeeks.org/biology/genetic-code-molecular-basis-of-inheritance/)
![](../images/Genetic-Code.webp)

## Data source and description

### Source

[UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/577/codon+usage)

### Variable description

In [56]:
with option_context('display.max_colwidth', None):
    display(
        DataFrame(
            {
                'Column': '1 2 3 4 5 6-69'.split(),
                'Variable': 'Kingdom DNAtype SpeciesID Ncodons SpeciesName codon'.split(),
                'Description': [
                    "A 3-letter code corresponding to 'xxx' in the CUTG database name: 'arc'(archaea), 'bct'(bacteria), 'phg'(bacteriophage), 'plm' (plasmid), 'pln' (plant), 'inv' (invertebrate), 'vrt' (vertebrate), 'mam' (mammal), 'rod' (rodent), 'pri' (primate), and 'vrl'(virus) sequence entries. Note that the CUTG database does not contain 'arc' and 'plm' (these have been manually curated ourselves).",
                    "An integer for the genomic composition in the species: 0-genomic, 1-mitochondrial, 2-chloroplast, 3-cyanelle, 4-plastid, 5-nucleomorph, 6-secondary_endosymbiont, 7-chromoplast, 8-leucoplast, 9-NA, 10-proplastid, 11-apicoplast, and 12-kinetoplast.",
                    "An integer, which uniquely indicates the entries of an organism. It is an accession identifier for each different species in the original CUTG database, followed by the first item listed in each genome.",
                    "The algebraic sum of the numbers listed for the different codons in an entry of CUTG. Codon frequencies are normalized to the total codon count, hence the number of occurrences divided by 'Ncodons' is the codon frequencies listed in the data file.",
                    "Descriptive label of the name of the species.",
                    "header: codon; entries: frequency of usage (5 digit floating point number)."
                ]
            },
        ).set_index('Column')
    )

Unnamed: 0_level_0,Variable,Description
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Kingdom,"A 3-letter code corresponding to 'xxx' in the CUTG database name: 'arc'(archaea), 'bct'(bacteria), 'phg'(bacteriophage), 'plm' (plasmid), 'pln' (plant), 'inv' (invertebrate), 'vrt' (vertebrate), 'mam' (mammal), 'rod' (rodent), 'pri' (primate), and 'vrl'(virus) sequence entries. Note that the CUTG database does not contain 'arc' and 'plm' (these have been manually curated ourselves)."
2,DNAtype,"An integer for the genomic composition in the species: 0-genomic, 1-mitochondrial, 2-chloroplast, 3-cyanelle, 4-plastid, 5-nucleomorph, 6-secondary_endosymbiont, 7-chromoplast, 8-leucoplast, 9-NA, 10-proplastid, 11-apicoplast, and 12-kinetoplast."
3,SpeciesID,"An integer, which uniquely indicates the entries of an organism. It is an accession identifier for each different species in the original CUTG database, followed by the first item listed in each genome."
4,Ncodons,"The algebraic sum of the numbers listed for the different codons in an entry of CUTG. Codon frequencies are normalized to the total codon count, hence the number of occurrences divided by 'Ncodons' is the codon frequencies listed in the data file."
5,SpeciesName,Descriptive label of the name of the species.
6-69,codon,header: codon; entries: frequency of usage (5 digit floating point number).


## First look

In [26]:
from pandas import read_csv, DataFrame, option_context

### Load csv and fix Dtypes

In [10]:
# Load CSV
df = read_csv('../data/codon_usage.csv', low_memory=False)

display(df.head())
print(df.info())

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,GCU,GCC,GCA,GCG,CCU,CCC,CCA,CCG,UGG,GGU,GGC,GGA,GGG,UCU,UCC,UCA,UCG,AGU,AGC,ACU,ACC,ACA,ACG,UAU,UAC,CAA,CAG,AAU,AAC,UGU,UGC,CAU,CAC,AAA,AAG,CGU,CGC,CGA,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,0,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,0.03208,0.001,0.0401,0.00551,0.02005,0.00752,0.02506,0.01103,0.0411,0.00902,0.03308,0.01003,0.05013,0.01554,0.01103,0.02356,0.03208,0.01203,0.00501,0.01003,0.01203,0.03158,0.01905,0.02456,0.01353,0.02155,0.00251,0.00652,0.0015,0.01554,0.00501,0.02105,0.00902,0.01053,0.00501,0.02256,0.00301,0.03108,0.00401,0.02607,0.00251,0.01153,0.00501,0.02356,0.01053,0.0386,0.00401,0.00702,0.00401,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,0,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,0.02849,0.00204,0.0441,0.01153,0.0251,0.00882,0.03324,0.00814,0.04071,0.00814,0.03256,0.01085,0.04885,0.01221,0.01357,0.00678,0.02714,0.01221,0.00407,0.01425,0.01221,0.01967,0.02239,0.01289,0.02103,0.01493,0.00407,0.00475,0.00068,0.02035,0.0095,0.02782,0.01425,0.00611,0.00475,0.02917,0.00407,0.02374,0.00882,0.02917,0.00271,0.01628,0.00204,0.01967,0.00543,0.03392,0.00136,0.00678,0.00136,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,0,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,0.01111,0.01028,0.01193,0.02283,0.01604,0.01316,0.0218,0.01625,0.01872,0.01213,0.0107,0.02406,0.01234,0.0144,0.00514,0.01604,0.0146,0.02098,0.0107,0.01728,0.01851,0.00864,0.01172,0.01892,0.01933,0.01419,0.01296,0.00967,0.01337,0.01337,0.01851,0.01131,0.01419,0.0109,0.02612,0.01275,0.01522,0.02365,0.02962,0.01789,0.01625,0.01234,0.01604,0.01687,0.02077,0.03949,0.00864,0.00596,0.00926,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,0,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,0.01358,0.0094,0.01723,0.02402,0.02245,0.02507,0.02924,0.02089,0.02141,0.01723,0.01932,0.02141,0.00679,0.02245,0.00522,0.01358,0.00418,0.0141,0.00574,0.01201,0.00992,0.00366,0.02402,0.02663,0.02872,0.00992,0.0235,0.00522,0.01619,0.00836,0.02037,0.01358,0.02089,0.00731,0.02141,0.00888,0.01567,0.01253,0.02298,0.01358,0.00992,0.00888,0.00783,0.00679,0.03133,0.04282,0.00627,0.00261,0.00261,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,0,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,0.00548,0.00473,0.02076,0.02716,0.00867,0.0131,0.02773,0.02803,0.00508,0.0092,0.02965,0.02878,0.00574,0.01572,0.01577,0.01007,0.00508,0.00604,0.00679,0.01205,0.03127,0.00775,0.00959,0.00797,0.02006,0.00359,0.00933,0.01191,0.01616,0.00788,0.02593,0.00854,0.012,0.02098,0.02089,0.01367,0.01502,0.01809,0.02738,0.01796,0.01082,0.00705,0.01174,0.00858,0.03408,0.03964,0.0095,0.00429,0.00578,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13028 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13028 non-null  object 
 1   DNAtype      13028 non-null  int64  
 2   SpeciesID    13028 non-null  int64  
 3   Ncodons      13028 non-null  int64  
 4   SpeciesName  13028 non-null  object 
 5   UUU          13028 non-null  object 
 6   UUC          13028 non-null  object 
 7   UUA          13028 non-null  float64
 8   UUG          13028 non-null  float64
 9   CUU          13028 non-null  float64
 10  CUC          13028 non-null  float64
 11  CUA          13028 non-null  float64
 12  CUG          13028 non-null  float64
 13  AUU          13028 non-null  float64
 14  AUC          13028 non-null  float64
 15  AUA          13028 non-null  float64
 16  AUG          13028 non-null  float64
 17  GUU          13028 non-null  float64
 18  GUC          13028 non-null  float64
 19  GUA 

In [14]:
# Find probematic values in the 'object' codon columns
display(df[~df.UUU.str.startswith('0')])
display(df[~df.UUC.str.startswith('0')])

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,GCU,GCC,GCA,GCG,CCU,CCC,CCA,CCG,UGG,GGU,GGC,GGA,GGG,UCU,UCC,UCA,UCG,AGU,AGC,ACU,ACC,ACA,ACG,UAU,UAC,CAA,CAG,AAU,AAC,UGU,UGC,CAU,CAC,AAA,AAG,CGU,CGC,CGA,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
486,vrl,0,12440,1238,Non-A,non-B hepatitis virus,0.04362,0.021,0.01292,0.01292,0.03554,0.01696,0.00323,0.00969,0.02666,0.01212,0.00323,0.01939,0.03958,0.01131,0.00162,0.00646,0.03877,0.01616,0.00727,0.01292,0.02989,0.01373,0.00242,0.00646,0.01454,0.03069,0.01535,0.00727,0.00485,0.07674,0.01696,0.00242,0.00808,0.01131,0.00485,0.0315,0.01212,0.00323,0.00727,0.04443,0.00565,0.00969,0.02181,0.02262,0.0105,0.00889,0.01292,0.00646,0.00727,0.01696,0.02423,0.02181,0.01535,0.00081,0.00323,0.00242,0.00162,0.04443,0.01696,0.02423,0.02262,0.00162,0.0
5063,bct,0,353569,1698,Salmonella enterica subsp. enterica serovar 4,12;I,-,0.0212,0.02356,0.01178,0.01296,0.0106,0.01296,0.00471,0.06949,0.0212,0.0212,0.00471,0.01767,0.00942,0.01531,0.00824,0.02945,0.00707,0.01885,0.00236,0.02827,0.00589,0.00471,0.00589,0.03298,0.01885,0.01178,0.02945,0.00471,0.0106,0.00118,0.01178,0.00471,0.01885,0.01178,0.02591,0.00353,0.01531,0.00236,0.0212,0.02238,0.02238,0.00824,0.03887,0.00824,0.03651,0.00589,0.01178,0.01767,0.01296,0.01885,0.01531,0.02473,0.03062,0.00118,0.00707,0.00118,0.0,0.02945,0.02356,0.04476,0.02473,0.00118


Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,GCU,GCC,GCA,GCG,CCU,CCC,CCA,CCG,UGG,GGU,GGC,GGA,GGG,UCU,UCC,UCA,UCG,AGU,AGC,ACU,ACC,ACA,ACG,UAU,UAC,CAA,CAG,AAU,AAC,UGU,UGC,CAU,CAC,AAA,AAG,CGU,CGC,CGA,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
5063,bct,0,353569,1698,Salmonella enterica subsp. enterica serovar 4,12;I,-,0.0212,0.02356,0.01178,0.01296,0.0106,0.01296,0.00471,0.06949,0.0212,0.0212,0.00471,0.01767,0.00942,0.01531,0.00824,0.02945,0.00707,0.01885,0.00236,0.02827,0.00589,0.00471,0.00589,0.03298,0.01885,0.01178,0.02945,0.00471,0.0106,0.00118,0.01178,0.00471,0.01885,0.01178,0.02591,0.00353,0.01531,0.00236,0.0212,0.02238,0.02238,0.00824,0.03887,0.00824,0.03651,0.00589,0.01178,0.01767,0.01296,0.01885,0.01531,0.02473,0.03062,0.00118,0.00707,0.00118,0.0,0.02945,0.02356,0.04476,0.02473,0.00118


In [19]:
# Keep only the good records
df = df[df.UUU.str.startswith('0')]
df = df.astype({'UUU': float, 'UUC': float})

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13026 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13026 non-null  object 
 1   DNAtype      13026 non-null  int64  
 2   SpeciesID    13026 non-null  int64  
 3   Ncodons      13026 non-null  int64  
 4   SpeciesName  13026 non-null  object 
 5   UUU          13026 non-null  float64
 6   UUC          13026 non-null  float64
 7   UUA          13026 non-null  float64
 8   UUG          13026 non-null  float64
 9   CUU          13026 non-null  float64
 10  CUC          13026 non-null  float64
 11  CUA          13026 non-null  float64
 12  CUG          13026 non-null  float64
 13  AUU          13026 non-null  float64
 14  AUC          13026 non-null  float64
 15  AUA          13026 non-null  float64
 16  AUG          13026 non-null  float64
 17  GUU          13026 non-null  float64
 18  GUC          13026 non-null  float64
 19  GUA      

### Check column values

In [38]:
df.describe(include='all')

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,GCU,GCC,GCA,GCG,CCU,CCC,CCA,CCG,UGG,GGU,GGC,GGA,GGG,UCU,UCC,UCA,UCG,AGU,AGC,ACU,ACC,ACA,ACG,UAU,UAC,CAA,CAG,AAU,AAC,UGU,UGC,CAU,CAC,AAA,AAG,CGU,CGC,CGA,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
count,13026,13026.0,13026.0,13026.0,13026,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0,13026.0
unique,11,,,,13014,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,bct,,,,Escherichia coli O157,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,2919,,,,4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,,0.367265,130443.036926,79617.76,,0.024818,0.02344,0.020637,0.014104,0.017821,0.018287,0.019045,0.018452,0.028355,0.025038,0.018294,0.021136,0.017648,0.015173,0.013622,0.016444,0.019937,0.023802,0.019063,0.011699,0.012945,0.012647,0.015695,0.008599,0.011612,0.017216,0.019062,0.018427,0.010529,0.014702,0.01324,0.015393,0.007161,0.009648,0.011085,0.015975,0.019372,0.01911,0.008226,0.018209,0.016173,0.019367,0.015487,0.022532,0.02198,0.00729,0.007592,0.011541,0.012172,0.028506,0.021531,0.008006,0.009658,0.006963,0.005453,0.00993,0.006423,0.024181,0.021164,0.028291,0.021683,0.00164,0.00059,0.006179
std,,0.688764,124777.067741,719755.6,,0.017628,0.011598,0.02071,0.00928,0.010587,0.014573,0.024252,0.016578,0.017507,0.014596,0.016045,0.008162,0.009953,0.010067,0.008316,0.011719,0.009889,0.017245,0.009112,0.013573,0.006765,0.009109,0.009606,0.00896,0.006569,0.010492,0.014769,0.009325,0.007137,0.00895,0.007574,0.009217,0.00612,0.006498,0.006571,0.008209,0.012454,0.013001,0.006774,0.011818,0.007349,0.0113,0.011285,0.015033,0.00951,0.006231,0.006435,0.006843,0.006604,0.01789,0.014578,0.006308,0.01068,0.004784,0.006601,0.008574,0.006388,0.013826,0.013039,0.014343,0.015019,0.001785,0.000882,0.010345
min,,0.0,7.0,1000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,0.0,28851.25,1602.0,,0.01391,0.01538,0.00561,0.007103,0.01089,0.00783,0.005302,0.00718,0.01637,0.01513,0.00632,0.01579,0.01052,0.00822,0.00694,0.0069,0.0133,0.01033,0.01299,0.00294,0.00843,0.00571,0.00925,0.00251,0.00711,0.00998,0.008973,0.01197,0.005573,0.008682,0.00782,0.008862,0.00258,0.00428,0.006703,0.01062,0.01005,0.01008,0.00318,0.00965,0.01096,0.01285,0.005842,0.011363,0.01556,0.00266,0.0036,0.00703,0.00719,0.017322,0.010235,0.00315,0.00288,0.00334,0.00122,0.00169,0.00117,0.01239,0.01186,0.01736,0.00971,0.00056,0.0,0.00041
50%,,0.0,81971.5,2929.0,,0.02175,0.021905,0.01526,0.01336,0.01613,0.01456,0.00968,0.0128,0.02548,0.02154,0.01414,0.022,0.017135,0.01316,0.01267,0.01452,0.018815,0.020235,0.018595,0.00708,0.01241,0.01099,0.0143,0.00567,0.01206,0.015545,0.01539,0.01752,0.00976,0.01373,0.01246,0.0146,0.0056,0.00945,0.01056,0.01568,0.01718,0.01666,0.00655,0.01608,0.01543,0.01904,0.01451,0.0198,0.02117,0.006155,0.00653,0.01062,0.01146,0.02532,0.02109,0.00687,0.00566,0.00599,0.00353,0.00927,0.004545,0.025425,0.01907,0.026085,0.02054,0.00138,0.00042,0.00113
75%,,1.0,222890.5,9120.0,,0.031308,0.02921,0.029495,0.019808,0.02273,0.02511,0.017255,0.024325,0.038117,0.03186,0.02597,0.02626,0.02346,0.01947,0.01912,0.02407,0.025088,0.03337,0.02454,0.01443,0.01685,0.01783,0.02033,0.011,0.015387,0.022358,0.02396,0.02372,0.01406,0.019078,0.01757,0.0207,0.010308,0.01395,0.01454,0.020698,0.0264,0.0261,0.011898,0.02477,0.020178,0.02499,0.023107,0.031277,0.02743,0.010308,0.00999,0.014697,0.01654,0.03726,0.03091,0.01137,0.01191,0.00985,0.00715,0.015928,0.01025,0.03419,0.02769,0.0368,0.031128,0.00237,0.00083,0.00289


In [31]:
# Print unique values
with option_context('display.max_rows', 100, 'display.max_colwidth', 100):
    display( DataFrame(
        [ [sorted(df[c].unique())] if (_ := df[c].nunique()) < 15 else [_] for c in df.columns ],
        index=df.columns).rename(columns={0:'values'}
    ))

Unnamed: 0,values
Kingdom,"[arc, bct, inv, mam, phg, plm, pln, pri, rod, vrl, vrt]"
DNAtype,"[0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12]"
SpeciesID,12366
Ncodons,7103
SpeciesName,13014
UUU,4789
UUC,4119
UUA,4796
UUG,3282
CUU,3677


In [37]:
# Count records per class of possible target variables
print(df.Kingdom.value_counts(normalize=True), '\n')
print(df.DNAtype.value_counts(normalize=True))

Kingdom
bct    0.224090
vrl    0.217335
pln    0.193690
vrt    0.159450
inv    0.103255
mam    0.043912
phg    0.016889
rod    0.016505
pri    0.013819
arc    0.009673
plm    0.001382
Name: proportion, dtype: float64 

DNAtype
0     0.711270
1     0.222555
2     0.062644
4     0.002380
12    0.000384
5     0.000154
3     0.000154
11    0.000154
9     0.000154
6     0.000077
7     0.000077
Name: proportion, dtype: float64
