# Introduction

In this tutorial we will see how to analyse the data we just retrieved!

In [2]:
from CodonU import analyzer as an

# Setting file  path

The whole genome fasta files contain more than 2000 genes, hence will create a significant amount of data is analyzed. Here i am going to restrain the analysis of first two genes of _Staphylococcus agnetis_.

In [2]:
in_file = 'Nucleotide/test.fasta'
# in_file = ''    your choice of file

# Setting up necessary parameters

Here we are going to set 3 parameters, viz.:
- `genetic_code_num`: Genetic table number for codon table. To know more about genetic table number, click [here](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
- `gene_analysis`: If you want to perform gene analysis or genome analysis. Set `True` if you want to take account of indiviual gene, else set `False` if you want perform the gene as whole, rather as genome
- `min_len_threshold`: Minimum length of nucleotide sequence to be considered as a gene

In [3]:
genetic_code_num = 11
gene_analysis = True
min_len_threshold = 200

# CAI

## Gene Analysis

One word of caution, you may find something called `nan`, please look into first part of Generate Summary

In [5]:
cai = an.calculate_cai(in_file, genetic_code_num, min_len_threshold, gene_analysis)

Below written code is just for showing purpose and nothing else.

In [6]:
# below code is just for performing the vizualization
for gene in cai.keys():
    print(f'{gene}:')
    for codon, cai_val in cai[gene].items():
        print(f'    {codon}: {cai_val}')

gene_1:
    TTT: 1.0
    TTC: 0.5
    TTA: 1.0
    TTG: 0.5
    TCT: 0.6956521739130435
    TCC: 0.43478260869565216
    TCA: 1.0
    TCG: 0.043478260869565216
    TAT: 1.0
    TAC: 1.0
    TGT: 1.0
    TGC: 1.0
    TGG: nan
    CTT: 0.5
    CTC: 0.5
    CTA: 0.5
    CTG: 1.0
    CCT: 1.0
    CCC: 0.25
    CCA: 0.5
    CCG: 0.5
    CAT: 1.0
    CAC: 0.5
    CAA: 1.0
    CAG: 0.0625
    CGT: 1.0
    CGC: 0.25
    CGA: 0.12500000000000003
    CGG: 0.12500000000000003
    ATT: 0.5
    ATC: 1.0
    ATA: 0.5
    ATG: nan
    ACT: 1.0
    ACC: 0.2
    ACA: 0.9333333333333333
    ACG: 0.3333333333333333
    AAT: 1.0
    AAC: 0.2857142857142857
    AAA: 1.0
    AAG: 0.1346153846153846
    AGT: 0.13043478260869565
    AGC: 0.021739130434782608
    AGA: 0.75
    AGG: 0.12500000000000003
    GTT: 1.0
    GTC: 0.12500000000000003
    GTA: 0.7500000000000001
    GTG: 0.25
    GCT: 1.0
    GCC: 0.14285714285714285
    GCA: 0.42857142857142855
    GCG: 0.3571428571428572
    GAT: 1.0
    GAC: 0.92857

## Genome analysis

In [7]:
cai = an.calculate_cai(in_file, genetic_code_num, min_len_threshold, gene_analysis=False)

# below code is just for performing the vizualization
for codon, cai_val in cai.items():
    print(f'{codon}: {cai_val}')

TTT: 1.0
TTC: 0.3333333333333333
TTA: 1.0
TTG: 0.4
TCT: 0.8235294117647057
TCC: 0.38235294117647056
TCA: 1.0
TCG: 0.02941176470588235
TAT: 1.0
TAC: 0.5
TGT: 1.0
TGC: 1.0
TGG: nan
CTT: 0.2
CTC: 0.10000000000000002
CTA: 0.05000000000000001
CTG: 0.10000000000000002
CCT: 1.0
CCC: 0.16666666666666669
CCA: 0.3333333333333333
CCG: 1.0
CAT: 1.0
CAC: 0.25
CAA: 1.0
CAG: 0.0625
CGT: 1.0
CGC: 0.5
CGA: 0.08333333333333333
CGG: 0.3333333333333333
ATT: 1.0
ATC: 0.2857142857142857
ATA: 0.5714285714285714
ATG: nan
ACT: 0.85
ACC: 0.15
ACA: 1.0
ACG: 0.7499999999999999
AAT: 1.0
AAC: 0.5
AAA: 1.0
AAG: 0.3211009174311927
AGT: 0.20588235294117643
AGC: 0.014705882352941173
AGA: 0.8333333333333333
AGG: 0.16666666666666669
GTT: 1.0
GTC: 0.09999999999999998
GTA: 1.0
GTG: 0.39999999999999997
GCT: 1.0
GCC: 0.11111111111111109
GCA: 0.4444444444444444
GCG: 0.5555555555555556
GAT: 1.0
GAC: 0.7045454545454546
GAA: 1.0
GAG: 0.4387755102040816
GGT: 1.0
GGC: 0.6666666666666666
GGA: 0.8333333333333334
GGG: 0.5


## Saving the file

Though you can work with CodonU in an easy pace, you may want to validate the data. Hence analyzer functions provide an easy way to save the data as .xlsx file.

Three parameters are needed to change the behaviour of the function during saving the files, viz.:
- `save_file`: If this is `True`, then the file will be saved, else not saved. (Default False)
- `file_name`: File name without extension. (Default _operation_\_report, where _operation_ may be CAI, RSCU etc.)
- `folder_path`: Folder name or path where the file will be saved. (Default Report)

It is not mandetory to give an existing folder. If a non-existing folder name is given, CodonU will create a folder by that name.

__In order to save the file in excel format you must have the package [openpyxl](https://pypi.org/project/openpyxl/)__

In [8]:
# file_name = ''
# folder_path = ''
cai = an.calculate_cai(in_file, genetic_code_num, save_file=True)

Report created successfully
The CAI score file can be found at: /home/souro/Projects/CodonU/Examples/Report/CAI_report.xlsx


# RSCU

## Gene analysis

In [23]:
rscu = an.calculate_rscu(in_file, genetic_code_num, gene_analysis=True)

Below written code is just for vizualisation purpose.

In [24]:
for gene, rscu_vals in rscu.items():
    print(f'{gene}:')
    for codon, rscu_val in rscu_vals.items():
        print(f'{codon}: {rscu_val}')

gene_1:
TTT: 1.3333333333333333
TTC: 0.6666666666666666
TTA: 1.5
TTG: 0.75
TCT: 1.7943925233644862
TCC: 1.1214953271028039
TCA: 2.579439252336449
TCG: 0.11214953271028039
TAT: 1.0
TAC: 1.0
TGT: 1.0
TGC: 1.0
TGG: 1.0
CTT: 0.75
CTC: 0.75
CTA: 0.75
CTG: 1.5
CCT: 1.7777777777777777
CCC: 0.4444444444444444
CCA: 0.8888888888888888
CCG: 0.8888888888888888
CAT: 1.3333333333333333
CAC: 0.6666666666666666
CAA: 1.8823529411764706
CAG: 0.11764705882352941
CGT: 2.5263157894736845
CGC: 0.6315789473684211
CGA: 0.31578947368421056
CGG: 0.31578947368421056
ATT: 0.75
ATC: 1.5
ATA: 0.75
ATG: 1.0
ACT: 1.6216216216216217
ACC: 0.32432432432432434
ACA: 1.5135135135135136
ACG: 0.5405405405405406
AAT: 1.5555555555555556
AAC: 0.4444444444444444
AAA: 1.7627118644067796
AAG: 0.23728813559322035
AGT: 0.33644859813084116
AGC: 0.05607476635514019
AGA: 1.8947368421052633
AGG: 0.31578947368421056
GTT: 1.8823529411764706
GTC: 0.23529411764705882
GTA: 1.411764705882353
GTG: 0.47058823529411764
GCT: 2.074074074074074
GCC

## Genome analysis

In [25]:
rscu = an.calculate_rscu(in_file, genetic_code_num)

for codon, rscu_val in rscu.items():
    print(f'{codon}: {rscu_val}')

TTT: 1.0
TTC: 1.0
TTA: 1.0
TTG: 1.0
TCT: 1.0
TCC: 1.0
TCA: 1.0
TCG: 1.0
TAT: 1.0
TAC: 1.0
TGT: 1.0
TGC: 1.0
TGG: 1.0
CTT: 1.0
CTC: 1.0
CTA: 1.0
CTG: 1.0
CCT: 1.0
CCC: 1.0
CCA: 1.0
CCG: 1.0
CAT: 1.0
CAC: 1.0
CAA: 1.0
CAG: 1.0
CGT: 1.0
CGC: 1.0
CGA: 1.0
CGG: 1.0
ATT: 1.0
ATC: 1.0
ATA: 1.0
ATG: 1.0
ACT: 1.0
ACC: 1.0
ACA: 1.0
ACG: 1.0
AAT: 1.0
AAC: 1.0
AAA: 1.0
AAG: 1.0
AGT: 1.0
AGC: 1.0
AGA: 1.0
AGG: 1.0
GTT: 1.0
GTC: 1.0
GTA: 1.0
GTG: 1.0
GCT: 1.0
GCC: 1.0
GCA: 1.0
GCG: 1.0
GAT: 1.0
GAC: 1.0
GAA: 1.0
GAG: 1.0
GGT: 1.0
GGC: 1.0
GGA: 1.0
GGG: 1.0


## Saving the file

In [26]:
rscu = an.calculate_rscu(in_file, genetic_code_num, save_file=True)

The RSCU score file can be found at: /home/souro/Projects/CodonU/Examples/Report/RSCU_report.xlsx


# CBI

## Gene Analysis

In [11]:
cbi = an.calculate_cbi(in_file, genetic_code_num, min_len_threshold=66, gene_analysis=True)

__Note__: here `min_len_threshold = 66` because 66 amino acids is translated from 198 nucleotide base-pairs. For nucleotide ([CAI](#CAI)) `min_len_threshold = 200`, hence in this case it's 66. Also where ever the sequence is based on nucleotide, default value for `min_len_threshold` is 200; else if based on amino acid, the value is 66.

Below written code is just for vizualisation purpose.

In [17]:
for gene, cbi_vals in cbi.items():
    print(f'{gene}:')
    for aa, cbi_val in cbi_vals.items():
        print(f'    {aa}: {cbi_val[0]}\t{cbi_val[1]}')

gene_1:
    A: 0.35802469135802467	GCT
    C: nan	None
    D: 0.037037037037037035	GAT
    E: 0.4444444444444444	GAA
    F: 1.0	TTT
    G: 0.3333333333333333	GGT
    H: 1.0	CAT
    I: 0.25	ATC
    K: 0.7627118644067796	AAA
    L: 0.4	TTA
    M: nan	ATG
    N: 0.5555555555555556	AAT
    P: 0.3333333333333333	CCT
    Q: 1.0	CAA
    R: 0.4	CGT
    S: 0.32075471698113206	TCA
    T: 0.2072072072072072	ACT
    V: 0.3333333333333333	GTT
    W: nan	TGG
    Y: nan	None
gene_2:
    A: 0.2727272727272727	GCG
    C: nan	None
    D: 0.5238095238095238	GAT
    E: 0.06666666666666667	GAG
    F: 0.42857142857142855	TTT
    G: 0.08333333333333333	GGT
    H: 0.5	CAT
    I: 0.5	ATT
    K: 0.3411764705882353	AAA
    L: 0.47500000000000003	TTA
    M: nan	ATG
    N: 0.2222222222222222	AAT
    P: 0.5555555555555556	CCG
    Q: 0.7777777777777778	CAA
    R: 0.06666666666666667	CGT
    S: 0.28	TCT
    T: 0.4074074074074074	ACG
    V: 0.3333333333333333	GTA
    W: nan	TGG
    Y: 1.0	TAT


## Genome Analysis

In [18]:
cbi = an.calculate_cbi(in_file, genetic_code_num)

`gene_analysis` is not given here beacuse the default value is `False`.

Below written code is just for vizualisation purpose.

In [19]:
for aa, cbi_vals in cbi.items():
    print(f'{aa}: {cbi_vals[0]}\t{cbi_vals[1]}')

A: 0.2982456140350877	GCT
C: nan	None
D: 0.17333333333333334	GAT
E: 0.3900709219858156	GAA
F: 0.5	TTT
G: 0.1111111111111111	GGT
H: 0.6	CAT
I: 0.3076923076923077	ATT
K: 0.5138888888888888	AAA
L: 0.4666666666666667	TTA
M: nan	ATG
N: 0.3333333333333333	AAT
P: 0.23809523809523808	CCT
Q: 0.8823529411764706	CAA
R: 0.2235294117647059	CGT
S: 0.2915662650602409	TCA
T: 0.15151515151515152	ACA
V: 0.2222222222222222	GTT
W: nan	TGG
Y: 1.0	TAT


## Saving the file

In [22]:
cbi = an.calculate_cbi(in_file, genetic_code_num, save_file=True)

Provided file not empty! Your action will result into completely changing the content of the file. Proceed [y/n]?: y
The CBI score file can be found at: /home/souro/Projects/CodonU/Examples/Report/CBI_report.xlsx


Above you can see a promt asking the user if (s)he wants to rewrite a file. CodonU will ask this if provided file is not empty.

__For reducing redundency, i will now present the tutorial in short.__

# ENc

In [29]:
enc = an.calculate_enc(in_file, genetic_code_num, gene_analysis=True)

for gene, enc_val in enc.items():
    print(f'{gene}: {enc_val}')

gene_1: 52.08838655503453
gene_2: 48.18721859443946


In [30]:
enc = an.calculate_enc(in_file, genetic_code_num)    # gene_analysis = False essentially means genome analysis

print(enc)

45.59355469644335


In [31]:
enc = an.calculate_enc(in_file, genetic_code_num, gene_analysis=True, save_file=True)

The ENc score file can be found at: /home/souro/Projects/CodonU/Examples/Report/ENc_report.xlsx


# Aromaticity

In [33]:
in_file = 'Protein/test.fasta'

As aromaticity and gravy scores are for proteins, hence i've changed the file path.

In [34]:
aroma = an.calculate_aromaticity(in_file, gene_analysis=True)

for prot, val in aroma.items():
    print(f'{prot}: {val}')

prot_seq1: 0.0049382716049382715
prot_seq2: 0.03180212014134275


In [35]:
aroma = an.calculate_aromaticity(in_file, gene_analysis=False)

print(aroma)

0.015988372093023256


In [36]:
aroma = an.calculate_aromaticity(in_file, gene_analysis=True, save_file=True)

The Aromaticity score file can be found at: /home/souro/Projects/CodonU/Examples/Report/Aroma_report.xlsx


# GRAVY

In [38]:
gravy = an.calculate_gravy(in_file, gene_analysis=True)

for aa, val in gravy.items():
    print(f'{aa}: {val}')

prot_seq1: -2.2782716049382694
prot_seq2: -1.718727915194343


In [39]:
gravy = an.calculate_gravy(in_file)

print(gravy)

-2.0481104651162876


In [40]:
gravy = an.calculate_gravy(in_file, gene_analysis=True, save_file=True)

The GRAVY score file can be found at: /home/souro/Projects/CodonU/Examples/Report/GRAVY_report.xlsx


# Generate Summary

If you are interested in only the results of all the operations all at once, then behold! CodonU also provides you two functions, viz.
- `generate_report`: Best for gene analysis
- `generate_report_summary`: Best for genome analysis

__Note:__ Both the functions take the type of file in second argument. Please set
- `_type = 'nuc'` for nucleotide analysis
- `_type = 'aa'` for protein analysis

## Nucleotide Analysis

In [43]:
in_file = 'Nucleotide/test.fasta'
an.generate_report(in_file, 'nuc', 11, 200)

Calculating RSCU, please be patient, this may take some time.
Calculating CAI, please be patient, this may take some time.
Calculating CBI, please be patient, this may take some time.
Calculating ENc, please be patient, this may take some time.

The report can be found at /home/souro/Projects/CodonU/Examples/Report/report_test_nuc.txt


In [44]:
an.generate_report_summary(in_file, 'nuc', 11, 200)

Calculating ENc, please be patient, this may take some time.
Calculating CAI, please be patient, this may take some time.
Calculating CBI, please be patient, this may take some time.
Calculating RSCU, please be patient, this may take some time.

The report can be found at /home/souro/Projects/CodonU/Examples/Report/summary_report_test_nuc.txt


## Protein Analysis

In [45]:
in_file = 'Protein/test.fasta'
an.generate_report(in_file, 'aa', 11, 66)

Calculating GRAVY score, please be patient, this may take some time.
Calculating aromaticity, please be patient, this may take some time.

The report can be found at /home/souro/Projects/CodonU/Examples/Report/report_test_aa.txt


In [48]:
an.generate_report_summary(in_file, 'aa', 11, 66)

Calculating GRAVY score, please be patient, this may take some time.
Calculating aromaticity, please be patient, this may take some time.

The report can be found at /home/souro/Projects/CodonU/Examples/Report/summary_report_test_aa.txt


# tAI

You can retrieve tRNA gene copy number from two famous web serevers, i.e.
  -  GtRNAdb [http://gtrnadb.ucsc.edu/](http://gtrnadb.ucsc.edu/)
  -  tRNADB-CE [http://trna.ie.niigata-u.ac.jp/cgi-bin/trnadb/index.cgi](http://trna.ie.niigata-u.ac.jp/cgi-bin/trnadb/index.cgi)
  
In order to retrieve the data you just need to give the link, and type of database. Here we will be retrieveing the data for *Staphylococcus aureus subsp. aureus str. Newman*

In [2]:
url = 'http://gtrnadb.ucsc.edu/GtRNAdb2/genomes/bacteria/Stap_aure_aureus_Newman/'

anti_codon_dict = an.get_anticodon_count_dict(
    url = url,
    database='GtRNAdb'
)

In [3]:
for anti_codon, count in anti_codon_dict.items():
    print(f'{anti_codon}: {count}')

IGC: 0
GGC: 0
CGC: 0
UGC: 3
ICC: 0
GCC: 2
CCC: 0
UCC: 5
IGG: 0
GGG: 0
CGG: 0
UGG: 2
IGU: 0
GGU: 0
CGU: 0
UGU: 3
IAC: 0
GAC: 0
CAC: 0
UAC: 3
IGA: 0
GGA: 1
CGA: 0
UGA: 3
GCU: 1
ICG: 2
GCG: 0
CCG: 1
UCG: 0
CCU: 0
UCU: 1
IAG: 0
GAG: 1
CAG: 0
UAG: 1
CAA: 1
UAA: 2
GAA: 2
GUU: 3
CUU: 0
UUU: 3
GUC: 4
CUC: 0
UUC: 3
GUG: 2
CUG: 0
UUG: 2
IAU: 0
GAU: 2
CAU: 4
UAU: 0
GUA: 2
UCA: 0
CUA: 0
UUA: 0
GCA: 1
CCA: 1


Now we will look into the function that calculates tAI, the name is `calculate_gtai`. The parameters for the functions are:
  - `handle`: Path to the fasta file as a string
  - `anticodon_dict`: The dictionary containing anticodon as key and count as value
  - `genetic_code_num`: Genetic table number for codon table
  - `reference`: Path to the reference fasta file as a string (Optional)
  - `size_pop`: A parameter for the genetic algorithm to identify the population size (Optional)
  - `generation_num`: A parameter for the genetic algorithm to identify the generation number (Optional)
  - `save_file`: Option for saving the values in xlsx format (Optional)
  - `file_name`: Intended file name (Optional)
  - `folder_path`: Folder path where image should be saved (optional)
  
The function returns following dataframes:
  - `tai_df`: The dataframe contains gene description and tAI values
  - `abs_wi_df`: The dataframe contains each anticodon and absolute weights according to the paper
  - `rel_wi_df`: The dataframe contains each anticodon and relative weights according to the paper
  
Also it is worth mentioning here that a file named `best_fit.py` will be containing the values of iteration number, values of sij , and fitness. Also a `.enc` file will be created.

As the function invloves parser, it is uploaded as a seperate file named [`tRNA.py`](https://github.com/SouradiptoC/CodonU/blob/8da5f9109433fb6048922f74543da238c8b7f970/Examples/tRNA.py)

# Customized Codon Table

In this section we will see how to create custom codon table

In [1]:
table_name = 'My_custom_name'
table_name_alt = 'MyCusNameAlt'    # this name is optional
table_id = 101
forward_table = {    # This must be a dictionary
    "TTT": "F", "TTC": "F", "TTA": "L", "TTG": "L",
    "TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S",
    "TAT": "Y", "TAC": "Y",                           # noqa: E241, this is for autopep formatter
    "TGT": "C", "TGC": "C",             "TGG": "W",   # noqa: E241
    "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
    "CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
    "CAT": "H", "CAC": "H", "CAA": "Q", "CAG": "Q",
    "CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R",
    "ATT": "I", "ATC": "I", "ATA": "I", "ATG": "M",
    "ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
    "AAT": "N", "AAC": "N", "AAA": "K", "AAG": "K",
    "AGT": "S", "AGC": "S", "AGA": "R", "AGG": "R",
    "GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
    "GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
    "GAT": "D", "GAC": "D", "GAA": "E", "GAG": "E",
    "GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
}    # Note: Stop codons must be excluded
stop_codons=["TAA", "TAG", "TGA"]
start_codons=["TTG", "CTG", "ATT", "ATC", "ATA", "ATG", "GTG"]

In [3]:
an.custom_codon_table(
    name=table_name,
    alt_name=table_name_alt,
    genetic_code_id=table_id,
    forward_table=forward_table,
    stop_codons=stop_codons,
    start_codons=start_codons
)

In [4]:
from Bio.Data.CodonTable import unambiguous_dna_by_id
print(unambiguous_dna_by_id[table_id])

Table 101 My_custom_name, MyCusNameAlt

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA I(s)| ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

### Length and GC contents

In [1]:
from Bio.Seq import Seq
from CodonU.analyzer.internal_comp import g3, a3, gc_123, at_123

In [2]:
seq = Seq('ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA')

In [3]:
length = len(seq)
length

66

In [4]:
g_3 = g3(seq)
g_3

9.090909090909092

In [5]:
a_3 = a3(seq)
a_3

13.636363636363635

In [7]:
gc123 = gc_123(seq)
gc123
# total, for first, second and third positions

(51.515151515151516, 22.727272727272727, 68.18181818181819, 63.63636363636363)

In [8]:
at123 = at_123(seq)
at123
# total, for first, second and third positions

(48.484848484848484, 77.27272727272728, 31.818181818181813, 36.36363636363637)