# Introduction

In this tutorial we will see how to analyse the data we just retrieved!

In [1]:
from CodonU import analyzer as an

# Setting file  path

The whole genome fasta files contain more than 2000 genes, hence will create a significant amount of data is analyzed. Here i am going to restrain the analysis of first two genes of _Staphylococcus agnetis_.

In [2]:
in_file = 'Nucleotide/test.fasta'
# in_file = ''    your choice of file

# Setting up necessary parameters

Here we are going to set 3 parameters, viz.:
- `genetic_code_num`: Genetic table number for codon table. To know more about genetic table number, click [here](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
- `gene_analysis`: If you want to perform gene analysis or genome analysis. Set `True` if you want to take account of indiviual gene, else set `False` if you want perform the gene as whole, rather as genome
- `min_len_threshold`: Minimum length of nucleotide sequence to be considered as a gene

In [3]:
genetic_code_num = 11
gene_analysis = True
min_len_threshold = 200

# CAI

## Gene Analysis

One word of caution, you may find something called `nan`, please look into first part of [Generate Summary](#generate_summary)

In [5]:
cai = an.calculate_cai(in_file, genetic_code_num, min_len_threshold, gene_analysis)

Below written code is just for showing purpose and nothing else.

In [6]:
# below code is just for performing the vizualization
for gene in cai.keys():
    print(f'{gene}:')
    for codon, cai_val in cai[gene].items():
        print(f'    {codon}: {cai_val}')

gene_1:
    TTT: 1.0
    TTC: 0.5
    TTA: 1.0
    TTG: 0.5
    TCT: 0.6956521739130435
    TCC: 0.43478260869565216
    TCA: 1.0
    TCG: 0.043478260869565216
    TAT: 1.0
    TAC: 1.0
    TGT: 1.0
    TGC: 1.0
    TGG: nan
    CTT: 0.5
    CTC: 0.5
    CTA: 0.5
    CTG: 1.0
    CCT: 1.0
    CCC: 0.25
    CCA: 0.5
    CCG: 0.5
    CAT: 1.0
    CAC: 0.5
    CAA: 1.0
    CAG: 0.0625
    CGT: 1.0
    CGC: 0.25
    CGA: 0.12500000000000003
    CGG: 0.12500000000000003
    ATT: 0.5
    ATC: 1.0
    ATA: 0.5
    ATG: nan
    ACT: 1.0
    ACC: 0.2
    ACA: 0.9333333333333333
    ACG: 0.3333333333333333
    AAT: 1.0
    AAC: 0.2857142857142857
    AAA: 1.0
    AAG: 0.1346153846153846
    AGT: 0.13043478260869565
    AGC: 0.021739130434782608
    AGA: 0.75
    AGG: 0.12500000000000003
    GTT: 1.0
    GTC: 0.12500000000000003
    GTA: 0.7500000000000001
    GTG: 0.25
    GCT: 1.0
    GCC: 0.14285714285714285
    GCA: 0.42857142857142855
    GCG: 0.3571428571428572
    GAT: 1.0
    GAC: 0.92857

## Genome analysis

In [7]:
cai = an.calculate_cai(in_file, genetic_code_num, min_len_threshold, gene_analysis=False)

# below code is just for performing the vizualization
for codon, cai_val in cai.items():
    print(f'{codon}: {cai_val}')

TTT: 1.0
TTC: 0.3333333333333333
TTA: 1.0
TTG: 0.4
TCT: 0.8235294117647057
TCC: 0.38235294117647056
TCA: 1.0
TCG: 0.02941176470588235
TAT: 1.0
TAC: 0.5
TGT: 1.0
TGC: 1.0
TGG: nan
CTT: 0.2
CTC: 0.10000000000000002
CTA: 0.05000000000000001
CTG: 0.10000000000000002
CCT: 1.0
CCC: 0.16666666666666669
CCA: 0.3333333333333333
CCG: 1.0
CAT: 1.0
CAC: 0.25
CAA: 1.0
CAG: 0.0625
CGT: 1.0
CGC: 0.5
CGA: 0.08333333333333333
CGG: 0.3333333333333333
ATT: 1.0
ATC: 0.2857142857142857
ATA: 0.5714285714285714
ATG: nan
ACT: 0.85
ACC: 0.15
ACA: 1.0
ACG: 0.7499999999999999
AAT: 1.0
AAC: 0.5
AAA: 1.0
AAG: 0.3211009174311927
AGT: 0.20588235294117643
AGC: 0.014705882352941173
AGA: 0.8333333333333333
AGG: 0.16666666666666669
GTT: 1.0
GTC: 0.09999999999999998
GTA: 1.0
GTG: 0.39999999999999997
GCT: 1.0
GCC: 0.11111111111111109
GCA: 0.4444444444444444
GCG: 0.5555555555555556
GAT: 1.0
GAC: 0.7045454545454546
GAA: 1.0
GAG: 0.4387755102040816
GGT: 1.0
GGC: 0.6666666666666666
GGA: 0.8333333333333334
GGG: 0.5


## Saving the file

Though you can work with CodonU in an easy pace, you may want to validate the data. Hence analyzer functions provide an easy way to save the data as .xlsx file.

__In order to save the file in excel format you must have the package [openpyxl](https://pypi.org/project/openpyxl/)__

In [8]:
# file_name = ''
# folder_path = ''
cai = an.calculate_cai(in_file, genetic_code_num, save_file=True)

Report created successfully
The CAI score file can be found at: /home/souro/Projects/CodonU/Examples/Report/CAI_report.xlsx


# CBI

## Gene Analysis

In [11]:
cbi = an.calculate_cbi(in_file, genetic_code_num, min_len_threshold=66, gene_analysis=True)

__Note__: here `min_len_threshold = 66` because 66 amino acids is translated from 198 nucleotide base-pairs. For nucleotide ([CAI](#CAI)) `min_len_threshold = 200`, hence in this case it's 66. Also where ever the sequence is based on nucleotide, default value for `min_len_threshold` is 200; else if based on amino acid, the value is 66.

Below written code is just for vizualisation purpose.

In [17]:
for gene, cbi_vals in cbi.items():
    print(f'{gene}:')
    for aa, cbi_val in cbi_vals.items():
        print(f'    {aa}: {cbi_val[0]}\t{cbi_val[1]}')

gene_1:
    A: 0.35802469135802467	GCT
    C: nan	None
    D: 0.037037037037037035	GAT
    E: 0.4444444444444444	GAA
    F: 1.0	TTT
    G: 0.3333333333333333	GGT
    H: 1.0	CAT
    I: 0.25	ATC
    K: 0.7627118644067796	AAA
    L: 0.4	TTA
    M: nan	ATG
    N: 0.5555555555555556	AAT
    P: 0.3333333333333333	CCT
    Q: 1.0	CAA
    R: 0.4	CGT
    S: 0.32075471698113206	TCA
    T: 0.2072072072072072	ACT
    V: 0.3333333333333333	GTT
    W: nan	TGG
    Y: nan	None
gene_2:
    A: 0.2727272727272727	GCG
    C: nan	None
    D: 0.5238095238095238	GAT
    E: 0.06666666666666667	GAG
    F: 0.42857142857142855	TTT
    G: 0.08333333333333333	GGT
    H: 0.5	CAT
    I: 0.5	ATT
    K: 0.3411764705882353	AAA
    L: 0.47500000000000003	TTA
    M: nan	ATG
    N: 0.2222222222222222	AAT
    P: 0.5555555555555556	CCG
    Q: 0.7777777777777778	CAA
    R: 0.06666666666666667	CGT
    S: 0.28	TCT
    T: 0.4074074074074074	ACG
    V: 0.3333333333333333	GTA
    W: nan	TGG
    Y: 1.0	TAT


## Genome Analysis

In [None]:
cbi = an.calculate_cbi(in_file, genetic_code_num)

# Generate Summary

## generate_summary