# Introduction

Welcome to the tutorial of `CodonU`. In this part we will see about file hadnling by CodonU. In this tutorial, we will fetch some genebank file from [NCBI](https://www.ncbi.nlm.nih.gov/)

In [None]:
# If you already haven't installed it, install it
# pip install CodonU

In [1]:
from CodonU.file_handler import make_dir, set_entrez_param    # for making directories and set entrez parameters
from CodonU.file_handler import write_exome_fasta, write_nucleotide_fasta, write_protein_fasta    # retrieveing data
import pandas as pd    # we will need it for reading the data
from os.path import join    # we will need it for writing the files

# Setting entrenz parameter

One word of caution, in order to use the package for data fetching, you may consider to link your google account with NCBI. After that, you can get your API key. For details, click [here](https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us).

If you don't set the parameters, it will affect the retrival speed, and will show you a warning (as you will look below, as i will not use my API here).

In [None]:
# Below we are setting the email and api value which are of string in nature
# email, api = '', ''
# fh.set_entrez_param(email=email, api_key=api)

# Making directories

I will confine myself to a certain file and folder structure. You can make certain directories to organize your results.

In [4]:
folder_path_dict = {
    'nucleo': 'Nucleotide',
    'exome': 'Exome',
    'prot': 'Protein'
}

for path in folder_path_dict.keys():
    make_dir(folder_path_dict[path])

# Retriving the gb file

In this tutorial we will fetch genebank file of *Staphylococcus aureus subsp. aureus str. Newman* (accession id: AP009351.1)

In [2]:
accession_id = 'AP009351.1'

## Writing the nucleotide file containing CDS

In [3]:
file_path = 'Nucleotide/Staphylococcus_aureus.fasta'
write_nucleotide_fasta(accession_id=accession_id, file_path=file_path)

Retrieval started


            Email address is not specified.

            To make use of NCBI's E-utilities, NCBI requires you to specify your
            email address with each request.  As an example, if your email address
            is A.N.Other@example.com, you can specify it as follows:
               from Bio import Entrez
               Entrez.email = 'A.N.Other@example.com'
            In case of excessive usage of the E-utilities, NCBI will attempt to contact
            a user at the email address provided before blocking access to the
            E-utilities.


Genbank file of Staphylococcus aureus subsp. aureus str. Newman retrieved successfully
Nucleotide file can be found at: /home/souro/Projects/CodonU/Examples/Nucleotide/Staphylococcus_aureus.fasta


## Writing the protein file

In [5]:
file_path = 'Protein/Staphylococcus_aureus.fasta'
write_protein_fasta(accession_id=accession_id, file_path=file_path)

Retrieval started


            Email address is not specified.

            To make use of NCBI's E-utilities, NCBI requires you to specify your
            email address with each request.  As an example, if your email address
            is A.N.Other@example.com, you can specify it as follows:
               from Bio import Entrez
               Entrez.email = 'A.N.Other@example.com'
            In case of excessive usage of the E-utilities, NCBI will attempt to contact
            a user at the email address provided before blocking access to the
            E-utilities.


Genbank file of Staphylococcus aureus subsp. aureus str. Newman retrieved successfully
Protein file can be found at: /home/souro/Projects/CodonU/Examples/Protein/Staphylococcus_aureus.fasta


## Writing the exome file
The function which writes the exome file, takes an additional input for parameter named `exclude_stops`, which controls the behaviour of including or excluding internal stop codons. In easy words, the exome file contains all the CDS, but as a single sequence. Now, if the stop codons from each CDS is included, then the sequence is meaningless in context of analysis. So, if you pass `True`, all the internal stop codons will not be appended during creation of the exone. If `False`, then the sequence will contain internal stop codons, which may create problems later.

In [7]:
file_path = 'Exome/Staphylococcus_aureus.fasta'
write_exome_fasta(file_path=file_path, accession_id=accession_id, exclude_stops=True)

Retrieval started


            Email address is not specified.

            To make use of NCBI's E-utilities, NCBI requires you to specify your
            email address with each request.  As an example, if your email address
            is A.N.Other@example.com, you can specify it as follows:
               from Bio import Entrez
               Entrez.email = 'A.N.Other@example.com'
            In case of excessive usage of the E-utilities, NCBI will attempt to contact
            a user at the email address provided before blocking access to the
            E-utilities.


Genbank file of Staphylococcus aureus subsp. aureus str. Newman retrieved successfully
Exome file can be found at: /home/souro/Projects/CodonU/Examples/Exome/Staphylococcus_aureus.fasta


If you want to go with a multiple organism, you can try:

```
from os.path, import join

accession_ids = []    # accession ids of the organisms in the list
orgasim_names = []    # respective organism names

nuc_folder_path = ''
# exome_folder_path = ''
# prot_folder_path = ''

for idx in range(len(accession_ids)):
    file_path = join(organism_names[idx], nuc_folder_path)    # may give other folder path if necessary
    write_nucleotide_fasta(accession_ids[idx], file_path)    # may give other function

```