# Introduction

Welcome to the tutorial of `CodonU`. In this part we will see about file hadnling by CodonU. In this tutorial, we will fetch some genebank file from [NCBI](https://www.ncbi.nlm.nih.gov/)

In [None]:
# If you already haven't installed it, install it
# pip install CodonU

In [1]:
from CodonU import file_handler as fh    # for file handling
from CodonU import extractor as ex    # for extracting the data
import pandas as pd    # we will need it for reading the data
from os.path import join    # we will need it for writing the files

# Setting entrenz parameter

One word of caution, in order to use the package for data fetching, you may consider to link your google account with NCBI. After that, you can get your API key. For details, click [here](https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us).

If you don't set the parameters, it will affect the retrival speed, and will show you a warning (as you will look below, as i will not use my API here).

In [None]:
# Below we are setting the email and api value which are of string in nature
# email, api = '', ''
# fh.set_entrez_param(email=email, api_key=api)

# Making directories

I will confine myself to a certain file and folder structure. You can make certain directories to organize your results.

In [2]:
folder_path_dict = {
    'nucleo': 'Nucleotide',
    'exome': 'Exome',
    'prot': 'Protein'
}

for path in folder_path_dict.keys():
    fh.make_dir(folder_path_dict[path])

Nucleotide created successfully
Exome created successfully
Protein created successfully


# Reading the csv file

CodonU gives the user the liberty to fetch genebank files from NCBI directly. In order to do that users just need a .csv file containing the organism name and accession id

In [3]:
df = pd.read_csv('Staphylococcus_species.csv')
df

Unnamed: 0,Name,Accession_id
0,Staphylococcus agnetis,CP045927.1
1,Staphylococcus argenteus,FR821777.2
2,Staphylococcus aureus,CP000253.1


# Retriving the gb file

In [4]:
for idx in df.index:
    # getting the accession id
    accession_id = df['Accession_id'][idx]
    # retreiving the data
    record = fh.get_gb(accession_id)
    cds_lst = ex.extract_cds_lst(record)
    # taking the organism name
    organism_name = "_".join(df['Name'][idx].split())
    # defining the file paths
    nuc_file_path = join(folder_path_dict['nucleo'], f"{organism_name}_nucleotide.fasta")
    exome_file_path = join(folder_path_dict['exome'], f"{organism_name}_exome.fasta")
    prot_file_path = join(folder_path_dict['prot'], f"{organism_name}_protein.fasta")
    # writing the files
    fh.write_nucleotide_fasta(nuc_file_path, cds_lst, record, df['Name'][idx])
    fh.write_exome_fasta(exome_file_path, nuc_file_path, df['Name'][idx])
    fh.write_protein_fasta(prot_file_path, cds_lst, df['Name'][idx])

Retrieval started


            Email address is not specified.

            To make use of NCBI's E-utilities, NCBI requires you to specify your
            email address with each request.  As an example, if your email address
            is A.N.Other@example.com, you can specify it as follows:
               from Bio import Entrez
               Entrez.email = 'A.N.Other@example.com'
            In case of excessive usage of the E-utilities, NCBI will attempt to contact
            a user at the email address provided before blocking access to the
            E-utilities.


Genbank file of Staphylococcus agnetis retrieved successfully
Nucleotide file for Staphylococcus agnetis created successfully
Exome file for Staphylococcus agnetis created successfully
Protein file for Staphylococcus agnetis created successfully
Retrieval started
Genbank file of Staphylococcus argenteus retrieved successfully
Nucleotide file for Staphylococcus argenteus created successfully
Exome file for Staphylococcus argenteus created successfully
Protein file for Staphylococcus argenteus created successfully
Retrieval started
Genbank file of Staphylococcus aureus subsp. aureus NCTC 8325 retrieved successfully
Nucleotide file for Staphylococcus aureus created successfully
Exome file for Staphylococcus aureus created successfully
Protein file for Staphylococcus aureus created successfully


If you want to go with a single organism, the optimal path may be 

```
accession_id = ''
organism_name = ''

nuc_file_path = ''
exome_file_path = ''
prot_file_path = ''

record = fh.get_gb(accession_id)
cds_lst = ex.extract_cds_lst(record)

fh.write_nucleotide_fasta(nuc_file_path, cds_lst, record, organism_name)
fh.write_exome_fasta(exome_file_path, nuc_file_path, organism_name)
fh.write_protein_fasta(prot_file_path, cds_lst, organism_name)

```