#**EX_1: Sequence retrieval and manipulation**

###Installing Biopython

In [None]:
pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


###Mounting Google Drive in Colab

To access files from your Google Drive in a Colab notebook, use the following code:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###Check if Biopython is installed and imported properly by printing a sample DNA and protein sequence.

In [None]:
#Load module from Biopython
from Bio.Seq import Seq
dna = Seq("TGTGACTA")
protein =Seq("MCVTEDSWL")
print(dna)
print(protein)


TGTGACTA
MCVTEDSWL


####SeqIO is designed for reading (parsing), writing, and manipulating biological sequence data. It supports a wide range of file formats commonly used in bioinformatics, such as: FASTA, GenBank, EMBL, Phylip, NEXUS.

####The Entrez module allows you to access and query various biological databases hosted by the National Center for Biotechnology Information (NCBI)

In [None]:
from Bio import Entrez
from Bio import SeqIO

### Fetching Sequence Data from NCBI Using Biopython

This code sets up an email (required by NCBI) and retrieves a nucleotide sequence in FASTA format from the **nuccore** database using the GI identifier. The sequence is then printed to the output.


In [None]:
#set email
Entrez.email= 'example@gmail.com'

#identifier- GI number
#database- nuccore
#handle=Entrez.efetch(db='nuccore', id='529158032')
handle=Entrez.efetch(db='nuccore', id='529158032', rettype='fasta')
print(handle.read())

>KC668274.1 Hordeum vulgare haplotype 16 LUX (LUX) gene, complete cds
ATGGGGGAGGAGGCCGGCGGCTACGGTTTTGACTTCGGAGGATATGGTGGGTATGAAGGGAGGGTGACCG
AGTGGGAGACGGGGCTGCCGGGGTGCGACGAGCTGACCCCGCTGTCTCAGCCGCTGGTGCCGCCGGGGCT
CGCCGCCGCGTTCCGCATCCCGCCGGAGCCGGGGCGCACGCTTCTGGACGTGCACCGCGCGTCCTCCGCC
ACCGTGTCCCGCCTCCGCTCCGCCTCCTCGTCCCCCTCCTCCGGCAACGCCCCCGCCACCGGCGGCTCCT
TCCCGTCCTTCCCCGGCAAGTCGGCGGCCGGGGACGACAACAACAACAACAGCTCGGCCGAGTCGGCGGG
GGAGAAGTCTGCCGCGGCGGCGACGAAGCGGGCGCGGCTGGTGTGGACGCCGCAGCTGCACAAGCGGTTC
GTGGAGGTGGTGGCGCACCTGGGGATCAAGAGCGCCGTGCCCAAGACCATCATGCAGCTGATGAACGTGG
AGGGGCTCACCCGCGAGAACGTCGCCAGCCACCTCCAGAAGTACCGCCTCTACGTCAAGCGGATGCAGGG
CCTCTCCAACGAGGGGCCCTCCGCCTCCGACCACATCTTCGCCTCCACCCCCGTCCCCCACAGCCTCCGC
GAGCCGCAGGTCCCCGTCCCGCACGCCGCCGCCATGGCCCCCGCCATGTACCACCACCACCCGGCCCCCA
TGGGCGGCGTCGCCGCCGGCCACGGCGGCTACTACCAGCAGCAGCACAGCGCCCACGCCGTCTACAACGG
CTACGGCGGCGGCGTCTCCTCCTACCCGCACTACCACCACGGCGACCAGTGA




### Fetching a Nucleotide Sequence in GenBank Format Using Accession Number

This code fetches a nucleotide sequence from the **NCBI** database using an **accession number**. The data is retrieved in **GenBank format** by specifying `rettype='gb'`. To change the format to **FASTA**, simply modify the `rettype` parameter to `'fasta'`.



In [None]:
#identifier- Accession number
#database- nucleotide
#handle=Entrez.efetch(db='nucleotide', id='NM_033497.3', rettype='fasta')
#for fetching from genbank just change the return type that is rettype='gb'
handle=Entrez.efetch(db='nucleotide', id=' KC668274.1', rettype='gb')
print(handle.read())
#Change the format to Genbank and view the sequence

LOCUS       KC668274                 822 bp    DNA     linear   PLN 06-AUG-2013
DEFINITION  Hordeum vulgare haplotype 16 LUX (LUX) gene, complete cds.
ACCESSION   KC668274
VERSION     KC668274.1
KEYWORDS    .
SOURCE      Hordeum vulgare
  ORGANISM  Hordeum vulgare
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; Liliopsida; Poales; Poaceae; BOP
            clade; Pooideae; Triticodae; Triticeae; Hordeinae; Hordeum.
REFERENCE   1  (bases 1 to 822)
  AUTHORS   Campoli,C., Pankin,A., Drosse,B., Casao,C.M., Davis,S.J. and von
            Korff,M.
  TITLE     HvLUX1 is a candidate gene underlying the early maturity 10 locus
            in barley: phylogeny, diversity, and interactions with the
            circadian clock and photoperiodic pathways
  JOURNAL   New Phytol. 199 (4), 1045-1059 (2013)
   PUBMED   23731278
REFERENCE   2  (bases 1 to 822)
  AUTHORS   Pankin,A. and von Korff,M.
  TITLE     Direct Submission
  J

Explore the sequence features and download

### Fetching a Nucleotide Sequence as Text and Parsing with Biopython

This code retrieves a nucleotide sequence in **FASTA format** from NCBI’s **nucleotide** database using the accession number. By specifying `retmode='text'`, the data is returned in plain text format. The **SeqIO.read()** function is then used to parse the FASTA sequence for easier manipulation and viewing in a structured way.


In [None]:
#download as text file so retmode ='text'
#the previous code gave us the fasta format this below code gives us the information in read mode with simpler way
handle=Entrez.efetch(db='nucleotide', id=' KC668274.1', rettype='fasta', retmode='text')
gene= SeqIO.read(handle,'fasta')
print(gene)

ID: KC668274.1
Name: KC668274.1
Description: KC668274.1 Hordeum vulgare haplotype 16 LUX (LUX) gene, complete cds
Number of features: 0
Seq('ATGGGGGAGGAGGCCGGCGGCTACGGTTTTGACTTCGGAGGATATGGTGGGTAT...TGA')


### Accessing and Printing Sequence Details from a FASTA File

This code fetches a nucleotide sequence in **FASTA format**, and using Biopython's `SeqIO.read()` function, it prints specific details from the parsed sequence:
- **gene.id**: The unique identifier of the sequence.
- **gene.description**: A brief description of the sequence.
- **gene.seq**: The actual nucleotide sequence.
- **len(gene.seq)**: The length of the nucleotide sequence.
- **gene.features**: Any additional features associated with the sequence (usually empty in FASTA format).


In [None]:
#printing some of the output from the file downloaded.
print(gene.id)
print(gene.description)
print(gene.seq)
print(len(gene.seq))
print(gene.features)

KC668274.1
KC668274.1 Hordeum vulgare haplotype 16 LUX (LUX) gene, complete cds
ATGGGGGAGGAGGCCGGCGGCTACGGTTTTGACTTCGGAGGATATGGTGGGTATGAAGGGAGGGTGACCGAGTGGGAGACGGGGCTGCCGGGGTGCGACGAGCTGACCCCGCTGTCTCAGCCGCTGGTGCCGCCGGGGCTCGCCGCCGCGTTCCGCATCCCGCCGGAGCCGGGGCGCACGCTTCTGGACGTGCACCGCGCGTCCTCCGCCACCGTGTCCCGCCTCCGCTCCGCCTCCTCGTCCCCCTCCTCCGGCAACGCCCCCGCCACCGGCGGCTCCTTCCCGTCCTTCCCCGGCAAGTCGGCGGCCGGGGACGACAACAACAACAACAGCTCGGCCGAGTCGGCGGGGGAGAAGTCTGCCGCGGCGGCGACGAAGCGGGCGCGGCTGGTGTGGACGCCGCAGCTGCACAAGCGGTTCGTGGAGGTGGTGGCGCACCTGGGGATCAAGAGCGCCGTGCCCAAGACCATCATGCAGCTGATGAACGTGGAGGGGCTCACCCGCGAGAACGTCGCCAGCCACCTCCAGAAGTACCGCCTCTACGTCAAGCGGATGCAGGGCCTCTCCAACGAGGGGCCCTCCGCCTCCGACCACATCTTCGCCTCCACCCCCGTCCCCCACAGCCTCCGCGAGCCGCAGGTCCCCGTCCCGCACGCCGCCGCCATGGCCCCCGCCATGTACCACCACCACCCGGCCCCCATGGGCGGCGTCGCCGCCGGCCACGGCGGCTACTACCAGCAGCAGCACAGCGCCCACGCCGTCTACAACGGCTACGGCGGCGGCGTCTCCTCCTACCCGCACTACCACCACGGCGACCAGTGA
822
[]


### Saving the Fetched Sequence to Google Drive as a FASTA File

This code demonstrates how to save a nucleotide sequence fetched using Biopython’s `Entrez.efetch()` function to a file on **Google Drive**. The file is saved in **FASTA format** using Biopython’s `SeqIO.write()` function.

gene: This is the sequence object (parsed from the SeqIO.read() function earlier). It contains the sequence data that we want to write to a file.
out: This is the file object created with open(), which refers to the file where the sequence will be written. In your case, it's the file located at '/content/drive/MyDrive/lux.fasta'.
'fasta': This is the format in which the sequence will be written. In this case, it's the FASTA format, a common format for representing nucleotide or protein sequences.
The SeqIO.write() function is writing the gene sequence to the file in FASTA format.

In [None]:
#file handling function- to open the file from the entrez.
#the file can be loaded to the drive.
out=open('/content/drive/MyDrive/example/lux.fasta', 'w')
SeqIO.write(gene,out,'fasta')
out.close()

#Exercise1: Retrieve and download multiple sequences with the following accessions ids:
'NM_000188.3', 'NM_001322365.2', 'NM_033496.3', 'NM_001322364.2', 'NM_033498.3', 'NM_001358263.1', 'NM_001322367.1', 'NM_001322366.1', 'NM_033500.2'



### Fetching Multiple Sequences from NCBI and Storing in a List

This code demonstrates how to fetch multiple nucleotide sequences from the **NCBI nucleotide database** using a list of accession numbers. The sequences are retrieved in **FASTA format** and parsed using Biopython's `SeqIO.parse()` function. All sequences are appended to a list called `all_genes`, and the total number of sequences fetched is verified using the `len()` function.


In [None]:
ids = ['NM_000188.3', 'NM_001322365.2', 'NM_033496.3', 'NM_001322364.2', 'NM_033498.3', 'NM_001358263.1', 'NM_001322367.1', 'NM_001322366.1', 'NM_033500.2']
Entrez.email= 'epsibashurly@gmail.com'
handle=Entrez.efetch(db='nucleotide', id=ids, rettype='fasta', retmode='text')
gene= SeqIO.parse(handle,'fasta')
#print(gene)
#appending the all genes to the all_genes
all_genes= [i for i in gene]
#to check if the all 9 gene are included in all_genes use length funciton len function
len(all_genes)

9

### Printing the Description and Sequence Length of Each Gene

This code iterates through the list of parsed sequences (`all_genes`) and prints two key details for each gene:
- **i.description**: The description of the gene, which typically includes the accession number and a brief annotation.
- **len(i.seq)**: The length of the nucleotide sequence for each gene.


In [None]:
for i in all_genes:
  print(i.description)
  print(len(i.seq))

NM_000188.3 Homo sapiens hexokinase 1 (HK1), transcript variant 1, mRNA
3602
NM_001322365.2 Homo sapiens hexokinase 1 (HK1), transcript variant 7, mRNA
3995
NM_033496.3 Homo sapiens hexokinase 1 (HK1), transcript variant 2, mRNA
3684
NM_001322364.2 Homo sapiens hexokinase 1 (HK1), transcript variant 6, mRNA
3690
NM_033498.3 Homo sapiens hexokinase 1 (HK1), transcript variant 4, mRNA
3866
NM_001358263.1 Homo sapiens hexokinase 1 (HK1), transcript variant 10, mRNA
3998
NM_001322367.1 Homo sapiens hexokinase 1 (HK1), transcript variant 9, mRNA
3506
NM_001322366.1 Homo sapiens hexokinase 1 (HK1), transcript variant 8, mRNA
3493
NM_033500.2 Homo sapiens hexokinase 1 (HK1), transcript variant 5, mRNA
3979


### Saving Multiple Fetched Sequences to a FASTA File on Google Drive

This code saves the nucleotide sequences from the `all_genes` list to a **FASTA file** in **Google Drive**. Using a loop, each gene is written to the file in FASTA format with the `SeqIO.write()` function. After all sequences are saved, the file is closed.


In [None]:
out=open('/content/drive/MyDrive/example/all_hxks.fasta', 'w')
for i in all_genes:
  SeqIO.write(i,out,'fasta')
out.close()

#exercise:2

In [None]:
ids = ['KC668274.1','KC668273.1','KC668272.1','KC668271.1', 'KC668270.1','KC668269.1']
Entrez.email= 'epsibashurly@gmail.com'
handle=Entrez.efetch(db='nucleotide', id=ids, rettype='fasta', retmode='text')
gene= SeqIO.parse(handle,'fasta')
#print(gene)
#appending the all genes to the all_genes
all_genes= [i for i in gene]
#to check if the all 9 gene are included in all_genes use length funciton len function
len(all_genes)

6

## Exercise:2a
Print the description and sequence length of all the genes retrieved from NCBI using their accession numbers.

Save all the sequences in a multi-FASTA file in the specified directory (Google Drive or local storage).

In [None]:
for i in all_genes:
  print(i.description)
  print(len(i.seq))

KC668274.1 Hordeum vulgare haplotype 16 LUX (LUX) gene, complete cds
822
KC668273.1 Hordeum vulgare haplotype 15 LUX (LUX) gene, complete cds
840
KC668272.1 Hordeum vulgare haplotype 14 LUX (LUX) gene, complete cds
825
KC668271.1 Hordeum vulgare haplotype 13 LUX (LUX) gene, complete cds
825
KC668270.1 Hordeum vulgare haplotype 12 LUX (LUX) gene, complete cds
834
KC668269.1 Hordeum vulgare haplotype 11 LUX (LUX) gene, complete cds
834


### Saving Retrieved Sequences to a Multi-FASTA File

This code snippet saves all nucleotide sequences stored in the `all_genes` list to a **multi-FASTA file** named `all_lux.fasta` in **Google Drive**. Each sequence is written to the file using the `SeqIO.write()` function, allowing for easy access and sharing of multiple sequences in a single file format.


In [None]:
out=open('/content/drive/MyDrive/example/all_lux.fasta', 'w')
for i in all_genes:
  SeqIO.write(i,out,'fasta')
out.close()

### Reading and Parsing a Multi-FASTA File

This code utilizes the `SeqIO.parse()` function from Biopython to read a multi-FASTA file named `all_hxks.fasta`. It retrieves the first sequence from the file and prints its **ID** and **nucleotide sequence**. Additionally, the type of the sequence object is printed to confirm it as a `Seq` object.

- **`j.id`**: Displays the unique identifier of the sequence.
- **`j.seq`**: Displays the actual nucleotide sequence.
- **`type(j)`**: Confirms the type of the sequence object.


In [None]:
#function parse - to read the file
from Bio.Seq import Seq
for sequence in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/all_hxks.fasta",'fasta'):
  j=sequence
  break
print(j.id)
print(j.seq)
#for knwoing the lenght the sequence.
#print(len(j.seq))
print(type(j))

NM_000188.3
GAGGAGGAGCCGCCGAGCAGCCGCCGGAGGACCACGGCTCGCCAGGGCTGCGGAGGACCGACCGTCCCCACGCCTGCCGCCCCGCGACCCCGACCGCCAGCATGATCGCCGCGCAGCTCCTGGCCTATTACTTCACGGAGCTGAAGGATGACCAGGTCAAAAAGATTGACAAGTATCTCTATGCCATGCGGCTCTCCGATGAAACTCTCATAGATATCATGACTCGCTTCAGGAAGGAGATGAAGAATGGCCTCTCCCGGGATTTTAATCCAACAGCCACAGTCAAGATGTTGCCAACATTCGTAAGGTCCATTCCTGATGGCTCTGAAAAGGGAGATTTCATTGCCCTGGATCTTGGTGGGTCTTCCTTTCGAATTCTGCGGGTGCAAGTGAATCATGAGAAAAACCAGAATGTTCACATGGAGTCCGAGGTTTATGACACCCCAGAGAACATCGTGCACGGCAGTGGAAGCCAGCTTTTTGATCATGTTGCTGAGTGCCTGGGAGATTTCATGGAGAAAAGGAAGATCAAGGACAAGAAGTTACCTGTGGGATTCACGTTTTCTTTTCCTTGCCAACAATCCAAAATAGATGAGGCCATCCTGATCACCTGGACAAAGCGATTTAAAGCGAGCGGAGTGGAAGGAGCAGATGTGGTCAAACTGCTTAACAAAGCCATCAAAAAGCGAGGGGACTATGATGCCAACATCGTAGCTGTGGTGAATGACACAGTGGGCACCATGATGACCTGTGGCTATGACGACCAGCACTGTGAAGTCGGCCTGATCATCGGCACTGGCACCAATGCTTGCTACATGGAGGAACTGAGGCACATTGATCTGGTGGAAGGAGACGAGGGGAGGATGTGTATCAATACAGAATGGGGAGCCTTTGGAGACGATGGATCATTAGAAGACATCCGGACAGAGTTTGACAGGGAGATAGACCGGGGATCCCTCAACCCTGGAAAACAGCTGTTTGAGAAGAT

### Transcribing and Translating a Nucleotide Sequence

This code demonstrates how to use Biopython's **Seq** class methods to transcribe and translate a nucleotide sequence. The operations performed are as follows:

- **`print(j)`**: Displays the original nucleotide sequence.
- **`j.seq.transcribe()`**: Converts the nucleotide sequence to its corresponding **mRNA** sequence.
- **`j.translate(to_stop=True)`**: Translates the entire mRNA sequence into a protein sequence, stopping at the first stop codon.
- **`j[1:].translate(to_stop=True)`**: Translates the mRNA sequence starting from the second nucleotide, effectively producing a different protein sequence.
- **`j[2:].translate(to_stop=True)`**: Translates the mRNA sequence starting from the third nucleotide, resulting in yet another variant of the protein sequence.

These methods showcase how to manipulate and analyze genetic sequences using Biopython.


In [None]:
print(j)
#the transcribe function is used to fetch mrna
print(j.seq.transcribe())
#the translate function is used translate mrna to protien
print(j.translate(to_stop=True))
print(j[1:].translate(to_stop=True))
print(j[2:].translate(to_stop=True))

ID: NM_000188.3
Name: NM_000188.3
Description: NM_000188.3 Homo sapiens hexokinase 1 (HK1), transcript variant 1, mRNA
Number of features: 0
Seq('GAGGAGGAGCCGCCGAGCAGCCGCCGGAGGACCACGGCTCGCCAGGGCTGCGGA...AAA')
GAGGAGGAGCCGCCGAGCAGCCGCCGGAGGACCACGGCUCGCCAGGGCUGCGGAGGACCGACCGUCCCCACGCCUGCCGCCCCGCGACCCCGACCGCCAGCAUGAUCGCCGCGCAGCUCCUGGCCUAUUACUUCACGGAGCUGAAGGAUGACCAGGUCAAAAAGAUUGACAAGUAUCUCUAUGCCAUGCGGCUCUCCGAUGAAACUCUCAUAGAUAUCAUGACUCGCUUCAGGAAGGAGAUGAAGAAUGGCCUCUCCCGGGAUUUUAAUCCAACAGCCACAGUCAAGAUGUUGCCAACAUUCGUAAGGUCCAUUCCUGAUGGCUCUGAAAAGGGAGAUUUCAUUGCCCUGGAUCUUGGUGGGUCUUCCUUUCGAAUUCUGCGGGUGCAAGUGAAUCAUGAGAAAAACCAGAAUGUUCACAUGGAGUCCGAGGUUUAUGACACCCCAGAGAACAUCGUGCACGGCAGUGGAAGCCAGCUUUUUGAUCAUGUUGCUGAGUGCCUGGGAGAUUUCAUGGAGAAAAGGAAGAUCAAGGACAAGAAGUUACCUGUGGGAUUCACGUUUUCUUUUCCUUGCCAACAAUCCAAAAUAGAUGAGGCCAUCCUGAUCACCUGGACAAAGCGAUUUAAAGCGAGCGGAGUGGAAGGAGCAGAUGUGGUCAAACUGCUUAACAAAGCCAUCAAAAAGCGAGGGGACUAUGAUGCCAACAUCGUAGCUGUGGUGAAUGACACAGUGGGCACCAUGAUGACCUGUGGCUAUGACGACCAGCACUGUGAAGUCGGCCUGAUCAUC



#Practice Exercise: Gene Analysis with Accession Number AB000824.1
1) Download the Gene Sequence:
Use Biopython to download the gene sequence associated with the accession number AB000824.1 and save it in FASTA format.

2)Convert to mRNA:
Transcribe the downloaded nucleotide sequence into its corresponding mRNA sequence.

3)Translate the Gene to Protein Sequence:
Translate the mRNA sequence into a protein sequence. Vary the open reading frame (ORF) to analyze different protein variants.

4)Design Primers:
Design forward and reverse primers that are 30 nucleotides long for this gene sequence.

5)Retrieve Protein ID:
From the GenBank file associated with the gene, extract the protein ID and download the corresponding protein sequence.

6)Extract Catalytic Domain:
Use InterProScan to identify and extract the catalytic domain of the protein.

7)Run BLAST on the Domain Sequence:
Perform a BLAST analysis on the extracted catalytic domain sequence to find similar sequences.

8)Identify Homologous Sequences:
Retrieve 5 homologous sequences from different organisms based on the BLAST results.

9)Change Accession IDs to Organism Names:
Replace the accession IDs of the retrieved sequences with the common names of the respective organisms.

10)Perform Multiple Sequence Alignment (MSA):
Use Clustal Omega to perform a multiple sequence alignment of the homologous sequences. Interpret the results to identify closely related organisms.