In [1]:
pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Playing with sequences- Useful advanced exercises

1. From the given multifasta file, create a new multifasta file with the first n sequences.

The file "all_pyr_dehy.faa" likely contains multiple FASTA sequences, each representing a protein sequence from different organisms. A FASTA file can store multiple sequences, each identified by a header (starting with >), followed by the actual sequence in the next lines.

In [6]:
from Bio import SeqIO
pyr_deh= SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/all_pyr_dehy.faa",'fasta')
all_proteins= [i for i in pyr_deh]
len(all_proteins)

879

Imagine you have a water tap (pyr_deh), and every time you open the tap, you get a glass of water (one sequence). If you want to collect all the water (all sequences) in a bucket (all_proteins), you need to pour all the glasses into the bucket.

Without doing this, you'd need to keep going back to the tap (iterator) each time you need water.

### Saving the First 30 Protein Sequences to a New FASTA File

This Python code extracts the first 30 protein sequences from a list (`all_proteins`) and writes them to a new FASTA file:

1. **Set the Total Number of Sequences to Save**:
   ```python
   n = 30
2. **Open a New File to Write Sequences**:
  Opens a new file first30_pyr_dehy.faa in write mode ('w') to store the first 30 sequences.

3. **Write Sequences to the New File**:
  Loops through the list all_proteins, and for each protein sequence, it writes it to the out file using SeqIO.write().
The loop stops after 30 sequences (count == n)

In [7]:
n=30
count=0
out=open("/content/drive/MyDrive/Biopython_Garden_City/Module_3/first30_pyr_dehy.faa",'w') #open a new filw to write sequence.
for i in all_proteins:    #Writing sequences to the new file
  SeqIO.write(i,out,'fasta')
  count=count+1
  if count==n:
    break
out.close()

When you created the first30_pyr_dehy.faa file and subsequently read it back into your code, the list all_proteins now only contains the first 30 protein sequences from your original FASTA file.

In [9]:
pyr_deh= SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/first30_pyr_dehy.faa",'fasta')
all_proteins= [i for i in pyr_deh] #Storing All Sequences in a List
len(all_proteins)

30

To extract the IDs of the proteins from the FASTA file (first30_pyr_dehy.faa), you can access the .id attribute of each sequence object in the list all_proteins. Here’s how you can do that:

In [10]:

# Parse the FASTA file
pyr_deh = SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/first30_pyr_dehy.faa", 'fasta')

# Store sequences in a list
all_proteins = [i for i in pyr_deh]

# Extract IDs
protein_ids = [protein.id for protein in all_proteins]

# Print the list of IDs
print(protein_ids)


['Albugo_laibachii', 'Monosiga_brevicollis', 'Salpingoeca_rosetta', 'Pythium_vexans', 'Pythium_iwayamai', 'Phytophthora_kernoviae', 'Hyaloperonospora_arabidopsidis', 'Rattus_norvegicus', 'Gallus_gallus', 'Xenopus_tropicalis', 'Bos_taurus', 'Pan_troglodytes', 'Pongo_abelii', 'Zea_mays', 'Homo_sapiens', 'Macaca_mulatta', 'Solanum_tuberosum', 'Callorhinchus_milii', 'Brassica_rapa', 'Solanum_lycopersicum', 'Saccharomyces_cerevisiae_S288C', 'Mus_musculus', 'Arabidopsis_thaliana', 'Drosophila_melanogaster', 'Schizosaccharomyces_pombe', 'Caenorhabditis_elegans', 'Eremothecium_gossypii_ATCC_10895', 'Achlya_hypogyna', 'Hartmannibacter_diazotrophicus', 'Silicimonas_algicola']


2. Create a new multifasta file by randomly subsetting 20 sequence from a large dataset.

In [11]:
import random #The random library is used to perform random sampling, randomly selects 20 protein IDs from a larger FASTA file (`all_pyr_dehy.faa`)
from Bio import SeqIO

all_ids=[] #This list will store all the protein IDs (headers) found in the FASTA file.
for seq in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/all_pyr_dehy.faa", 'fasta'):
  all_ids.append(seq.id)

random_20_ids=random.sample(all_ids,20) #the numeber 20 can be changed as per our interest # The random.sample() function selects 20 unique IDs from the all_ids list randomly.
print(random_20_ids)

['Aspergillus_niger_CBS_513.88', 'Anopheles_stephensi', 'Buceros_rhinoceros_silvestris', 'Monosiga_brevicollis', 'Micropterus_salmoides', 'Crassostrea_virginica', 'Strigops_habroptila', 'Manduca_sexta', 'Gavialis_gangeticus', 'Trypanosoma_grayi', 'Ornithorhynchus_anatinus', 'Pelecanus_crispus', 'Arthroderma_uncinatum', 'Rhinopithecus_roxellana', 'Theileria_annulata', 'Acanthamoeba_castellanii_str_Neff', 'Hartmannibacter_diazotrophicus', 'Kluyveromyces_marxianus_DMKU3-1042', 'Falco_cherrug', 'Oncorhynchus_tshawytscha']


In [12]:
out=open("/content/drive/MyDrive/Biopython_Garden_City/Module_3/random20_pyr_dehy.faa", 'w')
for i in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/all_pyr_dehy.faa" ,'fasta'):
  if i.id in random_20_ids:     #For each sequence (i), this line checks if the sequence ID (i.id) is part of the previously selected random IDs stored in random_20_ids.
         SeqIO.write(i,out,'fasta') #If the sequence ID is found in the random sample, the sequence is written to the output file (random20_pyr_dehy.faa) in FASTA format.

out.close()

3. (i) A text file with only the names of Fungi is provided. Read the file and extract only Fungal pyruvate dehydrogenase from the main file.

The method strip() is called on the line variable.
What strip() does: It removes any extra spaces (whitespace) at the beginning and the end of the line. This is important because sometimes text files might have spaces or new line characters that can interfere with processing.
For example, if a line in the file contains " Aspergillus " (with spaces), calling strip() will convert it to "Aspergillus".
After cleaning up the line with strip(), we use append() to add the cleaned line (the name of the fungus) to the fungi list.
By the end of the loop, the fungi list will contain only the names of the fungi without any leading or trailing spaces.

In [15]:
#read the file and create a list with fungi names
fungi=[]
for i in open("/content/drive/MyDrive/Biopython_Garden_City/fungi_names.txt"):
  fungi.append(i.strip()) #stri() method
print(len(fungi))

out=open("/content/drive/MyDrive/Biopython_Garden_City/Module_3/fungal_pyr_dehy.faa", 'w')
for i in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/all_pyr_dehy.faa" ,'fasta'):  #he SeqIO.parse() function reads through the main FASTA file (all_pyr_dehy.faa).
  if i.id in fungi:               #For each sequence (i), it checks if the sequence ID (i.id) is present in the fungi list.
         SeqIO.write(i,out,'fasta')
out.close()

193


**Exercise** 3. (ii) Remove all the fungal puruvate dehydrogenase from the main file and create a new multifasta file.


In [42]:
fungi=[]
for i in open("/content/drive/MyDrive/Biopython_Garden_City/fungi_names.txt"):
  fungi.append(i.strip())


out=open("/content/drive/MyDrive/Biopython_Garden_City/Module_3/no_fungi.faa", 'w') #Here, it opens (or creates) a new file called no_fungi.faa in write mode, which will store the non-fungal pyruvate dehydrogenase sequences.
for i in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/fungal_pyr_dehy.faa" ,'fasta'):
  if i.id not in fungi:
         SeqIO.write(i,out,'fasta')
out.close()



**Exercise 4**- Remove special characters like '.', '=', '+', '#', '(', ')' and '_'

Exercise- A multifasta file (proteins_with_acc_id.faa) contains protein accesion ids as the header. A text file (org_headers.txt) contains the corresponding organism name. Using the text file, change the header of the multifasta file accordingly.

In [19]:
#Replace the special characters with the null empty space.

for i in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/all_pyr_dehy.faa", 'fasta'):
  j=i.id.replace("_","").replace(".","").replace("=","")
  print(j)


Albugolaibachii
Monosigabrevicollis
Salpingoecarosetta
Pythiumvexans
Pythiumiwayamai
Phytophthorakernoviae
Hyaloperonosporaarabidopsidis
Rattusnorvegicus
Gallusgallus
Xenopustropicalis
Bostaurus
Pantroglodytes
Pongoabelii
Zeamays
Homosapiens
Macacamulatta
Solanumtuberosum
Callorhinchusmilii
Brassicarapa
Solanumlycopersicum
SaccharomycescerevisiaeS288C
Musmusculus
Arabidopsisthaliana
Drosophilamelanogaster
Schizosaccharomycespombe
Caenorhabditiselegans
EremotheciumgossypiiATCC10895
Achlyahypogyna
Hartmannibacterdiazotrophicus
Silicimonasalgicola
SolimonasspK1W22B-7
BoseaspTri-49
PleomorphomonasspSM30
RhodanobacteraceaebacteriumDysh456
RheinheimeraspLHK132
SedimentitaleaspW43
LeisingeraspNJS204
SphingosinicellaspBN140058
Sorangiumcellulosum
ThalassococcusspS3
Litorilituussediminis
TetrahymenathermophilaSB210
AspergillusterreusNIH2624
ChaetomiumglobosumCBS14851
CoccidioidesimmitisRS
AspergillusfischeriNRRL181
AspergillusclavatusNRRL1
Plasmodiumfalciparum3D7
Drosophilapseudoobscura
Scheffe

In [21]:
j=i.id.replace("_","").replace(".","").replace("=","")
i.id=j
out=open("/content/drive/MyDrive/Biopython_Garden_City/Module_3/without_special_character",'w')
from Bio import SeqIO
for i in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/Module_3/without_special_character", 'fasta'):
  out=open("/content/drive/MyDrive/Biopython_Garden_City/Module_3/without_special_character",'w')

  SeqIO.write(i,out,'fasta')
  print(i.id)

In [44]:
org_name = {}
with open("/content/drive/MyDrive/Biopython_Garden_City/org_headers.txt", "r") as org_file:
    for line in org_file:
        parts = line.strip().split()
        if len(parts) >= 2:
            accession_id = parts[0]
            organism_name = " ".join(parts[1:])
            org_name[accession_id] = organism_name

In [55]:

# Open the output file for writing
out = open("/content/drive/MyDrive/Biopython_Garden_City/protein_seqs_with_org_names.faa", 'w')

# Parse the input FASTA file and replace accession IDs with organism names
for seq_record in SeqIO.parse("/content/drive/MyDrive/Biopython_Garden_City/proteins_with_acc_id.faa", 'fasta'):
    # Use the accession ID to get the organism name, defaulting to "Unknown_Organism" if not found
    org_id = org_name.get(seq_record.id, "Unknown_Organism")

    # Print the accession ID and its corresponding organism name
    print(f"Accession ID: {seq_record.id} ===> Organism Name: {org_id}")


Accession ID: XP_027342331.1 ===> Organism Name: trehalase (Abrus precatorius) [Abrus_precatorius]
Accession ID: XP_022108358.1 ===> Organism Name: trehalase-like (Acanthaster planci) [Acanthaster_planci]
Accession ID: XP_022058461.1 ===> Organism Name: trehalase (Acanthochromis polyacanthus) [Acanthochromis_polyacanthus]
Accession ID: XP_036930371.1 ===> Organism Name: trehalase (Acanthopagrus latus) [Acanthopagrus_latus]
Accession ID: XP_025377972.1 ===> Organism Name: trehalase-domain-containing protein (Acaromyces ingoldii) [Acaromyces_ingoldii]
Accession ID: OQR89487.1 ===> Organism Name: trehalase (Achlya hypogyna) [Achlya_hypogyna]
Accession ID: XP_026892184.1 ===> Organism Name: trehalase (Acinonyx jubatus) [Acinonyx_jubatus]
Accession ID: XP_034768593.1 ===> Organism Name: trehalase (Acipenser ruthenus) [Acipenser_ruthenus]
Accession ID: XP_011067527.1 ===> Organism Name: PREDICTED: trehalase isoform X2 (Acromyrmex echinatior) [Acromyrmex_echinatior]
Accession ID: XP_015767574