# Google Colab

Google Colab is a coding environment that allows you to write and execute Python, R, Bash, etc in your browser without any configuration. It provides free GPU resources and is easy to share.

This particular document is called a notebook, where you write and execute code. This cell and other cells with notes are called markdown cells, while the ones where you write and run code are **code** cells.

In [None]:
# connecting your drive to colab

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget https://ftp.ebi.ac.uk/pub/training/2024/Genome_bioinformatics_2024/01_Introduction_BASH/IntroductionBASH_ExercisesMaterial_GenomeBx2024.zip

Files from wget or other download methods automatically go to a virtual storage directory known as /content. It is a temporary folder, which is why you always have to mount your drive for every session.

To change the default file download location (not recommended), use the following:
```
%cd /content/drive/MyDrive/my_colab_workspace
```

to direct wget to a certain folder, use -P folder_path.
Check space with !df -h / and RAM with !free -h .
You get disconnected after 90 mins of inactivity or 12 hours full run.

In [None]:
!ls /content/

## Intro Task: FASTA File Analysis
**Assignment Overview**  

In this first task, you'll get hands-on experience working with a **FASTA file** using both **Bash** and **Python**. This is a crucial skill for processing genomic data efficiently.  

By the end of this assignment, you will:  
✅ Count the number of sequences in 2 FASTA files (one with Bash & the other with Python)  
✅ Determine the lengths of each sequence  
✅ Identify the shortest and longest sequence (**Python** only)  
✅ Translate a DNA sequence to protein (**Python**)  

---

### Getting the Data
To start, download and extract the FASTA file:  

🔹 **If using Google Colab**  
Run the following code to download the data:  
```bash
!wget -O bioinformatics_data.zip "https://drive.google.com/file/d/1KPGqMmiBqSWEhjUFNYxgOqtCkyNwA1aU/view?usp=drive_link"
!unzip bioinformatics_data.zip -d /content/bioinformatics_data
```
This will create a folder: `/content/bioinformatics_data/` with your FASTA files. You can use ls to check the name of the files.  

🔹 **If running locally on Jupyter Notebook**  
1. Download the ZIP file from **[Google Drive link]**  
2. Extract it into your working directory  


In [None]:
# extract file here

!wget -O bioinformatics_data.zip 'https://drive.google.com/uc?export=download&id=1KPGqMmiBqSWEhjUFNYxgOqtCkyNwA1aU'

--2025-05-16 17:08:14--  https://drive.google.com/uc?export=download&id=1KPGqMmiBqSWEhjUFNYxgOqtCkyNwA1aU
Resolving drive.google.com (drive.google.com)... 74.125.199.113, 74.125.199.102, 74.125.199.139, ...
Connecting to drive.google.com (drive.google.com)|74.125.199.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1KPGqMmiBqSWEhjUFNYxgOqtCkyNwA1aU&export=download [following]
--2025-05-16 17:08:14--  https://drive.usercontent.google.com/download?id=1KPGqMmiBqSWEhjUFNYxgOqtCkyNwA1aU&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 173.194.203.132, 2607:f8b0:400e:c05::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|173.194.203.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14193 (14K) [application/octet-stream]
Saving to: ‘bioinformatics_data.zip’


2025-05-16 17:08:16 (83.0 MB/s) - ‘bioinformatics_data.

In [None]:
!unzip bioinformatics_data.zip -d /content/bioinformatics_data

Archive:  bioinformatics_data.zip
  inflating: /content/bioinformatics_data/mystery1.fa  
  inflating: /content/bioinformatics_data/mystery2.fa  


## **Task 1: Count the Number of Sequences**
Each sequence in a FASTA file starts with a **header line (`>` identifier)**. Count the number of sequences in file 1 with bash, and file 2 with Python.

In [None]:
# list the content of bioinformatics_data directory
!ls /content/bioinformatics_data

# count the seaquence header denoted with the ">" symbol
!grep '^>' /content/bioinformatics_data/mystery1.fa | wc -l

mystery1.fa  mystery2.fa
27


In [None]:
# define filepath to mystery2.fa file
file_path = '/content/bioinformatics_data/mystery2.fa'

# initiate sequence count
sequence_count = 0

# open the file && grant read-permission only
with open(file_path , 'r') as fasta_file:
  for line in fasta_file:
    # count sequence if it starts with a > identifier anad increase ccount by 1
    if line.startswith('>'):
      sequence_count += 1

  print(f'The mystery2.fa file contains {sequence_count} sequences.')
# using Python

# Hint: consider a for loop

The mystery2.fa file contains 35 sequences.


## **Task 2: Find the Length of Each Sequence**
For each sequence, compute its length **excluding the header line**.  

In [None]:
# Using Bash ... Remove header lines, Merge multi-line sequences, then compute sequence lengths
file_path = '/content/bioinformatics_data/mystery1.fa'

# count total sequence
!sed '/^>/d' "${file_path}" | awk 'BEGIN {RS=""} {gsub(/\n/, "", $0); print length($0)}'


In [None]:
!sed '/^>/d' $file_path

MESGGRPSLCQFILLGTTSVVTAALYSVYRQKARVSQELKGAKKVHLGEDLKSILSEAPGKCVPYAVIEG
AVRSVKETLNSQFVENCKGVIQRLTLQEHKMVWNRTTHLWNDCSKIIHQRTNTVPFDLVPHEDGVDVAVR
VLKPLDSVDLGLETVYEKFHPSIQSFTDVIGHYISGERPKGIQETEEMLKVGATLTGVGELVLDNNSVRL
QPPKQGMQYYLSSQDFDSLLQRQESSVRLWKVLALVFGFATCATLFFILRKQYLQRQERLRLKQMQEEFQ
EHEAQLLSRAKPEDRESLKSACVVCLSSFKSCVFLECGHVCSCTECYRALPEPKKCPICRQAITRVIPLY
NS

MSTRKRRGGAINSRQAQKRTREATSTPEISLEAEPIELVETAGDEIVDLTCESLEPVVVDLTHNDSVVIV
DGPQVLSVVPSAWTDTQRSCRMDVSSFPQNAAMSSVASASVIP

MSTRKRRGGAINSRQAQKRTREATSTPEISLEAEPIELVETAGDEIVDLTCESLEPVVVDLTHNDSVVIV
DERRRPRRNARRLPQDHADSCVVSSDDEELSRDRDVYVTTHTPRNARDEGATGLRPSGTVSCPICMDGYS
EIVQNGRLIVSTECGHVFCSQCLRDSLKNANTCPTCRKKINHKRYHPIYI

MSTRKRRGGAINSRQAQKRTREATSTPEISLEAEPIELVETAGDEIVDLTCESLEPVVVDLTHNDSVVIV
DERRRPRRNARRLPQDHADSCVVSSDDEELSRDRDVYVTTHTPRNARDEGATGLRPSGTVSCPICMDGYS
EIVQNGRLIVSTECGHVFCSQCLRDSLKNANTCPTCRKKINHKRYHPIYI

MAAELVEAKNMVMSFRVSDLQMLLGFVGRSKSGLKHELVTRALQLVQFDCSPELFKKIKELYETRYAKKN
SEPAPQPHRPLDPLTMHSTYDRAGAVPRTPLAGPNIDYPVLYGKYLNGLGRLPAKTLKPEVRLVKL

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!sed '/^>/d' $file_path | head

In [None]:
!sed '/^>/d' /content/bioinformatics_data/mystery1.fa | awk 'BEGIN {RS=""} {gsub(/\n/, "", $0); print length($0)}'

192
144
177
167
166
193
194
165
297
240
236
160
206
176
279
294
283
257
104
184
259
270
103
236
225
140
119


In [None]:
# file path
file_path = '/content/bioinformatics_data/mystery1.fa'

# function that counts sequence length
def count_sequence_lengths(file_path):
  #create a dictionary that stores the output sequence
  sequences = {}
  current_seq_id = ""

  # open file && grant read only permission
  with open(file_path, 'r') as fasta_file:
    for line in fasta_file:
      # remove empty spaces and strings
      line = line.strip()
      # check if a line starts with '>' character indicating a sequence header && extract its _id if true.
      if line.startswith('>'):
        current_seq_id = line[1:]
        sequences[current_seq_id] = ''
      else:
        if current_seq_id:
          # append the line/sequence following the header to the seq_id
          sequences[current_seq_id] += line


    print("Sequence ID\tLength")
    for seq_id, seq in sequences.items():
        # Print the ID and its length (len() counts characters)
        print(f"{seq_id}\t{len(seq)}")

count_sequence_lengths(file_path)


Sequence ID	Length
Q9H9L7 AKIR1_HUMAN Akirin-1 OS=Homo sapiens GN=AKIRIN1 PE=1 SV=1	192
P41223 BUD31_HUMAN Protein BUD31 homolog OS=Homo sapiens GN=BUD31 PE=1 SV=2	144
Q13352 CENPR_HUMAN Centromere protein R OS=Homo sapiens GN=ITGB3BP PE=1 SV=2	177
Q9UFW8 CGBP1_HUMAN CGG triplet repeat-binding protein 1 OS=Homo sapiens GN=CGGBP1 PE=1 SV=2	167
P23528 COF1_HUMAN Cofilin-1 OS=Homo sapiens GN=CFL1 PE=1 SV=3	166
P21291 CSRP1_HUMAN Cysteine and glycine-rich protein 1 OS=Homo sapiens GN=CSRP1 PE=1 SV=3	193
P50461 CSRP3_HUMAN Cysteine and glycine-rich protein 3 OS=Homo sapiens GN=CSRP3 PE=1 SV=1	194
P60981 DEST_HUMAN Destrin OS=Homo sapiens GN=DSTN PE=1 SV=3	165
P07992 ERCC1_HUMAN DNA excision repair protein ERCC-1 OS=Homo sapiens GN=ERCC1 PE=1 SV=1	297
P51858 HDGF_HUMAN Hepatoma-derived growth factor OS=Homo sapiens GN=HDGF PE=1 SV=1	240
Q0VD86 INCA1_HUMAN Protein INCA1 OS=Homo sapiens GN=INCA1 PE=1 SV=1	236
P61244 MAX_HUMAN Protein max OS=Homo sapiens GN=MAX PE=1 SV=1	160
O60682 MUSC_HUMAN M

**Comparism between the python code and bash command for Task 2**
        
The python code provides flexibilty regarding conditinal statement handling than the bash or UNIX commands.

## **Task 3: Find the Longest and Shortest Sequence**

In [None]:
# define filepath
file_path = '/content/bioinformatics_data/mystery1.fa'

def find_longest_sequence(sequences):
    # initiate data needed from each sequence body
    longest_seq_id = ""
    longest_seq_length = 0
    longest_seq = ""

    # a loop that checks if the current sequence is the longest as runs
    # through the file
    for seq_id, seq in sequences.items():
      current_seq_length = len(seq)
      if current_seq_length > longest_seq_length:
          longest_seq_length = current_seq_length
          longest_seq_id = seq_id
          longest_seq = seq

    return longest_seq_id, longest_seq_length, longest_seq

# Read and process the FASTA file
sequences = {}
current_seq_id = ""

with open(file_path, 'r') as fasta_file:
  for line in fasta_file:
    line = line.strip()
    if line.startswith('>'):
      current_seq_id = line[1:]
      sequences[current_seq_id] = ''
    else:
      # Only add if we have a current sequence ID
      if current_seq_id:
        sequences[current_seq_id] += line

# Find and print the longest sequence
if sequences:  # Check if we found any sequences
  longest_seq_id, longest_seq_length, longest_seq = find_longest_sequence(sequences)
  print(f'The longest sequence: {longest_seq_id}')
  print(f'Length: {longest_seq_length} bases')
  print(f'Here are it bases: {longest_seq}')

else:
  print("No sequences found in the file!")

def find_shortest_sequence(sequences):
  shortest_seq_id = ""
  shortest_seq_length = float('inf')
  shortest_seq = ""

  for seq_id, seq in sequences.items():
    current_seq_length = len(seq)
    if current_seq_length < shortest_seq_length:
      shortest_seq_length = current_seq_length
      shortest_seq_id = seq_id
      shortest_seq = seq

  return shortest_seq_id, shortest_seq_length, shortest_seq

sequences = {}
current_seq_id = ""

with open(file_path, 'r') as fasta_file:
  for line in fasta_file:
    line = line.strip()
    if line.startswith(">"):
      if current_seq_id:
        sequences[current_seq_id] = current_seq
      current_seq_id = line[1:]
      current_seq = ""
    else:
      current_seq += line

if sequences:
  shortest_seq_id, shortest_seq_length, shortest_seq = find_shortest_sequence(sequences)
  print(f"Shortest sequence ID: {shortest_seq_id}")
  print(f"Length of the shortest sequence: {shortest_seq_length}")
  print(f"Shortest sequence: {shortest_seq}")
else:
  print("No sequences found in the file.")

The longest sequence: P07992 ERCC1_HUMAN DNA excision repair protein ERCC-1 OS=Homo sapiens GN=ERCC1 PE=1 SV=1
Length: 297 bases
Here are it bases: MDPGKDKEGVPQPSGPPARKKFVIPLDEDEVPPGVAKPLFRSTQSLPTVDTSAQAAPQTYAEYAISQPLEGAGATCPTGSEPLAGETPNQALKPGAKSNSIIVSPRQRGNPVLKFVRNVPWEFGDVIPDYVLGQSTCALFLSLRYHNLHPDYIHGRLQSLGKNFALRVLLVQVDVKDPQQALKELAKMCILADCTLILAWSPEEAGRYLETYKAYEQKPADLLMEKLEQDFVSRVTECLTTVKSVNKTDSQTLLTTFGSLEQLIAASREDLALCPGLGPQKARRLFDVLHEPFLKVP
Shortest sequence ID: Q9NS25 SPNXB_HUMAN Sperm protein associated with the nucleus on the X chromosome B/F OS=Homo sapiens GN=SPANXB1 PE=2 SV=1
Length of the shortest sequence: 103
Shortest sequence: MGQQSSVRRLKRSVPCESNEANEANEANKTMPETPTGDSDPQPAPKKMKTSESSTILVVRYRRNVKRTSPEELVNDHARENRINPDQMEEEEFIEITTERPKK


In [None]:
# fixed the find_shortest_sequence function
file_path = '/content/bioinformatics_data/mystery1.fa'

def find_shortest_sequence(sequences):
  shortest_seq_id = ""
  shortest_seq_length = float('inf')
  shortest_seq = ""

  for seq_id, seq in sequences.items():
    current_seq_length = len(seq)
    if current_seq_length < shortest_seq_length:
      shortest_seq_length = current_seq_length
      shortest_seq_id = seq_id
      shortest_seq = seq

  return shortest_seq_id, shortest_seq_length, shortest_seq

sequences = {}
current_seq_id = ""

with open(file_path, 'r') as fasta_file:
  for line in fasta_file:
    line = line.strip()
    if line.startswith(">"):
      if current_seq_id:
        sequences[current_seq_id] = current_seq
      current_seq_id = line[1:]
      current_seq = ""
    else:
      current_seq += line

if sequences:
  shortest_seq_id, shortest_seq_length, shortest_seq = find_shortest_sequence(sequences)
  print(f"Shortest sequence ID: {shortest_seq_id}")
  print(f"Length of the shortest sequence: {shortest_seq_length}")
  print(f"Shortest sequence: {shortest_seq}")
else:
  print("No sequences found in the file.")


Shortest sequence ID: Q9NS25 SPNXB_HUMAN Sperm protein associated with the nucleus on the X chromosome B/F OS=Homo sapiens GN=SPANXB1 PE=2 SV=1
Length of the shortest sequence: 103
Shortest sequence: MGQQSSVRRLKRSVPCESNEANEANEANKTMPETPTGDSDPQPAPKKMKTSESSTILVVRYRRNVKRTSPEELVNDHARENRINPDQMEEEEFIEITTERPKK


### **Task 4 DNA to Protein Translation**
Use this codon table to make the conversion.

```python
codon_table = {
    "ATA": "I", "ATC": "I", "ATT": "I", "ATG": "M",
    "ACA": "T", "ACC": "T", "ACG": "T", "ACT": "T",
    "AAC": "N", "AAT": "N", "AAA": "K", "AAG": "K",
    "AGC": "S", "AGT": "S", "AGA": "R", "AGG": "R",
    "CTA": "L", "CTC": "L", "CTG": "L", "CTT": "L",
    "CCA": "P", "CCC": "P", "CCG": "P", "CCT": "P",
    "CAC": "H", "CAT": "H", "CAA": "Q", "CAG": "Q",
    "CGA": "R", "CGC": "R", "CGG": "R", "CGT": "R",
    "GTA": "V", "GTC": "V", "GTG": "V", "GTT": "V",
    "GCA": "A", "GCC": "A", "GCG": "A", "GCT": "A",
    "GAC": "D", "GAT": "D", "GAA": "E", "GAG": "E",
    "GGA": "G", "GGC": "G", "GGG": "G", "GGT": "G",
    "TCA": "S", "TCC": "S", "TCG": "S", "TCT": "S",
    "TTC": "F", "TTT": "F", "TTA": "L", "TTG": "L",
    "TAC": "Y", "TAT": "Y", "TAA": "*", "TAG": "*",
    "TGC": "C", "TGT": "C", "TGA": "*", "TGG": "W"
}
```

In [None]:
# Python; write a function that takes a DNA sequence as input and outputs the translated sequence of AAs
def sequence_translation(dna_sequence):
  # split the DNA sequence into codons (3-letter groups)
  codons = [dna_sequence[start_codon: start_codon + 3] for start_codon in range(0, len(dna_sequence), 3)]

  # define the aforementioned codon_table that maps codon to their corresponding amino acids.
  codon_table = {
    "ATA": "I", "ATC": "I", "ATT": "I", "ATG": "M",
    "ACA": "T", "ACC": "T", "ACG": "T", "ACT": "T",
    "AAC": "N", "AAT": "N", "AAA": "K", "AAG": "K",
    "AGC": "S", "AGT": "S", "AGA": "R", "AGG": "R",
    "CTA": "L", "CTC": "L", "CTG": "L", "CTT": "L",
    "CCA": "P", "CCC": "P", "CCG": "P", "CCT": "P",
    "CAC": "H", "CAT": "H", "CAA": "Q", "CAG": "Q",
    "CGA": "R", "CGC": "R", "CGG": "R", "CGT": "R",
    "GTA": "V", "GTC": "V", "GTG": "V", "GTT": "V",
    "GCA": "A", "GCC": "A", "GCG": "A", "GCT": "A",
    "GAC": "D", "GAT": "D", "GAA": "E", "GAG": "E",
    "GGA": "G", "GGC": "G", "GGG": "G", "GGT": "G",
    "TCA": "S", "TCC": "S", "TCG": "S", "TCT": "S",
    "TTC": "F", "TTT": "F", "TTA": "L", "TTG": "L",
    "TAC": "Y", "TAT": "Y", "TAA": "*", "TAG": "*",
    "TGC": "C", "TGT": "C", "TGA": "*", "TGG": "W"
  }

 # Translate each codon into its corresponding amino acid
  # create an empty list of protein sequence
  protein_sequence = []
  for codon in codons:
    # get amino acid from codon table
    amino_acid = codon_table.get(codon)
    # add amino acid to the protein sequence list
    protein_sequence.append(amino_acid)
  return ''.join(protein_sequence)

dna = "GCAGCC"
#call the functon with a codon argumments
print(sequence_translation(dna))


AA
