# Unzipping the Train and Test folders and view the MPEG-G files

In [2]:
# Test Docker works
!docker run hello-world


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/



In [3]:
%pip install Bio

Note: you may need to restart the kernel to use updated packages.


In [4]:
# Import packages
import os
import zipfile
import subprocess
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shutil


from Bio import SeqIO
from collections import Counter

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Errors ignore
import warnings
warnings.filterwarnings('ignore')

## Step 1. Extract the Train and Test File from a .zip

In [5]:
print(os.listdir('.'))

['Starter_NB_Part1_Decompressing_MPEG_G_files.ipynb']


In [9]:
# Define your zip files and corresponding output directories
zip_targets = {
    r"C:\Users\JUDAH\Downloads\TrainFiles.zip": './train',
    r"C:\Users\JUDAH\Downloads\TestFiles.zip": './test'
}

for zip_path, extract_to in zip_targets.items():
    # Create the output directory if it doesn't exist
    os.makedirs(extract_to, exist_ok=True)

    try:
        # Use shutil to extract
        shutil.unpack_archive(zip_path, extract_to, format='zip')
        print(f"✅ Successfully extracted {zip_path} to ./{extract_to}/")
    except Exception as e:
        print(f"❌ Failed to extract {zip_path}: {e}")


✅ Successfully extracted C:\Users\JUDAH\Downloads\TrainFiles.zip to ././train/
✅ Successfully extracted C:\Users\JUDAH\Downloads\TestFiles.zip to ././test/


##  Step 2: Pull the Genie Docker Image

In [10]:
# Ensure you have the latest version of Genie
!docker pull muefab/genie:latest

latest: Pulling from muefab/genie
b0907bbb508f: Pulling fs layer
4619a895afbd: Pulling fs layer
0a9a5dfd008f: Pulling fs layer
846abc99b13f: Pulling fs layer
4f4fb700ef54: Pulling fs layer
bb6c686a0e98: Pulling fs layer
4f4fb700ef54: Download complete
0a9a5dfd008f: Download complete
b0907bbb508f: Download complete
4619a895afbd: Download complete
0a9a5dfd008f: Pull complete
846abc99b13f: Download complete
bb6c686a0e98: Download complete
bb6c686a0e98: Pull complete
4619a895afbd: Pull complete
846abc99b13f: Pull complete
b0907bbb508f: Pull complete
4f4fb700ef54: Pull complete
Digest: sha256:c3112a3879cc18061bbab5ed8f76dec255ab1be46e2133cd59320dd5ba98ef89
Status: Downloaded newer image for muefab/genie:latest
docker.io/muefab/genie:latest


###  Step 3: Write a function to decode the mgb files

In [41]:
import os
import subprocess
import gzip
import shutil

# Set base directories
notebook_dir = os.getcwd()
container_dir = "/data"

# Function to decode all `.mgb` files in their respective folder
def decode_all_mgb_in_folder(folder_name):
    host_dir = os.path.join(notebook_dir, folder_name)
    
    for mgb_filename in os.listdir(host_dir):
        if not mgb_filename.endswith(".mgb"):
            continue

        mgb_path = os.path.join(host_dir, mgb_filename)
        mgb_filename_no_ext = os.path.splitext(mgb_filename)[0]
        fastq_path = os.path.join(host_dir, f"{mgb_filename_no_ext}.fastq")
        fastq_gz_path = f"{fastq_path}.gz"

        command = [
            "docker", "run", "--rm",
            "-v", f"{host_dir}:{container_dir}",
            "muefab/genie:latest", "run",
            "-f",
            "-i", f"{container_dir}/{mgb_filename}",
            "-o", f"{container_dir}/{mgb_filename_no_ext}.fastq"
        ]

        subprocess.run(command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

        if not os.path.exists(fastq_path):
            print(f"[Error] FASTQ not created: {mgb_filename}")
            continue

        try:
            with open(fastq_path, "rb") as f_in, gzip.open(fastq_gz_path, "wb") as f_out:
                shutil.copyfileobj(f_in, f_out)
        except Exception as e:
            print(f"[Error] Compression failed: {mgb_filename} - {str(e)}")
            continue

        os.remove(mgb_path)# delete the original `.mgb` file
        os.remove(fastq_path) #delete the uncompressed FASTQ file

        print(f"[OK] {mgb_filename} → {os.path.basename(fastq_gz_path)}")
            
        """
        Caution on printing out each line as this does take up memory.

        print("\n--- STDOUT ---\n")
        print(result.stdout)
        if result.stderr:
            print("\n--- STDERR ---\n")
            print(result.stderr)#

        """

###  Step 4: Use function to decode the mgb files

In [42]:
decode_all_mgb_in_folder(r"C:\Users\JUDAH\Desktop\Mircobiome MPEG\train\TrainFiles")
decode_all_mgb_in_folder(r"C:\Users\JUDAH\Desktop\Mircobiome MPEG\test\TestFiles")

[OK] ID_AAFNOT.mgb → ID_AAFNOT.fastq.gz
[OK] ID_AAXPTO.mgb → ID_AAXPTO.fastq.gz
[OK] ID_AAYKAN.mgb → ID_AAYKAN.fastq.gz
[OK] ID_ABEZNS.mgb → ID_ABEZNS.fastq.gz
[OK] ID_ABFFLP.mgb → ID_ABFFLP.fastq.gz
[OK] ID_ABFQPG.mgb → ID_ABFQPG.fastq.gz
[OK] ID_ABMLPB.mgb → ID_ABMLPB.fastq.gz
[OK] ID_ABOEMW.mgb → ID_ABOEMW.fastq.gz
[OK] ID_ABRMNZ.mgb → ID_ABRMNZ.fastq.gz
[OK] ID_ABROLI.mgb → ID_ABROLI.fastq.gz
[OK] ID_ABYEPC.mgb → ID_ABYEPC.fastq.gz
[OK] ID_ABYUSV.mgb → ID_ABYUSV.fastq.gz
[OK] ID_ABZIIM.mgb → ID_ABZIIM.fastq.gz
[OK] ID_ACDYOS.mgb → ID_ACDYOS.fastq.gz
[OK] ID_ACFOIY.mgb → ID_ACFOIY.fastq.gz
[OK] ID_ACKYNO.mgb → ID_ACKYNO.fastq.gz
[OK] ID_ACNBRX.mgb → ID_ACNBRX.fastq.gz
[OK] ID_ACPNZE.mgb → ID_ACPNZE.fastq.gz
[OK] ID_ACSAGK.mgb → ID_ACSAGK.fastq.gz
[OK] ID_ACWUII.mgb → ID_ACWUII.fastq.gz
[OK] ID_ADDBVX.mgb → ID_ADDBVX.fastq.gz
[OK] ID_ADGTHC.mgb → ID_ADGTHC.fastq.gz
[OK] ID_AEHLIF.mgb → ID_AEHLIF.fastq.gz
[OK] ID_AEWEDE.mgb → ID_AEWEDE.fastq.gz
[OK] ID_AFDWVD.mgb → ID_AFDWVD.fastq.gz


###  What MPEG-G Did in simple terms

Your `.mgb` file used the MPEG-G standard to store sequencing data efficiently. Here's what happened under the hood:

- **Access Units (AUs)**: Think of these as independent blocks, like packets or video frames. Each AU can be decoded without needing the entire file.
  
- **Descriptor Streams**:
  - `SEQUENCE`: These are the DNA letters (A, T, C, G...).
  - `QUALITY`: Confidence for each base (used to assess sequencing accuracy).
  - `READ_IDENTIFIER`: Name or ID of each read.

- **Compression Techniques**:
  - Redundancies in the reads and IDs were removed.
  - Quality scores may have been quantized or entropy-coded.
  - Optional reference-based compression could align reads to a known genome and store only differences.

- **Output Format (`.fastq`)**:
  - This format is standard in genomics: it includes the ID, DNA sequence, and quality scores for each read.

MPEG-G is to genomics what `.mp4` is to video — a way to store large data efficiently without losing critical information.