# Introduction
This notebook has been created as an addition to the main pipeline as is not necessarily part of the analysis process. Instead it takes the raw data provided and goes throught the process or shaping it to be ready to be imported into euler for the analytical pipeline. Chronologically, this notebook sits before ***01_Data_Exploration***. 

the files we recieved are:
- *Illumina_MAGs.tar.gz*
- *PacBio_MAGs.tar.gz*
- *merged_metadata_filtered.tsv*

The samples provided have been sequenced with each technique, meaning that intuitively all samples for a technique have a duplicate counterpart for the other. Feature files in *Illumina_MAGs* look to be already in [UUIDv4]('https://moshpit.qiime2.org/en/stable/chapters/howtos/import/') naming convention, but *PacBio_MAGs* are not.

what is missing and the changes that need to be made are:
- Rename features with the [UUIDv4]('https://moshpit.qiime2.org/en/stable/chapters/howtos/import/')
- Create a primary key for all samples to retain technique information.
- Update the metadata file to reflect the new primary key.
- Create a metadata file feature-wide for preliminary analisis.
- Fullfill qiime2 requirments for import
    - Merge the 2 datasets in a sample_data common directory so that it is qiime-compatible.
    - Create a MANIFEST file for **MAGS** in the sample_data directory.
    - Clean sample_data of possible .tmp files that would interfere with qiime compatibility.
- Compress sample_data to be exported to the Euler cluster for analysis.


# Setup

In [1]:
#set up environment
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization

# create directories for the notebook. DO NOT change
!mkdir -p data
!mkdir -p data/processed
raw_data = 'data/raw'
metadata_dir = 'data/processed/metadata'

In [2]:
# Just for this once it's better to erase the raw_data folder if you already ran the notebook and intend to run it again
!rm -r $raw_data

In [3]:
%%bash -s "$raw_data" "$metadata_dir"
mkdir -p "$1"
mkdir -p "$2"

# Downloading and unzipping raw data

In [4]:
# Download files from the polybox
!wget 'https://polybox.ethz.ch/index.php/s/56JaAiKdGwioBKN/download'  -O $raw_data/Download.zip
# Unzip
!unzip -o $raw_data/Download.zip -d $raw_data
!rm $raw_data/Download.zip
# extract contents from applied_bioinformatics folder
!mv $raw_data/applied_bioinformatics/* $raw_data
!rm -r $raw_data/applied_bioinformatics

--2025-12-10 16:03:52--  https://polybox.ethz.ch/index.php/s/56JaAiKdGwioBKN/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘additional_data/Download.zip’

additional_data/Dow     [       <=>          ] 492.28M   371MB/s    in 1.3s    

2025-12-10 16:03:54 (371 MB/s) - ‘additional_data/Download.zip’ saved [516188455]

Archive:  additional_data/Download.zip
   creating: data/raw/applied_bioinformatics/
 extracting: data/raw/applied_bioinformatics/.DS_Store  
 extracting: data/raw/applied_bioinformatics/Illumina_MAGs.tar.gz  
 extracting: data/raw/applied_bioinformatics/PacBio_MAGs.tar.gz  
 extracting: data/raw/applied_bioinformatics/merged_metadata_filtered.tsv  


## .tar.gz
the MAGs are stored in .tar.gz files for Illumina- and PacBio- generated sequences. Extracting the files from the is the next step.

In [5]:
#untar MAGs files and store everything in raw_data
!tar -xzf $raw_data/Illumina_MAGs.tar.gz -C $raw_data
!tar -xzf $raw_data/PacBio_MAGs.tar.gz -C $raw_data
#remove the .tar.gz files
!rm -r $raw_data/Illumina_MAGs.tar.gz
!rm -r $raw_data/PacBio_MAGs.tar.gz

<a id='UUID'></a>
## UUIDv4 renaming
A simple nested for loop that goes throught every file in the directories of interest and applies the uuid as the **new_path**. UUIDv4 is a 32 digit hexadecimal string that is semi-randomly generated. It's used to ensure that feature names are unique and 32 digits (128bit) long.

In [6]:
import os
from uuid import uuid4

#Rename all fasta files with UUIDs
for technique in os.listdir(raw_data):
    tech_path = os.path.join(raw_data, technique)
    if not os.path.isdir(tech_path):
        continue
    for sample_id in os.listdir(tech_path):
        sample_path = os.path.join(tech_path, sample_id)
        if not os.path.isdir(sample_path):
            continue
        for file in os.listdir(sample_path):
            if file.endswith((".fa", ".fasta")):
                old_path = os.path.join(sample_path, file)
                new_path = os.path.join(sample_path, f"{uuid4()}.fa")
                os.rename(old_path, new_path)

<a id='UUID'></a>

## Primary Key Generation
By primary key, a unequivocal ID is intended. Right now if we were to merge the features and look at them with the metadata information, the MAGs coming from the same sample and sequenced with Illumina would not be distinguishable from the Pacbio ones and vice versa. As primary key, the a technique dependent prefix ('IL' for Illumina and 'PB' for PacBio) is enough to discern.
to do so we will use a for loop again, adding prefix + '_' to the name of each sample directory.

In [7]:
%%bash -s "$raw_data"
raw_data="$1"
# Rename all samples in Illumina_MAGs 
for d in $raw_data/Illumina_MAGs/*; do
    mv "$d" "$(dirname "$d")/IL_$(basename "$d")"
done
for d in $raw_data/PacBio_MAGs/*; do
    mv "$d" "$(dirname "$d")/PB_$(basename "$d")"
done


It is important that the new primary key is included in the metadata file, the following code will retreive the new sample name. Additionally it will retrieve the filename of the fasta files. This is not required for the primary key, but will make it easier to perform Data Exploration.

In [8]:
# import metadata
metadata_df = pd.read_csv(f"{raw_data}/merged_metadata_filtered.tsv", sep="\t", index_col=0)

#
prefixes = ("PB_", "IL_")
records = []

for technique in os.listdir(raw_data):
    tech_path = os.path.join(raw_data, technique)
    if not os.path.isdir(tech_path):
        continue
    for sample_id in os.listdir(tech_path):
        sample_path = os.path.join(tech_path, sample_id)
        if not os.path.isdir(sample_path):
            continue

        # derive ID for lookup
        lookup_id = sample_id
        for p in prefixes:
            if lookup_id.startswith(p):
                lookup_id = lookup_id[len(p):]

        for f in os.listdir(sample_path):
            if f.endswith((".fa", ".fasta")):
                if lookup_id in metadata_df.index:
                    mag_id = os.path.splitext(f)[0]
                    records.append((sample_id, mag_id))

primary_df = pd.DataFrame(records, columns=["sample-id", "mag-id"])

In [9]:
# Check sample-id
primary_df['sample-id'].unique()

array(['PB_B039_Aa_Gp_La', 'PB_MS001-3', 'PB_9b8b5', 'PB_M008',
       'PB_B038_Az_Gp_La', 'PB_e7c76', 'PB_P009', 'PB_M010',
       'PB_HM010-03', 'PB_B056_Sc_Na_Af', 'PB_HM010-01', 'PB_M004',
       'PB_36fe4', 'PB_a36ba', 'PB_M009', 'PB_P003', 'PB_P001', 'PB_M012',
       'PB_A001', 'PB_MS013-1', 'PB_B051_Aj_Po_Laf', 'PB_MS003-2',
       'PB_M006', 'PB_MS005-1', 'PB_B037_La_Ac_La', 'PB_MS009-1',
       'PB_3ee22', 'IL_B038_Az_Gp_La', 'IL_A001', 'IL_P003', 'IL_MS003-3',
       'IL_M009', 'IL_HM010-03', 'IL_M010', 'IL_B044_Hb_Ac_Ab',
       'IL_MS009-2', 'IL_MS009-1', 'IL_MS001-3', 'IL_M002', 'IL_M012',
       'IL_MS011-1', 'IL_B056_Sc_Na_Af', 'IL_M008', 'IL_B037_La_Ac_La',
       'IL_MS013-1', 'IL_MS005-1', 'IL_MS003-2', 'IL_HM010-01', 'IL_P009',
       'IL_B051_Aj_Po_Laf', 'IL_A002', 'IL_B039_Aa_Gp_La'], dtype=object)

Now that everybody is accounted for, we can derive the old sample name to merge the new information to the metadata file provided.

In [10]:
#create 'sample' without the IL/PB prefix
primary_df['sample'] = primary_df['sample-id'].str.split(pat='_', n=1).str[1]
primary_df.head(3)

Unnamed: 0,sample-id,mag-id,sample
0,PB_B039_Aa_Gp_La,6de933c6-afeb-4e58-b764-f8ba5fc0d366,B039_Aa_Gp_La
1,PB_B039_Aa_Gp_La,8d28fd5f-5f2c-4508-9764-1c20ebffa8df,B039_Aa_Gp_La
2,PB_B039_Aa_Gp_La,c0a09ab4-2630-4493-a84e-67f8b764d4d9,B039_Aa_Gp_La


## feature-based metadata
for distributions and preliminary analisis a count of of fasta per could be useful. We merge primary_df with metadata on the left to achieve this result

In [11]:
merged_df = primary_df.merge(metadata_df, left_on ='sample', right_index = True, how = 'left')
merged_df.head(3)

Unnamed: 0,sample-id,mag-id,sample,samp_country,category,fermented_food_type
0,PB_B039_Aa_Gp_La,6de933c6-afeb-4e58-b764-f8ba5fc0d366,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin
1,PB_B039_Aa_Gp_La,8d28fd5f-5f2c-4508-9764-1c20ebffa8df,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin
2,PB_B039_Aa_Gp_La,c0a09ab4-2630-4493-a84e-67f8b764d4d9,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin


A column with the relative path for jupyter will be added to make Data exploration easier. the file path can be made with *raw_data/sample_data/* and *sample-id/mag-id* from the merged_df

In [20]:
merged_df["file_path"] = merged_df.apply(
    lambda r: f"{raw_data}/sample_data/{r['sample-id']}/{r['mag-id']}",
    axis=1
)
merged_df.head(3)

Unnamed: 0,sample-id,mag-id,sample,samp_country,category,fermented_food_type,file_path
0,PB_B039_Aa_Gp_La,6de933c6-afeb-4e58-b764-f8ba5fc0d366,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin,data/raw/sample_data/PB_B039_Aa_Gp_La/6de933c6...
1,PB_B039_Aa_Gp_La,8d28fd5f-5f2c-4508-9764-1c20ebffa8df,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin,data/raw/sample_data/PB_B039_Aa_Gp_La/8d28fd5f...
2,PB_B039_Aa_Gp_La,c0a09ab4-2630-4493-a84e-67f8b764d4d9,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin,data/raw/sample_data/PB_B039_Aa_Gp_La/c0a09ab4...


The new *merged_df* can be saved as *Metadata_Extended.tsv* and used for exploratory analysis

In [12]:
merged_df.to_csv(f"{metadata_dir}/Metadata_Extended.tsv", sep="\t", index=False)

## Getting things ready for qiime2 Import
### fixing directory tree
To import sequences correctly into qiime we need to modify the directory tree to match from the current:

data/  
&nbsp;&nbsp;&nbsp;&nbsp;- technique 1  
&nbsp;&nbsp;&nbsp;&nbsp;- technique2  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- sample 1  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- sample 2  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- seq1.fa  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- seq2.fa  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...  
To the required:  
data/  
&nbsp;&nbsp;&nbsp;&nbsp;sample1  
&nbsp;&nbsp;&nbsp;&nbsp;sample2  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-seq1  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-seq2  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...  
&nbsp;&nbsp;&nbsp;&nbsp;MANIFEST  

qiime would import the original tree without issue, but the subsequent BUSCO Evaluation job would fail. The error was due to some issue with the pathing.

Let's move all the sample folders found to a new *raw_data/sample_data* folder

In [13]:
%%bash -s "$raw_data"
raw_data="$1"

# Destination folder
dest_dir="$raw_data/sample_data"
mkdir -p "$dest_dir"

mv "$raw_data/Illumina_MAGs"/* "$dest_dir"/
mv "$raw_data/PacBio_MAGs"/* "$dest_dir"/

# Delete the directory husk left behind
rm -r $raw_data/Illumina_MAGs
rm -r $raw_data/PacBio_MAGs

### MANIFEST

It is a file that associates each files data to their location. By following the [Moshpit Tutorial]('https://moshpit.qiime2.org/en/stable/chapters/howtos/import/') for importing non-dereplicated mags, no information about the MANIFEST format is provided. This is not a problem since the format was taken from the week 3 tutorial.

In [14]:
#import MANIFEST from the tutorial
!wget 'https://polybox.ethz.ch/index.php/s/LBs9m5ZAAxGTLR6/download' -O $raw_data/w3_MANIFEST

w3 = pd.read_csv(f'{raw_data}/w3_MANIFEST', sep = '\t')
w3.head(1)

--2025-12-10 16:04:58--  https://polybox.ethz.ch/index.php/s/LBs9m5ZAAxGTLR6/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3012 (2.9K) [application/octet-stream]
Saving to: ‘data/raw/w3_MANIFEST’


2025-12-10 16:04:58 (549 MB/s) - ‘data/raw/w3_MANIFEST’ saved [3012/3012]



Unnamed: 0,sample-id,absolute-filepath
0,550.L1S109.s.1,$PWD/w3_data/seqs/ERR1866491.fastq


As you can be seen the MANIFEST needs to be a **2 column**, **tab separated** file to be compatible with qiime. From the Moshpit tutorial we know that the sequences files names have to be UUIDv4 generated. It should follow naturally that for the mags a **3 column, comma separated value** is required.

It is a matter of creating a df with **'sample-id'** and **'mags-id'** from our newly created *merged_df* and the **path** of each file in *sample_data*, making sure that the path matches the future location in the cluster, as they need to be imported in a cache for ease of access of the computational nodes. This has actually been one of the first roadblocks for the project as understanding how to get everything in ready-to-go state on Jupyter (where the UI is more visually intuitive for novice bioinformaticians) and move it to Euler was a real challenge.

In [15]:
# make sure to change your 
Euler_dir = '/cluster/scratch/emotta/sample_data'

manifest_df = merged_df[["sample-id", "mag-id"]].copy()
manifest_df["filename"] = manifest_df.apply(
    lambda r: f"{Euler_dir}/{r['sample-id']}/{r['mag-id']}.fa",
    axis=1
)

manifest_df.head(3)


Unnamed: 0,sample-id,mag-id,filename
0,PB_B039_Aa_Gp_La,6de933c6-afeb-4e58-b764-f8ba5fc0d366,/cluster/scratch/emotta/sample_data/PB_B039_Aa...
1,PB_B039_Aa_Gp_La,8d28fd5f-5f2c-4508-9764-1c20ebffa8df,/cluster/scratch/emotta/sample_data/PB_B039_Aa...
2,PB_B039_Aa_Gp_La,c0a09ab4-2630-4493-a84e-67f8b764d4d9,/cluster/scratch/emotta/sample_data/PB_B039_Aa...


 This has easily been one of the most challenging steps in our project, since no proper documentation exist about this and even asking the the wednesday's session supervisors, the only guide was the error messages from the ***Mosh tools cache-import*** command when trying to import the sequences in qiime.
 
 <span style="color:red">*MANIFEST is not a(n) MultiMAGManifestFormat file  
Found header on line 1 with the following labels: ['sample-id\tabsolute-filepath'], expected: ['sample-id', 'mag-id', 'filename']*</span>

And later 

<span style="color:red"> Found header on line 1 with the following labels: ['sample-id\tmag-id\tfilename'], expected: ['sample-id', 'mag-id', 'filename']</span>

In retrospective it was not a complicated issue, but summed with the steep learning curve of both qiime and Euler simultaneously it added up to a great setback of the first part of the semester.

We are now ready to save the MANIFEST in the *sample_data* directory

In [16]:
manifest_df.to_csv(f'{raw_data}/sample_data/MANIFEST', sep=",", index=False)

## Clean sample_data
another common issue we encountered was the error warning was due to the presence of temporary and hidden files (*.files*) created by jupyter and moved from the raw data directories.


In [17]:
!find "$raw_data/sample_data" -maxdepth 1 -type f -name ".*" -delete

## Zip and upload to Euler
*sample_data* has been 

In [18]:
!tar -czf sample_data.tar.gz -C "$raw_data" sample_data


tar: sample_data: file changed as we read it


**all set!** *sample_data* is now ready to be downloaded and manually uploaded to the ${HOME}/applied_bioinformatics/03_Aritifacts_Zipped directory on Euler for safe keeping