# Introduction
This notebook has been created as an addition to the main pipeline as is not necessarily part of the analysis process. Instead it takes the raw data provided and goes throught the process or shaping it to be ready to be imported into euler for the analytical pipeline. Chronologically, this notebook sits before ***01_Data_Exploration***. 

the files we recieved are:
- *Illumina_MAGs.tar.gz*
- *PacBio_MAGs.tar.gz*
- *merged_metadata_filtered.tsv*

The samples provided have been sequenced with each technique, meaning that intuitively all samples for a technique have a duplicate counterpart for the other. Feature files in *Illumina_MAGs* look to be already in [UUIDv4]('https://moshpit.qiime2.org/en/stable/chapters/howtos/import/') naming convention, but *PacBio_MAGs* are not.

what is missing and the changes that need to be made are:
- Rename features with the [UUIDv4]('https://moshpit.qiime2.org/en/stable/chapters/howtos/import/')
- Create a primary key for all samples to retain technique information.
- Update the metadata file to reflect the new primary key.
- Create a metadata file feature-wide for preliminary analisis.
- Fullfill qiime2 requirments for import
    - Merge the 2 datasets in a sample_data common directory so that it is qiime-compatible.
    - Create a MANIFEST file for **MAGS** in the sample_data directory.
    - Clean sample_data of possible .tmp files that would interfere with qiime compatibility.
- Compress sample_data to be exported to the Euler cluster for analysis.


# Setup

In [1]:

#set up environment
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization

# create directories for the notebook. DO NOT change
data_dir = 'data/01_Data_Reshaping'
!data_dir = 'data/01_Data_Reshaping'

# delete old folders
!rm -r $data_dir


!mkdir -p data
!mkdir -p $data_dir

/usr/bin/sh: line 1: data_dir: command not found
rm: cannot remove 'data/01_Data_Reshaping': No such file or directory


# Downloading and unzipping raw data

In [2]:
# Download files from the polybox
!wget 'https://polybox.ethz.ch/index.php/s/56JaAiKdGwioBKN/download'  -O additional_data/Download.zip
# Unzip
!unzip -o additional_data/Download.zip -d $data_dir
!rm additional_data/Download.zip
# extract contents from applied_bioinformatics folder
!mv $data_dir/applied_bioinformatics/* $data_dir
!rm -r $data_dir/applied_bioinformatics

--2025-12-07 17:01:40--  https://polybox.ethz.ch/index.php/s/56JaAiKdGwioBKN/download
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘additional_data/Download.zip’

additional_data/Dow     [       <=>          ] 492.28M   392MB/s    in 1.3s    

2025-12-07 17:01:41 (392 MB/s) - ‘additional_data/Download.zip’ saved [516188455]

Archive:  additional_data/Download.zip
   creating: data/01_Data_Reshaping/applied_bioinformatics/
 extracting: data/01_Data_Reshaping/applied_bioinformatics/.DS_Store  
 extracting: data/01_Data_Reshaping/applied_bioinformatics/Illumina_MAGs.tar.gz  
 extracting: data/01_Data_Reshaping/applied_bioinformatics/PacBio_MAGs.tar.gz  
 extracting: data/01_Data_Reshaping/applied_bioinformatics/merged_metadata_filtered.tsv  


## .tar.gz
the MAGs are stored in .tar.gz files for Illumina sequencing and PacBio originating sequences. Extracting the files from the is the next step.

In [3]:
#untar MAGs files and store everything in data_dir
!tar -xzf $data_dir/Illumina_MAGs.tar.gz -C $data_dir
!tar -xzf $data_dir/PacBio_MAGs.tar.gz -C $data_dir
#remove the .tar.gz files
!rm -r $data_dir/Illumina_MAGs.tar.gz
!rm -r $data_dir/PacBio_MAGs.tar.gz

<a id='UUID'></a>
## UUIDv4 renaming
a simple nested for loop that goes throught every file in the directories of interest and applies the uuid as the **new_path** 

In [4]:
import os
from uuid import uuid4

#Rename all fasta files with UUIDs
for technique in os.listdir(data_dir):
    tech_path = os.path.join(data_dir, technique)
    if not os.path.isdir(tech_path):
        continue
    for sample_id in os.listdir(tech_path):
        sample_path = os.path.join(tech_path, sample_id)
        if not os.path.isdir(sample_path):
            continue
        for file in os.listdir(sample_path):
            if file.endswith((".fa", ".fasta")):
                old_path = os.path.join(sample_path, file)
                new_path = os.path.join(sample_path, f"{uuid4()}.fa")
                os.rename(old_path, new_path)

<a id='UUID'></a>

## Primary Key Generation
Using a for loop again.


In [5]:
%%bash -s "$data_dir"
data_dir="$1"
# Rename all samples in Illumina_MAGs 
for d in $data_dir/Illumina_MAGs/*; do
    mv "$d" "$(dirname "$d")/IL_$(basename "$d")"
done
for d in $data_dir/PacBio_MAGs/*; do
    mv "$d" "$(dirname "$d")/PB_$(basename "$d")"
done


In [6]:
# import metadata
metadata_df = pd.read_csv(f"{data_dir}/merged_metadata_filtered.tsv", sep="\t", index_col=0)

#
prefixes = ("PB_", "IL_")
records = []

for technique in os.listdir(data_dir):
    tech_path = os.path.join(data_dir, technique)
    if not os.path.isdir(tech_path):
        continue
    for sample_id in os.listdir(tech_path):
        sample_path = os.path.join(tech_path, sample_id)
        if not os.path.isdir(sample_path):
            continue

        # derive ID for lookup
        lookup_id = sample_id
        for p in prefixes:
            if lookup_id.startswith(p):
                lookup_id = lookup_id[len(p):]

        for f in os.listdir(sample_path):
            if f.endswith((".fa", ".fasta")):
                if lookup_id in metadata_df.index:
                    mag_id = os.path.splitext(f)[0]
                    records.append((sample_id, mag_id))

primary_df = pd.DataFrame(records, columns=["sample-id", "mag-id"])

In [7]:
# Check sample-id
primary_df['sample-id'].unique()

array(['PB_B039_Aa_Gp_La', 'PB_MS001-3', 'PB_9b8b5', 'PB_M008',
       'PB_B038_Az_Gp_La', 'PB_e7c76', 'PB_P009', 'PB_M010',
       'PB_HM010-03', 'PB_B056_Sc_Na_Af', 'PB_HM010-01', 'PB_M004',
       'PB_36fe4', 'PB_a36ba', 'PB_M009', 'PB_P003', 'PB_P001', 'PB_M012',
       'PB_A001', 'PB_MS013-1', 'PB_B051_Aj_Po_Laf', 'PB_MS003-2',
       'PB_M006', 'PB_MS005-1', 'PB_B037_La_Ac_La', 'PB_MS009-1',
       'PB_3ee22', 'IL_B038_Az_Gp_La', 'IL_A001', 'IL_P003', 'IL_MS003-3',
       'IL_M009', 'IL_HM010-03', 'IL_M010', 'IL_B044_Hb_Ac_Ab',
       'IL_MS009-2', 'IL_MS009-1', 'IL_MS001-3', 'IL_M002', 'IL_M012',
       'IL_MS011-1', 'IL_B056_Sc_Na_Af', 'IL_M008', 'IL_B037_La_Ac_La',
       'IL_MS013-1', 'IL_MS005-1', 'IL_MS003-2', 'IL_HM010-01', 'IL_P009',
       'IL_B051_Aj_Po_Laf', 'IL_A002', 'IL_B039_Aa_Gp_La'], dtype=object)

In [8]:
#create 'sample' without the IL/PB prefix
primary_df['sample'] = primary_df['sample-id'].str.split(pat='_', n=1).str[1]
primary_df.head(3)

Unnamed: 0,sample-id,mag-id,sample
0,PB_B039_Aa_Gp_La,829f489a-5001-40f7-b6f2-3cfca555ec09,B039_Aa_Gp_La
1,PB_B039_Aa_Gp_La,559461fb-ff70-4610-9b15-fd7b67fcaefd,B039_Aa_Gp_La
2,PB_B039_Aa_Gp_La,55e60f53-806c-486d-898f-f72cadfb37d6,B039_Aa_Gp_La


In [9]:
metadata_df.head(3)

Unnamed: 0_level_0,samp_country,category,fermented_food_type
sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MS009-1,Laos,fermented fish,Fermented_fish
A002,Thailand,fermented meat,Fermented_pork_sausage_(Sai-krok_Isaan_Moo)
3ee22,Germany,fermented vegetables,Sauerkraut


## feature-based metadata
for distributions and preliminary analisis a count of of fasta per could be useful. we merge primary_df with metadata on the left to achieve this result

In [10]:
merged_df = primary_df.merge(metadata_df, left_on ='sample', right_index = True, how = 'left')
merged_df.head(3)

Unnamed: 0,sample-id,mag-id,sample,samp_country,category,fermented_food_type
0,PB_B039_Aa_Gp_La,829f489a-5001-40f7-b6f2-3cfca555ec09,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin
1,PB_B039_Aa_Gp_La,559461fb-ff70-4610-9b15-fd7b67fcaefd,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin
2,PB_B039_Aa_Gp_La,55e60f53-806c-486d-898f-f72cadfb37d6,B039_Aa_Gp_La,Benin,fermented fish,Lanhouin


The new *merged_df* can be saved as *Metadata_Extended.tsv* and used for exploratory analysis

## Getting things ready for qiime2 Import
### fixing directory tree
To import sequences correctly into qiime we need to modify the directory tree to match from the current:

data/  
&nbsp;&nbsp;&nbsp;&nbsp;- technique 1  
&nbsp;&nbsp;&nbsp;&nbsp;- technique2  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- sample 1  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- sample 2  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- seq1.fa  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- seq2.fa  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...  
To the required:  
data/  
&nbsp;&nbsp;&nbsp;&nbsp;-seq1  
&nbsp;&nbsp;&nbsp;&nbsp;-seq2  
&nbsp;&nbsp;&nbsp;&nbsp;...  
&nbsp;&nbsp;&nbsp;&nbsp;MANIFEST

qiime would import the original tree without issue, but the subsequent BUSCO Evaluation job would fail. The error was due to some issue with the pathing. That is why the nesting was removed.

Let's loop through all folders and move any .fa of .fasta files found to a new *data_dir/sample_data* folder

In [11]:
%%bash -s "$data_dir"
data_dir="$1"

# Destination folder
dest_dir="$data_dir/sample_data"
mkdir -p "$dest_dir"

# Function to move fasta files safely
move_fasta() {
    folder="$1"
    # Use find to safely handle no matches
    find "$folder" -type f \( -name "*.fa" -o -name "*.fasta" \) -exec mv {} "$dest_dir" \;
}

# Move files from Illumina and PacBio folders
move_fasta "$data_dir/Illumina_MAGs"
move_fasta "$data_dir/PacBio_MAGs"

# Delete the directory husk left behind
rm -r $data_dir/Illumina_MAGs
rm -r $data_dir/PacBio_MAGs

### MANIFEST

It is a file that associates each files data to their location. By following the [Moshpit Tutorial]('https://moshpit.qiime2.org/en/stable/chapters/howtos/import/') for importing non-dereplicated mags, no information about the MANIFEST format is provided. This is not a problem since the format was taken from the week 3 tutorial.

In [12]:
#import MANIFEST from the tutorial
w3 = pd.read_csv(f'{data_dir}/w3_MANIFEST', sep = '\t')
w3.head(1)

FileNotFoundError: [Errno 2] No such file or directory: 'data/01_Data_Reshaping/w3_MANIFEST'

As you can be seen the MANIFEST needs to be a **2 column**, **tab separated** file to be compatible with qiime. From the Moshpit tutorial we know that the sequences files names have to be UUIDv4 generated. It should follow naturally that for the mags a **3 column, comma separated value** is required.

It is a matter of creating a df with **'sample-id'** and **'mags-id'** from our newly created *merged_df* and the **path** of each file in *sample_data*, making sure that the path matches the future location in the cluster, as they need to be imported in a cache for ease of access of the computational nodes. This has actually been one of the first roadblocks for the project as understanding how to get everything in ready-to-go state on Jupyter (where the UI is more visually intuitive for novice bioinformaticians) and move it to Euler was a real challenge.

In [None]:
Euler_dir = '/cluster/scratch/emotta/sample_data'  # target import directory

# Build filename column from mag-id
manifest_df = merged_df[["sample-id", "mag-id"]].copy()
manifest_df["filename"] = manifest_df["mag-id"].apply(
    lambda x: f"{Euler_dir}/{x}.fa"
)
manifest_df.head(3)


 This has easily been one of the most challenging steps in our project, since no proper documentation exist about this and even asking the the wednesday's session supervisors, the only guide was the error messages from the ***Mosh tools cache-import*** command when trying to import the sequences in qiime.
 
 <span style="color:red">*MANIFEST is not a(n) MultiMAGManifestFormat file  
Found header on line 1 with the following labels: ['sample-id\tabsolute-filepath'], expected: ['sample-id', 'mag-id', 'filename']*</span>

And later 

<span style="color:red"> Found header on line 1 with the following labels: ['sample-id\tmag-id\tfilename'], expected: ['sample-id', 'mag-id', 'filename']</span>

In retrospective it was not a complicated issue, but summed with the steep learning curve of both qiime and Euler simultaneously it added up to a great setback of the first part of the semester.

We are now ready to save the MANIFEST in the *sample_data* directory

In [None]:
manifest_df.to_csv(f'{data_dir}/sample_data/MANIFEST', sep=",", index=False)

## Clean sample_data
another common issue we encountered was the error warning was due to the presence of temporary and hidden files (*.files*) created by jupyter and moved from the raw data directories.


In [None]:
!find "$data_dir/sample_data" -maxdepth 1 -type f -name ".*" -delete

## Zip and upload to Euler
*sample_data* has been 

In [None]:
!tar -czf sample_data.tar.gz -C "$data_dir" sample_data


**all set!** *sample_data* is now ready to be downloaded and manually uploaded to the ${HOME} directory on Euler for safe keeping