# 0 Setup

In [1]:
# importing all required packages at the start of the notebook
import IPython
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
os.getcwd() #Get the working directory

'/home/jovyan/assignments/FunGut-Project'

In [3]:
data_dir = "/home/jovyan/assignments/FunGut-Project" #Store the folder's path

# 1 Importing the data

In [33]:
# Getting our data from the polybox:
!wget -O fungut_forward_reads.qza "https://polybox.ethz.ch/index.php/s/uV06vmm96ZzB5eM/download/fungut_forward_reads.qza"
!wget -O fungut_sample_metadata.tsv "https://polybox.ethz.ch/index.php/s/CA76kKFC9FApqpR/download/fungut_metadata.tsv"

--2025-10-05 12:36:12--  https://polybox.ethz.ch/index.php/s/uV06vmm96ZzB5eM/download/fungut_forward_reads.qza
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 712595535 (680M) [application/octet-stream]
Saving to: ‘fungut_forward_reads.qza’


2025-10-05 12:36:13 (405 MB/s) - ‘fungut_forward_reads.qza’ saved [712595535/712595535]

--2025-10-05 12:36:15--  https://polybox.ethz.ch/index.php/s/CA76kKFC9FApqpR/download/fungut_metadata.tsv
Resolving polybox.ethz.ch (polybox.ethz.ch)... 129.132.71.243
Connecting to polybox.ethz.ch (polybox.ethz.ch)|129.132.71.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18798 (18K) [application/octet-stream]
Saving to: ‘fungut_sample_metadata.tsv’


2025-10-05 12:36:15 (2.26 MB/s) - ‘fungut_sample_metadata.tsv’ saved [18798/18798]



In [5]:
# To check that our files are in the right place:
qza_file = f"{data_dir}/fungut_forward_reads.qza" #Store the sequences file
tsv_file = f"{data_dir}/fungut_sample_metadata.tsv" #Store the sample metadata file
print("File exists?", os.path.exists(qza_file), os.path.exists(tsv_file))

File exists? True True


# 3 Feature table construction

## 3.1 First overview of our sample and quality score assessment 

In [6]:
!qiime demux summarize \
  --i-data fungut_forward_reads.qza \
  --o-visualization demux-summary.qzv

  import pkg_resources
[32mSaved Visualization to: demux-summary.qzv[0m
[0m[?25h

In [7]:
Visualization.load(f"{data_dir}/demux-summary.qzv")

The mean length of our reads is 151 nts. We can see that the quality of our reads stays quite high, even at the end of the sequences (mean quality score ~38 at the position 151).

## 3.2 Denoizing and creation of ASVs

In [17]:
! qiime dada2 denoise-single \
    --i-demultiplexed-seqs $data_dir/fungut_forward_reads.qza \
    --p-trunc-len 0 \
    --p-n-threads 3 \
    --o-table $data_dir/dada2_table_no_trunc.qza \
    --o-representative-sequences $data_dir/dada2_rep_set_no_trunc.qza \
    --o-denoising-stats $data_dir/dada2_stats_no_trunc.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: /home/jovyan/assignments/FunGut-Project/dada2_table_no_trunc.qza[0m
[32mSaved FeatureData[Sequence] to: /home/jovyan/assignments/FunGut-Project/dada2_rep_set_no_trunc.qza[0m
[32mSaved SampleData[DADA2Stats] to: /home/jovyan/assignments/FunGut-Project/dada2_stats_no_trunc.qza[0m
[0m[?25h

First I tried to put 150 nts as trunc-len, but this disgarded too much sequences. I don't want to put the truncating lenght lower, because we will loose a lot of information on the sequences. As ITS have usually very different length, and that the mean quality of our reads was good, I want to keep all of them. However, after denoizing I want to do a step in order to remove the sequences that are too rare and the ones that come up in too few samples, in order to avoid too much noise in further downstream analysis:

In [19]:
! qiime feature-table filter-features \
  --i-table $data_dir/dada2_table_no_trunc.qza \
  --p-min-frequency 10 \
  --p-min-samples 2 \
  --o-filtered-table $data_dir/dada2_table.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: /home/jovyan/assignments/FunGut-Project/dada2_table.qza[0m
[0m[?25h

Now we have filtered all features that have a frequencies smaller than 10% across all samples, as well as features present in only one sample. Now that it's done, I will update the list of sequences so they match.

In [20]:
! qiime feature-table filter-seqs \
  --i-data $data_dir/dada2_rep_set_no_trunc.qza \
  --i-table $data_dir/dada2_table.qza \
  --o-filtered-data $data_dir/dada2_rep_set.qza

  import pkg_resources
[32mSaved FeatureData[Sequence] to: /home/jovyan/assignments/FunGut-Project/dada2_rep_set.qza[0m
[0m[?25h

Now that our denoizing and filtering is done, we can look at what the result it. Let's start with the denoizing statistics (so here nothing is filtered yet):

In [21]:
! qiime metadata tabulate \
    --m-input-file $data_dir/dada2_stats_no_trunc.qza \
    --o-visualization $data_dir/dada2_stats_no_trunc.qzv

  import pkg_resources
[32mSaved Visualization to: /home/jovyan/assignments/FunGut-Project/dada2_stats_no_trunc.qzv[0m
[0m[?25h

In [22]:
Visualization.load(f"{data_dir}/dada2_stats_no_trunc.qzv")

Eyeballing it the statistics look good, we don't loose to many sequences.

Now we are going to visualize the sequences:

In [23]:
! qiime feature-table tabulate-seqs \
    --i-data $data_dir/dada2_rep_set.qza \
    --o-visualization $data_dir/dada2_rep_set.qzv

  import pkg_resources
[32mSaved Visualization to: /home/jovyan/assignments/FunGut-Project/dada2_rep_set.qzv[0m
[0m[?25h

In [24]:
Visualization.load(f"{data_dir}/dada2_rep_set.qzv")

We can see that we have 145 unique features (=ASVs) after filtering.

Finally, we are going to create a feature table with the information of our sequences and the sample metadata:

In [25]:
! qiime feature-table summarize \
    --i-table $data_dir/dada2_table.qza \
    --m-sample-metadata-file $data_dir/fungut_sample_metadata.tsv \
    --o-visualization $data_dir/dada2_table.qzv

  import pkg_resources
[32mSaved Visualization to: /home/jovyan/assignments/FunGut-Project/dada2_table.qzv[0m
[0m[?25h

In [26]:
Visualization.load(f"{data_dir}/dada2_table.qzv")

Bonus: We can compare this table that we just made that contains the information of the filtered features (with >=10% of frequency and presence in >= 2 samples) with a table created with all the sequences prior to filtering:

In [29]:
! qiime feature-table summarize \
    --i-table $data_dir/dada2_table_no_trunc.qza \
    --m-sample-metadata-file $data_dir/fungut_sample_metadata.tsv \
    --o-visualization $data_dir/dada2_table_no_trunc.qzv

  import pkg_resources
[32mSaved Visualization to: /home/jovyan/assignments/FunGut-Project/dada2_table_no_trunc.qzv[0m
[0m[?25h

In [30]:
Visualization.load(f"{data_dir}/dada2_table_no_trunc.qzv")

Thing is that with my filter I did remove a lot of special features that seemed either to appear in only one sample, or to be at a less than 10% frequency. I don't know how to decide if that's good or not.

# Analysis of the sample metadata

In [31]:
sample_metadata = pd.read_csv("fungut_sample_metadata.tsv", sep="\t")
sample_metadata.head()

Unnamed: 0,ID,country_sample,state_sample,latitude_sample,longitude_sample,sex_sample,age_years_sample,height_cm_sample,weight_kg_sample,bmi_sample,diet_type_sample,ibd_sample,gluten_sample
0,ERR5327198,USA,TN,36.1,-86.8,female,67.0,152.0,41.0,17.75,Omnivore,I do not have this condition,No
1,ERR5327199,USA,DC,38.9,-77.1,male,55.0,182.0,79.0,23.73,Omnivore,I do not have this condition,I was diagnosed with gluten allergy (anti-glut...
2,ERR5327266,USA,VA,38.9,-77.1,female,28.0,175.0,61.0,19.94,Omnivore,I do not have this condition,I do not eat gluten because it makes me feel bad
3,ERR5327282,United Kingdom,Not provided,51.6,-0.2,female,26.0,166.0,60.0,21.77,Omnivore,I do not have this condition,No
4,ERR5327284,United Kingdom,Not provided,51.5,-0.2,female,25.0,173.0,59.0,20.01,Vegetarian but eat seafood,I do not have this condition,No


I think that what we could do is already find some categories that have big correlations, and "put them together", so in our analysis we don't have doubles (like latitude and longitude that are probably strongly correlated to the country and state).

Also we are going to answer the basic questions that are asked in our FunGut guideline + the general group project guideline.
It could be interesting, once we know the taxa, to look if this is coherent with current literature with what their diet / health issues etc are. See what factors seem to be the strongest -> health issues over diet or I don't know
May be look if these taxa are indicator of good health or not, and even maybe try to find literature on how these subject could enhance their gut health