# Fungut

# 00 Packages and Directory

In [28]:
%matplotlib inline

import os
import IPython
import pandas as pd
import matplotlib.pyplot as plt

import qiime2 as q2
from qiime2 import Visualization

data_dir = 'data'
PATH = "data/fungut_metadata.tsv"
surveys_df = pd.read_csv(PATH, sep="\t")

# 01 Data import

In [2]:
! qiime tools peek $data_dir/fungut_forward_reads.qza

[32mUUID[0m:        3638611d-1767-413b-9390-70ee3d78e4ff
[32mType[0m:        SampleData[SequencesWithQuality]
[32mData format[0m: SingleLanePerSampleSingleEndFastqDirFmt


In [3]:
! qiime demux summarize \
  --i-data $data_dir/fungut_forward_reads.qza \
  --o-visualization $data_dir/01/demux_summary.qzv

  import pkg_resources
[32mSaved Visualization to: data/01/demux_summary.qzv[0m
[0m[?25h

In [4]:
Visualization.load(f"{data_dir}/01/demux_summary.qzv")

# 02 Trimming the primers

In [5]:
! qiime cutadapt trim-single \
  --i-demultiplexed-sequences $data_dir/fungut_forward_reads.qza \
  --p-front CTTGGTCATTTAGAGGAAGTAA \
  --o-trimmed-sequences $data_dir/02/fungut_forward_reads_trimmed.qza \
  --verbose

  import pkg_resources
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: cutadapt -u 0 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 --cores 1 -o /tmp/qiime2/jovyan/processes/377-1763043391.11@jovyan/tmp/q2-OutPath-fo8glm1p/ERR5327198_01_L001_R1_001.fastq.gz --front CTTGGTCATTTAGAGGAAGTAA /tmp/qiime2/jovyan/data/3638611d-1767-413b-9390-70ee3d78e4ff/data/ERR5327198_01_L001_R1_001.fastq.gz

This is cutadapt 5.1 with Python 3.10.14
Command line parameters: -u 0 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -q 0,0 --quality-base 33 --cores 1 -o /tmp/qiime2/jovyan/processes/377-1763043391.11@jovyan/tmp/q2-OutPath-fo8glm1p/ERR5327198_01_L001_R1_001.fastq.gz --front CTTGGTCATTTAGAGGAAGTAA /tmp/qiime2/jovyan/data/3638611d-1767-413b-9390-70ee3d78e4ff/data/E

In [6]:
! qiime demux summarize \
  --i-data $data_dir/02/fungut_forward_reads_trimmed.qza \
  --o-visualization $data_dir/02/demux_summary_posttrimming.qzv

  import pkg_resources
[32mSaved Visualization to: data/02/demux_summary_posttrimming.qzv[0m
[0m[?25h

In [7]:
Visualization.load(f"{data_dir}/02/demux_summary_posttrimming.qzv")

# 03 Denoising

### --p-max-ee 4 — controls the maximum expected errors per read accepted by DADA2’s filtering. 4 is a relatively lenient filter; 1–2 is stricter and will remove more reads. Choose this based on the post-trim quality (look at the demux summary).

In [11]:
! qiime dada2 denoise-single \
   --i-demultiplexed-seqs $data_dir/02/fungut_forward_reads_trimmed.qza \
   --p-trim-left 0 \
   --p-trunc-len 0 \
   --p-min-fold-parent-over-abundance 4 \
   --p-max-ee 4 \
    --o-representative-sequences $data_dir/03/dada2_rep_seqs_1.qza \
    --o-table $data_dir/03/dada2_table_1.qza \
    --o-denoising-stats $data_dir/03/dada2_stats_1.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: data/03/dada2_table_1.qza[0m
[32mSaved FeatureData[Sequence] to: data/dada2_rep_seqs_1.qza[0m
[32mSaved SampleData[DADA2Stats] to: data/03/dada2_stats_1.qza[0m
[0m[?25h

In [15]:
# Convert denoising statistics to a QZV (shows per-sample read counts through DADA2 steps)
! qiime metadata tabulate \
    --m-input-file $data_dir/03/dada2_stats_1.qza \
    --o-visualization $data_dir/03/dada2_stats_1.qzv

  import pkg_resources
[32mSaved Visualization to: data/03/dada2_stats_1.qzv[0m
[0m[?25h

In [16]:
Visualization.load(f"{data_dir}/03/dada2_stats_1.qzv")

In [17]:
# Create interactive table of ASV representative sequences (inspect sequence lengths/content)
! qiime feature-table tabulate-seqs \
    --i-data $data_dir/03/dada2_rep_seqs_1.qza \
    --o-visualization $data_dir/03/dada2_rep_seqs_1.qzv

  import pkg_resources
[32mSaved Visualization to: data/03/dada2_rep_seqs_1.qzv[0m
[0m[?25h

In [18]:
Visualization.load(f"{data_dir}/03/dada2_rep_seqs_1.qzv")

In [19]:
# Summarize feature table (check per-sample read depth, number of features, and overall counts)
! qiime feature-table summarize \
    --i-table $data_dir/03/dada2_table_1.qza \
    --m-sample-metadata-file $data_dir/fungut_metadata.tsv \
    --o-visualization $data_dir/03/dada2_table_1.qzv

  import pkg_resources
[32mSaved Visualization to: data/03/dada2_table_1.qzv[0m
[0m[?25h

In [20]:
Visualization.load(f"{data_dir}/03/dada2_table_1.qzv")

Comparing denoised data with trunc. length 0 and 150:
ITS sequences are more variable in length than 16S data. There is the risk of cutting the truncation length too short and loose valid ITS reads. Regarding that the the original data was good overall (everywhere over 30), it is better to keep all sequences. With this conclusion, it is decided to continue further steps with the data optained by run 1.

# 03-Checks to run now (after denoising) — pass/fail criteria to decide whether to proceed to taxonomy

Open the three QZVs (you already load them). For each, check the following items.


B. From dada2_table_1.qzv (feature-table summary):

Look at the sample frequency histogram. Decide a minimum read-depth cutoff for downstream analyses (alpha/beta). The cutoff should retain most samples but filter out failures. Document the cutoff.

Look at number of features per sample: if many samples have very few features, they might be failures.

Check whether any samples were dropped entirely (count=0).

C. From dada2_rep_seqs_1.qzv (rep sequences):

Inspect the length distribution of ASVs. ITS has variable lengths — confirm that distribution is biologically plausible and not dominated by very short sequences (primer dimers) or artifacts.

If you find many very short sequences (<100 nt), consider filtering by minimum sequence length before taxonomic classification.

D. Basic cross-checks:

Ensure sample IDs in dada2_table_1.qzv match your metadata fungut_metadata.tsv. Use qiime metadata tabulate --m-input-file fungut_metadata.tsv (to generate metadata qzv) and visually compare IDs. Mismatched names cause downstream failures.

Compute per-sample fraction unassigned after taxonomy (you’ll do this next) and flag samples with e.g., >50% unassigned.

If these checks are acceptable (reasonable retention, acceptable chimera rate, plausible sequence length distribution, few low-depth samples), you can proceed to taxonomy.

# 04 Taxonomy Linus (done on Euler, imported qza file)

In [None]:
! qiime metadata tabulate \
    --m-input-file $data_dir/04/taxjoblinus.qza \
    --o-visualization $data_dir/04/taxonomylinus.qzv

In [None]:
Visualization.load(f"{data_dir}/04/taxonomylinus.qzv")

In [None]:
! qiime taxa barplot \
    --i-table $data_dir/03/dada2_table_1.qza \
    --i-taxonomy $data_dir/04/taxjoblinus.qza \
    --m-metadata-file $data_dir/fungut_metadata.tsv \
    --o-visualization $data_dir/04/taxa-bar-plots_linus.qzv

In [None]:
Visualization.load(f"{data_dir}/04/taxa-bar-plots_linus.qzv")

# 05 Filtering

Wir müssen alle samples rausfiltern die in der Taxonomy k_unclassified, nur k_fungi oder etwas anderes als k_fungi haben.

In [21]:
! qiime feature-table summarize \
  --i-table $data_dir/03/dada2_table_1.qza \
  --m-sample-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/05/dada2_table_1_metadata.qzv

  import pkg_resources
[32mSaved Visualization to: data/05/dada2_table_1_metadata.qzv[0m
[0m[?25h

In [22]:
Visualization.load(f"{data_dir}/05/dada2_table_1_metadata.qzv")

von livia pbernehmen und dann anpassen mit ordnersystem

Filtering per tsv:
    - Filter out unwanted samples in tsv file
    - Group metadata (age and BMI)
    - Change "Not provided" to Nan

# 06 Preparing for further analysis

## 06.1 Selecting sequencing depth

In [None]:
#30000
! qiime diversity alpha-rarefaction \
    --i-table $data_dir/dada2_table_1.qza \
    --p-max-depth 30000 \
    --m-metadata-file $data_dir/fungut_metadata.tsv \
    --o-visualization $data_dir/06/fungut_alpha_rarefaction_30000.qzv

In [None]:
#50000
! qiime diversity alpha-rarefaction \
    --i-table $data_dir/dada2_table_1.qza \
    --p-max-depth 50000 \
    --m-metadata-file $data_dir/fungut_metadata.tsv \
    --o-visualization $data_dir/06/fungut_alpha_rarefaction_50000.qzv

In [None]:
Visualization.load(f"{data_dir}/06/fungut_alpha_rarefaction_30000.qzv")

In [None]:
Visualization.load(f"{data_dir}/06/fungut_alpha_rarefaction_50000.qzv")

##### Sequencing depths to test (how many samples do we loose?):
    1. 17500
    2. 20000
    3. 22500
    
Results:
17500: no samples lost (150)
20000: 1 sample lost (149)
22500: 3 samples lost (147)

-> Lets go with 20000. This way we have a good plateau for all curves and lose only 1 sample. 
-> We don't have any extremly low-depth samples that might need filtering (yay!)

In [None]:
Test = 17500
!qiime feature-table filter-samples \
  --i-table $data_dir/dada2_table_1.qza \
  --p-min-frequency 17500 \
  --o-filtered-table $data_dir/06/table_minfreq_17500.qza

!qiime feature-table summarize \
  --i-table $data_dir/06/table_minfreq_17500.qza \
  --m-sample-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/06/table_minfreq_17500_summary.qzv

In [None]:
# Depth 20000
!qiime feature-table filter-samples \
  --i-table $data_dir/dada2_table_1.qza \
  --p-min-frequency 20000 \
  --o-filtered-table $data_dir/06/table_minfreq_20000.qza

!qiime feature-table summarize \
  --i-table $data_dir/06/table_minfreq_20000.qza \
  --m-sample-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/06/table_minfreq_20000_summary.qzv

In [None]:
# Depth 22500
!qiime feature-table filter-samples \
  --i-table $data_dir/dada2_table_1.qza \
  --p-min-frequency 22500 \
  --o-filtered-table $data_dir/06/table_minfreq_22500.qza

!qiime feature-table summarize \
  --i-table $data_dir/06/table_minfreq_22500.qza \
  --m-sample-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/06/table_minfreq_22500_summary.qzv

In [None]:
Visualization.load(f"{data_dir}/06/table_minfreq_17500_summary.qzv")

In [None]:
Visualization.load(f"{data_dir}/06/table_minfreq_20000_summary.qzv")

In [None]:
Visualization.load(f"{data_dir}/06/table_minfreq_22500_summary.qzv")

## 06.02 Final check before Bootrstrapping

In [None]:
!qiime diversity alpha-rarefaction \
  --i-table $data_dir/dada2_table_1.qza \
  --p-max-depth 23000 \
  --p-iterations 50 \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/06/alpha_rarefaction_sanitycheck.qzv

In [None]:
Visualization.load(f"{data_dir}/06/alpha_rarefaction_sanitycheck.qzv")

# 07 Alpha diversity

### 07.01 Kmerizing our ASVs

In [None]:
#Here I probably have to use the feature table after filtering
! qiime kmerizer core-metrics \
  --i-table $data_dir/03/dada2_table_1.qza \
  --i-sequences $data_dir/03/dada2_rep_seqs_1.qza \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --p-sampling-depth 20000 \
  --p-kmer-size 16 \
  --output-dir $data_dir/fungut_kmerizer_output

In [34]:
!qiime tools peek $data_dir/fungut_forward_reads.qza 

[32mUUID[0m:        3638611d-1767-413b-9390-70ee3d78e4ff
[32mType[0m:        SampleData[SequencesWithQuality]
[32mData format[0m: SingleLanePerSampleSingleEndFastqDirFmt


In [32]:
!qiime boots kmer-diversity \
  --i-table $data_dir/03/dada2_table_1.qza \
  --i-sequences $data_dir/fungut_forward_reads.qza \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --p-sampling-depth 20000 \
  --p-n 10 \
  --p-replacement \
  --p-kmer-size 16 \
  --p-alpha-average-method median \
  --p-beta-average-method medoid \
  --p-color-by "country_sample" \
  --output-dir boots-kmer-diversity

  import pkg_resources
Usage: [94mqiime boots kmer-diversity[0m [OPTIONS]

  Given a single feature table as input, this action resamples the feature
  table `n` times to a total frequency of `sampling depth` per sample. It then
  splits all input sequences into kmers, and computes common alpha and beta
  diversity on each resulting kmer table. The resulting artifacts are then
  averaged using the method specified by `alpha_average_method` and
  `beta_average_method` parameters. The resulting average alpha and beta
  diversity artifacts are returned, along with a scatter plot integrated all
  alpha diversity metrics and the PCoA axes for all beta diversity metrics.

[1mInputs[0m:
  [94m[4m--i-table[0m ARTIFACT [32mFeatureTable[Frequency | RelativeFrequency |[0m
    [32mPresenceAbsence][0m      The input feature table.                  [35m[required][0m
  [94m[4m--i-sequences[0m ARTIFACT [32mFeatureData[Sequence | RNASequence |[0m
    [32mProteinSequence][0m      Inp

### Questions regarding boots kmer diversity 
- when to use mean when median?
- replace or not replace?
- Currently we use the nontrimmed sequences, how can I use the trimmed ones (filetype supposedly not supported)


In [24]:
!qiime boots kmer-diversity ?

Usage: [94mqiime boots kmer-diversity[0m [OPTIONS]

  Given a single feature table as input, this action resamples the feature
  table `n` times to a total frequency of `sampling depth` per sample. It then
  splits all input sequences into kmers, and computes common alpha and beta
  diversity on each resulting kmer table. The resulting artifacts are then
  averaged using the method specified by `alpha_average_method` and
  `beta_average_method` parameters. The resulting average alpha and beta
  diversity artifacts are returned, along with a scatter plot integrated all
  alpha diversity metrics and the PCoA axes for all beta diversity metrics.

[1mInputs[0m:
  [94m[4m--i-table[0m ARTIFACT [32mFeatureTable[Frequency | RelativeFrequency |[0m
    [32mPresenceAbsence][0m      The input feature table.                  [35m[required][0m
  [94m[4m--i-sequences[0m ARTIFACT [32mFeatureData[Sequence | RNASequence |[0m
    [32mProteinSequence][0m      Input sequences for kmeriz

## ?? Alpha diversity Analysis old

In [None]:
! qiime diversity core-metrics \
  --i-table $data_dir/dada2_table_1.qza  \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --p-sampling-depth 17500 \
  --output-dir $data_dir/fungut_coremetrics

In [None]:
# Sind das jetzt die richtigen Sequenzen? Sind nicht getrimmt aber mit den anderen gehts irgendwie nicht (siehe Datentypen oben)

! qiime kmerizer core-metrics \
  --i-table $data_dir/dada2_table_1.qza \
  --i-sequences $data_dir/dada2_rep_seqs_1.qza \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --p-sampling-depth 17500 \
  --p-kmer-size 16 \
  --output-dir $data_dir/fungut_kmerizer_output

### 06.2.1 Alpha diversity: Statistical testing

In [None]:
! qiime diversity alpha-group-significance \
  --i-alpha-diversity $data_dir/fungut_coremetrics/shannon_vector.qza \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/fungut_coremetrics/shannon-group-significance.qzv

In [None]:
Visualization.load(f"{data_dir}/fungut_coremetrics/shannon-group-significance.qzv")

In [None]:
! qiime diversity alpha-correlation \
  --i-alpha-diversity $data_dir/fungut_coremetrics/shannon_vector.qza \
  --m-metadata-file $data_dir/fungut_metadata.tsv \
  --o-visualization $data_dir/fungut_coremetrics/shannon-group-significance-numeric.qzv

In [None]:
Visualization.load(f"{data_dir}/fungut_coremetrics/shannon-group-significance-numeric.qzv")

In [None]:
!qiime boots kmer-diversity ?