# 4. Differential abundance

> **Goal:** Find which features are differentially abundant across samples.

---

**Overview**

This section examines how features are more abundant (=enriched) or less present (=depleted) between different metadata categories.

The workflow is organized into three key steps:

1. **Normality testing of our data**  
   Distribution of the features across samples is assessed in order to define which statistical method is relevant.

2. **Preparation our features for analysis**  
   A filtering method was applied on the features to assure statistical pertinence, and the ASVs were converted into taxonomic units.

3. **Differential abundance analysis - ANCOM-BC**  
    - **IBD:** Comparing features abundance of officially- and self-diagnosed IBD categories against the group without IBD.
    - **Gluten status:** Comparing features abundance of CD, gluten-allergic and gluten-free categories against the group without CD.
    - **Diet:** Comparing features abundance of red meat-free, fully vegetarian, only-eating-seefood vegetarian and vegan categories against the omnivore group.
    - **Gender:** Comparing features abundance of female and other categories against males.
    - **BMI:** Comparing features abundance of underweight, overweight, obese and severe obese categories against the healthy BMI group.
    - **Continent:** Comparing features abundance of America and Oceania against Europe.
    - **Urbanization:** Comparing features abundance between urbanization categories
 
 
---

**Table of Contents**

- [4.1 Import packages](#4.1-Import-packages)
- [4.2 Data directory](#4.2-Data-directory)
- [4.3 Normality testing of our data](#4.3-Normality-testing-of-our-data)
- [4.4 Preparing our features for analysis](#4.4-Preparing-our-features-for-analysis)
    - [4.4.1 Filtering of the features](#4.4.1-Filtering-of-the-features)
    - [4.4.2 Transforming ASVs to taxonomic units](#4.4.2-Transforming-ASVs-to-taxonomic-units)
- [4.5 Differential abundance analysis - ANCOM-BC](#4.5-Differential-abundance-analysis )
    - [4.5.1 IBD](#4.5.1-IBD)
    - [4.5.2 Glutent status](#4.5.2-Gluten-status)
    - [4.5.3 Diet](#4.5.3-Diet)
    - [4.5.4 Gender](#4.5.4-Gender)
    - [4.5.5 BMI](#4.5.5-BMI)
    - [4.5.6 Continent](#4.5.6-Continent)
    - [4.5.7 Urbanization](#4.5.7-Urbanization)

## 4.1 Import packages

In [1]:
# Importing all required packages at the start of the notebook
import os
import matplotlib.pyplot as plt
import pandas as pd
import qiime2 as q2
from qiime2 import Visualization
from qiime2 import Artifact
import seaborn as sns
from scipy.stats import shapiro, kruskal, f_oneway

## 4.2 Data directory

In [2]:
# Location
data_dir = "Project_data/Differential_Abundance"
! mkdir -p "$data_dir"

In [3]:
# Paths to project inputs
input_table    = "Project_data/Taxonomy/table_filtered.qza"
input_taxonomy = "Project_data/Taxonomy/taxonomy_pretrained.qza"
input_metadata = "Project_data/Metadata/updated_fungut_metadata.tsv"

## 4.3 Normality testing of our data

In [4]:
data = q2.Artifact.load(input_table).view(pd.DataFrame)

  import pkg_resources


In [5]:
alpha = 0.05
results = {}

# iterate through rows (samples) and test each of them for normality
for asv_name, asv_values in data.items():
    stat, p = shapiro(asv_values)
    results[asv_name] = p

# convert test results into a DataFrame
results_df = pd.DataFrame(data=results.values(), index=results.keys(), columns=['p'])

# add a new column with a descriptive test result
results_df['is_normal'] = results_df['p'] > alpha

In [6]:
print('Number of ASVs with normal distribution:', results_df['is_normal'].sum())

Number of ASVs with normal distribution: 0


Distribution of our ASVs is not normal (which was expected), so we will use ANCOM

## 4.4 Preparing our features for analysis

### 4.4.1 Filtering of the features

We tried different filtering parameters in order to see how it affects our features numbers

In [7]:
# First trying what we did in the course
! qiime feature-table filter-features \
  --i-table $input_table \
  --p-min-frequency 25 \
  --p-min-samples 4 \
  --o-filtered-table $data_dir/table_abund.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Project_data/Differential_Abundance/table_abund.qza[0m
[0m[?25h

In [8]:
! qiime feature-table filter-features \
  --i-table $input_table \
  --p-min-samples 4 \
  --o-filtered-table $data_dir/table_abund_test2.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Project_data/Differential_Abundance/table_abund_test2.qza[0m
[0m[?25h

In [9]:
! qiime feature-table filter-features \
  --i-table $input_table \
  --p-min-frequency 25 \
  --o-filtered-table $data_dir/table_abund_test3.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Project_data/Differential_Abundance/table_abund_test3.qza[0m
[0m[?25h

In [10]:
! qiime feature-table filter-features \
  --i-table $input_table \
  --p-min-frequency 25 \
  --p-min-samples 3 \
  --o-filtered-table $data_dir/table_abund_test4.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Project_data/Differential_Abundance/table_abund_test4.qza[0m
[0m[?25h

In [11]:
! qiime feature-table filter-features \
  --i-table $input_table \
  --p-min-frequency 25 \
  --p-min-samples 2 \
  --o-filtered-table $data_dir/table_abund_test5.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Project_data/Differential_Abundance/table_abund_test5.qza[0m
[0m[?25h

In [12]:
#Get number of remaining features after filtering with the different parameters
table_abund_test1 = Artifact.load(f"{data_dir}/table_abund.qza").view(pd.DataFrame)
table_abund_test2 = Artifact.load(f"{data_dir}/table_abund_test2.qza").view(pd.DataFrame)
table_abund_test3 = Artifact.load(f"{data_dir}/table_abund_test3.qza").view(pd.DataFrame)
table_abund_test4 = Artifact.load(f"{data_dir}/table_abund_test4.qza").view(pd.DataFrame)
table_abund_test5 = Artifact.load(f"{data_dir}/table_abund_test5.qza").view(pd.DataFrame)

tests = [f"Test {i}" for i in range(1, 6)]
min_freq = [25, 0, 25, 25, 25]
min_sample = [4, 4, 0, 3, 2]
dfs = [table_abund_test1, table_abund_test2, table_abund_test3, table_abund_test4, table_abund_test5]

rem_features = []

for df in dfs:
    rem_features.append(len(df.columns))

comparison_df = pd.DataFrame({"Minimum frequency": min_freq, "Minimum sample": min_sample, "Number of features remaining": rem_features}, index=tests)

display(comparison_df)

Unnamed: 0,Minimum frequency,Minimum sample,Number of features remaining
Test 1,25,4,56
Test 2,0,4,59
Test 3,25,0,538
Test 4,25,3,74
Test 5,25,2,109


We have to use strict parameters for our differential abundance to make sense, so we will still use a minimum frequency of 25 and a minimum of samples of 4, even if this makes us to loose a substantial number of features.

### 4.4.2 Transforming ASVs to taxonomic units

In [13]:
# Collapse to species level (L7)
! qiime taxa collapse \
  --i-table $data_dir/table_abund.qza \
  --i-taxonomy $input_taxonomy \
  --p-level 7 \
  --o-collapsed-table $data_dir/table_abund_L7.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Project_data/Differential_Abundance/table_abund_L7.qza[0m
[0m[?25h

## 4.5 Differential abundance analysis 

Features were pre-filtered (min-frequency 25; min-samples 4) prior to ANCOM-BC. To avoid additional filtering inside ANCOM-BC, prevalence and library-size cutoffs were disabled (prv-cut = 0, lib-cut = 0)

### 4.5.1 IBD

In [14]:
# ANCOM-BC: effect of IBD
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "ibd_sample" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_ibd_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_ibd_L7_diffs.qza[0m
[0m[?25h

In [15]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_ibd_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_ibd_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_ibd_L7_barplot.qzv[0m
[0m[?25h

In [16]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_ibd_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_ibd_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_ibd_L7_results.qzv[0m
[0m[?25h

In [17]:
Visualization.load("Project_data/Differential_Abundance/ancombc_ibd_L7_barplot.qzv")

In [18]:
Visualization.load("Project_data/Differential_Abundance/ancombc_ibd_L7_results.qzv")

### 4.5.2 Gluten status

In [19]:
# ANCOM-BC: effect of Gluten
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "gluten_sample" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_gluten_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_gluten_L7_diffs.qza[0m
[0m[?25h

In [20]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_gluten_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_gluten_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_gluten_L7_barplot.qzv[0m
[0m[?25h

In [21]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_gluten_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_gluten_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_gluten_L7_results.qzv[0m
[0m[?25h

In [22]:
Visualization.load("Project_data/Differential_Abundance/ancombc_gluten_L7_barplot.qzv")

In [23]:
Visualization.load("Project_data/Differential_Abundance/ancombc_gluten_L7_results.qzv")

### 4.5.3 Diet

In [24]:
# ANCOM-BC: effect of Diet
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "diet_type_sample" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_diet_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_diet_L7_diffs.qza[0m
[0m[?25h

In [25]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_diet_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_diet_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_diet_L7_barplot.qzv[0m
[0m[?25h

In [26]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_diet_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_diet_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_diet_L7_results.qzv[0m
[0m[?25h

In [27]:
Visualization.load("Project_data/Differential_Abundance/ancombc_diet_L7_barplot.qzv")

In [28]:
Visualization.load("Project_data/Differential_Abundance/ancombc_diet_L7_results.qzv")

### 4.5.4 Gender

In [29]:
# ANCOM-BC: effect of Sex
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "sex_sample" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_sex_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_sex_L7_diffs.qza[0m
[0m[?25h

In [30]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_sex_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_sex_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_sex_L7_barplot.qzv[0m
[0m[?25h

In [31]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_sex_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_sex_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_sex_L7_results.qzv[0m
[0m[?25h

In [32]:
Visualization.load("Project_data/Differential_Abundance/ancombc_sex_L7_barplot.qzv")

In [33]:
Visualization.load("Project_data/Differential_Abundance/ancombc_sex_L7_results.qzv")

### 4.5.5 BMI

In [34]:
# ANCOM-BC: effect of BMI
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "bmi_category" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_bmi_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_bmi_L7_diffs.qza[0m
[0m[?25h

In [35]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_bmi_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_bmi_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_bmi_L7_barplot.qzv[0m
[0m[?25h

In [36]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_bmi_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_bmi_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_bmi_L7_results.qzv[0m
[0m[?25h

In [37]:
Visualization.load("Project_data/Differential_Abundance/ancombc_bmi_L7_barplot.qzv")

In [38]:
Visualization.load("Project_data/Differential_Abundance/ancombc_bmi_L7_results.qzv")

### 4.5.6 Continent

In [39]:
# ANCOM-BC: effect of the continent
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "continent" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_continent_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_continent_L7_diffs.qza[0m
[0m[?25h

In [40]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_continent_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_continent_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_continent_L7_barplot.qzv[0m
[0m[?25h

In [41]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_continent_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_continent_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_continent_L7_results.qzv[0m
[0m[?25h

In [42]:
Visualization.load("Project_data/Differential_Abundance/ancombc_continent_L7_barplot.qzv")

In [43]:
Visualization.load("Project_data/Differential_Abundance/ancombc_continent_L7_results.qzv")

### 4.5.7 Urbanization

In [44]:
# ANCOM-BC: effect of the urbanization
! qiime composition ancombc \
  --i-table $data_dir/table_abund_L7.qza \
  --m-metadata-file $input_metadata \
  --p-formula "urban_category" \
  --p-prv-cut 0 \
  --p-lib-cut 0 \
  --o-differentials $data_dir/ancombc_urbanization_L7_diffs.qza

  import pkg_resources
[32mSaved FeatureData[DifferentialAbundance] to: Project_data/Differential_Abundance/ancombc_urbanization_L7_diffs.qza[0m
[0m[?25h

In [45]:
# Barplot results
! qiime composition da-barplot \
  --i-data $data_dir/ancombc_urbanization_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_urbanization_L7_barplot.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_urbanization_L7_barplot.qzv[0m
[0m[?25h

In [46]:
! qiime composition tabulate \
  --i-data $data_dir/ancombc_urbanization_L7_diffs.qza \
  --o-visualization $data_dir/ancombc_urbanization_L7_results.qzv

  import pkg_resources
[32mSaved Visualization to: Project_data/Differential_Abundance/ancombc_urbanization_L7_results.qzv[0m
[0m[?25h

In [47]:
Visualization.load("Project_data/Differential_Abundance/ancombc_urbanization_L7_barplot.qzv")

In [48]:
Visualization.load("Project_data/Differential_Abundance/ancombc_urbanization_L7_results.qzv")