# Split feature table by distribution 

## Overview

This notebook contains code to analyze mucus, tissue, and skeleton feature tables. To do this we are first going to split out each table to test if they are at above, below or at the prevelance and abundance predicted by neutral theory. 
The result of this first table will be 9 possible tables mucus_above, mucus_below, mucus_neutral, tissue_above, tissue_below...etc

## Import libraries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from qiime2 import Artifact
from os import listdir
import pandas as pd
from qiime2 import Metadata
from qiime2.plugins.diversity.visualizers import alpha_group_significance
from qiime2.plugins.feature_table.methods import filter_features
from qiime2.plugins.diversity.pipelines import alpha
from qiime2.plugins.feature_table.methods import filter_features_conditionally

## Load data and metadata

In [3]:
# Load the data
print("About to load the feature table")
feature_table = Artifact.load("../../Neutral Model Analysis/input/carib_silva_merged_table.qza")
print("Done")

# Load the metadata
print("About to load the metadata table")
metadata = Metadata.load("../../Neutral Model Analysis/input/carib_merged_mapping.txt")
print("Done")

#Load the data as csv files
mucus_data = pd.read_csv("../input/M_rarefied_table.csv")
mucus_data = mucus_data.set_index("id")
mucus_data = mucus_data.rename_axis('#OTU ID')
mucus_data.to_csv("../output/M_rarefied_table_index_renamed.tsv", sep = '\t' )
print("Done")
mucus_data
!biom convert --input-fp ../output/M_rarefied_table_index_renamed.tsv -o ../output/M_rarefied_table_index_renamed.biom --table-type='OTU table' --to-json
# Turn BIOM file into QIIME 2 artifact (qza)
!qiime tools import \
  --input-path ../output/M_rarefied_table_index_renamed.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path ../output/M_rarefied_table.qza

# Validate QIIME 2 artifact file
!qiime tools validate ../output/M_rarefied_table.qza

#Load the data as csv files
tissue_data = pd.read_csv("../input/T_rarefied_table.csv")
tissue_data = tissue_data.set_index("id")
tissue_data = tissue_data.rename_axis('#OTU ID')
tissue_data.to_csv("../output/T_rarefied_table_index_renamed.tsv", sep = '\t' )
print("Done")
tissue_data
!biom convert --input-fp ../output/T_rarefied_table_index_renamed.tsv -o ../output/T_rarefied_table_index_renamed.biom --table-type='OTU table' --to-json
# Turn BIOM file into QIIME 2 artifact (qza)
!qiime tools import \
  --input-path ../output/T_rarefied_table_index_renamed.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path ../output/T_rarefied_table.qza

# Validate QIIME 2 artifact file
!qiime tools validate ../output/T_rarefied_table.qza


#Load the data as csv files
skeleton_data = pd.read_csv("../input/S_rarefied_table.csv")
skeleton_data = skeleton_data.set_index("id")
skeleton_data = skeleton_data.rename_axis('#OTU ID')
skeleton_data.to_csv("../output/S_rarefied_table_index_renamed.tsv", sep = '\t' )
print("Done")
skeleton_data
!biom convert --input-fp ../output/S_rarefied_table_index_renamed.tsv -o ../output/S_rarefied_table_index_renamed.biom --table-type='OTU table' --to-json
# Turn BIOM file into QIIME 2 artifact (qza)
!qiime tools import \
  --input-path ../output/S_rarefied_table_index_renamed.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path ../output/S_rarefied_table.qza

# Validate QIIME 2 artifact file
!qiime tools validate ../output/S_rarefied_table.qza

About to load the feature table
Done
About to load the metadata table
Done
Done
[32mImported ../output/M_rarefied_table_index_renamed.biom as BIOMV100Format to ../output/M_rarefied_table.qza[0m
[0m[32mResult ../output/M_rarefied_table.qza appears to be valid at level=max.[0m
[0mDone
[32mImported ../output/T_rarefied_table_index_renamed.biom as BIOMV100Format to ../output/T_rarefied_table.qza[0m
[0m[32mResult ../output/T_rarefied_table.qza appears to be valid at level=max.[0m
[0mDone
[32mImported ../output/S_rarefied_table_index_renamed.biom as BIOMV100Format to ../output/S_rarefied_table.qza[0m
[0m[32mResult ../output/S_rarefied_table.qza appears to be valid at level=max.[0m
[0m

## Calculating mucus alpha diversity

In the next section of code we load our mucus qza file and we calculate alpha diversity.

In [4]:
# Load QZA feature tables
mucus_table = Artifact.load("../output/M_rarefied_table.qza")

mucus_data
# Filter by abundance and prevalence
filtered_mucus_result = filter_features_conditionally(table=mucus_table, abundance=0.01, prevalence=1/50)
filtered_mucus_table = filtered_mucus_result.filtered_table
print("Done")

# Further filter by min frequency and sample occurrence
filtered_mucus_result = filter_features(table=filtered_mucus_table, min_frequency=100, min_samples=2, filter_empty_samples=True)
filtered_mucus_table = filtered_mucus_result.filtered_table
print("Done")

# Calculate observed features (species richness)
alpha_obs_mucus_result = alpha(table=filtered_mucus_table, metric="observed_features")
observed_mucus_alpha_diversity = alpha_obs_mucus_result.alpha_diversity
obs_mucus_alpha_diversity_df = observed_mucus_alpha_diversity.view(pd.Series)
print("Done")


# Calculate Gini index (evenness)
alpha_gini_mucus_result = alpha(table=filtered_mucus_table, metric="gini_index")
gini_index_mucus = alpha_gini_mucus_result.alpha_diversity
gini_mucus_df = gini_index_mucus.view(pd.Series)
print("Done")



Done
Done


  warn(f"{func.__name__} is deprecated as of {ver}.")
  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Done
1256-012-C121-M               0.986786
1256-019-C123-M               0.965825
1256-022-C127-M               0.993080
1256-025-C128-M               0.944596
1256-028-C129-M               0.993080
                                ...   
E7.6.Mon.aequ.3.20150620.M    0.948491
E7.6.Mon.aequ.4.20150620.M    0.949207
E7.7.Pav.vari.1.20150622.M    0.834301
E7.7.Pav.vari.1.20150623.M    0.979678
E7.7.Pav.vari.2.20150622.M    0.857870
Name: gini_index, Length: 292, dtype: float64


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


In [10]:
## next step
##give path to the file and make a function to

In [9]:
tissue_table = Artifact.load("../output/T_rarefied_table.qza")
skeleton_table = Artifact.load("../output/S_rarefied_table.qza")


tissue_data
# Filter by abundance and prevalence
filtered_tissue_result = filter_features_conditionally(table=tissue_table, abundance=0.01, prevalence=1/50)
filtered_tissue_table = filtered_tissue_result.filtered_table
print("Done")

# Further filter by min frequency and sample occurrence
filtered_tissue_result = filter_features(table=filtered_tissue_table, min_frequency=100, min_samples=2, filter_empty_samples=True)
filtered_tissue_table = filtered_tissue_result.filtered_table
print("Done")

# Calculate observed features (species richness)
alpha_obs_tissue_result = alpha(table=filtered_tissue_table, metric="observed_features")
observed_tissue_features = alpha_obs_tissue_result.alpha_diversity
obs_tissue_df = observed_tissue_features.view(pd.Series)
print("Done")

# Calculate Gini index (evenness)
alpha_gini_tissue_result = alpha(table=filtered_tissue_table, metric="gini_index")
gini_tissue_index = alpha_gini_tissue_result.alpha_diversity
gini_tissue_df = gini_tissue_index.view(pd.Series)
print("Done")



skeleton_data
# Filter by abundance and prevalence
filtered_skeleton_result = filter_features_conditionally(table=skeleton_table, abundance=0.01, prevalence=1/50)
filtered_skeleton_table = filtered_skeleton_result.filtered_table
print("Done")

# Further filter by min frequency and sample occurrence
filtered_skeleton_result = filter_features(table=skeleton_table, min_frequency=100, min_samples=2, filter_empty_samples=True)
filtered_skeleton_table = filtered_skeleton_result.filtered_table
print("Done")

# Calculate observed features (species richness)
alpha_obs_skeleton_result = alpha(table=filtered_skeleton_table, metric="observed_features")
observed_skeleton_features = alpha_obs_skeleton_result.alpha_diversity
obs_skeleton_df = observed_skeleton_features.view(pd.Series)
print("Done")

# Calculate Gini index (evenness)
alpha_gini_skeleton_result = alpha(table=filtered_skeleton_table, metric="gini_index")
gini_skeleton_index = alpha_gini_skeleton_result.alpha_diversity
gini_skeleton_df = gini_skeleton_index.view(pd.Series)
print("Done")

Done
Done


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Done


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Done
Done
Done


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Done
Done


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


In [4]:
%matplotlib inline
# Calculate mean richness for each sample type
means = [obs_mucus_df.mean(), obs_tissue_df.mean(), obs_skeleton_df.mean() ]

# Labels for the bars
labels = ['Mucus', 'Tissue', 'Skeleton']

# Plot
plt.figure(figsize=(8, 6))
plt.bar(labels, means)
plt.title('Average Observed Features (Richness)')
plt.ylabel('Observed Features')
plt.xlabel('Sample Type')
plt.tight_layout()
plt.show()

NameError: name 'obs_mucus_df' is not defined

## Calculate alpha diversity for mucus, skeleton, and tissue tables

In the next section of code we calculate alpha diversity for feature table.

In [5]:
def calc_alpha_diversity(feature_table_path, sample_type, metrics=["observed_features", "gini_index"]):
    """
    Calculate and return alpha diversity metrics for a feature table.
    
    Parameters:
    - feature_table_path: str, path to the .qza file
    - sample_type: str, name of the sample (e.g., 'tissue', 'skeleton', 'mucus')
    - metrics: list, alpha diversity metrics to calculate
    
    Returns:
    - DataFrame with sample IDs as index and metrics + sample type as columns
    """
    # Load feature table
    feature_table = Artifact.load(feature_table_path)

    # Filter by abundance and prevalence
    filtered_result = filter_features_conditionally(
        table=feature_table, abundance=0.01, prevalence=1/50)
    
    filtered_table = filtered_result.filtered_table
    print(f"{sample_type} - Done: abundance/prevalence filtering")

    # Further filter by min frequency and sample occurrence
    filtered_result = filter_features(
        table=filtered_table, min_frequency=100, min_samples=2, filter_empty_samples=True)
    
    filtered_table = filtered_result.filtered_table
    print(f"{sample_type} - Done: frequency/sample filtering")

    # Collect alpha diversity results
    alpha_results = {}
    for metric in metrics:
        alpha_result = alpha(table=filtered_table, metric=metric)
        alpha_series = alpha_result.alpha_diversity.view(pd.Series)
        alpha_results[metric] = alpha_series
        print(f"{sample_type} - Done: {metric}")

    # Combine into single DataFrame
    df = pd.DataFrame(alpha_results)
    df["sample_type"] = sample_type
    return df


In [7]:
mucus_df = calc_alpha_diversity("../output/M_rarefied_table.qza", "Mucus")
mucus_df.to_csv("Mucus_alpha_diversity.csv")

tissue_df = calc_alpha_diversity("../output/T_rarefied_table.qza", "Tissue")
tissue_df.to_csv("Tisse_alpha_diversity.csv")

skeleton_df = calc_alpha_diversity("../output/S_rarefied_table.qza", "Skeleton")
skeleton_df.to_csv("Skeleton_alpha_diversity.csv")

Mucus - Done: abundance/prevalence filtering
Mucus - Done: frequency/sample filtering


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Mucus - Done: observed_features


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Mucus - Done: gini_index
Tissue - Done: abundance/prevalence filtering
Tissue - Done: frequency/sample filtering


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Tissue - Done: observed_features


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Tissue - Done: gini_index
Skeleton - Done: abundance/prevalence filtering
Skeleton - Done: frequency/sample filtering


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


Skeleton - Done: observed_features
Skeleton - Done: gini_index


  df[cols] = df[cols].apply(pd.to_numeric, errors='ignore')


## Filter feature table

Filter the feature table to just microbes with a minimum abundance and prevalance.
Our feature table has many zero counts representing microbes that are present in just a few samples.
Therefore we want to remove rare microbes that are present in fewer than 1/50 samples, or less than 1% of the total abundance.

We also filter the feature table second time to a minimum frequency of 100 counts per microbe and occurence in 2 minimum samples.

In [None]:
# Apply filtering 
filtered_feature_table_results = filter_features_conditionally(table = feature_table, abundance = 0.01, prevalence = 1/50)
filtered_feature_table = filtered_feature_table_results.filtered_table
df = filtered_feature_table.view(pd.DataFrame)

In [None]:
# Then apply additional feature filtering
filtered_feature_table_results = filter_features(
    table=filtered_feature_table,
    min_frequency=100,  # Minimum total frequency for a feature to be retained
    min_samples=2,      # Minimum number of samples a feature must be present in
    filter_empty_samples=True  # Remove samples with no features after filtering    
)
filtered_feature_table = filtered_feature_table_results.filtered_table


## Calculate observed features

We calculate the observed features(species richness) of microbes from our filtered feature table. we're counting how many different types of microbes show up in each sample. The more different types of microbes we find, the higher our diversity count.

In [None]:
#calculate observed features

alpha_diversity_results = alpha(table = filtered_feature_table, metric = "observed_features")
observed_features = alpha_diversity_results.alpha_diversity

print(observed_features)

## Calculate gini index

Calculate the gini index of microbes from our filtered feature table. This measures how evenly our microbes are distributed across our samples. A lower Gini index (closer to 0) means the microbes are distributed pretty evenly, while a higher value (closer to 1) tells us that some microbes are much more abundant than others.

In [None]:
#calculate gini index

alpha_diversity_results = alpha(table = filtered_feature_table, metric = "gini_index")
gini_index = alpha_diversity_results.alpha_diversity

print(gini_index)

## Compare diversity of microbes

First we load in our sample metadata. Then we create visualization to help us see if there are significant differences in the number of microbes (observed features) between these groups and saves it as a QZV file that we can view. 

In [None]:
#compare observed features within alpha diversity
metadata = Metadata.load("../../Neutral Model Analysis/input/carib_merged_mapping.txt")

#get visualization of alpha group significance
alpha_group_significance_results = alpha_group_significance(alpha_diversity = observed_features, metadata = metadata)
observed_features_visualization = alpha_group_significance_results.visualization
observed_features_visualization.save("../../Neutral Model Analysis/output/observed_features_kruskal_wallis.qzv")


In [None]:
def calculate_alpha_diversity(feature_table, metrics):
    """Calculate multiple alpha diversity metrics for the feature table
    
    Parameters:
    feature_table -- QIIME2 artifact of the feature table
    metrics -- list of metrics to calculate
    
    Returns -- alpha diversity results
    """
    if metrics is None:
        metrics = ['observed_features', 'gini_index']
    
    alpha_diversity_results = {}
    
    #calculate observed features

    alpha_diversity_results = alpha(table = filtered_feature_table, metric = "observed_features")
    observed_features = alpha_diversity_results.alpha_diversity
    
    
    #calculate gini index
    alpha_diversity_results = alpha(table = filtered_feature_table, metric = "gini_index")
    gini_index = alpha_diversity_results.alpha_diversity


    