# Notebook for Metalign Data Analysis

----

Quick Notes:
| superkingdom   | ID    |
| ---------------| ------|
| Bacteria       | 2     |
| Fungi          | 4751  |
| Viruses        | 10239 |
| Archaea        | 2157  |
| Eukaryota      | 2759  |

Regular imports

In [None]:
from src.metalign_analysis import Metalign as DB
from src.my_decorators import dt
import seaborn as sns

Provide the file path to the Metalign data and initialize the class.
This attempts to set up the database if it hasn't already been created

In [None]:

start_time = dt.now()

In [None]:
leaves = "data/Leaf_all.nostrain.txt"
db = DB(leaves)

If metadata is available, add it to the class initialized. If using the samples template file submitted to Cornell's slims website, then you can add extra columns that represent categories to the template file. You can pass the names of the other categorical variables (columns) you want as arguments

In [None]:
metadata_file = "data/leaf_phenotype.csv"
db.get_metadata(metadata_file, categories=['samples', 'site', 'treatment_herb_level'], sep=',', index_col='samples')

***

## Calculate Diversity Metrics

- For either alpha or beta diversity, you have the liberty to specify which metric has to be used for the computation

Calculate alpha diversity


In [None]:
alpha_div = db.get_alpha_diversity("shannon")
print(f"Alpha diversity for all samples: \n{list(alpha_div)}")

Calculate beta diversity

In [None]:
beta_diversity = db.get_beta_diversity()
# beta_diversity.to_data_frame().head()
beta_diversity

## Display taxa-level information for samples

This returns individual records based on the taxa. If __`sample_id`__ is not specified, each will return a dataframe of all individuals for each sample and their relative abundance. The output is sorted in descending order of relative abundance.

In [None]:
all_species = db.get_all_species(superkingdom_id=2)
all_species

In [None]:
all_genus = db.get_all_genus()
all_genus

In [None]:
all_family = db.get_all_family("A02")

In [None]:
all_orders = db.get_all_order()

In [None]:
all_class = db.get_all_class("A02")

In [None]:
all_phyla = db.get_all_phyla()
all_phyla.superkingdom_id.unique()

`-222` represents an individual with unknown kingdom id

In [None]:
all_phyla[all_phyla["superkingdom_id"] == -2222]

## Data Visualization

### Stacked columns

This plots a bar chart, showing the relative abundance of the top `subset` phyla in the sample specified

In [None]:
db.barplot_by_sample("A03", "genus", 10)

In [None]:
db.barplot_by_sample('B04', 'order', 8, superkingdom_id=2)

Make a stacked barplot showing the relative abundances of all samples (or samples specified)

In [None]:
db.taxa_level_barplot(level='class', top_n=10)

In [None]:
db.taxa_level_barplot(level='order', color_palette='tab10', top_n=10, superkingdom_id=2)

Stacked barplot of all class taxa in samples "A01" and "A02"

In [None]:
db.taxa_level_barplot(level='class', top_n=10, color_palette="tab10", choose_samples=["A01", "A02"], superkingdom_id=2)

## Principal Coordinate Analysis Plot

Show PCoA of samples and color by category (complex)

In [None]:
fig = db.plot_pcoa(dissimilarity_metric='braycurtis', color_by="site", method='eigh')

Save pcoa dimensions to file

In [None]:
umapDB = db.plot_UMAP(use_dissimilarity=False, color_by='site', metric='braycurtis', n_neighbors=10, min_dist=0.1,)
umapDB

In [None]:
sns.scatterplot(umapDB, x='UMAP1', y='UMAP2', hue='site')

Save principal coordinate components to output

In [None]:
# db.save_pcoa_dimensions("pcoa_output.csv", sep=',')

## SPECIES ACCUMULATION CURVE

Still working on it

In [None]:
# db.plot_species_accum('y')

## PERMANOVA (COMING SOON)

## CO-OCCURENCE NETWORKS (COMING SOON)

### - MODIFY SPECIES ACCUMULATION CURVE
### - INCLUDE PARAMETERS TO CHOOSE WHICH GROUPS WITHIN DATA SHOULD BE INCLUDED IN THE ACCUMULATION CURVE

## Still Working

In [None]:

end_time = dt.now()
tTime = end_time - start_time
print(f"Total Time Elapsed for Analysis: {tTime}")