# Genome Overview

## Introduction

This notebook summarize a comprehensive overview of BGCs detected across the genomes in the snakemake run. 

### Table of Contents
* Step 1: Import required python packages for the notebook
* Step 2: Query and investigate the dataframe for initial assessment
* Step 3: Visualize the dataframe

### Data directory structure

    1. ../../data/interim/antismash : Output folders for all genomes in analysis (also used as input for BiGSCAPE software). 
    2. ../../data/processed/tables : Directory for raw tables generated by snakemake
        2.1. ../../data/processed/tables/df_genomes.csv (Raw output table with information on genomes contained in the antismash directory)     
        2.2. ../../data/processed/tables/df_bgc_products.csv (Raw output table with information on BGC product distribution contained in the antismash directory) 
    3. ../tables : Directory to save user defined tables (from notebook)
    4. ../figures : Directory to save figures (from notebook)

### Load Libraries

In [None]:
# Packages used in the notebook
import sys, os # Directory and file management
from pathlib import Path

import pandas as pd # Dataframe
import seaborn as sns # Visualization
import matplotlib.pyplot as plt # Visualization 
%matplotlib inline

In [None]:
# Custom packages from bgc_flow
module_path = os.path.abspath(os.path.join('../src/')) # location of the bgc_flow custom scripts
if module_path not in sys.path:
    sys.path.append(module_path)
    
from visualization.vis_genome_overview import plot_hist, plot_bgc_dist, scatter_bgcs_len

## Reading and Filtering Genome Summary of the Runs

In [None]:
# Loading saved dataframe
df_samples = pd.read_csv('../../config/samples.csv')

# load raw tables
df_genomes_all = pd.read_csv('../../data/processed/tables/df_genomes.csv', index_col='Unnamed: 0')
df_bgc_products_all = pd.read_csv('../../data/processed/tables/df_bgc_products.csv', index_col= 'Unnamed: 0')

# filter tables for the samples in snakemake config
df_genomes = df_genomes_all[df_genomes_all.index.isin(df_samples.genome_id)]
df_bgc_products = df_bgc_products_all[df_bgc_products_all.index.isin(df_samples.genome_id)]

In [None]:
# View df_genomes
df_genomes.head(2)

In [None]:
# View df_bgc_products
df_bgc_products.head(2)

## Querying genomes dataframe

As the next step of analysis, we will caryy out investigations on the generated tables. We will ask several questions such as how many genomes per genus and how many genomes per species.

You can learn more about splicing and searching of pandas dataframe to ask many of your own questions in this step.

In [None]:
# Find number of genomes per genus
df_genomes.groupby(by='genus').count()['genome_name'].sort_values(ascending=False)

In [None]:
# Find number of genomes per species
df_genomes.groupby(by='species').count()['genome_name'].sort_values(ascending=False)

## Visualization of data

Now that we have investigated the dataframe to satisfy our initial curiousity, we will focus on more comprehensive visualizations of data by using histograms, scatter plots, heatmaps and such other tool. 

You can read many online sources on cool data visualizations using seaborn, pandas (https://seaborn.pydata.org/examples/index.html) 

Here, we will generate a histogram, scatter plot and a heatmap visualization and save these figures in pdf format. We will first create a function that can be used to manipulate the figures and then generate the figures.

In [None]:
bgc_dist_path = '../figures/bgc_dist_all.pdf'
plot_bgc_dist(df_genomes, col_select='bgcs_count', to_path=bgc_dist_path)

In [None]:
scat_path = '../figures/scatter_bgcs_len_all.pdf'
scatter_bgcs_len(df_genomes, to_path=scat_path)

## Investigate the outliers from above and remove them from analysis

At this step we will remove few of the outliers from above data and copy these genomes in separate folder called filtered genomes. These genomes thus will be discarded from any future steps. Please evaluate them manually case by case to be sure.

In [None]:
# Create df_genome_all as copy for all genomes in project
df_genomes_all = df_genomes.copy()
# df_genomes will be reduced from here on

In [None]:
# Observe genomes with high record numbers
df_genomes.loc[df_genomes.records.sort_values()[:20].index, :]

In [None]:
# Remove all genome with number of records above 20 from the dataset
df_genomes = df_genomes[df_genomes.records <= 20]
bgc_dist_path = '../figures/records_dist.pdf'
plot_bgc_dist(df_genomes, col_select='records', to_path=bgc_dist_path)

In [None]:
# Get list of genomes with least BGC count
df_genomes.loc[df_genomes.bgcs_count.sort_values()[:2].index, :]

In [None]:
# Remove filtered genomes to new folder
from shutil import copytree, rmtree

antismash_dir = '../../data/interim/antismash/'
filtered_genome_dir = '../../data/processed/filtered_antismash'

if not os.path.isdir(filtered_genome_dir):
    os.mkdir(filtered_genome_dir)

for genome_id in df_genomes_all.index:
    if genome_id not in df_genomes.index:
        print(genome_id, 'to be removed to filtered directory')
        in_path = os.path.join(antismash_dir, genome_id)
        out_path = os.path.join(filtered_genome_dir, genome_id)
        if os.path.isdir(in_path):
            copytree(in_path, out_path)
            rmtree(in_path)

In [None]:
df_bgc_products = df_bgc_products.reindex(df_genomes.index)

In [None]:
df_genomes.to_csv('../tables/df_genomes.csv')
df_bgc_products.to_csv('../tables/df_bgc_products.csv')

In [None]:
# View distribution of BGC products per genome (top 25)
plt.figure()
sns.clustermap(df_bgc_products.iloc[:,:25], cmap=sns.color_palette('BuPu'), col_cluster=False, figsize=(10,20))
plt.show()