# Python Study Group
## April 11th, 2019

### Using pandas to analyze tab-delimited or comma-delimited data (tables).

Pandas (**Pan**eled **da**taframe**s**) is a powerful python module capable of efficiently writing, reading, navigating and munging data tables.

We will cover:

1. Reading in data
2. Data munging
    - Removing duplicates
    - Transitioning from a pandas dataframe to a python built-in datastructure
3. Writing data


Examples: 

1. Given Metagenomics binning output (Autometa) determine the heterogeneity of a binned genome.
2. Determine the number of shared single-copy marker genes between clusters.
3. Determine what single-copy marker genes are duplicated in a genome bin.

In [None]:
# The conventional method of importing pandas
import pandas as pd

# Some other tools for visualizing our data..
from matplotlib import pyplot as plt # boring
%matplotlib inline
import seaborn as sns

In [None]:
!head -n1 data/ML_recruitment_output.tab

In [None]:
# We can easily explore a taxonomy table generated from Autometa...
df = pd.read_csv('data/ML_recruitment_output.tab', sep='\t', index_col='contig')

In [None]:
# Display first 5 observations
df.head()

There is a nice one-liner that allows you to group all of the clusters into each of their respective dataframes

In [None]:
# We can easily see what columns we can look up with the columns method.
df.columns

## determining bin heterogeneity

We will determine the different taxa within a genome bin by first grouping each contig by its named cluster. In this case we will use the column from the decision tree classifier output table from Autometa.

In [None]:
clusters = dict(list(df.groupby('ML_expanded_clustering')))

You may explore the dataframe using the index locator method or by columns

In [None]:
#We can easily visually explore the autometa clustering using seaborn to build a scatter plot
sns.lmplot(x='bh_tsne_x', y='bh_tsne_y', data=df, 
           hue='ML_expanded_clustering', 
#            markers=['o','x','v'],
           palette='Set1',
           fit_reg=False,
          )

In [None]:
for cluster, cluster_df in clusters.items():
    print('{} unique genus in {}:\n{}'.format(cluster_df['genus'].nunique(), cluster, cluster_df.genus.unique()))

Lets continue on with clusters 'DBSCAN_round1_9' and 'DBSCAN_round1_0' as they both have been classified under the genus _Nocardia_.

Let us determine the number of shared single-copy marker genes between the two clusters, see if there exists duplicates and see if each has their own unique set.

We can determine other genome properties such as the total size of the cluster and even mean GC percentage.

In [None]:
# We can easily determine the length in one line. First let's compare the two by size
print('size of DBSCAN_round1_0:',df[df.ML_expanded_clustering == 'DBSCAN_round1_0'].length.sum(), 'bp')

print('size of DBSCAN_round1_9:',df[df.ML_expanded_clustering == 'DBSCAN_round1_9'].length.sum(), 'bp')

As you can see these genome bins vary significantly in size (5,526,970 bps to be exact!)

Now let us determine the number of single-copy marker genes in each bin and which are shared or unique between the genome bins.

First we will need to retrieve the single-copy marker genes specific to each genome bin..

In [None]:
# Let's recall we have single_copy_PFAMs as a column id to look up
df[df.ML_expanded_clustering == 'DBSCAN_round1_9'].columns

In [None]:
df[df.ML_expanded_clustering == 'DBSCAN_round1_9'].single_copy_PFAMs

Notice cells with missing values have NaN and the other cells contain a comma-delimited list of single-copy PFAM annotations. Let's overlook the NaN contigs so we can more closely inspect the number of contigs within the cluster that contain the marker genes.

In [None]:
df[df.ML_expanded_clustering == 'DBSCAN_round1_9'].single_copy_PFAMs.dropna()

In [None]:
n_contigs = len(df[df.ML_expanded_clustering == 'DBSCAN_round1_9'].single_copy_PFAMs.dropna())
print('num contigs containing marker genes: {}'.format(n_contigs))

We can perform the same method for our other _Nocardia_ cluster...

In [None]:
df[df.ML_expanded_clustering == 'DBSCAN_round1_0'].single_copy_PFAMs.dropna()
n_contigs = len(df[df.ML_expanded_clustering == 'DBSCAN_round1_0'].single_copy_PFAMs.dropna())
print('num contigs containing marker genes: {}'.format(n_contigs))

### Now let us determine the shared set of single-copy marker genes from these 19 and 16 contigs, respectively.

In [None]:
n_copies = clusters['DBSCAN_round1_9'].num_single_copies.sum()
print('num single copies cluster DBSCAN_round1_9: {}'.format(n_copies))
n_copies = clusters['DBSCAN_round1_0'].num_single_copies.sum()
print('num single copies cluster DBSCAN_round1_0: {}'.format(n_copies))

In [None]:
# Note this will generate a list of list of PFAMs separated by commas.
pfams = clusters['DBSCAN_round1_0'].single_copy_PFAMs.dropna().tolist()

Flatten the list with a list comprehension.. 
This will take each element of the PFAMs list (a list of csv pfams)
and split this element into a list of pfams..
Finally these elements (the individual PFAMs) will be placed into a list

In [None]:
all_pfams = [p for pfam in pfams for p in pfam.split(',')]

we can determine the number of pfams as well as the number unique in the pfams with the built-in set function

In [None]:
print('total number:',len(all_pfams))
print('num unique:',len(set(all_pfams)))

In [None]:
# Get pfams from cluster 1_9
pfams = clusters['DBSCAN_round1_9'].single_copy_PFAMs.dropna().tolist()
nocardia1 = [p for pfam in pfams for p in pfam.split(',')]
print('num pfams in DBSCAN_round1_9: {}'.format(len(nocardia1)))
# Get pfams from cluster 1_0
pfams = clusters['DBSCAN_round1_0'].single_copy_PFAMs.dropna().tolist()
nocardia2 = [p for pfam in pfams for p in pfam.split(',')]
print('num pfams in DBSCAN_round1_0: {}'.format(len(nocardia2)))
n1 = set(nocardia1)
n2 = set(nocardia2)
print('unique pfams in DBSCAN_round1_0: {}'.format(len(set(nocardia2))))
print('unique pfams in DBSCAN_round1_9: {}'.format(len(set(nocardia1))))

In [None]:
intersecting = n1.intersection(n2)
# interseting = n1 & n2
print('num shared markers: {}'.format(len(intersecting)))

In [None]:
n2_uniques = n2 - n1
print('{} PFAMs unique to DBSCAN_round1_0'.format(len(n2_uniques)))
n1_uniques = n1 - n2
print('{} PFAMs unique to DBSCAN_round1_9'.format(len(n1_uniques)))

In [None]:
shared_pfams = n2 & n1
print('{} PFAMs shared between clusters'.format(len(shared_pfams)))

Let's determine what of the "single-copy" marker genes are found as duplicates in these genome bins...

In [None]:
n1_dups = [pfam for pfam in nocardia1 if nocardia1.count(pfam) > 1]
n2_dups = [pfam for pfam in nocardia2 if nocardia2.count(pfam) > 1]

In [None]:
print('DBSCAN_round1_9')
print(n1_dups)
print('DBSCAN_round1_0')
print(n2_dups)

Finally, if we would like to write out clusters as their own individual table. we may us \*.the to_csv() method

In [None]:
for c, c_df in clusters.items():
    outfile = 'data/{}.tsv'.format(c)
    c_df.to_csv(outfile, sep='\t', header=True, index=True)

Now we can see we have written out a tab-delimited file corresponding to each cluster..

In [None]:
!ls data/