<a href="https://colab.research.google.com/github/mlites/mlites2019/blob/master/intro_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Clustering

Hello! This lesson will introduce you to Clustering using a number of Unsupervised Machine Learning algorithms including k-means, DBSCAN, Spectral Clustering, and Gaussian Mixture Modeling.

Along the way, we'll learn about feature scaling, feature selection, expectation maximization, and other tools in the machine learning toolbox.

If you haven't already, make sure to run [kaggle_introduction.ipynb](https://colab.research.google.com/github/mlites/mlites2019/blob/master/kaggle_introduction.ipynb) to download the necessary datasets and set up the environment.

## Genomic Dataset Background

The dataset we'll use for this lesson is based on the Kaggle dataset [Decontamination of Microbial Genomes](https://www.kaggle.com/rec3141/biol342-genome-data). The dataset is the result of 3 years of student work in the General Microbiology course at the University of Alaska Fairbanks. In each year, students cultured a microorganism from the environment and subjected it to DNA sequencing.

For various reasons, some of the cultures were impure, containing 2, 3, or more microorganisms. The goal of our unsupervised machine learning clustering is to separate out (i.e. label) the fragments of DNA belonging to each of the microorganisms.

The dataset is organized by anonymized student ids ("samples") and DNA fragments, which are also known as contigs (for "contiguous sequences"). These contigs have been assembled from raw DNA reads using an *assembler* called SPAdes after various quality control steps. The contigs are named sequentially by student id and order by decreasing length, e.g. student0_1, student0_2, student0_3, etc. More detailed information about the dataset can be found [here](https://www.kaggle.com/rec3141/biol342-genome-data).

The dataset consists of 5 files, although for now we'll focus just on the 1st file.

1. biol342_cov_len_gc.tsv
    * contains the sequencing coverage, length, and G+C content of each contig
2. biol342_depths.tsv
    * containing the sequencing depth of each contig in each sample
3. biol342_paired.tsv
    * contains the number of raw reads spanning each pair of contigs
4. biol342_tax.tsv
    * contains taxonomic labels based on comparison to a large database
5. biol342_tnf.tsv
    * contains the tetranucleotide frequencies of each contig


## Visualizing the data

It's always a good idea to start off by looking at some of the data.

Let's read it into a pandas dataframe



In [3]:
import pandas as pd
#note the use of the sep="\t" option to tell pandas that the data is tab-separated
clg = pd.read_csv("biol342_cov_len_gc.tsv",sep='\t') 

clg.head()

Unnamed: 0,contig,student,cov,len,gc
0,student0_1,student0,29.0114,255873,0.4995
1,student0_2,student0,31.5053,190425,0.5151
2,student0_3,student0,39.5121,149891,0.5077
3,student0_4,student0,37.9206,135958,0.5212
4,student0_5,student0,34.0143,121845,0.5204


Let's take a subset of data from a single student.

Here's a few different way to do the same task, which is called *subsetting* or *slicing* the array.


In [21]:
print("select rows where index 'student' equals 'student0' using dot notation")
stu = clg[clg.student=="student0"]
print(stu.head())

print("select rows where index 'student' equals 'student0' using index notation")
stu = clg[clg['student']=="student0"]
print(stu.head())

print("generate a boolean which is True when index 'student' equals 'student0'")
pick_stu = clg['student']=="student0"
stu = clg[pick_stu]
print(stu.head())

print("use matching via the isin() function")
pick_stu = ['student0']
stu = clg[clg.student.isin(pick_stu)]
print(stu.head())

print("selecting on multiple conditions")
pick_stu = ['student0']
stu = clg[clg.student.isin(pick_stu) & (clg.gc > 0.5)] #the parantheses around the second comparison are required
print(stu.head())

print("selecting the inverse")
pick_stu = ['student0']
stu = clg[~clg.student.isin(pick_stu)]
print(stu.head())


select rows where index 'student' equals 'student0' using dot notation
       contig   student      cov     len      gc
0  student0_1  student0  29.0114  255873  0.4995
1  student0_2  student0  31.5053  190425  0.5151
2  student0_3  student0  39.5121  149891  0.5077
3  student0_4  student0  37.9206  135958  0.5212
4  student0_5  student0  34.0143  121845  0.5204
select rows where index 'student' equals 'student0' using index notation
       contig   student      cov     len      gc
0  student0_1  student0  29.0114  255873  0.4995
1  student0_2  student0  31.5053  190425  0.5151
2  student0_3  student0  39.5121  149891  0.5077
3  student0_4  student0  37.9206  135958  0.5212
4  student0_5  student0  34.0143  121845  0.5204
generate a boolean which is True when index 'student' equals 'student0'
       contig   student      cov     len      gc
0  student0_1  student0  29.0114  255873  0.4995
1  student0_2  student0  31.5053  190425  0.5151
2  student0_3  student0  39.5121  149891  0.5077


Let's go ahead and plot all of the samples dynamically

In [0]:
import colab.widgets as widgets


## k-means Clustering


## DBSCAN Clustering

## Spectral Clustering

## Gaussian Mixture Clustering