# PCA Plot Tutorial

In [None]:
import caplot
from pprint import pprint
from bokeh.plotting import show
from bokeh.io import output_notebook
output_notebook()

## Dataset
The dataset used in this notebook is described in [SampleData.md](https://github.com/ArashLab/caplot/tree/main/examples/data/SampleData.md)

Briefly the `samples.tsv.gz` contains the following columns
- s: sample id
- pheno-: phenotypic information including subpopulation, superpopulation, age, t2d, bmi and isFemale
- sample_qc-: quality-control metrics computed by hail.sample_qc
- Peinciple Component Analysis (PCA)
  - pcaSS1-scores_: The first 3 principle component vectors. Computed from 1% variants randomely selected
  - pcaSS2-scores_: The first 10 principle component vectors. Computed from 10% variants randomely selected
  - pcaMAF-scores_: The first 10 principle component vectors. Computed from common variants with minor allele frequency above 1%
  - pca-scores_: The first 20 principle component vectors. Computed from all variants

### Create the caplot PCA object

In [None]:
plot = caplot.PCA()

### Load data
You may load data from pandas dataframe, tabular file and SQL database.\
Read the documentaion for this property to see details of suported formats.\
In case, the data source is a file, caplot infer file format from the extension (i.e. `tsv.gz`)

In [None]:
plot.source = 'data/samples.tsv.gz'

### Access internal data 
caplot store data internally in a pandas dataframe.\
You can access that datafame using `_data`.\
Let see the columns available in the data.

In [None]:
pprint(list(plot._data.columns))

### Set the requiered attributes
Which columns contains principal components to be ploted?
Basically the X and Y coordinate for a scatter plot

In [None]:
plot.subplots = ['pcaMAF-scores_1', 'pcaMAF-scores_2']

### Show the plot

In [None]:
plot.Show()

### Color samples by super-population

In [None]:
plot.coloringColumn = 'pheno-superpopulation'
plot.coloringStyle = 'Categorical'
plot.coloringPalette = 'Category10'
plot.Show()

### Try continues coloring by bmi

In [None]:
plot.coloringColumn = 'pheno-bmi'
plot.coloringStyle = 'Continuous'
plot.coloringPalette = 'Magma256'
plot.Show()

### More than 2 PCA vectors?
caplot plots all pairwise combinations

In [None]:
# Revert to color by super population
plot.coloringColumn = 'pheno-superpopulation'
plot.coloringStyle = 'Categorical'
plot.coloringPalette = 'Category10'

plot.subplots = ['pcaMAF-scores_1', 'pcaMAF-scores_2', 'pcaMAF-scores_3', 'pcaMAF-scores_4']
plot.Show()

### Don't want all combinations?
caplot accept list of pairs too.

In [None]:
plot.subplots = [['pcaMAF-scores_1', 'pcaMAF-scores_2'], ['pcaMAF-scores_3', 'pcaMAF-scores_4']]
plot.Show()

### Filter Samples
Filter samples using SQL queries.\
In this example we filter samples with Type 2 Diabetes

In [None]:
#Revert to a single plot
plot.subplots = ['pcaMAF-scores_1', 'pcaMAF-scores_2']

plot.filter = 'SELECT * FROM data WHERE "pheno-t2d"==1'
plot.Show()

### Highligh Samples
Highligh samples using SQL queries.\
In this example we highlight samples younger than 40.\
Zoom in using bokeh tooltip to better see highlighted samples.

In [None]:
plot.highlight = 'SELECT * FROM data WHERE "pheno-age"<40'
plot.Show()

### More contrast in highlight

In [None]:
plot.minorAlpha = 0.05
plot.Show()

### Add Hovers
This way you can see sample id and gender once hover over a sample. 

In [None]:
plot.hovers = {'id': 's', 'isFemale': 'pheno-isFemale'}
plot.Show()

### Hovers is a python dictionary
Use dictionary functions to modify hovers.
For eaxample to remove the gender and add call rate and TiTv ratio:

In [None]:
plot.hovers.pop('id', None)
plot.hovers['call-rate'] = 'sample_qc-call_rate'
plot.hovers.update({'TiTv-ratio': 'sample_qc-r_ti_tv'})
plot.Show()

### Even more interactivity with a form
Run the following cell, and click Show.\
Play with the form and click Shwo again.

In [None]:
plot.ShowWithForm()

### Ultimate interactivity with customized form
In this example `c1` to `c4` are value selectors.\
`c1` and `c2` are used in the filter query.\
`c3` and `c4` are used in the highlight query.

`c1` selects the super population.\
`c2` selects the age.\
`c3` selects the BMI.\
`c4` selects the Gender.

Also we color the samples by the sub-population


In [None]:
c1 = '{SuperPop to Filter:singleChoice:"pheno-superpopulation":"AMR"}'
c2 = '25' #'{Minimum Age to Filter:intSlider:5:100:5:25}'
c3 = '{Highligh BMI above:floatBox:18.55}'
c4 = 'True' #'{Filter SuperPop:singleChoice:"pheno-isFemale":"True"}'

plot.coloringColumn = 'pheno-subpopulation'
plot.filterTemplate = f'SELECT * FROM data WHERE "pheno-superpopulation" = {c1} AND "pheno-age" > {c2}'
plot.highlightTemplate = f'SELECT * FROM data WHERE "pheno-bmi" > {c3} AND "pheno-isFemale" = {c4}'

plot.ShowWithForm()

In [None]:
plot.coloringColumn = 'pheno-superpopulation'

### Save your plot in a variety of formats
The format is infered from the file extension.\
Hovers and bokeh tooltip remains active in the html output.

In [None]:
#plot.SaveAs('results/pca.html')
plot.SaveAs('results/pca.png')
plot.SaveAs('results/pca.svg')

### Save plot, data and config all together.
It is possible to save everything in one file and share it.\
Use `caplot` as your file extension. That's all. 

In [None]:
plot.SaveAs('results/pca.caplot')

### Restore everything.
To test this feature you can restart your notebook (clear all data) and run the following cell.\
It will restore your plot, data and config all together.

In [None]:
plot = caplot.read('results/pca.caplot')

In [None]:
plot.ShowWithForm()

### Too many plots?
smaller plots in more columns could help

In [None]:
plot.filter = 'SELECT * FROM data'
plot.highlight = 'SELECT * FROM data'
plot.coloringColumn = 'pheno-superpopulation'

plot.subplots = [f'pcaMAF-scores_{i}' for i in range(1,7)]
plot.numCols = 5
plot.subplotWidth = 200
plot.subplotHeight = 200
plot.Show()

### Smaller points
For crowded regions

In [None]:
plot.pointSize = 1
plot.Show()

### Directly play with bokeh plot
Change the axis title by accessing underying bokeh object

In [None]:
# revert to single plot
plot.subplots = ['pcaMAF-scores_1', 'pcaMAF-scores_2']
plot.subplotWidth = 400
plot.subplotHeight = 400

bokeh_plot = plot.Generate()
# bokeh_plot.xaxis.axis_label = 'PC1'
# bokeh_plot.yaxis.axis_label = 'PC2'
show(bokeh_plot)

### Do it all at once
You can set all parameters in one go

In [None]:
plot = caplot.PCA(source = 'data/samples.tsv.gz',
                  coloringColumn = 'pheno-superpopulation',
                  coloringStyle = 'Categorical',
                  coloringPalette = 'Category10',
                  subplots = ['pcaMAF-scores_1', 'pcaMAF-scores_2'])
plot.Show()