# NESM Python Part 4 - Advanced Topics

- Deep learning with Tensorflow
- Our image analysis pipeline at a glance
- Dask for out of memory computing
- Classical machine learning with `scikit-learn`

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
from mpl_interactions import hyperslicer
%matplotlib widget

## Dask for out of memory computing

In [None]:
memory = 8e9 #8GB 
pixels = 1024*1024
bytes_per_pix = 2 #16 bit unsigned ints

In [None]:
memory/(pixels*bytes_per_pix) #images you can have in memory

That seems like a lot but that corresponds to less than

(20 Time points) x (10 Positions) x (4 Channels) x (5 z-slices) = 4000 Images


**Enter Dask Array**

In [None]:
20*10*4*5

In [None]:
import dask.array as da

In [None]:
#impossible_arr = np.random.random((10000,1024,1024))

In [None]:
darr = da.random.random((10000, 1024, 1024))

In [None]:
darr

In [None]:
darr.mean(0)

In [None]:
(darr - darr.min())/(darr.std())

In [None]:
# launch a client from dask-labextension
# scale to 3 cores

In [None]:
# Or, the oldschool way
# from dask.distributed import Client
# client = Client()
# client.cluster

In [None]:
out = darr.mean(0).compute()

In [None]:
out

In [None]:
fake_data = da.random.random((10, 100, 4, 5, 1024,1024))
xr.DataArray(fake_data, dims=['S','T','C','Z','Y','X'])

**Note about Dask**

- One of my favorite things about dask is that I can develop on my laptop and run with 4 cores but then move to Harvard's computing cluster and run with much more computing power. Dask scales seamlessly between these two settings.

- Dask maintains several different APIs. I'd recommend [this page from their documentation](https://docs.dask.org/en/latest/user-interfaces.html) to see what would work for you. In brief there are high level interfaces:
  - Array - for data that is a high dimensional rectangle - Will take you far for large imaging datasets and is likely the easiest to use.
  - Dataframe - for tabular data. Array:Numpy :: Dataframe:Pandas
  - Bag - more like a database format that implements Map-Reduce type operations.
  - Dask-ML - scikit-learn (more on this below) like interface for scaling machine learning tasks.
 
 
- There are also lower level interfaces for custom computation
  - Delayed - For custom python computation that does not necessarily fit the array paradigm. *Importantly* you set up all your computation and tell dask when evaluate it.
  - Futures - *Dynamic* custom computation. Things start running in real time and dask decides when things run by evaluating which computations depend on other computations. This is likely the most powerful and most confusing interface. Example: Return a list of some length and do operations on all elements.

## PCA on Hyperspectral SRS imaging data

**What is PCA?**

Principal component analysis (PCA) finds the basis vectors which explain most of the variance in a dataset. Below is a picture from the [wikipedia page](https://en.wikipedia.org/wiki/Principal_component_analysis) which shows the principal components of some correlated 2D data. 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/2560px-GaussianScatterPCA.svg.png" width="500"/>

**What is SRS?**

[Stimulated Raman Scattering](https://en.wikipedia.org/wiki/Stimulated_Raman_spectroscopy) (SRS) is an optical imaging technique that probes the vibrational energy levels of different molecules. I'm using it to study cellular metabolism and composition but its generally good for chemical mapping of materials with different vibrational energy levels.

The toy dataset below is a spectral scan of two different species of beads. We will use PCA to "discover" how many different species are in the sample and what their spectra look like. 

In [None]:
import io
import requests

In [None]:
# Get the dataset directly from github (33MB)
# Feel free to just watch if you dont want to download
response = requests.get(
    "https://github.com/jrussell25/data-sharing/raw/master/srs_beads.npy"
)
response.raise_for_status()
beads = np.load(io.BytesIO(response.content))

In [None]:
beads.nbytes/1e6

In [None]:
# Define the coordinates for the xarray as a dict of name:array pairs
# Wns = Wns is relevant spectroscopic unit in cm^-1 as above
# X,Y = actual dimensions of the images in microns from microscope metadata
coords = {'wavenums':np.linspace(2798.65, 3064.95, beads.shape[0]),
          'X':np.linspace(0, 386.44, 512),
          'Y':np.linspace(0, 386.44,512)}

x_beads = xr.DataArray(beads, dims=coords.keys(), coords=coords)

In [None]:
plt.figure()
ctrls = hyperslicer(x_beads)

### How to do PCA in python?

Of course we google it first.

But the answer is [scikit-learn](https://scikit-learn.org/stable/)

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=10)

In [None]:
# need to do some annoying reshapeing because sklearn expects (N_data, N_features)
pcs = pca.fit_transform(beads.reshape(beads.shape[0], -1).T)

In [None]:
#instead of 126 spectral points, we have 10 features corresponding to the first 10 PCs
pcs.shape

In [None]:
plt.figure()
plt.plot(x_beads['wavenums'],pca.components_[:5].T + np.arange(5)[None,:])
plt.show()

In [None]:
plt.figure()
plt.plot(pca.explained_variance_ratio_)

In [None]:
# a fun visualization
rgb = pcs[...,:3].reshape(512, 512,3)
rgb = (rgb-rgb.min(0).min(0))
rgb = rgb/rgb.max(0).max(0)

In [None]:
plt.figure()
plt.imshow(rgb)