# (BioImage) Data Analysis with Python

*created June 2018 by Jonas Hartmann (Gilmour group, EMBL Heidelberg)*<br>

## Table of Contents

1. [About this Tutorial](#about)
2. [Preparations](#prep)
    1. [Imports](#imports)
    2. [Loading the Data](#loading)
    3. [Some Checks for Common Problems](#checks)
3. [Basic Data Visualization](#dataviz)
    1. [Basic Boxplot](#bplot)
    2. [Interactive Scatterplot](#iscatter)
    3. [Interactive Backmapping](#ibackmap)
4. [Multi-Dimensional Analysis](#MDA)
    1. [Feature Standardization](#featstand)
    2. [Dimensionality Reduction by PCA](#pca)
    3. [Dimensionality Reduction by tSNE](#tsne)
    4. [Clustering with k-Means](#cluster)
    5. [Cluster Visualization by Minimum Spanning Tree](#mst)
    6. [Classification of Mitotic Cells](#mitotic)
    7. [Grouped Analysis and Hypothesis Testing](#grouped_and_hypot)

## 1. About this Tutorial <a id=about></a>

*Analyzing biological image data commonly involves the detection and segmentation of objects of interest such as cells or nuclei, the properties of which can then be measured individually, producing *single-cell data*. However, extracting biological meaning from such data is often far from trivial! Fortunately, a large host of data analysis algorithms and data visualization tools is freely available in the python ecosystem. This tutorial provides an introductory overview of some of the most important tools in the field.*


#### <font color=orangered>Warning:</font> This Tutorial is in Beta!

It has not been extensively tested yet and may contain flaws both at the conceptional and the implementation level. Furthermore, it has not yet been extended to be not fully self-explanatory!


#### Background

The images used for this tutorial were downloaded from the [Broad Bioimage Benchmark Collection (BBBC)](https://data.broadinstitute.org/bbbc/index.html), which is a collection of freely downloadable microscopy image sets.

They are 3-color images of cultured **HT29 cells**, a widely used human colon cancer cell line. The data was originally produced by *Moffat et al.* in the context of a high-content RNAi screen. The three channels are `Hoechst 33342` (channel named `DNA`, showing the nuclei), `phospho-histone H3` (channel named `pH3`, indicates cells in mitosis), and `phalloidin` (channel named `actin`, shows the actin cytoskeleton). This dataset makes for a very nice example case because the cells are morphologically highly diverse and the pH3 staining allows the classification and analysis of a functionally relevant subset of cells.

The images were obtained from [BBBC018](https://data.broadinstitute.org/bbbc/BBBC018/) as `16bit` images in the `.DIB` format and converted into `8bit .tif` images using a simple Fiji macro. Next, nuclei were segmented based on the `DNA` channel and segmentations were extended to capture cell outlines using the `actin` channel (see `\data\image_analysis_pipeline_DEV.ipynb` and `\data\image_analysis_pipeline_RUN.ipynb`). 

Features quantifying cell shape and intensity of each channel were extracted using `skimage.measure.regionprops` and converted to a pandas DataFrame, which was then saved in `\data\BBBC018_v1_features.pkl`. This file is the starting point for this tutorial.


#### Required Modules

- Make sure the following modules are installed before you get started:
    - numpy
    - scipy
    - matplotlib
    - pandas
    - scikit-learn
    - networkx
    - scikit-image or tifffile (only used for imread function)
- All required modules (except tifffile) come pre-installed if you are using the **[Anaconda distribution](https://www.anaconda.com/download/)** of python. 
- To install tifffile, use `conda install -c conda-forge tifffile`.

## 2. Preparations <a id=prep></a>

In this section we import the required modules, load the data and prepare it for analysis.

Importantly, we check the data for some of the most common problems/mistakes that can sneak into such datasets. Although this step seems trivial, it is often *crucial* for the success of data analysis! Input data frequently comes with all kinds of issues and failing to clean them up will lead to error messages when running analysis algorithms (in the best case) or to biased/erroneous results that go unnoticed (in the worst case).

### Imports <a id=imports></a>

In [None]:
# General
import os
import numpy as np
import matplotlib.pyplot as plt

# Images
from tifffile import imread, imsave

# Statistics & machine learning
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_validate

# Networks
import networkx as nx
from scipy.spatial.distance import pdist, squareform

# Interactivity
from ipywidgets import interact

### Loading the Data <a id=loading></a>

In [None]:
# Path to data 
datafile_path = 'data/BBBC018_v1_features.pkl'

# Load dataframe
df = pd.read_pickle(datafile_path)

# Report
print( df.shape )

In [None]:
df.head()

In [None]:
df.describe()

### Some Checks for Common Problems <a id=checks></a>

In [None]:
### Are there any columns (except imageID) that do not have numerical data?

# Check
print( df.select_dtypes(exclude=[np.number]).columns )  # ->> No, it's only imageID!

In [None]:
### Are there any duplicated columns or rows?

# Check rows
print ( df.duplicated().nonzero() )    # ->> No, looks fine!

# Check columns
print ( df.T.duplicated().nonzero() )  # ->> Yes, there are! Remove them and check again!

# Remove duplicate columns and check again
df = df.drop(df.columns[df.T.duplicated()], axis=1)
print ( df.T.duplicated().nonzero() )

In [None]:
### Are there any columns or rows that have NaN entries?

# Find NaN columns
print( df.loc[:, df.isnull().sum() > 0].columns )  # ->> There is one column with NaNs!

# Find NaN rows
print( df.isnull().any(axis=1).nonzero() )         # ->> There are many rows with NaNs!

# Since all rows' NaNs are in one column, the easiest is to remove that column!
df = df.dropna(axis=1)
print( df.loc[:, df.isnull().sum() > 0].columns )
print( df.isnull().any(axis=1).nonzero() )

In [None]:
### Are there any columns where all values are identical?
# This can be checked looking for columns that have a standard deviation of zero.

# Check
print ( df.select_dtypes([np.number]).loc[:, df.select_dtypes([np.number]).std()==0].columns )  # ->> No, looks fine!

## 3. Basic Data Visualization <a id=dataviz></a>

As a first step, we need to get an idea of what our data "looks like". Things like `df.describe` are a starting point for that but they don't get us very far; we need plots! Lots and lots of plots!

### Basic Boxplot <a id=bplot></a>

A good starting point for looking at any kind of data that can be divided into categories.

In [None]:
### Simple boxplot

# Prep
fig = plt.figure(figsize=(18,5))

# Create boxplot
# Pandas dataframes come with a boxplot function. This is useful since it
# provides some additional functionalities over matplotlib's standard boxplots,
# as we will see later in the tutorial.
df.boxplot()

# Some formatting
plt.grid(False)
fig.autofmt_xdate()

# Done
plt.show()

### Interactive Scatterplot <a id=iscatter></a>

In multi-dimensional datasets such as this one, the limitations of plotting to the 2D or 3D space present a real problem. Fortunately, interactive plotting can to some extent solve this problem, as illustrated in this interactive scatterplot.

<font color=green>**Exercise:**</font> Color the dots based on a third feature, which should be selectable from a third drop-down menu.

In [None]:
### Interactive scatterplot

# Set interactivity
@interact(x = list(df.columns),
          y = list(df.columns))
def make_interactive_scatterplot(x=df.columns[0], 
                                 y=df.columns[1]):

    # Handle potential problems
    if 'imageID' in [x,y]:
        print("'imageID' is an invalid selection for this plot.")
        return
    
    # Prep
    fig = plt.figure(figsize=(8,8))
    
    # Create scatterplot
    plt.scatter(df[x], df[y], s=20,
                edgecolor='k', alpha=0.5)
    
    # Labels
    plt.xlabel(x)
    plt.ylabel(y)
    
    # Done
    plt.show()

### Interactive Backmapping <a id=ibackmap></a>

Since our data originally derives from images, one of the most interesting ways of visualizing it is to map it back onto the image as a colored overlay. This was already shown in the image analysis tutorial but here it is extended to allow interactive choice of various aspects of the visualization.

In [None]:
### Backmapping onto images

# Location of images & segmentations
img_path = r'data\BBBC018_v1_images_tif'
seg_path = r'data\BBBC018_v1_images_seg'

# Set interactivity
@interact(img_id  = list(set(df['imageID'])),
          channel = ['DNA', 'pH3', 'actin'],
          segtype = ['nucseg', 'cytseg'],
          feature = list(df.columns),
          alpha   = (0.0, 1.0, 0.1))
def make_interactive_scatterplot(img_id  = list(set(df['imageID']))[0], 
                                 channel = 'actin',
                                 segtype = 'cytseg',
                                 feature = 'cyt-area-act',
                                 alpha   = 0.4):

    # Handle potential problems
    if feature=='imageID':
        print("'imageID' is an invalid feature for this plot.")
        return
    
    # Load image & segmentation
    img = imread(os.path.join(img_path, img_id+'-'+channel+'_8bit.tif'))
    seg = imread(os.path.join(seg_path, img_id+'-'+segtype+'.tif'))
    
    # Get feature values and standardize to 8bit
    feat = np.array( df[df['imageID']==img_id][feature] )
    feat = (feat - feat.min()) / (feat.max() - feat.min()) * 255.0
    feat = feat.astype(np.uint8)
    
    # Recolor segmentation
    seg_colored = np.zeros_like(seg).astype(np.uint8)
    for cell_idx, cell_value in zip(np.unique(seg)[1:], feat):
        seg_colored[seg==cell_idx] = cell_value
    
    # Prep
    fig = plt.figure(figsize=(10,10))
    
    # Display image
    plt.imshow(img, interpolation='none', cmap='gray')
    
    # Overlay values
    plt.imshow(np.ma.array(seg_colored, mask=seg_colored==0), 
               interpolation='none', cmap='viridis', alpha=alpha)
    
    # Add a title
    plt.title('img: '+img_id+' | ch: '+channel+' | seg: '+segtype[:3]+' | feat: '+feature,
              fontsize=18)
    
    # Other formatting
    plt.axis('off')
    
    # Done
    plt.show()

## 4. Multi-Dimensional Analysis <a id=MDA></a>

Whilst simple plots and summary statistics allow the investigation of individual measures and their relationships, the true power of large multi-dimensional datasets lies in the combined use of all the extracted features.

Multi-dimensional data analysis closely intersects with the *machine learning* field. Therefore, two types of multi-dimensional analysis can be distinguished:

- **Unsupervised methods** investigate the structure of the dataset to find patterns, such as clusters of similar cells.
    - Here, we will...
        - ...visualize the diversity of the cells in the "phenotype space" using PCA and tSNE
        - ...cluster the cells into phenotypically similar groups using k-means clustering
        - ...visualize cluster relationships and properties using a minimum spanning tree


- **Supervised methods** relate the data to some pre-determined external piece of information, for example the classification of specific cell types based on pre-annotated training data. 
    - Here, we will...
        - ...classify cells into mitotic and non-mitotic based on their phenotype, using the pH3 marker to create the pre-annotated training data
        - ...analyze the differences between mitotic and non-mitotic cells

### Feature Standardization <a id=featstand></a>

Before doing any analysis, the different features/dimensions of the data need to be normalized such that they all can equally contribute to the analysis. Without normalization, the area of a cell might contribute more than the circumference, simply because the numbers measuring area are generally larger than those measuring circumferences - not because the area necessarily encodes more information.

The most common normalization is called `normalization to zero mean and unit variance`, also known simply as `standardization` or `standard scaling` ([wiki](https://en.wikipedia.org/wiki/Feature_scaling#Standardization)). For each dimension, the mean is subtracted and the result is divided by the standard deviation, which makes the 'unit' of the axes into 'unit variance' and therefore encodes the relative differences of cells more than the absolute magnitude of values.

<font color=green>**Exercise:**</font> In what situations might standardization be problematic? Can you think of (and implement) alternatives that might work better in such situations?

In [None]:
# Remove non-numerical columns (here only imageID)
data_df = df.select_dtypes([np.number])

# Show boxplot before standardization
fig = plt.figure(figsize=(12, 3))
data_df.boxplot(grid=False)
fig.autofmt_xdate()
plt.show()

# Standardize to zero mean and unit variance
scaled  = StandardScaler().fit_transform(data_df)
data_df = pd.DataFrame(scaled, index=data_df.index, columns=data_df.columns)

# Show boxplot after standardization
fig = plt.figure(figsize=(12, 3))
data_df.boxplot(grid=False)
fig.autofmt_xdate()
plt.show()

### Dimensionality Reduction by PCA <a id=pca></a>

... ([wiki](https://en.wikipedia.org/wiki/Principal_component_analysis))

In [None]:
### PCA

# Perform PCA
pca = PCA()
pca.fit(data_df)
pca_df = pd.DataFrame(pca.transform(data_df), 
                      index=data_df.index, 
                      columns=['PC'+str(i) for i in range(1,data_df.shape[1]+1)])

# Look at explained variance ratio
plt.figure(figsize=(12,3))
plt.plot(pca.explained_variance_ratio_)
plt.xlabel('PCs'); plt.ylabel('expl_var_ratio')
plt.show()

# Truncate to remove unimportant PCs
pca_df = pca_df.iloc[:, :15]
pca_df.head()

In [None]:
### Plot PCs in interactive scatterplot

# Set interactivity
@interact(x = list(pca_df.columns),
          y = list(pca_df.columns),
          color = list(data_df.columns))
def make_interactive_scatterplot(x=pca_df.columns[0], 
                                 y=pca_df.columns[1],
                                 color=data_df.columns[0]):
    
    # Prep
    fig = plt.figure(figsize=(8,8))
    
    # Create scatterplot
    plt.scatter(pca_df[x], pca_df[y], s=20,
                c=data_df[color], alpha=0.5)
    
    # Labels
    plt.xlabel(x)
    plt.ylabel(y)
    
    # Limits
    plt.xlim([np.percentile(pca_df[x], 0.5), np.percentile(pca_df[x], 99.5)])
    plt.ylim([np.percentile(pca_df[x], 0.5), np.percentile(pca_df[x], 99.5)])
    
    # Done
    plt.show()

### Dimensionality Reduction by tSNE <a id=tsne></a>

... ([wiki](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding))

In [None]:
### tSNE

# Random subsampling of cells
sample = np.random.choice(np.arange(data_df.shape[0]), 2000, replace=False)

# Perform tSNE
# WARNING: The metaparameters (in particular perplexity) matter a lot for tSNE!
#          See https://distill.pub/2016/misread-tsne/ for more information!
tsne = TSNE(n_components=2, perplexity=30.0, learning_rate=200.0, n_iter=2000) 
tsne_df = pd.DataFrame(tsne.fit_transform(pca_df.iloc[sample, :]), 
                       index=data_df.iloc[sample,:].index, 
                       columns=['tSNE1', 'tSNE2'])

In [None]:
### Plot tSNEs in interactive scatterplot

# Set interactivity
@interact(color = list(data_df.columns))
def make_interactive_scatterplot(color=data_df.columns[0]):

    # Prep
    fig = plt.figure(figsize=(8,8))

    # Create scatterplot
    plt.scatter(tsne_df['tSNE1'], tsne_df['tSNE2'], s=20,
                c=data_df.iloc[sample,:][color], alpha=0.5)

    # Labels
    plt.xlabel('tSNE1')
    plt.ylabel('tSNE2')

    # Done
    plt.show()

### Clustering with k-Means<a id=cluster></a>

... ([wiki](https://en.wikipedia.org/wiki/K-means_clustering))

In [None]:
### Simple k-means

# Perform clustering and get cluster labels
kmeans = KMeans(n_clusters=12, n_jobs=2)
kmeans.fit(pca_df)

# Get labels and add to df
labels = kmeans.labels_
df['cluster'] = labels

<font color=green>**Exercise:**</font> There are many unsupervised clustering algorithms available in scikit-learn and all of them are easy to use in the same way as KMeans. Find and implement another one and think about ways of comparing the results of the two.

In [None]:
### Plot tSNE with KMeans labels colored

# Prep
fig = plt.figure(figsize=(8,8))

# Create scatterplot
plt.scatter(tsne_df['tSNE1'], tsne_df['tSNE2'], s=20,
            c=labels[sample], edgecolor='face', cmap='Set1')

# Labels
plt.xlabel('tSNE1')
plt.ylabel('tSNE2')

# Done
plt.show()

In [None]:
### Interactive boxplot grouped by cluster

# Set interactivity
@interact(feature=list(data_df.columns))
def make_interactive_box(feature=data_df.columns[0]):
    
    # Create boxplot
    df.boxplot(by='cluster', column=feature, grid=False, figsize=(12,6))
    
    # Formatting
    plt.xlabel('Cluster', fontsize=18)
    plt.ylabel(feature, fontsize=18)
    plt.suptitle('')
    plt.title('')
    
    # Done
    plt.show()

### Cluster Visualization by Minimum Spanning Tree <a id=mst></a>

... ([wiki](https://en.wikipedia.org/wiki/Minimum_spanning_tree))

In [None]:
### Create graph based on pairwise distance between cluster centers

# Adjacency matrix
dists = squareform(pdist(kmeans.cluster_centers_))

# Graph from adjacency matrix
G = nx.from_numpy_matrix(dists)

# Minimum Spanning Tree
T = nx.minimum_spanning_tree(G)

# Show
fig, ax = plt.subplots(1, 2, figsize=(12,4))
nx.draw(G, ax=ax[0])
nx.draw(T, ax=ax[1])

In [None]:
### Interactive display of minimal spanning tree of clusters

# Get positions
pos = nx.fruchterman_reingold_layout(T, random_state=46)

# Get mean data per cluster
cluster_df = df.groupby('cluster').mean()

# Set interactivity
@interact(feature=list(data_df.columns))
def make_interactive_MST(feature=data_df.columns[0]):
    
    # Prep
    plt.figure(figsize=(12,6))
    
    ## Draw network
    #nx.draw(T, pos=pos, width=2
    #        node_color = cluster_df.iloc[np.array(T.nodes)][feature],
    #        node_size  = df.groupby('cluster').count().iloc[:, 0],
    #        edge_color = [e[-1]['weight'] for e in T.edges(data=True)])
    
    # Draw edges
    p_edges = nx.draw_networkx_edges(T, pos=pos, width=3, edge_color='gray')
    
    # Draw nodes
    nodes = nx.draw_networkx_nodes(T, pos=pos, node_size=500,
                                   node_color=cluster_df.iloc[np.array(T.nodes)][feature])
    
    # Add colorbar
    cbar = plt.colorbar(nodes)
    cbar.set_label(feature, labelpad=10, fontsize=18)
    cbar.ax.tick_params(labelsize=14)
    
    # Formatting
    plt.axis('off')
    
    # Done
    plt.show()

### Classification of Mitotic Cells <a id=mitotic></a>

... SVM ([wiki](https://en.wikipedia.org/wiki/Support_vector_machine)) ([sklearn](http://scikit-learn.org/stable/modules/svm.html))

In [None]:
### Use pH3 signal to create ground truth labels (True: "in mitosis" | False: "not in mitosis") 

# Check pH3 signal distribution with histogram
plt.figure(figsize=(12,4))
plt.hist(df['nuc-mean_intensity-pH3'], bins=50)
plt.xticks(range(0,130,5))
plt.ylim([0, 500])
plt.show()

# Create ground truth
ground_truth = (df['nuc-mean_intensity-pH3'] > 20).values
print( ground_truth )

In [None]:
### Split into training and test set

out = train_test_split(pca_df, ground_truth, test_size=0.3, random_state=43, stratify=ground_truth)
X_train, X_test, y_train, y_test = out

In [None]:
### Support Vector Classification

# Train linear SVC on training data
svc = LinearSVC()
svc.fit(X_train, y_train)

# Predict on test data
y_pred = svc.predict(X_test)

In [None]:
### Check how well it worked

# Compute accuracy: (TP+TN)/ALL
accuracy = np.sum(y_pred==y_test) / y_pred.size
print( "Accuracy: ", accuracy )

# Compute precision TP/ALL_T
precision = np.sum( (y_pred==1) & (y_pred==y_test) ) / np.sum(y_test)
print( "Precision:", precision )

# Confusion matrix
cmat = confusion_matrix(y_test, y_pred)

# Show
plt.imshow(cmat, interpolation='none', cmap='Blues')
for (i, j), z in np.ndenumerate(cmat):
    plt.text(j, i, z, ha='center', va='center')
plt.xticks([0,1], ["Non-Mitotic", "Mitotic"])
plt.yticks([0,1], ["Non-Mitotic", "Mitotic"], rotation=90)
plt.xlabel("prediction")
plt.ylabel("ground truth")
plt.show()

# Note: This already works very well with just a linear SVC. In practice, a non-linear
#       SVC (with a so-called 'RBF' kernel) is often better suited, which will require
#       hyper-parameter optimization to yield the best possible results!

In [None]:
### Cross validation

# Run cross-validation
cross_val = cross_validate(svc, pca_df, ground_truth, cv=5, scoring=['accuracy', 'precision'])

# Print results
print( cross_val['test_accuracy'] )
print( cross_val['test_precision'] )

In [None]:
### Highlighting in tSNE plot

# Prep
fig = plt.figure(figsize=(8,8))

# Create scatterplot
plt.scatter(tsne_df['tSNE1'], tsne_df['tSNE2'], s=20,
            c=svc.predict(pca_df)[sample], edgecolor='face', cmap='Set1_r')

# Labels
plt.xlabel('tSNE1')
plt.ylabel('tSNE2')

# Done
plt.show()

### Grouped Analysis and Hypothesis Testing <a id=grouped_and_hypot></a>

...

In [None]:
### Add mitotic label to df

# Predict for everyone
mitotic = svc.predict(pca_df)

# Add to df
df['mitotic'] = mitotic

In [None]:
### Grouped interactive boxplot

# Set interactivity
@interact(feature=list(data_df.columns))
def make_interactive_box(feature=data_df.columns[0]):
    
    # Create boxplot
    df.boxplot(by='mitotic', column=feature, grid=False, figsize=(4,6), fontsize=16, widths=0.6)
    
    # Formatting
    plt.xlabel('mitotic', fontsize=18)
    plt.ylabel(feature, fontsize=18)
    plt.suptitle('')
    plt.title('')
    
    # Done
    plt.show()

In [None]:
### Simple hypothesis tests

from scipy.stats import mannwhitneyu

# Check if solidity is greater in mitotic cells
s,p = mannwhitneyu(df.loc[ df['mitotic']]['cyt-solidity-act'],
                   df.loc[~df['mitotic']]['cyt-solidity-act'],
                   alternative='greater')
print( 'MWU p-value:', p )

# Check if area is greater in mitotic cells
s,p = mannwhitneyu(df.loc[ df['mitotic']]['cyt-area-act'],
                   df.loc[~df['mitotic']]['cyt-area-act'],
                   alternative='greater')
print( 'MWU p-value:', p )

# WARNING: Large sample numbers tend to yield 'significant' p-values even for very small
#          (and possibly only technical) differences. Be very careful in interpreting 
#          these measures and ask your resident statistician for complementary approaches
#          to validate your results (e.g. effect size measures such as Cohen's d, or 
#          sampling-based methods such as bootstrapping).

<font color=green>**Exercise:**</font> There are many hypothesis tests available in `scipy.stats`. See if you can do a t-test instead of Mann-Whitney U for the data above (but don't forget that you first have to check if the data fits the assumptions of a t-test!)