# Uncorking the I/O Bottleneck of Bio-Imaging

## Part 3 - Analysis of Whole Slide Image Features

Now that we have reduced the dimensionality of our image data to a more manageable size we can load up the features that were saved as a Pandas dataframe in the previous lab and take a look at a few of the methods that are suitable for this type of data. Rather than using the standard Pandas dataframe, we are going to use the RAPIDS equivalent, cuDF (CUDA dataframe). This loads the data into GPU memory rather than using the host system's RAM. As you will see, this opens up a new realm of possibilities because of the huge speed boost this can provide.

As usual, we start by importing the libraries that we'll need. As you will notice, there are a few new names here, such as cuDF, cuGraph and cuML. These are the core of the RAPIDS tools offering GPU accelerated Dataframe functionality, GPU accelerated graph analytics and GPU accelerated Machine Learning routines

You will find documentation on all of these libraries and features here https://docs.rapids.ai/api

In [None]:
%matplotlib inline
#from sklearn.decomposition import PCA
from cuml import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import pandas as pd
import cudf
import cugraph
import cuml
import numpy as np

We start by loading the wsi_dfx file, which is a dataframe containing all of the features from the tiles from the last Notebook. Once we have loaded this, you can see that we immediately create a cdDF version of the dataframe, which puts it on to the GPU. The cdf.head command loads the first few rows and you can see that we have 32 variables named latent var (0-31) and an X and Y coordinate.

In [None]:
# import the data we saved in our other notebook
df = pd.read_pickle('data/wsi_dfx')

#create a cuda dataframe from the pandas one
cdf = cudf.from_pandas(df)

cdf.head

We can also display some summary statistics for the dataframe columns

In [None]:
# get some stats on the data
cdf.describe()

We can find the principle components of these feature vectors, which is another way to distill the important information in these features into fewer variables.

In [None]:
# get the column names
feat_cols = cdf.columns

pca = PCA(n_components=3)
pca_result = pca.fit_transform(df[feat_cols[0:32]].values)
pca_df = cudf.DataFrame(pca_result,columns=["pca-1", "pca-2", "pca-3"])

print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
print(pca_df)

So there are a few things going on in that last cell. Let's unpick what we did. 

Firstly we created a tuple of columns - feat_cols, which we used to get the first 32 column names and their values

We created a [PCA analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) with 3 principle components. Then we used the PCA's fit_transform method to actually determine the 3 principle components.

Finally, we created a new dataframe and provided names for the new PCA columns

To check your understanding:

1) Try printing the first 10 columns in the feat_cols and determine what data type is being used. 

2) Print out the memory_usage of the dataframe and see how much memory the Index adds (see the documentation here https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.DataFrame.memory_usage.html)

3) See if you can generate a new PCA analysis with 4 principle components, but only using the first 16 columns and the first 10,000 rows

([Solution](solutions/solution3_1.py))

In [None]:
# TODO - use this cell to answer the questions above


Another great visualisation tool is a confusion matrix with heatmap, which shows the correlation between the features

In [None]:
figsize = (34,34)
sns.set(rc={'figure.figsize':figsize}, style = 'white')

data_corr = df.corr()
ax =sns.heatmap(data=round(data_corr,3),
    cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9), annot=True,fmt='.1g', linewidth = .5,annot_kws={"size":8})
b, t = plt.ylim()
b += 0.5
t -= 0.5
plt.ylim(b,t)
plt.show()

# delete the dataframe to conserve memory
del(data_corr)

What you should observe is that the purple diagonal is showing that each feature is perfectly correlated with itself (as you'd expect!). You can also see that there are a few features that do seem to be correlated with each other.

To check your understanding, which is the only latent feature that seems to have any correlation to the x or y coordinate? 

(answer: Latent Var 30 with x coordinate)

We can also plot the Principle Components in a scatter plot. You will see that we are using Numba as the means of getting the numpy array into a cuda dataframe. Numba is a Python-based function compiler that can create multi-core or GPU accelerated versions of ordinary Python code. In this instance we are just using it to convert a numpy array into a GPU memory equivalent.

You will also notice that we are importing cuxfilter, which is another RAPIDS component which provides visualization capabilities

In [None]:
from cuxfilter import DataFrame as fdf
from cuxfilter.charts import scatter

cdf_pca = cudf.DataFrame(pca_result, columns=["pca-1", "pca-2", "pca-3"])

cuxdf = fdf.from_dataframe(cdf_pca)
scatter_chart = scatter(x="pca-1", y="pca-3", pixel_shade_type="linear")

cuxdf.dashboard([scatter_chart])
scatter_chart.view()


There are other types of plot we can do, including 3D plots

In [None]:
# Use Matplotlib - which means we need to copy the data 
# back to the CPU - hence the .to_numpy() calls
rndperm = np.random.permutation(pca_df.to_numpy().shape[0])

pca_df.loc[rndperm,:]["pca-1"]

ax = plt.figure(figsize=(16,10)).add_subplot(projection='3d')
ax.scatter(
    xs=pca_df.loc[rndperm,:]["pca-1"].to_numpy(), 
    ys=pca_df.loc[rndperm,:]["pca-2"].to_numpy(), 
    zs=pca_df.loc[rndperm,:]["pca-3"].to_numpy(), 
   c=pca_df.loc[rndperm,:]["pca-3"].to_numpy(), 
   cmap='tab10'
)
ax.set_xlabel('pca-1')
ax.set_ylabel('pca-2')
ax.set_zlabel('pca-3')
plt.show()

In this case, the visualization is not showing us anything particularly insightful, but these types of plot can be very useful for gaining insights into data

A more interesting idea to explore would be how the tiles on the whole slide image relate to one another in terms of the features that we calculated with our VAE (i.e. the latent features). Often it can be insightful to classify image features, such as nuclei and then build a graph in which each nucleus is a node and is linked by edges to its nearest neighbors.

In order to turn our data into some sort of graph, we can use the nearest neighbour algorithm in cuML to find the nearest neighbours in feature space.
With around 0.5 million items, each with 32 features, finding the 5 nearest neighbours is a sizeable computation. 

In [None]:
from cuml.neighbors import NearestNeighbors as cuNN

knn_cuml = cuNN()
no_xy=cdf[feat_cols[0:32]]
knn_cuml.fit(no_xy)

%time D_cuml, I_cuml = knn_cuml.kneighbors(no_xy, 5)
I_cuml, D_cuml

So, what this has produced is a list of the 5 nearest neighbours for each tile in latent feature space - since we only used the first 32 columns in the dataframe, thereby excluding the x and y coordinates. So what do we mean by 'nearest neighbout in feature space'? You can think of each feature as an axis and the value of each feature places the parent tile somewhere along this axis. The KNN (k-nearest neighbours) algorithm is using a Euclidean distance calculation which tells us how close each node is to every other node. Because we chose 5 as the number of nearest neighbours, we have a row value which represents the index of each node and then five columns containing the indexes of the 5 nearest nodes, with the nearest in column 0 and the furthest in column 4. You will also notice that the index in the nearest column, 0, always matches the row index. That's because the algorithm does not exclude each node from being its own nearest neighbour. We can ignore that column.

We are looking at the indexes in the I_cuml dataframe. To look at the feature-space distances you should look at the D_cuml dataframe.

If you want to compare this with the sklearn CPU implementation, be aware that it can take > 30 minutes to run! It is not necessary to execute it - the code is there as a reference. 

In [None]:
# Only run this and the following cell to compare with CPU version
from sklearn.neighbors import NearestNeighbors as skNN

df_1=df[feat_cols[:32]]
knn_sk = skNN(algorithm='brute',n_jobs=1)
knn_sk.fit(df_1)

# Only uncomment the lines below if you have a spare 30 mins to wait for it to complete!
#%time D_sk, I_sk = knn_sk.kneighbors(df_1, 5)
#I_sk

In order to convert the output of the KNN operation into a graph, we need to prepare the data. The data needs to be presented to the RAPIDS graph library, cugraph, as a set of edges with the source and destination node and an optional weight parameter.

Firstly we combine the nearest neighbour indexes and distances into one dataframe and give them unique column names. We do this so that we can use the distance to set the weight of the connection between the nodes

In [None]:
# give the columns names because they have to be unique in the merged dataframe
I_cuml.columns=['ix1','n1','n2','n3','n4']
D_cuml.columns=['ix2','d1','d2','d3','d4']
all_cols = cudf.concat([I_cuml, D_cuml],axis=1)

# remove the index and distance from the self-referenced nearest neighbour
all_cols = all_cols[['n1','n2','n3','n4','d1','d2','d3','d4']]

all_cols 

So the next step is to manipulate this data so that it is in the desired format. There should be 3 columns, named 'source', 'target' and 'weight'.

To do this, you will need to extract 4 sets of columns - one for each neighbour - and then concatenate the rows into a new dataframe.

Remember that each row index represents a node, the n1-n4 columns contain the row index of a destination node and the d1-d4 columns contain the distance between these nodes. 

In [None]:
# Reformat the data to match the way edges are defined in cuGraph
all_cols['index1'] = all_cols.index

c1 = all_cols[['index1','n1','d1']]
c1.columns=['source','target','weight']
c2 = all_cols[['index1','n2','d2']]
c2.columns=['source','target','weight']
c3 = all_cols[['index1','n3','d3']]
c3.columns=['source','target','weight']
c4 = all_cols[['index1','n4','d4']]
#c4 = all.iloc[:,[0, 4, 9]]
c4.columns=['source','target','weight']
                 
edges = [c1,c2,c3,c4]

edge_df = cudf.concat(edges)

# remove the old dataframe from memory
del(all_cols)

edge_df = edge_df.reset_index()
edge_df = edge_df[['source','target','weight']]
edge_df

Lastly, if we are going to use the weight to set the distance between the nodes then we will need to invert the weight value, otherwise we are setting the nodes with greatest distance between them as having stronger connections, which the opposite of what we want. To do this we need to manipulate the data in the dataframe.

See if you can invert the weights by manipulating the data in the edge_df DataFrame ([solution](solutions/solution3_2.py))

In [None]:
# TODO Invert the weight values so that larger distances create weaker weights


This dataframe is now ready to be used to generate the graph. For this we use the cugraph library

In [None]:
# now we can actually create a graoh!!
G = cugraph.Graph()

%time G.from_cudf_edgelist(edge_df,source='source', destination='target', edge_attr='weight', renumber=True)

Once we have the graph we can do standard graph analytical operations

In [None]:
#now we can do some graph things
count = cugraph.triangles(G)
print("No of triangles = " + str(count))

coreno = cugraph.core_number(G)
print("Core Number = " + str(coreno))

cugraph is also able to use the force_atlas2 algorithm to generate a layout for the graph, which can allow us to visualise it. This algorithm uses a physics-based approach, treating the weights between the nodes as springs. The output is a new dataframe with the x and y coordinates of each node, which can then be fed into a visualiser.

In [None]:
nodes_ = cugraph.layout.force_atlas2(G, max_iter=500,
                strong_gravity_mode=False,
                outbound_attraction_distribution=True,
                lin_log_mode=False,
                barnes_hut_theta=0.5, verbose=True)
nodes_

Finally we use cuXFilter to render the whole graph!

In [None]:
import cuxfilter.charts as cfc
import cuxfilter.layouts as clo

cux_df = fdf.load_graph((nodes_, edge_df))

chart0 = cfc.graph(edge_color_palette=['gray', 'black'],
                                            timeout=200, 
                                            node_aggregate_fn='mean', node_pixel_shade_type='linear',
                                            edge_render_type='direct',#other option available -> 'curved'
                                            edge_transparency=0.5
                                          )
d = cux_df.dashboard([chart0], layout=clo.double_feature)

# draw the graph
chart0.view()

You should be able to use the mouse-wheel to zoom in and out of the graph plot. If you zoom in far enough you will see the individual vertices (coloured) and edges (grey lines)

As a follow-up exercise. Can you generate a similar plot that shows the 2 nearest neighbours of each tile using their X and Y coordinates rather than feature space? See if you can generate one plot for the graph of nearest neighbours using the force atlas computation and one on which the tiles are shown in their actual locations.

If you do it correctly, then the latter should resemble the actual slide. If you zoom in, you may notice that the connections to the nearest neighbours are not always quite as you might expect. This is because, by default, the nearest neighbour algorithm uses heuristics to make the calculation faster. However, you can force it to use use the brute force algorithm (see if you can find out how to do that from the documentation).

In [None]:
from cuml.neighbors import NearestNeighbors as cuNN
import cuxfilter.charts as cfc
import cuxfilter.layouts as clo
import pickle

# This loads the tiles created in the previous lab
with open("data/tiles_xy", "rb") as fp: 
    tiles_xy = pickle.load(fp)
    
tiles_xy_cdf=cudf.DataFrame(tiles_xy)

#TODO Complete the solution

([Solution](solutions/solution3_last.py)) 

We hope you enjoyed this course and discovered a few new techniques to apply to your own Bio-imaging challenges!