# Cell Analysis Worksheet #1

### Emaad Khwaja

<div class="alert alert-warning">
    <b>NOTE</b> This worksheet is meant to be challenging. It very much represents a normal process towards analyzing cell images. Except to do a ton of Googling.
</div>

## Introduction

Below is an example of a cell barcode. Similar to how we have RGB channels on a digital image, we acquire images of fluorophores on different channels based on the colors they emit. In an RGB image, a unique color is produced based on the ratio of the intensities within each color channel, we selectively insert unique flurophore ratios into cells to identify them.

![Cell Barcode](Images/Barcode.jpg)

We want to select which cells in an image correspond to a unique population. The CSV file has already identified the per-channel intensity for each cell in a particular image. This file contains a mixture of different cell types.

## Pandas Basics

![](Images/panda.png)

<div class="alert alert-warning">
    <b>NOTE</b> This analysis will rely on some external packages. You will need to import <i>numpy</i>, <i>pandas</i>, <i>sklearn</i>, and <i>matplotlib</i>.
</div>

0. Import the CSV file and convert it to a pandas dataframe. You will use this for the rest of the analysis so pick a good name for it. (i.e. cell_dataframe)

1. Use the .head() and .tail() commands to visualize parts of the dataframe

## Cell Stats

![](Images/histogram.png)

1. Using the Area (px^2) column, calculate the average cell area. What is the standard deviation?

2. Create a frequency histogram of the cell areas.

3. How big is the largest cell? What is the mask number corresponding to this cell?

## Population Identification

1. We need to calculate the relative ratios of each color channel. Create a new column (pandas) which is a sum of every column labeled (Clean Intensity (Magnitude/px^2)).

2. Plot a frequency histogram of the summation column.

3. Now create 3 more columns which correspond to the relative ratio of each channel to the newly created sum column. 

4. Using the newly created relative ratio columns, create 3 scatterplots with the following axes:

a) DAPI vs FITC

b) FITC vs Alexa-594

c) DAPI vs Alexa-594

Bonus: Make a 3D scatter plot with all 3 color channels

How many cell populations do we appear to have based on these three plots?

## Principal Component Analysis

PCA is a dimension-reduction technique which can be extremely useful in identifying cell populations. In essence, it maximizes sum of the distances between datapoints to show the largest separation. The individual principal components are calculated by multiplying variables in different combinations. We will be using all columns as data sources to see if this provides better clustering.

https://en.wikipedia.org/wiki/Principal_component_analysis

![](Images/PCA.jpeg)

1. Use the following code to plot a PCA representation of the dataframe.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

pca.fit(scaled_data_frame)

x_pca = pca.transform(scaled_data_frame)

print(x_pca.shape)

print(scaled_data_frame.shape)

plt.scatter(x_pca[:,0],x_pca[:,1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

2. Sometimes larger intensities can throw off the information hidden within smaller intensity cells. To counter this, we can try applying a logarithmic function to squash down the intensity values. Apply the log function to the intensity columns and recalculate the proportions. Then re-plot the PCA using the same method shown above.

## Clustering

![](Images/kmeans.png)

1. Choose the dataframe that produced the best separation when you plotted. Use k-means clustering to identify which cells apply to which population and add colors to the plots corresponding to the cell identity. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans 