# AILAB final notebook GROUP 2

Matteo Nulli, Lorenzo Tarricone, Yaraslau Ivashynka, Karim Ahmed Abdel Sadek, Fabio Pernisi, Joel Kim

#Heuristic and setting of our experiment

**Why cancer and genome?**

Studying the correlation between cancer and human genome is a difficult thing because the gene expression of cancer cells has an high variability among people and among the cells of the single person. Moreover, there can be many mutations developing over the time. If we want to understand and cure cancer cells it is crucial to study interactions between them and their neighbors (to see how cancer cells are escaping the T-cells, for example). Because of that we can't limit ourselfs to the study of the DNA, but we need to study the RNA since it is the "instruction list" that guides the cell behaviour/methabolism.

**Why is it important to study hypoxia?**

Cancer cells are excessively grown up cells that refuse to die committing a "controlled suicide" (*apoptosis*) and instead continue to grow in your body. To do that they'll need more oxigen than the normal cells and in the regions with large metastasis the blood vassels will not be able to keep up whith the cancer cells request of oxigen. The result is that cancer cells, about to die "against their will", will end up expressing abnormal genes (to encode a metabolic reaction to survive these extreme coditions) that eventually will lead them to become much more aggressive and difficult to treat.
Our question now is: can we identify apoxic cells? If we could understand how they react we could better treat cancer!

**How is the experiment designed?**

In our experiemnt we grow some cells in two different environments, one whith normal levels of oxigen (21%) and others with reduced levels of oxigen (1%) of oxigen. We then sequenced every single cell (Single-Cell RNA sequencing using the Smart-Seq and the Drop-Seq methods). Our target is starting from data to build a good predictor that given the gene expression of a celll will tell if this last was living in an hypoxic or normal environment. Both these two sequecing techniques were applied to two different cell lines, called HCC1086 and MCF7. The first one is a type of liver cancer and the second one is breast cancer.

**How are the data structured?**

The data provided to us were given in the format of .csv table. Every column represents a single cell sequenced (so an observetion), identified with a precise name containig informations about which of the two treatment conditions it grew up in (hypoxia or normoxia) and in case of Smarseq also its position on the plate of colture. Every row represents a single gene (so a feature), identified with its official gene symbol. Each entry of the table represents the gene expression count in case of SmartSeq sequenced cells and the Unique Molecular Identifier count (UMI) in case of DropSeq sequenced cells.

**How will we proceed with these data?**
We will analyse all the four tables given to us using a pretty standard Differential Gene Expression study pipeline that will include the following:

- Exploratory data analysis and Data cleaning
- Quality Control and Nomralization
- Feature selection and Higly Variable Genes study
- Dimensionality Reduction and Clustering
- Supervised learning: Tree based methods and Deep Neural Networks
- BONUS: Heuristics on Enrichment Analysis and Ontology Analysis

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Python libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

## Defining functions

In [None]:
def get_condition_list(X, condition):
  names = X.columns.values
  l_condition = []
  for i in range(len(names)):
      if condition in names[i]:
          l_condition.append(i)
      else:
          continue
  return l_condition

In [None]:
def plot_data(X):
    plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)

def plot_centroids(centroids, weights=None, circle_color='w', cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max() / 10]
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='o', s=35, linewidths=8,
                color=circle_color, zorder=10, alpha=0.9)
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=2, linewidths=12,
                color=cross_color, zorder=11, alpha=1)

def plot_decision_boundaries(clusterer, X, resolution=1000, show_centroids=True,
                             show_xlabels=True, show_ylabels=True):
    mins = X.min(axis=0) - 0.1
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
                         np.linspace(mins[1], maxs[1], resolution))
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                cmap="Pastel2")
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                linewidths=1, colors='k')
    plot_data(X)
    #if clusterer == GaussianMixture:
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)

    if show_xlabels:
        plt.xlabel("$x_1$", fontsize=14)
    else:
        plt.tick_params(labelbottom=False)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)
    else:
        plt.tick_params(labelleft=False)

In [None]:
from matplotlib.colors import LogNorm

def plot_gaussian_mixture(clusterer, X, resolution=1000, show_ylabels=True):
    mins = X.min(axis=0) - 0.1
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
                         np.linspace(mins[1], maxs[1], resolution))
    Z = -clusterer.score_samples(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z,
                 norm=LogNorm(vmin=1.0, vmax=30.0),
                 levels=np.logspace(0, 2, 12))
    plt.contour(xx, yy, Z,
                norm=LogNorm(vmin=1.0, vmax=30.0),
                levels=np.logspace(0, 2, 12),
                linewidths=1, colors='k')

    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contour(xx, yy, Z,
                linewidths=2, colors='r', linestyles='dashed')

    plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)
    plot_centroids(clusterer.means_, clusterer.weights_)

    plt.xlabel("$x_1$", fontsize=14)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)
    else:
        plt.tick_params(labelleft=False)

In [None]:
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)