<img src='https://www.msm.edu/online/makingmedicines/images/makingmedicines784.jpg'>
<p>
<p>
<h1><center>Mechanisms of Action (MoA) Prediction</center><h1>
<h3><center>💊Can you improve the algorithm that classifies drugs based on their biological activity?
💊</center><h3>
    
    
[image credit](https://www.msm.edu/online/makingmedicines/index.php)

# Table of contents🗃<a id='0.1'></a>

- [1. Introduction](#1)
- [2. Import Packages](#2)
- [3. Data Overview](#3)
   - [3.1 Train Data](#3-1)
   - [3.2 Train Targets](#3-2)
   - [3.3 Test Data](#3-3)
   - [3.4 Missing Values](#3-4)
- [4. Visualization](#4)
   - [4.1 Treatment Features (Type, Dose & Duration)](#4-1)
   - [4.2 Gene Expression](#4-2)
       - [4.2.1 Gene Expression: Correlation](#4-2-1)
   - [4.3 Cell Viability](#4-3)
       - [4.3.1 Cell Viability: Correlation](#4-3-1)
   - [4.4 Target Features](#4-4)
       - [4.4.1 Top Occurences](#4-4-1)
       - [4.4.2 Low Occurences](#4-4-2)
       - [4.4.3 Activation Per Sample](#4-4-3)
   - [4.5 Non Target Features](#4-5)
       - [4.5.1 Top Occurences](#4-5-1)
       - [4.5.2 Activation Per Sample](#4-5-2)

# 1. <a id='1'>Introduction📃</a>
[Table of Contents](#0.1)

This competition is presented by [Connectivity Map](https://clue.io/) and [Laboratory for Innovation Science at Harvard (LISH)](https://lish.harvard.edu/). The goal of this competition of is to advance drug development through improvements to MoA prediction algorithms. Let's first understand what is **Machanism of Action (MoA)**?

## What is Mechanism of Action (MoA)?

In medicine, a term used to describe how a drug or other substance produces an effect in the body. For example, a drug’s mechanism of action could be how it affects a specific target in a cell, such as an enzyme, or a cell function, such as cell growth. Knowing the mechanism of action of a drug may help provide information about the safety of the drug and how it affects the body. It may also help identify the right dose of a drug and which patients are most likely to respond to treatment. Also called MOA. For more information check [here](https://www.kaggle.com/c/lish-moa/overview).

## Competition Data

We have access to a unique dataset that combines gene expression and cell viability data. The data is based on the 100 different types of human cells responses to various drugs. We are provided with separate predictor(train_features.csv) and targets(train_targets_scored.csv) CSVs. Our data has - 

   * train_features.csv - Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).
   
   * train_targets_scored.csv - The binary MoA targets that are scored.
   
   * train_targets_nonscored.csv - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
   
   * test_features.csv - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.
   
   * sample_submission.csv - A submission file in the correct format.


## What we are predicting? 

**Our goal is to classify drugs based on their biological activity**. Since drugs can have multiple MoA annotations, the task is formally a **multi-label classification problem**. Given various gene expression and cell viability features we have to predict multiple targets of the Mechanism of Action (MoA) responses of different samples.

# 2. <a id='2'>Import Packages📚</a>

[Table of Contents](#0.1)

In [None]:
# import packages
import os, gc
import numpy as np

# data manipulation
import pandas as pd
import pandas_profiling 

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import manifold
import cufflinks as cf
import plotly.offline

# Settings
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
plt.style.use('fivethirtyeight')
plt.show()

%matplotlib inline

os.listdir('../input/lish-moa/')

# 3. <a id='3'>Data Overview🔍</a>
[Table of Contents](#0.1)

In [None]:
# root directory
ROOT = '../input/lish-moa/'

# files
target_scored = pd.read_csv(f'{ROOT}train_targets_scored.csv')
target_nonscored = pd.read_csv(f'{ROOT}train_targets_nonscored.csv')
train_features = pd.read_csv(f'{ROOT}train_features.csv')
test_features = pd.read_csv(f'{ROOT}test_features.csv')

## 3.1 <a id='3-1'>Train Data</a>
[Table of Contents](#0.1)

In [None]:
train_features.head()

In [None]:
print(f'We have {train_features.shape[0]} rows and {train_features.shape[1]} columns in train_features.')

* sig_id is unique id for each sample.
* We have **cp_type**, **cp_time** and **cp_dose** which indicates sample treated with a compound, treatment duration and dose levels respectively.
* **"g-"** means gene expression data and **"c-"** means cell viability data.

## 3.2 <a id='3-2'>Train Targets</a>
[Table of Contents](#0.1)

In [None]:
target_scored.head()

In [None]:
print(f'We have {target_scored.shape[0]} rows and {target_scored.shape[1]} columns in target_scored.')

* We have 207 target variables. We can see binary labels(0 & 1).

In [None]:
target_nonscored.head()

In [None]:
print(f'We have {target_nonscored.shape[0]} rows and {target_nonscored.shape[1]} columns in target_nonscored.')

* These are the additional binary responses. These are not predicted nor scored.

## 3.3 <a id='3-3'>Test Data</a>
[Table of Contents](#0.1)

In [None]:
test_features.head()

In [None]:
print(f'We have {test_features.shape[0]} rows and {test_features.shape[1]} columns in test_features.')

* Same as the train data (train_features). We need to predict the probability of each scored MoA for each row in the test data.

## 3.4 <a id='3-4'>Missing Values</a>
[Table of Contents](#0.1)

In [None]:
# missing values
print(f'We have {train_features.isnull().values.sum()} missing values in train data')
print(f'We have {test_features.isnull().values.sum()} missing values in test data')

# 4. <a id='4'>Visualization</a>
[Table of contents](#0.1)

Now comes the amazing part, **Visualization**. We will visualize various features categorical and numericals. 

## 4.1 <a id='4-1'>Treatment Features (Type, Dose & Duration)</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 6))
gs = f.add_gridspec(1, 3)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[0, 0])
    sns.countplot(train_features['cp_type'], palette="Set3")
    plt.title("Treatment Type")

with sns.axes_style("white"):
    ax = f.add_subplot(gs[0, 1])
    sns.countplot(train_features['cp_dose'], palette="Set2")
    plt.title("Treatment Dose")

with sns.axes_style("ticks"):
    ax = f.add_subplot(gs[0, 2])
    sns.countplot(train_features['cp_time'], palette="ch:.25")
    plt.title("Treatment Time")

f.tight_layout()

📌 **Points to note :**

   * We can see most of the treatment are compound treatments (trt_cp).
   * Treatment dose has two categories D1 and D2 (high vs low).
   * We have three categories for treatment duration viz 24, 48 and 72.

## 4.2 <a id='4-2'>Gene Expression</a>
[Table of contents](#0.1)

**"Gene expression"** is the process by which the instructions in our DNA are converted into a functional product, such as a protein. When the **information stored in our DNA is converted into instructions for making proteins or other molecules, it is called gene expression**.

We have columns ranging from **"g-0 to g-771"** which seems to be similar. We will make a density plot to view some of the gene expressions. 

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("white"):
    for i in range(0,26):
        sns.kdeplot(train_features.loc[:,f"g-{i}"], shade=True);
        plt.title("Gene Distribution")

**📌 Points to note :**

   * We can see most of the gene expression columns are forming a normal distribution.  
   * We can conclude that all the columns from **"g-0 to g-771"** are normally distributed.
   * Value in gene expression features ranges from -10 to 10.

### 4.2.1 <a id='4-2-1'>Gene Expression: Correlation</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(20, 16))

mask = np.triu(np.ones_like(train_features.loc[:,"g-0":"g-771"].corr(), dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)

with sns.axes_style("white"):
    sns.heatmap(train_features.loc[:,"g-0":"g-771"].corr(), mask=mask, square=True, cmap=cmap);
    plt.title("Gene Expression: Correlation")

**📌 Points to note :**

   * So we can see this is huge correlation plot with **"772 gene expression features"**.

## 4.3 <a id='4-3'>Cell Viability</a>

**"Cell viability"** is a measure of the proportion of live, healthy cells within a population. Cell viability assays are used to determine the overall health of cells, optimize culture or experimental conditions, and to measure cell survival following treatment with compounds, such as during a drug screen.

[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 8))

with sns.axes_style("white"):
    for i in range(0,26):
        sns.kdeplot(train_features.loc[:,f"c-{i}"], shade=True);
        plt.title("Cell Viability Distribution")

**📌 Points to note :**

   * We can see most of the cell viability columns are rightly skewed.  
   * There are 100 cell viability features ranging from **"c-0 to c-99"**. 
   * Values in cell viability features ranges from -10 to 6.

### 4.3.1 <a id='4-3-1'>Cell Viability: Correlation</a>
[Table of contents](#0.1)

In [None]:
f = plt.figure(figsize=(16, 10))

mask = np.triu(np.ones_like(train_features.loc[:,"c-0":"c-99"].corr(), dtype=bool))

with sns.axes_style("white"):
    sns.heatmap(train_features.loc[:,"c-0":"c-99"].corr(), mask=mask, square=True);
    plt.title("Cell Viability: Correlation")

**📌 Points to note :**
   * We can see there is high correlation between **"cell viability features"**.

## 4.4 <a id='4-4'>Target Features</a>
[Table of contents](#0.1)

We have **206 target variables** all of which are **binary(0,1)**.

### 4.4.1 <a id='4-4-1'>Top occurences</a>

In [None]:
data = target_scored.drop(['sig_id'], axis=1).sum().sort_values(inplace=False, ascending=False)
data = pd.Series(data)
with sns.axes_style("whitegrid"):
    data[:20].plot(kind='bar', figsize=(16, 8), title='Top 20 Target Features')

**📌 Points to note :**
   * We have **206 target variables**. Since target dataframe is huge here are the **top 20 target variables**.
   * We can see **nfkb_inhibitor** has huge(832) occurence followed by proteasome_inhibitor(726).

### 4.4.2 <a id='4-4-2'>Low Occurences</a>
[Table of contents](#0.1)

We will see target variables with low MoAs.

In [None]:
data = target_scored.drop(['sig_id'], axis=1).sum().sort_values(inplace=False)
data = pd.Series(data)
with sns.axes_style("whitegrid"):
    data[:20].plot(kind='bar', figsize=(16, 8), title='Last 20 Target Features')

**📌 Points to note :**
   * We can see **"atp-sensitive_potassium_channel_antagonist"** and **"erbb2_inhibitor"** have few occurence only 1 respectively.

### 4.4.3 <a id='4-4-3'>Activations Per Sample</a>
[Table of contents](#0.1)

In [None]:
with sns.axes_style("whitegrid"):
    target_scored.sum(axis=1).value_counts().plot(kind='bar', figsize=(16, 8), title='Number of MoA Activations per samples')

**📌 Points to note :**
   * We can see most of the samples have one Machanism of Action (MoA) annotation and we can see significant samples with no MoA annotation at all. 
   * There are not any samples with 6 MoA annotation.
   * Very few samples have 4, 5 or 6 annotations.

## 4.5 <a id='4-5'>Non Target Features</a>
[Table of contents](#0.1)

## 4.5.1 <a id='4-5-1'>Top Occurences Non Target Features</a>


In [None]:
data = target_nonscored.drop(['sig_id'], axis=1).sum().sort_values(inplace=False, ascending=False)
data = pd.Series(data)
with sns.axes_style("whitegrid"):
    data[:20].plot(kind='bar', figsize=(16, 8), title='Top 20 Non Target Features')

## 4.5.2 <a id='4-5-2'>Activations Per Sample</a>
[Table of contents](#0.1)

In [None]:
with sns.axes_style("whitegrid"):
    target_nonscored.sum(axis=1).value_counts().plot(kind='bar', figsize=(16, 8), title='Number of MoA Activations per samples (Non Target Features)')

**📌 Points to note :**
   * We can see most of the samples have **no Machanism of Action (MoA) annotation around **19224**.
   * There are not any samples with 6 MoA annotation.
   * Very few samples have 2, 3, 4, 5 and 6 annotations.

# Reference
[Table of contents](#0.1)

* [MOA Definition](https://www.cancer.gov/publications/dictionaries/cancer-terms/def/mechanism-of-action)
* [Cell viability](https://www.cellsignal.com/contents/_/synopsis-of-cell-proliferation-metabolic-status-and-cell-death/cell-viability-and-survival)
* [Seaborn](https://seaborn.pydata.org/index.html)
* [Parul Panday's notebook collection on Kaggle](https://www.kaggle.com/parulpandey/notebooks)
* https://www.kaggle.com/headsortails/explorations-of-action-moa-eda#individual-feature-visualisations

# WORK IN PROGRESS