<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0; color:blue' role="tab" aria-controls="home"><center>Mechanism of Action</center></h2>


<img src = "https://image.slidesharecdn.com/mechanismofdrugaction-131104071748-phpapp01/95/mechanism-of-drug-action-3-638.jpg?cb=1383549622">
    
This competition is about MOA, so first let's understand what is it before moving to data and model building.
    
In pharmacology, the term `Mechanism of Action (MOA)` refers to the specific biochemical interaction through which a drug substance produces its pharmacological effect. 
    
A mechanism of action usually includes mention of the specific molecular targets to which the drug binds, such as an `enzyme` or `receptor`.

In the past, scientists derived drugs from natural products or were inspired by traditional remedies. 
    
Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. 
    
Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. 
    
In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. 
    
As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0; color:blue' role="tab" aria-controls="home"><center>How do we determine the MoAs of a new drug?</center></h2>

One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

In this competition, we have access to a unique dataset that combines gene expression and cell viability data. 
    
The data is based on a new technology that measures simultaneously (within the same samples) human cells’ responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). 
    
In addition, we have access to MoA annotations for more than 5,000 drugs in this dataset.

As is customary, the dataset has been split into testing and training subsets. 
    
Hence, our task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. 
    
**Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem.**

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0; color:blue' role="tab" aria-controls="home"><center>How to evaluate the accuracy of a solution?</center></h2>

Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

If successful, we will help to develop an algorithm to predict a compound’s MoA given its cellular signature, thus helping scientists advance the drug discovery process.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0; color:blue' role="tab" aria-controls="home"><center>Information about the Data given for the competition</center></h2>
    
In this competition, we will be predicting multiple targets of the Mechanism of Action (MoA) response(s) of different samples (sig_id), given various inputs such as gene expression data and cell viability data.

**Important notes:**

1)Training data has an additional (optional) set of MoA labels that are not included in the test data and not used for scoring.

2)re-run dataset has approximately 4x the number of examples seen in the Public test.

    
**Files Provided:**
    
1) **train_features.csv** - Features for the training set. 
    
    a) Features g- signify gene expression data
    
    b) c- signify cell viability data. 
    
    c) cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); 
       control perturbations have no MoAs; 
    
    d) cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) 
    
    e) cp_dose indicates whether the dose was high or low.

2) **train_targets_scored.csv** - The binary MoA targets that are scored.

3) **train_targets_nonscored.csv** - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.

4) **test_features.csv** - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.

5) **sample_submission.csv** - A submission file in the correct format.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0; color:blue' role="tab" aria-controls="home"><center>Let's get Started!</center></h2>
    
We have learnt enough about MOA and the competition, Its now time to load the datasets and check the data!

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
from colorama import Fore, Back, Style
import matplotlib.pyplot as plt
import seaborn as sns

# Set Style
sns.set_style("whitegrid")
sns.despine(left=True, bottom=True)

# Set Color Palettes for the notebook
colors_nude = ['#e0798c','#65365a','#da8886','#cfc4c4','#dfd7ca']
sns.palplot(sns.color_palette(colors_nude))

### Read Files

In [None]:
train = pd.read_csv("/kaggle/input/lish-moa/train_features.csv")
test  = pd.read_csv("/kaggle/input/lish-moa/test_features.csv")
sub = pd.read_csv("/kaggle/input/lish-moa/sample_submission.csv")
target = pd.read_csv("/kaggle/input/lish-moa/train_targets_scored.csv")

### Check Training Dataset

In [None]:
print(Fore.YELLOW+"Training dataset:", Style.RESET_ALL + "has {} rows and {} columns".format(train.shape[0],train.shape[1]))

In [None]:
train.head()

Training Dataset has 23814 samples and 875 feature variables excluding sig_id

### Check Test Dataset

In [None]:
print(Fore.YELLOW+"Test dataset:", Style.RESET_ALL + "has {} rows and {} columns".format(test.shape[0],train.shape[1]))

In [None]:
test.head()

* Test Dataset has 3982 drug samples and 875 feature variables excluding sig_id
* This means we need to make predictions for all these 3982 drugs for all the target variables

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Check Target Features Dataset</center></h2>

In [None]:
print(Fore.YELLOW+"Feature dataset:", Style.RESET_ALL + "has {} rows and {} columns".format(target.shape[0],target.shape[1]))

Important Observations ✍
<br><br>
📌 Here we have no. of rows equat to the training dataset, which makes complete sense
<br>  
📌 We have 207 target variables, which means we need to make predictions for 207 target variables for all the 3982 drug samples provided in the test data<br>

📌 Now, this means our submission file should have a shape of 3982 rows and 207 columns<br> 

lets check if our understanding is correct!

In [None]:
target.head()

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Check Submission File</center></h2>

In [None]:
sub = pd.read_csv("../input/lish-moa/sample_submission.csv")
sub.shape

<b>📌 3982 rows and 207 columns, exactly the shape we were expecting!<b>

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Check Null Values</center></h2>

In [None]:
# helper function to check null values
def check_nulls(data):
    isNull = "N"
    for col in data.columns:
        if data[col].isnull().sum() > 0:        
            print("{} has {} null values".format(col,data[col].isnull().sum()))
            isNull = "Y"

    if isNull == "N":
        print("No Null Values found in the dataset")   

In [None]:
check_nulls(train)

In [None]:
check_nulls(test)

In [None]:
check_nulls(target)

No Null Values present in any of the datasets<b>

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:gold; border:0; color:blue' role="tab" aria-controls="home"><center>Exploratory Data Analysis!</center></h2>
  
We have loaded all the necessary datasets, we have also taken a look at data available, 
It's time to dig the data by performing EDA!

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Let's dig training dataset</center></h2>

In [None]:
train.columns

There are total of 876 features,Majority of these are "c-" and "g-" types!

Let's check if there are any duplicate "sig_id" in the dataset

In [None]:
#check for duplicate sig_ids in training dataset
print("Total no. of records in the training dataset: ",train.shape[0])
print("No. of unique sig_ids in the training dataset:",train.sig_id.nunique())

No. of unique sig_id is equal to the no. of rows in the training dataset, which means we dont have any duplicates here.

Let's now check the count of different kind of features in the dataset!

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Feature Distribution</center></h2>

In [None]:
# Count of "c-" & "g-" features
c_count = 0
g_count = 0
others = []
for feat in train.columns:
        if (feat.find("c-")) != -1:
            c_count = c_count + 1
        elif (feat.find("g-")) != -1:
            g_count = g_count + 1
        else:
            others.append(feat)
            
print(Fore.YELLOW +"No. of g- features:", Style.RESET_ALL + "{}".format(g_count)) 
print(Fore.YELLOW +"No. of c- features:", Style.RESET_ALL + "{}".format(c_count)) 
print(Fore.YELLOW +"Other features:", Style.RESET_ALL + "{}".format(train.shape[1] 
                            - (c_count + g_count)),others)

# visualize the no. of g- & c- features in the dataset
others = train.shape[1] - (c_count + g_count)
plt.figure(figsize = (8,6))
plt.bar(["g-", "c-","others"], [g_count,c_count,others],color = colors_nude)
plt.title("Categorical Features Distribution")
plt.xlabel("Features")
plt.ylabel("Count")
plt.legend()

g- & c- features contributes 99% of the features in the training dataset, and they are numeric features
all other features are categorical features 

We will now check how these features are distributed in the training dataset

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Categorical Feature Distribution (cp_type, cp_time & cp_dose)</center></h2>

In [None]:
# helper function to plot categorical variables
def plot_cp(feats):    
    print("--------------------" + feats[0] + "------------------------")
    print(train[feats[0]].value_counts())
    print("---------------------------------------------------")
    print("\n")
    print("--------------------" + feats[1] + "------------------------")
    print(train[feats[1]].value_counts())
    print("---------------------------------------------------")
    print("\n")
    print("--------------------" + feats[2] + "------------------------")
    print(train[feats[2]].value_counts())
    print("---------------------------------------------------")
    
    plt.figure(figsize = (21,8))
    
    plt.subplot(1,3,1)
    sns.countplot(train[feats[0]],palette=colors_nude)
    plt.xlabel("Features Distribution - " + feats[0],fontsize=15)
    plt.ylabel("Count",fontsize=15)
     
    plt.subplot(1,3,2)
    sns.countplot(train[feats[1]],palette=colors_nude)
    plt.xlabel("Features Distribution - " + feats[1],fontsize=15)
    plt.ylabel("Count",fontsize=15)
        
    plt.subplot(1,3,3)
    sns.countplot(train[feats[2]],palette=colors_nude)
    plt.xlabel("Features Distribution - " + feats[2],fontsize=15)
    plt.ylabel("Count",fontsize=15)
    
    plt.suptitle("Feature Distribution for cp_ variable",fontsize=25)
    plt.show()

In [None]:
# lets check how the distribution of cp_ features looks like
plot_cp(['cp_type','cp_dose','cp_time'])

cp_type
   1.1 cp_type feature has 2 possible values: trt_cp & ctl_vehicle
   1.2 It means that the samples are either treated with compount(trt_cp) or with a control perturbation(ctl_vehicle)
   1.3 Majority of the samples are treated with compount

cp_dose
    1.1 cp_dose also has 2 possible values : D1 & D2
    1.2 both values have almost equal presence in the dataset 
    
cp_time
      1.1 cp_time has 3 possible values : 24,48 & 72
      1.2 All these values have almost equal presence in the dataset 

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Cell Viability Feature Distribution</center></h2>

Cell viability is a measure of the proportion of live, healthy cells within a population. 

Cell viability assays are used to determine the overall health of cells, optimize culture or experimental conditions, and to measure cell survival following treatment with compounds, such as during a drug screen.

Cell-viability assessment is based on `PRISM` technology. 

PRISM is a high-throughput screen for assessing cell viability in which cell lines that have each been labelled with a unique 24-nucleotide barcode are pooled and treated with the experimental condition, and surviving cells are “counted” through identification of the cognate barcode. 

PRISM is an acronym for `Profiling Relative Inhibition Simultaneously in Mixture`.

There are 100 cell-viability features (c-0 to c-99) in the training dataset. 

Each cell-viability feature represents viability of one particular cell line, and all experiments are based on a set of similar cells. These are mostly cancer cells.

In [None]:
# make seperate lists for various type of features
c_list = []
g_list = []
others = []
for feat in train.columns:
        if (feat.find("c-")) != -1:
            c_list.append(feat)
        elif (feat.find("g-")) != -1:
            g_list.append(feat)
        else:
            others.append(feat)

### Cell Viability - Meta Statistics

In [None]:
plt.figure(figsize = (12,10))

plt.subplot(2,2,1)
sns.distplot(train[c_list].describe().values[1],color = "red")
plt.title("Mean",fontsize=15)

plt.subplot(2,2,2)
sns.distplot(train[c_list].describe().values[2],color = "blue")
plt.title("Standar Deviation",fontsize=15)


plt.subplot(2,2,3)
sns.distplot(train[c_list].describe().values[3],kde_kws={'bw': 0.1},color = "green")
plt.title("Mininmum Value",fontsize=15)

plt.subplot(2,2,4)
sns.distplot(train[c_list].describe().values[7],color = "yellow")
plt.title("Maximum Value",fontsize=15)


plt.suptitle("Cell Viability - Meta Statistics",fontsize=20)

plt.show()

* What a sharp contrast to the gene meta distributions. Most obviously, the minima are nearly all below -9.5, rising up to the border of -10. The maxima show a much broader distribution between 3 and 5.
 
* As a consequence of this imbalance, the means are shifted towards negative values around -0.5. Note, that none of the means is above zero. The distribution of standard deviations is shifted from around 1 to around 2, compared to the gene data, with a notable tail towards small values.

In [None]:
# helper function to plot distribution of g- & c- features
def plot_g_c(feats,type):    
        
    plt.figure(figsize = (15,30))
    
    for idx,feat in enumerate(feats):
        plt.subplot(5,2,idx+1)
        sns.distplot(train[feats[idx]],color = "red")
        plt.xlabel("Features Distribution - " + feats[idx],fontsize=15)
        plt.ylabel("Count",fontsize=15)
        plt.title(feats[idx],fontsize=15)
    
    plt.suptitle(type + " Features - Distribution",fontsize=20)
    plt.show()

In [None]:
# call the helper function to check c- feature distribution
plot_g_c(['c-1','c-20','c-30','c-40','c-50','c-60','c-65','c-70','c-75','c-99'],"Cell Viability")

* Most common cell viability feature distribution looks like a left-skewed bell curve with mean close to 0.5
* Features with most different distributions are c-37, c-58, c-69, c-74 and c-76 because of their shorter tails
* They have higher overall cell viability. Minimum values in other features are clipped at -10, but it wasn't required for formerly mentioned features

### Correlation Matrix between c- variables

In [None]:
# heat map
plt.figure(figsize = (50,100))
corrMatrix = train[c_list].corr()
mask = np.triu(corrMatrix)
sns.heatmap(corrMatrix,
            annot=True,
            fmt='.1f',
            cmap='coolwarm',            
            mask=mask,
            linewidths=1,
            cbar=False)
plt.show()

* Cell Viability variables are highly correlated with each other

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Gene Expression Feature Distribution</center></h2>

Gene expression is the amount and type of proteins that are expressed in a cell at any given point in time. 

Gene expression level is based on a protocol similar to L1000 which is a high-throughput gene expression assay that measures the mRNA transcript abundance of 978 "landmark" genes from human cells. (The "L" in L1000 refers to the Landmark genes measured in the assay.)

There are 772 gene expression features (g-0 to g-771) present in the training dataset. 

Each gene expression feature represents the expression of one particular gene, so there are 772 individual genes are being monitored in this assay.

### Gene Expression - Meta Statistic

In [None]:
train[g_list].describe().index

In [None]:
plt.figure(figsize = (12,10))

plt.subplot(2,2,1)
sns.distplot(train[g_list].describe().values[1],color = "red")
plt.title("Mean",fontsize=15)

plt.subplot(2,2,2)
sns.distplot(train[g_list].describe().values[2],color = "blue")
plt.title("Standar Deviation",fontsize=15)


plt.subplot(2,2,3)
sns.distplot(train[g_list].describe().values[3],color = "green")
plt.title("Mininmum Value",fontsize=15)

plt.subplot(2,2,4)
sns.distplot(train[g_list].describe().values[7],color = "yellow")
plt.title("Maximum Value",fontsize=15)


plt.suptitle("Gene Expression - Meta Statistics",fontsize=20)

plt.show()


* The means are pretty nicely distributed around zero; with standard deviations chiefly between 0.5 and 1.5.

* The min and max are a nice mirror image of each other. There are notable increases around the range of positive/negative 9 - 10.

In [None]:
# call the helper function to check g- feature distribution
plot_g_c(['g-1','g-6','g-11','g-16','g-21','g-26','g-31','g-36','g-41','g-46'],"Gene Expression")

* Gene expression feature distributions are more diverse than cell viability feature distributions
* There are both left/right tailed and long/short tailed distributions exist

### Correlation Matrix between g- variables

In [None]:
# heat map
plt.figure(figsize = (12,8))
corrMatrix = train[g_list].corr()
sns.heatmap(corrMatrix)
plt.show()

* Correlation between g- variables not as strong as it is between c- variables

### Treatment Features relation with Cell Features

In [None]:
plt.figure(figsize = (12,12))

plt.subplot(2,3,1)
c_1 = train[["c-1","cp_time"]]
c_1_24 = c_1[c_1["cp_time"]==24]
sns.distplot(c_1_24["c-1"],color = "red")
plt.title("C-1 Vs CP TIME = 24")

plt.subplot(2,3,2)
c_1_48 = c_1[c_1["cp_time"]==48]
sns.distplot(c_1_48["c-1"],color = "blue")
plt.title("C-1 Vs CP TIME = 48")


plt.subplot(2,3,3)
c_1_72 = c_1[c_1["cp_time"]==72]
sns.distplot(c_1_72["c-1"],color = "green")
plt.title("C-1 Vs CP TIME = 72")

plt.subplot(2,3,4)
c_2 = train[["c-2","cp_time"]]
c_2_24 = c_2[c_2["cp_time"]==24]
sns.distplot(c_2_24["c-2"],color = "red")
plt.title("C-2 Vs CP TIME = 24")

plt.subplot(2,3,5)
c_2_48 = c_2[c_2["cp_time"]==48]
sns.distplot(c_2_48["c-2"],color = "blue")
plt.title("C-2 Vs CP TIME = 48")


plt.subplot(2,3,6)
c_2_72 = c_2[c_2["cp_time"]==72]
sns.distplot(c_2_72["c-2"],color = "green")
plt.title("C-2 Vs CP TIME = 72")


plt.suptitle("Cell Viability Vs CP Time")

plt.show()

For both the sample features we took above, c-1 and c-2, there is no difference in the distribution for different kind of time treatments.

In [None]:
plt.figure(figsize = (12,12))

plt.subplot(2,2,1)
c_1 = train[["c-1","cp_dose"]]
c_1_D1 = c_1[c_1["cp_dose"]=="D1"]
sns.distplot(c_1_D1["c-1"],color = "red")
plt.title("C-1 Vs CP DOSE = D1")

plt.subplot(2,2,2)
c_1_D2 = c_1[c_1["cp_dose"]=="D2"]
sns.distplot(c_1_D2["c-1"],color = "green")
plt.title("C-1 Vs CP DOSE = D2")

plt.subplot(2,2,3)
c_2 = train[["c-2","cp_dose"]]
c_2_D2 = c_2[c_2["cp_dose"]=="D1"]
sns.distplot(c_2_D2["c-2"],color = "red")
plt.title("C-2 Vs CP DOSE = D1")

plt.subplot(2,2,4)
c_2_D2 = c_2[c_2["cp_dose"]=="D2"]
sns.distplot(c_2_D2["c-2"],color = "green")
plt.title("C-2 Vs CP DOSE = D2")

plt.suptitle("Cell Viability Vs CP DOSE")

plt.show()

For both the sample features we took above, c-1 and c-2, there is no difference in the distribution for different kind of dose treatments.

### Treatment Features relation with Gene Features

In [None]:
plt.figure(figsize = (12,12))

plt.subplot(2,3,1)
g_1 = train[["g-1","cp_time"]]
g_1_24 = g_1[g_1["cp_time"]==24]
sns.distplot(g_1_24["g-1"],color = "red")
plt.title("G-1 Vs CP TIME = 24")

plt.subplot(2,3,2)
g_1_48 = g_1[g_1["cp_time"]==48]
sns.distplot(g_1_48["g-1"],color = "blue")
plt.title("G-1 Vs CP TIME = 48")


plt.subplot(2,3,3)
g_1_72 = g_1[g_1["cp_time"]==72]
sns.distplot(g_1_72["g-1"],color = "green")
plt.title("G-1 Vs CP TIME = 72")

plt.subplot(2,3,4)
g_2 = train[["g-2","cp_time"]]
g_2_24 = g_2[g_2["cp_time"]==24]
sns.distplot(g_2_24["g-2"],color = "red")
plt.title("G-2 Vs CP TIME = 24")

plt.subplot(2,3,5)
g_2_48 = g_2[g_2["cp_time"]==48]
sns.distplot(g_2_48["g-2"],color = "blue")
plt.title("G-2 Vs CP TIME = 48")


plt.subplot(2,3,6)
g_2_72 = g_2[g_2["cp_time"]==72]
sns.distplot(g_2_72["g-2"],color = "green")
plt.title("G-2 Vs CP TIME = 72")


plt.suptitle("Gene Features Vs CP Time")

plt.show()

For both the sample features we took above, g-1 and g-2, there is almost no difference in the distribution for different kind of time treatments.

In [None]:
plt.figure(figsize = (12,12))

plt.subplot(2,2,1)
g_1 = train[["g-1","cp_dose"]]
g_1_D1 = g_1[g_1["cp_dose"]=="D1"]
sns.distplot(g_1_D1["g-1"],color = "red")
plt.title("G-1 Vs CP DOSE = D1")

plt.subplot(2,2,2)
g_1_D2 = g_1[g_1["cp_dose"]=="D2"]
sns.distplot(g_1_D2["g-1"],color = "blue")
plt.title("G-1 Vs CP DOSE = D2")

plt.subplot(2,2,3)
g_2 = train[["g-2","cp_dose"]]
g_2_D1 = g_2[g_2["cp_dose"]=="D1"]
sns.distplot(g_2_D1["g-2"],color = "red")
plt.title("G-2 Vs CP DOSE = D1")

plt.subplot(2,2,4)
g_2_D2 = g_2[g_2["cp_dose"]=="D2"]
sns.distplot(g_2_D2["g-2"],color = "blue")
plt.title("G-2 Vs CP DOSE = D2")



plt.suptitle("Cell Viability Vs CP DOSE")

plt.show()

For both the sample features we took above, g-1 and g-2, there is almost no difference in the distribution for different kind of dose treatments.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Let's now check Target Dataset</center></h2>

In [None]:
target.head()

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Target Dataset - Sparsity</center></h2>

As we saw earlier, there are 207 target variables, it is most likely that the targets are going to be sparse, lets check!

In [None]:
plt.figure(figsize = (35,350))
for idx,col in enumerate(target.columns[1:]):   
    plt.subplot(42,5,idx+1)       
    sns.countplot(target[col],palette=colors_nude)
    plt.xlabel("Target Distribution - " + col)
    plt.ylabel("Count")
plt.show()

* Clearly, all the target variables are mostly classified as 0s, % of MOAs(target variable = 1) seems to be extremely low, not even visible for most of the target variables.

* This also means that the dataset is highly imbalanced, something we need to keep in mind while preparing the data for modelling.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Check for Multi-Label Drugs</center></h2>

In [None]:
new_df= target[target.sum(axis=1)>1]
new_df ['sum'] = new_df.sum(axis=1)
new_df.shape

There are 1915 drugs which are classified into more than one categories

In [None]:
plt.figure(figsize = (15,10))
    
plt.subplot(2,2,1)
sns.countplot(new_df['sum'],palette=colors_nude)
plt.xlabel("No. of Targets",fontsize=15)
plt.ylabel("Count",fontsize=15)
plt.title("Drugs classified into more than one categories",fontsize=20)

* Around 1500 drugs are classified into more than 2 targets
* Around 300 drugs are classified into more than 3 targets
* Around 100 drugs are classified into more than 4,5 & 7 targets

**Last thing that we should check in the target dataset is that "sig_id"s present in the target dataset are matching with training dataset "sig_id"s**

In [None]:
# we are using set.intersection to find out similarity between the sig_id values present in the two datasets
len(set(train.sig_id.values).intersection(set(target.sig_id.values)))

No. we got above is equal to the no. of records present in the dataframes, hence we can conclude that the sig_id values present in both the dataframes is exactly matching.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Target Variables Interaction with Independent Features</center></h2>

In [None]:
# create a master dataset by concatenating train and target dataframes
master_df = pd.concat([train,target],axis = 1)
master_df.shape

In [None]:
master_df.head()

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Gene Expression Vs Target Variables</center></h2>

In [None]:
for tar_col in target.columns[1:11]:
    plt.figure(figsize = (25,10))

    for idx in range(1,11):
        plt.subplot(2,5,idx)
        col = "g-" + str(idx)
        plt.scatter(x=master_df[col],y=master_df[tar_col])
        plt.xlabel(col,fontsize=10)
        plt.ylabel(tar_col,fontsize=10)
        plt.suptitle("Correlation between g-1 to 10 &  {}".format(tar_col),fontsize=15)
    plt.show()

* Random gene expression features are plotted against a random target feature because it is not possible to visualize 772x206 feature interactions. 
     
* Gene expression features and targets have weaker relationships compared to cell viability features, because data points of positive target values are more spread along the x axis.
 
* All gene expression features have one thing in common; positive target values are clustered around zero means just like cell viability features. 
     
* High absolute values in gene expression features (>2 or <-2) indicate that the drug or perturbation had a significant effect on the current cell, whereas values close to zero mean means that the effect for that cell was non-measurable.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>Cell Viability Vs Target Variables</center></h2>

In [None]:
for tar_col in target.columns[1:11]:
    plt.figure(figsize = (25,100))

    for idx in range(1,11):
        plt.subplot(20,5,idx)
        col = "c-" + str(idx)
        plt.scatter(x=master_df[col],y=master_df[tar_col])
        plt.xlabel(col,fontsize=10)
        plt.ylabel(tar_col,fontsize=10)
        #plt.suptitle("Correlation between c-1 to 100 &  {}".format(tar_col),fontsize=15)
    plt.show()

* Cell viability features are plotted against only a random target feature since there are 206 targets. 
     
* It can be seen that, there are positive relationships between cell viability features and target features in most of the cases. 
 
* However, some of them have no relationship or negative relationship with target features. This could be related to, most of the cells are being cancer cells while some of them are not. Another pattern that can be seen in cell viability features is, positive target values are clustered around zero means in most of the cases.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:brown; border:0; color:white' role="tab" aria-controls="home"><center>c_type, c_time & c_dose Vs Target Variables</center></h2>

In [None]:
target.columns[25:26]

In [None]:
def bivariate(feat):
    plt.figure(figsize = (25,50))

    for idx,tar_col in enumerate(target.columns[25:50]):  
        plt.subplot(10,5,idx+1)
        sns.countplot(master_df[tar_col],hue = master_df[feat],palette=colors_nude)    
    plt.show()

In [None]:
bivariate("cp_type")

* We have already seen how cp_type values are distributed in the training dataset, majority of the records had value "trt_cp" and that's the reason why we see big buildings for "trt_cp" values
 
* To summarize what we see above is, majority of the records are classified as "0" and most of the records have a value of "trt_cp" for cp_type feature

In [None]:
bivariate("cp_time")

To summarize what we see above is, majority of the records are classified as "0" and different values of cp_time have almost equal distribution

In [None]:
bivariate("cp_dose")

To summarize what we see above is, majority of the records are classified as "0" and different values of cp_dose have almost equal distribution

# Model Building

In [None]:
objects = []
for col in train.columns:
    if train[col].dtype == "object" and col != "sig_id":
        if train[col].nunique() > 1:
            train[col] = train[col].astype('category')
            objects.append(col)

In [None]:
for col in objects:
    train[col] = train[col].astype("category")
    train[col] = train[col].cat.codes + 1

train[objects].head()

In [None]:
for col in objects:
    test[col] = test[col].astype("category")
    test[col] = test[col].cat.codes + 1

test[objects].head()

In [None]:
X = train.drop("sig_id",1)
y = target.drop("sig_id",1)
X.shape, y.shape

# PCA

In [None]:
#Improting the PCA module
#from sklearn.decomposition import PCA
#pca = PCA(svd_solver='randomized', random_state=42)

In [None]:
#let's apply PCA
#pca.fit(X)

In [None]:
#Making the screeplot - plotting the cumulative variance against the number of components
#%matplotlib inline
#fig = plt.figure(figsize = (12,8))
#plt.plot(np.cumsum(pca.explained_variance_ratio_))
#plt.xlabel('number of components')
#plt.ylabel('cumulative explained variance')
#plt.show()

In [None]:
#Using incremental PCA for efficiency - saves a lot of time on larger datasets
#from sklearn.decomposition import IncrementalPCA
#pca_final = IncrementalPCA(n_components=350)

In [None]:
#df_pca = pd.DataFrame(pca_final.fit_transform(X))
#df_pca.shape

In [None]:
import xgboost as xgb
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score

# split dataset into training and test set
#X_train, X_test, y_train, y_test = train_test_split(df_pca, y, test_size=0.2, random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# create XGBoost instance with default hyper-parameters
args = {'max_depth': 7, 'learning_rate': .05,'n_estimators': 170, 'tree_method': "gpu_hist", 'gamma':4}

xgb_estimator = xgb.XGBClassifier(**args)

# create MultiOutputClassifier instance with XGBoost model inside
multilabel_model = MultiOutputClassifier(xgb_estimator)

# fit the model
multilabel_model.fit(X_train, y_train)

# evaluate on test data
print('Accuracy on test data: {:.1f}%'.format(accuracy_score(y_test, multilabel_model.predict(X_test))*100))


In [None]:
test.drop('sig_id',1,inplace=True)
#test_pca = pd.DataFrame(pca_final.fit_transform(test))
#test_pca.shape

In [None]:
test_preds = np.zeros((test.shape[0], y.shape[1]))
#val_preds = multilabel_model.predict_proba(test_pca) # list of preds per class
val_preds = multilabel_model.predict_proba(test) # list of preds per class
val_preds = np.array(val_preds)[:,:,1].T # take the positive class

In [None]:
# create the submission file
sub.iloc[:,1:] = val_preds
sub.to_csv('submission.csv', index=False)

In [None]:
sub.head()