<img src='https://img.timesnownews.com/story/1536770343-miedicine-2.jpg?d=600x450'>
<p>
<h1><center>Mechanisms of Action(MoA) Prediction: EDA!💊</center><h1>
    
    
    
# 1. <a id='Introduction'>Introduction 🃏 </a>
    
### 1.1 What is Mechanisms of Action(MoA)?
* In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

###  1.2 What is Mechanisms of Action(MoA) Prediction Competition?
* The aim of this challenge is to “classify drugs based on their biological activity”. Pharmaceutical drug discovery aims to identify certain proteins that are associated with a specific disease, and then to develop molecules that can target those proteins. The MoA of a molecule encodes its biological activity. Our dataset describes the response of 100 different types of human cells to various drugs. Those response patterns will be used to classify the MoA reponse.
(This is a multi-label classification problem.)  

### 1.3 Metric: log loss
* Please see the [evaluation metric](http://https://www.kaggle.com/c/lish-moa/overview/evaluation).

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#color
from colorama import Fore, Back, Style

# 2. <a id='2'>Reading the data 📚</a>

In [None]:
base_path = '../input/lish-moa/'

* train_features.csv - Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).
* train_targets_scored.csv - The binary MoA targets that are scored.
* train_targets_nonscored.csv - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
* test_features.csv - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.
* sample_submission.csv - A submission file in the correct format.

In [None]:
test_features_df = pd.read_csv(base_path + 'test_features.csv')
train_features_df = pd.read_csv(base_path + 'train_features.csv')
train_targets_scored_df = pd.read_csv(base_path + 'train_targets_scored.csv')
train_targets_nonscored_df = pd.read_csv(base_path + 'train_targets_nonscored.csv')
sample_submission_df = pd.read_csv(base_path + 'sample_submission.csv')

In [None]:
print(Fore.YELLOW + 'Sample submission shape: ',Style.RESET_ALL,sample_submission_df.shape)
sample_submission_df.head(5)

# 3. <a id='3'>Basic Data Exploration 🏕️</a>

In [None]:
# Null values and Data types
print(Fore.YELLOW + 'test_features_df !!',Style.RESET_ALL)
print(test_features_df.info())
print('-------------')
print(Fore.BLUE + 'train_features_df !!',Style.RESET_ALL)
print(train_features_df.info())
print(Fore.YELLOW + 'train_targets_scored_df !!',Style.RESET_ALL)
print(train_targets_scored_df.info())
print('-------------')
print(Fore.BLUE + 'train_targets_nonscored_df !!',Style.RESET_ALL)
print(train_targets_nonscored_df.info())
print(Fore.YELLOW + 'sample_submission_df !!',Style.RESET_ALL)
print(test_features_df.info())

float64(`872`) : 772 columns with gene expression data + 100 columns with cell viability data

## Correlation
* https://www.kaggle.com/blessondensil294/beginners-eda-viz-moa-drug-prediction

In [None]:
corrmat = train_features_df.corr()
f, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corrmat, square=True, vmax=.8)

In [None]:
corrmat = train_targets_scored_df.corr()
f, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corrmat, square=True, vmax=.8)

In [None]:
corrmat = train_targets_nonscored_df.corr()
f, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corrmat, square=True, vmax=.8)

## Automated ML tools for EDA

For basic EDA, I'll use some automated ML tools.

# 4. <a id='4'>Data Analysis Baseline Library (dabl) 🌱</a> 

For basic EDA, I'll use some automated ML tools.

This project tries to help make supervised machine learning more accessible for beginners, and reduce boiler plate for common tasks

In [None]:
!pip install dabl

In [None]:
import dabl

Now let’s ask dabl what it thinks by cleaning up the data.

dabl tries to detect the types of your data and apply appropriate conversions. It also tries to detect potential data quality issues. The field of data cleaning is impossibly broad, and dabl’s approaches are by no means sophisticated. The goal of dabl is to get the data “clean enough” to create useful visualizations and models, and to allow users to perform custom cleaning operations themselves. In particular if the detection of semantic types (continuous, categorical, ordinal, text, etc) fails, the user can provide type_hints:

In [None]:
test_features_df_clean = dabl.clean(test_features_df, verbose=0)
train_features_df_clean = dabl.clean(train_features_df, verbose=0)
train_targets_scored_df_clean = dabl.clean(train_targets_scored_df, verbose=0)
train_targets_nonscored_df_clean = dabl.clean(train_targets_nonscored_df, verbose=0)

# EDA for test_features_df

In [None]:
types = dabl.detect_types(test_features_df_clean)
print(types) 

### cp_type

`cp_type` indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle).

In [None]:
dabl.plot(test_features_df_clean, 'cp_type')

### cp_time

`cp_time` : treatment duration (24, 48, 72 hours).

In [None]:
dabl.plot(test_features_df_clean, 'cp_time')

### cp_dose

`cp_dose` : dose (high or low)

In [None]:
dabl.plot(test_features_df_clean, 'cp_dose')

### g-0 ~ g-771

Features `g-` signify gene expression data. Gene expression is the process by which the information encoded in a gene is used to direct the assembly of a protein molecule. The cell reads the sequence of the gene in groups of three bases.

In test_feature_df, we can see `g-0` ~ `g-771` and I'll show `g-0, g-100, g-350, g-500, g-771` since there are too many columns for `g-`.

In [None]:
dabl.plot(test_features_df_clean, 'g-0')

In [None]:
dabl.plot(test_features_df_clean, 'g-100')

In [None]:
dabl.plot(test_features_df_clean, 'g-350')

In [None]:
dabl.plot(test_features_df_clean, 'g-500')

In [None]:
dabl.plot(test_features_df_clean, 'g-771')

### C-0 ~ C-99

`c-` signify `cell viability` data. Cell viability is a measure of the proportion of live, healthy cells within a population. Cell viability assays are used to determine the overall health of cells, optimize culture or experimental conditions, and to measure cell survival following treatment with compounds, such as during a drug screen.

In test_feature_df, we can see `c-0` ~ `c-99` and I'll show `c-0, c-50, c-99` since there are too many columns for `c-`.

In [None]:
dabl.plot(test_features_df_clean, 'c-0')

In [None]:
dabl.plot(test_features_df_clean, 'c-50')

In [None]:
dabl.plot(test_features_df_clean, 'c-99')

# EDA for train_features_df

In [None]:
types = dabl.detect_types(train_features_df_clean)
print(types) 

### cp_type

In [None]:
dabl.plot(train_features_df_clean, 'cp_type')

### cp_time

In [None]:
dabl.plot(train_features_df_clean, 'cp_time')

### cp_dose

In [None]:
dabl.plot(train_features_df_clean, 'cp_dose')

### g-0 ~ g-771
In train_features_df, we can see `g-0` ~ `g-771` and I'll show `g-0, g-100, g-350, g-500, g-771` since there are toomany columns for `g-`.

In [None]:
dabl.plot(train_features_df_clean, 'g-0')

In [None]:
dabl.plot(train_features_df_clean, 'g-100')

In [None]:
dabl.plot(train_features_df_clean, 'g-350')

In [None]:
dabl.plot(train_features_df_clean, 'g-500')

In [None]:
dabl.plot(train_features_df_clean, 'g-771')

In train_features_df, we can see `c-0` ~ `c-99` and I'll show `c-0, c-50, c-99` since there are too many columns for `c-`.

In [None]:
dabl.plot(train_features_df_clean, 'c-0')

In [None]:
dabl.plot(train_features_df_clean, 'c-50')

In [None]:
dabl.plot(train_features_df_clean, 'c-99')

# EDA for train_targets_scored_df

In [None]:
types = dabl.detect_types(train_targets_scored_df_clean)
print(types) 

In [None]:
train_targets_scored_df_clean

In [None]:
train_targets_scored_df

All is `0`.

# EDA for train_targets_nonscored_df

In [None]:
train_targets_nonscored_df_clean

In [None]:
train_targets_nonscored_df

All is `0`.

# 5. <a id='5'>Datasist ✨</a> 

![](https://warehouse-camo.ingress.cmh1.psfhosted.org/6572d848c045b008268a4d6ca2617526a102d9b0/68747470733a2f2f726973656e772e6769746875622e696f2f64617461736973742f64617461736973742e706e67)
**datasist** is a python package providing fast, quick, and an abstracted interface to popular and frequently used functions or techniques relating to data analysis, visualization, data exploration, feature engineering, Computer, NLP, Deep Learning, modeling, model deployment etc.

In [None]:
!pip install datasist

In [None]:
import datasist as ds  #import datasist library

**check_train_test_set**: Checks the distribution of train and test for uniqueness in order to determine the best feature engineering strategy.

In [None]:
ds.structdata.check_train_test_set(train_features_df, test_features_df, index=None, col=None)

In [None]:
ds.structdata.describe(test_features_df)

In [None]:
ds.structdata.describe(train_features_df)

**detect_outliers**: Detect Rows with outliers.

In [None]:
numerical_feats = ds.structdata.get_num_feats(test_features_df)
ds.structdata.detect_outliers(test_features_df,80,numerical_feats)

In [None]:
numerical_feats = ds.structdata.get_num_feats(train_features_df)
ds.structdata.detect_outliers(train_features_df,80,numerical_feats)

**display_missing**: Display missing values as a pandas dataframe.

In [None]:
ds.structdata.display_missing(test_features_df)

In [None]:
ds.structdata.display_missing(train_features_df)

**get_cat_feats** : Returns the categorical features in a data set

In [None]:
cat_feats = ds.structdata.get_cat_feats(test_features_df)
cat_feats

In [None]:
cat_feats = ds.structdata.get_cat_feats(train_features_df)
cat_feats

**get_num_feats** : Returns the numerical features in a data set

In [None]:
num_feats = ds.structdata.get_num_feats(test_features_df)
print(len(num_feats))

In [None]:
num_feats = ds.structdata.get_num_feats(train_features_df)
print(len(num_feats))

In [None]:
get_unique_counts = ds.structdata.get_unique_counts(test_features_df)
get_unique_counts

In [None]:
get_unique_counts = ds.structdata.get_unique_counts(train_features_df)
get_unique_counts

# 6. <a id='6'>AutoViz 🛶 </a> 

![](https://github.com/AutoViML/AutoViz/raw/master/logo.png)
Automatically Visualize any dataset, any size with a single line of code.

AutoViz performs automatic visualization of any dataset with one line. Give any input file (CSV, txt or json) and AutoViz will visualize it.

In [None]:
!pip install autoviz

In [None]:
import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

## test_features

In [None]:
test_features_df

In [None]:
sep = ','
target = 'cp_type'
dft = AV.AutoViz(filename="", sep=sep, depVar=target, dfte=test_features_df, header=0, verbose=1,
                            lowess=False,chart_format='svg',max_rows_analyzed=5000,max_cols_analyzed=50)

In [None]:
sep = ','
target = 'cp_type'
dft = AV.AutoViz(filename="", sep=sep, depVar=target, dfte=train_features_df, header=0, verbose=1,
                            lowess=False, chart_format='svg',max_rows_analyzed=4000,max_cols_analyzed=50)

I can not get Plot for data.csv, but get some information.
* Number of variables removed due to `high correlation` = `235` in test_features_df
* Number of variables removed due to `high correlation` = `229`  in train_features_df

# 7. <a id='7'>missingno 🛶</a> 

![](https://storage.googleapis.com/coderzcolumn/static/tutorials/data_science/article_image/missingno%20-%20Visualize%20Missing%20Data%20in%20Python.jpg)
Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pip install missingno to get started.

In the case of a real-world dataset, it is very common that some values in the dataset are missing. We represent these missing values as NaN (Not a Number) values. But to build a good machine learning model our dataset should be complete. That’s why we use some imputation techniques to replace the NaN values with some probable values. But before doing that we need to have a good understanding of how the NaN values are distributed in our dataset.

Missingno library offers a very nice way to visualize the distribution of NaN values. Missingno is a Python library and compatible with Pandas.

In [None]:
!pip install missingno

In [None]:
import missingno as msno

import matplotlib.pyplot as plt
%matplotlib inline

**Matrix:**

Visualising missing values for a sample. Using this matrix you can very quickly find the pattern of missingness in the dataset.

In [None]:
msno.matrix(test_features_df)

In [None]:
msno.matrix(train_features_df)

**Bar Chart :**

This bar chart gives you an idea about how many missing values are there in each column.

In [None]:
msno.bar(test_features_df.sample(10))

In [None]:
msno.bar(train_features_df.sample(10))

## If this kernel is useful, <font color='orange'>please upvote</font>!
- See you next time and I will update it soon!