# 🔥🔥TPS Oct 2021 - 🔥🔥EDA and Analysis 🔥🔥

# Biological Molecules response to Chemicals 
.         |  ..
:-------------------------:|:-------------------------:
![](https://miro.medium.com/max/1400/0*hQG_y2xkkj4cjexx)  |  ![DNA](https://upload.wikimedia.org/wikipedia/commons/1/16/DNA_orbit_animated.gif)
![Key Bio Proteins](https://upload.wikimedia.org/wikipedia/commons/d/d3/0322_DNA_Nucleotides.jpg)  | ![Meta Genomic Methods](https://upload.wikimedia.org/wikipedia/commons/4/47/Overview_of_metagenomic_methods.jpg)

> Chemical biology is a scientific discipline spanning the fields of chemistry and biology. The discipline involves the application of chemical techniques, analysis, and often small molecules produced through synthetic chemistry, to the study and manipulation of biological systems. In contrast to biochemistry, which involves the study of the chemistry of biomolecules and regulation of biochemical pathways within and between cells, chemical biology deals with chemistry applied to biology (synthesis of biomolecules, simulation of biological systems etc.).

## Excited? 


##### Some References on Biology and Chemistry correlation 
###### Thanks to Tensor Girl for posting on BMS competition
- [Deep learning and generative methods in cheminformatics and chemical biology: navigating small molecule space intelligently](https://portlandpress.com/biochemj/article/477/23/4559/227194/Deep-learning-and-generative-methods-in)
- [Learning Drug Functions from Chemical Structures with Convolutional Neural Networks and Random Forests](https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236)
-[Deep Learning of Atomically Resolved Scanning Transmission Electron Microscopy Images: Chemical Identification and Tracking Local Transformations](https://www.osti.gov/servlets/purl/1427646)
-[Chemception: Deep Learning from 2D Chemical Structure Images](https://depth-first.com/articles/2019/02/04/chemception-deep-learning-from-2d-chemical-structure-images/)
-[Molecular Structure Extraction From Documents Using Deep Learning](https://arxiv.org/ftp/arxiv/papers/1802/1802.04903.pdf)
-[CheMixNet: Mixed DNN Architectures for Predicting Chemical Properties using Multiple Molecular Representations](http://cucis.eecs.northwestern.edu/publications/pdf/PJA18.pdf)

### One more source to check for good info on molecules is [PubChem](https://pubchem.ncbi.nlm.nih.gov/)
> PubChem is the world's largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other identifiers. Find chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more.

### Let us check the analysis

# Table of Contents
<a id="table-of-contents"></a>
- [1 Introduction](#1)
- [2 Preparations](#2)
- [3 Datasets Overview](#3)
    - [3.1 Train dataset](#3.1)
    - [3.2 Test dataset](#3.2)
    - [3.3 Submission](#3.3)
- [4 Features](#4)
    - [4.1 Missing values](#4.1)
       - [4.1.1 Preparation](#4.1.1)
       - [4.1.2 Individual features](#4.1.2)
       - [4.1.3 Individual rows](#4.1.3)
       - [4.1.3 Dealing with missing values (reference)](#4.1.4)
    - [4.2 Distribution](#4.2)

[back to top](#table-of-contents)
<a id="1"></a>
# 1 Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a [CTGAN](https://github.com/sdv-dev/CTGAN). The original dataset deals with *predicting the **biological response of molecules** given various chemical properties*. Although the features are anonymized, they have properties relating to real-world features.

This competition will asked to predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

Submissions are evaluated on **area under the ROC curve** between the predicted probability and the observed target.

In [None]:

# import packages
import os
import joblib
import numpy as np
import pandas as pd
import warnings

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

# import datasets
train_file = '../input/tabular-playground-series-oct-2021/train.csv'
test_file = '../input/tabular-playground-series-oct-2021/test.csv'
sub_file = '../input/tabular-playground-series-oct-2021/sample_submission.csv'

In [None]:
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)
submission = pd.read_csv(sub_file)

[back to top](#table-of-contents)
<a id="2"></a>
# 2 Preparations
Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation, data visualization and modeling. There are 2 datasets that are used in the analysis, they are train and test dataset. The main use of train dataset is to train models and use it to predict test dataset. While sample submission file is used to informed participants on the expected submission for the competition. *(to see the details, please expand)*

In [None]:
train_df.head()

[back to top](#table-of-contents)
<a id="3"></a>
# 3 Dataset Overview
The intend of the overview is to get a feel of the data and its structure in train, test and submission file. An overview on train and test datasets will include a quick analysis on missing values and basic statistics, while sample submission will be loaded to see the expected submission.

<a id="3.1"></a>
## 3.1 Train dataset
As stated before, train dataset is mainly used to train predictive model as there is an available target variable in this set. This dataset is also used to explore more on the data itself including find a relation between each predictors and the target variable.

**Observations:**
- `target` column is the target variable which is only available in the `train` dataset.
- There are `287` columns: `285` are features, `1` target variable `target` and `1` column of `id`.
- `train` dataset contain `1,000,000`  data, and the test consists of 500000 data.
- f0~f241 : continuous feature (242)
- f242 ~ f284 : binary feature (43)

### 3.1.1 Quick view
Below is the first 5 rows of train dataset:

In [None]:
print(f'Number of rows: {train_df.shape[0]}')
print(f'Number of columns: {train_df.shape[1]}')
print(f'No of missing values: {sum(train_df.isna().sum())}')

### 3.1.2 Basic statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
train_df.describe()

In [None]:
test_df.head()

[back to top](#table-of-contents)
<a id="3.3"></a>
## 3.3 Submission
The submission file is expected to have an `id` and `target` columns.

Below is the first 5 rows of submission file:

In [None]:
submission.head()

[back to top](#table-of-contents)
<a id="4"></a>
# 4 Features
Number of features available to be used to create a prediction model are `285`.

<a id="4.1"></a>
## 4.1 Missing values
Counting number of missing value and it's relative with their respective observations between train & test dataset.

<a id="4.1.1"></a>
### 4.1.1 Preparation
Prepare train and test dataset for data analysis and visualization. *(to see the details, please expand)*

In [None]:
train_df.loc[:, 'f0':'f284'].describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Greens')\
                            .background_gradient(subset=['25%'], cmap='Spectral')\
                            .background_gradient(subset=['50%'], cmap='seismic')\
                            .background_gradient(subset=['75%'], cmap='viridis')\
                            .background_gradient(subset=['mean'], cmap='cubehelix')\
                            .background_gradient(subset=['min'], cmap='Reds')\
                            .background_gradient(subset=['max'], cmap='Blues')


## 4.1.2 Memory Reduction

Let us use datatable to reduce the memory

##### Source Credit: https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd

### Missing Data Analysis 

In [None]:
missing_train_df = pd.DataFrame(train_df.isna().sum())
missing_train_df = missing_train_df.drop(['id', 'target']).reset_index()
missing_train_df.columns = ['feature', 'count']

missing_train_percent_df = missing_train_df.copy()
missing_train_percent_df['count'] = missing_train_df['count']/train_df.shape[0]

missing_test_df = pd.DataFrame(test_df.isna().sum())
missing_test_df = missing_test_df.drop(['id']).reset_index()
missing_test_df.columns = ['feature', 'count']

missing_test_percent_df = missing_test_df.copy()
missing_test_percent_df['count'] = missing_test_df['count']/test_df.shape[0]

features = [feature for feature in train_df.columns if feature not in ['id', 'target']]
missing_train_row = train_df[features].isna().sum(axis=1)
missing_train_row = pd.DataFrame(missing_train_row.value_counts()/train_df.shape[0]).reset_index()
missing_train_row.columns = ['no', 'count']

missing_test_row = test_df[features].isna().sum(axis=1)
missing_test_row = pd.DataFrame(missing_test_row.value_counts()/test_df.shape[0]).reset_index()
missing_test_row.columns = ['no', 'count']

In [None]:
cat_features =[]
num_features =[]

for col in train_df.columns:
    if train_df[col].dtype=='float64':
        num_features.append(col)
    else:
        cat_features.append(col)
print('Catagoric features: ', cat_features)
display(len(cat_features))
print('Numerical features: ', num_features)
display(len(num_features))

In [None]:
L = len(num_features[175:202])
nrow= 7
ncol= 4

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(16, 20))
fig.subplots_adjust(top=0.95)
i = 1
for feature in num_features[0:28]:
    plt.subplot(nrow, ncol, i)
    ax = sns.kdeplot(train_df[feature], shade=True,  color='lime',  alpha=0.9, label='Train DS')
    ax = sns.kdeplot(test_df[feature], shade=True, color='fuchsia',  alpha=0.9, label='Test DS')
    plt.xlabel(feature, fontsize=7)
    plt.legend()
    i += 1
plt.suptitle('Distribution of some numerical features ==> Training & Test datasets ', fontsize=18)
plt.show()

<a id="4.1.2"></a>
### 4.1.2 Individual features
Count how many missing values in each features on `train` and `test` dataset to see if there any similiarity between them.

**Observations:**
- Every features in `train` and `test` dataset has a missing value of around `0.0%`.


## 4.1.3 - Analysis using Autoviz

In [None]:
#!pip install sweetviz autoviz xlrd

In [None]:
'''
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df = AV.AutoViz(train_file, depVar='target',verbose = 1, 
                lowess = False, chart_format ='png', 
                max_rows_analyzed = 150000)
'''

#More to come! Thanks to the notebook outline from TPS SEP 2021 from Sharito Cope

# 5 - Simple LGBM model for evaluation

In [None]:
#test_df = dt.fread(test_file).to_pandas()
#train_df = dt.fread(train_file).to_pandas()

In [None]:
#Drop the Target Column
training_data = train_df.drop("target", axis=1)
#Save the value of Target for usage
training_label = train_df["target"].copy()

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
lgbm_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy="most_frequent")),
    ('std_scaler', StandardScaler()),
])
training_prepared = lgbm_pipeline.fit_transform(training_data)

In [None]:
import lightgbm as lgb
lgbm_cls = lgb.LGBMClassifier()
lgbm_cls.fit(training_prepared,training_label)

In [None]:
testing_prepared = lgbm_pipeline.fit_transform(test_df)
test_predictions = lgbm_cls.predict(testing_prepared)

In [None]:
submission = pd.read_csv(sub_file)
submission['target'] = list(map(float, test_predictions))
submission.to_csv('submission.csv', index=False)