# [Part 2] Exploratory Data Analysis

## Install conda and rdkit

In [1]:
! pip install rdkit



## Load bioactivity data

In [2]:
import pandas as pd

In [None]:
df = pd.read_csv("acetylcholinesterase_03_bioactivity_data_curated.csv")
df

In [None]:
df_no_smiles = df.drop(columns="canonical_smiles")

In [None]:
smiles = []

for i in df.canonical_smiles.tolist():
    cpd = str(i).split(".")
    cpd_longest = max(cpd, key = len)
    smiles.append(cpd_longest)

smiles = pd.Series(smiles, name="canonical_smiles")

In [None]:
df_clean_smiles = pd.concat([df_no_smiles, smiles], axis=1)
df_clean_smiles

## Calculate Lipinski descriptors

Christopher Lipinski, a scientist at Pfizer, came up with a set of rule-of-thumb for evaluating the druglikeness of compounds. Such druglikeness is based on the Absorption, Distribution, Metabolism and Excretion (ADME) that is also known as the pharmacokinetic profile. Lipinski analyzed all orally active FDA-approved drugs in the formulation of what is to be known as the Rule-of-Five or Lipinski's Rule.

The Lipinski's Rule stated the following:
  * Molecular weight < 500 Dalton
  * Octanol-water partition coefiicient (LogP) < 5
  * Hydrogen bond donors < 5
  * Hydrogen bond acceptors < 10>

## Import libraries

In [3]:
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

## Calculate descriptors