# Data Curation and Analysis
This notebook uses the data table created from the previous notebook (bbb_permeability_data_table) and filters the data to remove inconsistent or invalid entries. After curating the data table, the values for molecular descriptors such as molecular weight and logP are found using RDKit.Chem and inserted into the data table. Additionally, the data table will then be split up based on whether a molecule is BBB permeable or not to generate two groups.

## Imports
Import the python libraries and python files needed to curate the dataset and determine values of molecular descriptors

In [6]:
from data_table import LightBBB, MoleculeNet, DeePred, B3BD
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen, rdMolDescriptors
import pandas as pd 

## Variables 
List the variables needed for the notebook

In [7]:
BBB_permeable_number = 0
BBB_nonpermeable_number = 0
invalid_molecule = []
molecular_weights = []
logPs = []
TPSAs = []
BBB_permeable = []
BBB_nonpermeable = []
unique_multiple_BBB_compound = []
invalid_multiple_BBB_compound = []

## Construct a Data Table
Construct a data table containing a molecule's SMILES and BBB classification data from the four datasets: LightBBB, MoleculeNet, DeePred, B3BD

In [8]:
data_table = pd.concat([LightBBB[['SMILES','BBclass']], MoleculeNet[['SMILES','BBclass']], DeePred[['SMILES','BBclass']], 
                        B3BD[['SMILES','BBclass']]], ignore_index=True)

## Determine the Number of Duplicate and Unique SMILES
Count the number of SMILES that appear more than once and the number of SMILES that appears only once.

In [5]:
'''Determine the Number of Duplicate SMILES'''
SMILES_count = data_table['SMILES'].value_counts() # Count the amount of times each SMILES occurs
duplicates = SMILES_count[SMILES_count > 1]

'''Determine the Number of Unique SMILES'''  
unique_compounds = SMILES_count[SMILES_count < 2] 

'''Print the Number of SMILES Duplicate and Unique SMILES'''
print(f"There are a total of {duplicates.count()} duplicate SMILES in the data table") 
print(f"There are a total of {unique_compounds.count()} unique SMILES in the data table")

There are a total of 1463 SMILES in the data table
There are a total of 10635 SMILES in the data table


## Create a Table of Duplicate SMILES
Create a new sub data table which contains the duplicate SMILES as well as the BBB classification data for each instance of the duplicate SMILES 

In [None]:
'''Table of Duplicate SMILES and Reported BBclass Data'''
duplicate_SMILES = duplicates.index.to_numpy() # Makes a numpy array of all the duplicate SMILES
duplicate_SMILES_table = data_table[data_table['SMILES'].isin(duplicate_SMILES)] # Makes a table of every instance of duplicate SMILES 
duplicate_SMILES_table = duplicate_SMILES_table.groupby('SMILES')['BBclass'].nunique() # Makes a table that shows the number of unique reported values

## Curate Duplicate SMILES Table
Based on the consistency of the reported BBB classification data, keep the SMILES to discard the SMILES. If the duplicate SMILES has consistent BBB classification data (Example: all instances of the same SMILES states that the molecule is BBB permeable), keep the SMILES in the original dataset. If the duplicate SMILES has inconsistent BBB classification data (Example: one instance of a SMILES states that the molecule is BBB permeable while another instance of the same SMILES states that the molecule is BBB nonpermeable), discard the SMILES

In [None]:
'''Filtering Duplicates based on whether the reported values are consistent or not'''
for i in range(len(duplicate_SMILES_table)):
    if duplicate_SMILES_table.iloc[i] == 1: 
        unique_multiple_BBB_compound.append(duplicate_SMILES_table.index[i])
    else: 
        invalid_multiple_BBB_compound.append(duplicate_SMILES_table.index[i])
unique_SMILES_table = data_table[~data_table['SMILES'].isin(invalid_multiple_BBB_compound)]

## Determine the Molecular Weight for SMILES in Data Table 
Determine the molecular weight of the molecules using RDKit.Chem(). If RDKit.Chem() doesn't recognize the inputted SMILES, remove the SMILES from the data table. Add molecular weight as a new column in the data table.

In [None]:
'''Find Molecular Weight and remove invalid SMILES'''
for smiles in unique_SMILES_table['SMILES']:
    molecule = Chem.MolFromSmiles(smiles) # Convert SMILES to molecular name
    if molecule: 
        molecular_weight = Descriptors.MolWt(molecule) # Find the molecular weight for each SMILES
        molecular_weights.append(molecular_weight) # Add molecular weight to list
    else:
        invalid_molecule.append(smiles) # Add invalid SMILES to invalid_molecule  
unique_SMILES_table = unique_SMILES_table[~unique_SMILES_table['SMILES'].isin(invalid_molecule)] # Keep all SMILES that are valid
unique_SMILES_table['Molecular Weight (amu)'] = molecular_weights # Add molecular weights to Data Table

## Determine the logP Value for SMILES in Data Table 
Determine the logP value of the molecules using RDKit.Chem(). If RDKit.Chem() doesn't recognize the inputted SMILES, remove the SMILES from the data table. Add logP as a new column in the data table.

In [None]:
'''Find logP value and remove invalid SMILES'''
for smiles in unique_SMILES_table['SMILES']:
    molecule = Chem.MolFromSmiles(smiles) # Convert SMILES to molecular name 
    logP = Crippen.MolLogP(molecule) # Find the logP value for each SMILES 
    logPs.append(logP) 
unique_SMILES_table['LogP Value'] = logPs # Add logP value to the Data Table

## Determine the TPSA for SMILES in Data Table 
Determine the TPSA of the molecules using RDKit.Chem(). If RDKit.Chem() doesn't recognize the inputted SMILES, remove the SMILES from the data table. Add TPSA as a new column in the data table.

In [None]:
'''Find TPSA and remove invalid SMILES'''
for smiles in unique_SMILES_table['SMILES']:
    molecule = Chem.MolFromSmiles(smiles) # Convert SMILES to molecular name 
    TPSA = rdMolDescriptors.CalcTPSA(molecule) # Find the TPSA for each SMILES 
    TPSAs.append(TPSA) 
unique_SMILES_table['TPSA Value'] = TPSAs # Add TPSA to the Data Table

## Determine the Number of BBB+ and BBB- Molecule
Determine the number of molecules in the dataset that are BBB permeable and BBB nonpermeable.

In [None]:
'''Find # of BBB permeable and nonpermeable Molecules'''
for value in unique_SMILES_table['BBclass']: 
    if value == 1: 
        BBB_permeable_number = BBB_permeable_number + 1
    elif value == 0: 
        BBB_nonpermeable_number = BBB_nonpermeable_number + 1
print(f'The number of molecules in the data table that are BBB permeable is {BBB_permeable_number}') 
print(f'The number of molecules in the data table that are BBB nonpermeable is {BBB_nonpermeable_number}')

## Organize the Data Table into BBB+ and BBB-
Separate the molecules into two tables, where one table consists of all the molecules that are BBB permeable and the other table consists of all the molecules that are BBB nonpermeable.

In [None]:
'''Distribute SMILES into separate tables based on BBB permeability'''
BBB_permeable = unique_SMILES_table[unique_SMILES_table['BBclass'] == 1]['SMILES'].tolist()
BBB_nonpermeable = unique_SMILES_table[unique_SMILES_table['BBclass'] == 0]['SMILES'].tolist()
BBB_permeable_table = unique_SMILES_table[~unique_SMILES_table['SMILES'].isin(BBB_nonpermeable)] # Keep all BBB+ permeable molecules
BBB_nonpermeable_table = unique_SMILES_table[~unique_SMILES_table['SMILES'].isin(BBB_permeable)] # Keep all BBB+ nonpermeable molecules

## Store the TPSA for BBB+ and BBB- molecules as a list
Create two variables, where one variable is a list of the TPSA for BBB permeable molecules while the other variable is a list of the TPSA for BBB nonpermeable molecules.

In [None]:
'''Get TPSA Values for BBB+ and BBB- molecules individually'''
tpsa_positive = BBB_permeable_table['TPSA Value']
tpsa_negative = BBB_nonpermeable_table['TPSA Value']

## Store the logP for BBB+ and BBB- molecules as a list
Create two variables, where one variable is a list of the logP for BBB permeable molecules while the other variable is a list of the logP for BBB nonpermeable molecules.

In [None]:
'''Get logP Values for BBB+ and BBB- molecules individually'''
logP_positive = BBB_permeable_table['LogP Value']
logP_negative = BBB_nonpermeable_table['LogP Value']