# **ChEMBL Data Import and Preparation**

## Objectives:

The first notebook focuses on setting up the data pipeline by:

1.  Reading the ChEMBL dataset directly from the URL.

2.  Performing initial data cleaning and transformation.

3.  Conducting exploratory analysis to identify critical tables, columns, and relationships.

4.  Uploading the cleaned data to an Azure SQL Database.



### Section 1: Import Libraries and Establish Project Root for Directory

##### First let's set our directory to the root of the project.  We will reuse this as well as it is a great way to keep yourself out of trouble with your directory issues, by setting your abspath to the root.  We can then in future notebooks refer back to the project_root when needed.  While we are doing this we will also create a path and variable to our data folder, since we will be referring back to it often.

In [1]:
import os

# Define project root
project_root = "/home/azureuser/cloudfiles/code/Users/kalpha1865/BioPred"

# Validate the directory
if not os.path.exists(project_root):
    raise FileNotFoundError(f"Project root not found: {project_root}")

# Change working directory to project root if not already
if os.getcwd() != project_root:
    os.chdir(project_root)

print(f"Project root set to: {os.getcwd()}")

Project root set to: /mnt/batch/tasks/shared/LS_root/mounts/clusters/kalpha18651/code/Users/kalpha1865/BioPred


##### Now we can import the rest of our libraries, as well as establish a reference point to our Config file for our database credentials.

In [2]:
import sys
import requests
import pandas as pd
import tarfile
from sqlalchemy import create_engine
import sqlite3
import pyodbc
import mysql.connector

# Referencing the config file for Azure MySQL Database credentials.
config_dir = os.path.join(project_root, "Config")
sys.path.append(config_dir)

from config import MYSQL_CONFIG


### Section 2: Read and Extract Data from URL

##### Now we will bring in our ChEMBL data from the site url.  We will send it to our data extracted subfolder for now.

In [6]:
data_url = "https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_35_sqlite.tar.gz"
extract_path = "./data/extracted"

os.makedirs(extract_path, exist_ok=True)

#Download the file from the url
response = requests.get(data_url, stream=True)

with tarfile.open(fileobj = response.raw, mode = "r|gz") as tar:
    tar.extractall(path = extract_path)

print(f"Files extracted to: {extract_path}")

ConnectionError: HTTPSConnectionPool(host='ftp.ebi.ac.uk', port=443): Max retries exceeded with url: /pub/databases/chembl/ChEMBLdb/latest/chembl_35_sqlite.tar.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4741671ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))

##### Let's take a preliminary look at the tables and see what is contained n the ChEMBL database.  We will use the provided schema to reference what information is available in the dataset and some of the existing relationships.

<img src="https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_35_schema.png" alt = "ChEMBL Schema" width = 2000>

### Section 3: Data Exploration and Filtering

SQLite is used to inspect the database schema and query relationships for efficient preparation of data for downstream processes in PySpark.  To avoid unnecessary overhead and ensure efficient data handling, the molfile column from the compound_structures table will be excluded from modeling workflows.  Instead, canonical_smiles will serve as the primary representation for molecular structures, as it is compact and fully compatible with RDKit and GNN workflows.

Here are the tables we will be primarily interested in storing at this phase of the project:

**compound_structures**:
Contains molecule identifiers (SMILES, InChI) essential for molecular modeling.
**WHY**: SMILES strings are the standard input format for cheminformatics tools and models.  They are compact, efficient, and encode the molecular structure needed for advancded analyses.

**activities**:
Provides bioactivity metrics(e.g., IC50, Ki), which are critical for model labels.
**WHY**: Bioactivity metrics from the labels for supervised learning models, helping predict the effectiveness or potency of molecules.

**target_dictionary**:
Contains target-level details, such as target type and associated proteins.
**WHY**: Understanding the biological context of targets allows for more interpretable and biologically relevant predictions.

**molecule_hierarchy**:
Provides parent-child relationships between molecules (e.g., salts, hydrates, or parents).
**WHY**: These relationships are useful for grouping related molecules and ensuring consistent labeling in models.

**compound_properties**:
Includes physiochemical attributes of molecules (e.g., molecular weight, logP, PSA).
**WHY**: These descriptors enhance molecular feature sets and are commonly used in cheminformatics for predicting bioactivity or drug-likeness.


##### Let's load the SQLite database and perform some initial exploration.

In [3]:
# Make the initial connection to the SQLite db with our path to the ChEMBL data.
db_path = "/home/azureuser/cloudfiles/code/Users/kalpha1865/BioPred/data/extracted/chembl_35.db"
conn = sqlite3.connect(db_path)

In [4]:
# Function to inspect the schema and preview data for an individual table
def inspect_table(table_name, limit = 5):
    schema_query = f"PRAGMA table_info({table_name});"
    schema = conn.execute(schema_query).fetchall()
    print(f"Schema for {table_name}:", schema)
    
    data_query = f"SELECT * FROM {table_name} LIMIT {limit};"
    data = pd.read_sql_query(data_query, conn)
    print(f"Data from {table_name}:", data.head())

# Call the function for our specified tables above.
inspect_table("compound_structures")
inspect_table("activities")
inspect_table("target_dictionary")
inspect_table("molecule_hierarchy")
inspect_table("compound_properties")

Schema for compound_structures: [(0, 'molregno', 'BIGINT', 1, None, 1), (1, 'molfile', 'TEXT', 0, None, 0), (2, 'standard_inchi', 'VARCHAR(4000)', 0, None, 0), (3, 'standard_inchi_key', 'VARCHAR(27)', 1, None, 0), (4, 'canonical_smiles', 'VARCHAR(4000)', 0, None, 0)]
Data from compound_structures:    molregno                                            molfile  \
0         1  \n     RDKit          2D\n\n 24 26  0  0  0  0...   
1         2  \n     RDKit          2D\n\n 25 27  0  0  0  0...   
2         3  \n     RDKit          2D\n\n 25 27  0  0  0  0...   
3         4  \n     RDKit          2D\n\n 23 25  0  0  0  0...   
4         5  \n     RDKit          2D\n\n 24 26  0  0  0  0...   

                                      standard_inchi  \
0  InChI=1S/C17H12ClN3O3/c1-10-8-11(21-17(24)20-1...   
1  InChI=1S/C18H12N4O3/c1-11-8-14(22-18(25)21-16(...   
2  InChI=1S/C18H16ClN3O3/c1-10-7-14(22-18(25)21-1...   
3  InChI=1S/C17H13N3O3/c1-11-2-4-12(5-3-11)16(22)...   
4  InChI=1S/C17H12ClN3O3

##### Looking at the output from the selected tables, let's pick our key features for this phase of the project.

**compound_structures**:  molregno, canonical_smiles, standard_inchi_key

**activities**:  molregno, target_id, standard_value, standard_type

**target_dictionary**:  target_id, pref_name, target_type

**molecule_hierarchy**:  molregno, parent_molregno

**compound_properties**:  molregno, full_mwt, alogp, psa, hba, hbd, rtb

### Section 4:  Joining and Data Cleaning

This section focuses on joining data from the prioritized tables into a unified dataset, followed by cleaning to ensure it is ready for the next step(s) and is ready for ingestion into Azure SQL.  We will break down the joins into small steps and review our progress to make sure we are progressing forward.  We will utilize chunking for batch processing as well as saving to csv in between each query in case errors happen due to the longer querying times so we can pick back up where we left off as a failsafe.

##### Step 1:  Join compound_structures and compound_properties

These tables are joined using molregno to combine molecular identifiers with physiochemical properties.

In [9]:
# using chunking to avoid kernel crashes and easier batch processing.
# Very small chunk size due to the large dataset and multiple joins, the executions will take more time as a result.
chunk_size = 50000
data_chunks = []
offset = 0

join_query_1 = """
SELECT DISTINCT cs.canonical_smiles, cs.standard_inchi_key,
    MIN(cp.full_mwt) AS full_mwt, MIN(cp.alogp) AS alogp,
    MIN(cp.psa) AS psa, MIN(cp.hba) AS hba, MIN(cp.hbd) AS hbd,
    MIN(cp.aromatic_rings) AS aromatic_rings, MIN(cp.heavy_atoms) AS heavy_atoms,
    MIN(cp.rtb) AS rtb, cp.molecular_species, cs.molregno
FROM compound_structures cs
LEFT JOIN compound_properties cp ON cs.molregno = cp.molregno
WHERE cp.full_mwt BETWEEN 200 AND 500
    AND cp.rtb <= 10
    AND cp.alogp BETWEEN -1 AND 5
    AND cp.psa <= 140
    AND cp.hbd <= 5
    AND cp.hba <= 10
GROUP BY cs.canonical_smiles, cs.standard_inchi_key
LIMIT ? OFFSET ?;
"""

while True:
    chunk = pd.read_sql_query(join_query_1, conn, params=(chunk_size, offset))
    if chunk.empty:
        break
    data_chunks.append(chunk)
    offset += chunk_size

data_query_1 = pd.concat(data_chunks, ignore_index=True)
print(data_query_1.head())
print(data_query_1.shape)

data_query_1.to_csv("./data/processed/data_query_1.csv", index=False)

                     canonical_smiles           standard_inchi_key  full_mwt  \
0       Br.Br.Br.N=C(CN)Nc1cccc(CN)c1  OSNUXCHNXGTCLF-UHFFFAOYSA-N    420.97   
1                 Br.Br.Br.NCCCNOCCCN  GVINPSHTUJENLH-UHFFFAOYSA-N    389.96   
2     Br.Br.CC(=N)NCc1ccc(NC(C)=N)cc1  XQZFTLYUQGEYRI-UHFFFAOYSA-N    366.10   
3  Br.Br.CC(=N)Nc1ccc(-c2csc(N)n2)cc1  DRTLBWIJKHAZHV-UHFFFAOYSA-N    394.14   
4       Br.Br.CC(=N)Nc1ccc(CN(C)C)cc1  BUWHTWAGPYSQFK-UHFFFAOYSA-N    353.10   

   alogp    psa  hba  hbd  aromatic_rings  heavy_atoms  rtb molecular_species  \
0   0.49  87.92    3    4               1           13    3              BASE   
1  -0.79  73.30    4    3               0           10    7              BASE   
2   2.18  71.76    2    4               1           15    3              BASE   
3   2.80  74.79    4    3               2           16    2              BASE   
4   2.16  39.12    2    2               1           14    3              BASE   

   molregno  
0    546221  
1   

In [22]:
# Read in the CSV file from query_1
data_query_1 = pd.read_csv("./data/processed/queried_data_step_1.csv")

# Error checking, drop any duplicates based on our target in canonical_smiles
data_query_1 = data_query_1.drop_duplicates(subset = 'canonical_smiles')


chunk_size = 100000
data_chunks = []
offset = 0

# Adjusted query to include canonical_smiles via a join
join_query_2 = """
SELECT cs.canonical_smiles,
    MIN(a.standard_value) AS min_standard_value
FROM activities a
LEFT JOIN compound_structures cs ON a.molregno = cs.molregno
GROUP BY cs.canonical_smiles
LIMIT ? OFFSET ?;
"""

while True:
    # Fetch data in chunks
    chunk = pd.read_sql_query(join_query_2, conn, params=(chunk_size, offset))
    if chunk.empty:
        break

    # Merge the chunk with the CSV dataframe
    merged_chunk = pd.merge(
        data_query_1,
        chunk,
        on='canonical_smiles',
        how='left'
    )
    data_chunks.append(merged_chunk)
    offset += chunk_size

# Combine all chunks
data_query_2 = pd.concat(data_chunks, ignore_index=True)
print(data_query_2.shape)
print(data_query_2.head())

# Save the final dataset
data_query_2.to_csv("./data/processed/data_query_2.csv", index=False)

(38445672, 14)
   molregno                                   canonical_smiles  \
0     27502                 CC(C)(C)OC(=O)NCC/N=C(\NN)NCC(=O)O   
1     32210                 CC(=O)OCCn1cnc2c1c(=O)n(C)c(=O)n2C   
2     47802  C[C@H]1[C@H](NC(=O)Cc2csc(N)n2)C(=O)N1OCC(=O)[...   
3     73646                    O=c1[nH]c(=O)n(C2COC(CO)O2)cc1I   
4     73647                        Cc1cn(C2COC(CO)O2)c(=O)nc1N   

            standard_inchi_key  full_mwt  alogp     psa  hba  hbd  \
0  OYJKAGYHSGFSGQ-UHFFFAOYSA-N    275.31   -1.0  138.07    5    5   
1  CQKZJBMFMGGBHJ-UHFFFAOYSA-N    266.26   -1.0   88.12    8    0   
2  QRVMRNMOBASNLF-WFZUHFMFSA-M    352.41   -1.0  134.85    7    3   
3  GFRQBEMRUYIKAQ-UHFFFAOYSA-N    340.07   -1.0   93.55    6    2   
4  MJGOECXTQVPSJN-UHFFFAOYSA-N    227.22   -1.0   99.60    7    2   

   aromatic_rings  np_likeness_score  heavy_atoms  rtb molecular_species  \
0               0              -0.44           19    5        ZWITTERION   
1               2    

In [18]:
# Checking to make sure before we run this query that our merge column doesn't have duplicates.
print(f"There are {data_query_2['canonical_smiles'].duplicated().sum()} duplicated rows in the canonical_smiles feature.")

# Adding check to see how many unique values we have for our target modifier in canonical_smiles 
print(f"Unique canonical_smiles in query_1: {data_query_1['canonical_smiles'].nunique()}")
print(f"Unique canonical_smiles in query_2: {data_query_2['canonical_smiles'].nunique()}")

# Unique values hold, drop duplicates before next query
data_query_2 = data_query_2.drop_duplicates(subset = 'canonical_smiles')

# Verify shape
print(data_query_2.shape)

There are 0 duplicated rows in the canonical_smiles feature.
Unique canonical_smiles in query_1: 1601903
Unique canonical_smiles in query_2: 1601903
(1601903, 15)


##### Step 2: Add Bioactivity Data from activities table.

This step will introduce the metrics standard_value and standard_type.  We will use the built-in temp table variable s1 here to just add on to our previous work.  Also note here in this query we will filter our data based on the standard_value.  This will help in processing as well as help us determine how effective a molecule is at interacting with a biological target.  For this value, lower values indicate higher potency (i.e. the molecule is more effective at lower concentrations).

In [19]:
# Step 3.1: Add hierarchy data from molecule_hierarchy.
data_query_2 = pd.read_csv("./data/processed/data_query_2.csv")

# This time the logic will be a little different since we are just doing a quick join, can do the merge after declaring.
molecule_hierarchy = pd.read_sql_query("SELECT molregno, parent_molregno FROM molecule_hierarchy;", conn)

# Now set up the merge on our PK 'molregno'
data_query_3 = pd.merge(
    data_query_2,
    molecule_hierarchy[['molregno', 'parent_molregno']],
    on = 'molregno',
    how = 'left'
)

print(data_query_3.shape)
print(data_query_3.head())

data_query_3.to_csv("./data/processed/data_query_3.csv", index=False)


(38447520, 16)
   molregno                                   canonical_smiles  \
0     27502                 CC(C)(C)OC(=O)NCC/N=C(\NN)NCC(=O)O   
1     32210                 CC(=O)OCCn1cnc2c1c(=O)n(C)c(=O)n2C   
2     47802  C[C@H]1[C@H](NC(=O)Cc2csc(N)n2)C(=O)N1OCC(=O)[...   
3     73646                    O=c1[nH]c(=O)n(C2COC(CO)O2)cc1I   
4     73647                        Cc1cn(C2COC(CO)O2)c(=O)nc1N   

            standard_inchi_key  full_mwt  alogp     psa  hba  hbd  \
0  OYJKAGYHSGFSGQ-UHFFFAOYSA-N    275.31   -1.0  138.07    5    5   
1  CQKZJBMFMGGBHJ-UHFFFAOYSA-N    266.26   -1.0   88.12    8    0   
2  QRVMRNMOBASNLF-WFZUHFMFSA-M    352.41   -1.0  134.85    7    3   
3  GFRQBEMRUYIKAQ-UHFFFAOYSA-N    340.07   -1.0   93.55    6    2   
4  MJGOECXTQVPSJN-UHFFFAOYSA-N    227.22   -1.0   99.60    7    2   

   aromatic_rings  np_likeness_score  heavy_atoms  rtb molecular_species  \
0               0              -0.44           19    5        ZWITTERION   
1               2    

In [None]:
# Checking to make sure before we run this query that our merge column doesn't have duplicates.
print(f"There are {data_query_3['canonical_smiles'].duplicated().sum()} duplicated rows in the canonical_smiles feature.")

# Adding check to see how many unique values we have for our target modifier in canonical_smiles 
print(f"Unique canonical_smiles in query_2: {data_query_1['canonical_smiles'].nunique()}")
print(f"Unique canonical_smiles in query_3: {data_query_2['canonical_smiles'].nunique()}")

# Unique values hold, drop duplicates before next query
data_query_3 = data_query_3.drop_duplicates(subset = 'canonical_smiles')

# Verify shape
print(data_query_3.shape)

In [None]:
# Now add TID from assays table as a bridge for us to get to the target_dictionary table
data_query_3 = pd.read_csv("./data/processed/data_query_3.csv")

chunk_size = 100000
offset = 0
data_chunks = []

# Query to get data from the assays table, in doc_id and tid
join_query_3 = """
SELECT doc_id, tid
FROM assays
LIMIT ? OFFSET ?;
"""

while True:
    chunk = pd.read_sql_query(join_query_3, conn, params=(chunk_size, offset))

    if chunk.empty:
        break
    
    data_chunks.append(chunk)
    offset += chunk_size

assays_data = pd.concat(data_chunks, ignore_index = True)

data_query_4 = pd.merge(
    data_query_3,
    assays_data,
    on = 'doc_id',
    how = 'left'
)

print(data_query_4.head())
print(data_query_4.shape)

data_query_4.to_csv("./data/processed/data_query_4.csv", index=False)

In [None]:
# Checking to make sure before we run this query that our merge column doesn't have duplicates.
print(f"There are {data_query_4['canonical_smiles'].duplicated().sum()} duplicated rows in the canonical_smiles feature.")

# Adding check to see how many unique values we have for our target modifier in canonical_smiles 
print(f"Unique canonical_smiles in query_3: {data_query_1['canonical_smiles'].nunique()}")
print(f"Unique canonical_smiles in query_4: {data_query_2['canonical_smiles'].nunique()}")

# Unique values hold, drop duplicates before next query
data_query_4 = data_query_3.drop_duplicates(subset = 'canonical_smiles')

# Verify shape
print(data_query_4.shape)

In [None]:
# Read in our csv file once more
data_query_4 = pd.read_csv("./data/processed/data_query_4.csv")

# No need for the doc_id column anymore, just drop now.
data_query_4 = data_query_4.drop(columns = ['doc_id'])

chunk_size = 100000
offset = 0
data_chunks = []

join_query_4 = """
SELECT pref_name, target_name, target_type, tid
FROM target_dictionary
LIMIT ? OFFSET ?;
"""

while True:
    chunk = pd.read_sql_query(join_query_4, conn, params=(chunk_size, offset))
    if chunk.empty:
        break
    
    data_chunks.append(chunk)
    offset += chunk_size
    
target_dictionary_data = pd.concat(data_chunks, ignore_index=True)

data_query_5 = pd.merge(
    data_query_4,
    target_dictionary_data,
    on = 'tid',
    how = 'left'
)

print(data_query_5.head())
print(data_query_5.shape)

# Last minute error checking

data_query_5.to_csv("./data/processed/eda_db.csv", index=False)

In [None]:
# Important!  Close the connection
conn.close()

### Section 5: Connection to Azure MySQL Database to Upload Data and Data Review

We finally have the data we need at this time to send to our Azure MySQL database.  We will now connect to said database and upload the acquired dataset so we can use it at will during future phases of our project.  Before doing so however we will go through it quickly and review our features and see if there are any we can prune due to being redundant to our cause.  We will look to do this before making our submission to the server and moving on to the EDA portion of the project.

In [None]:
# Let's read in our eda_db file and review the features.
eda_db = pd.read_csv("./data/processed/eda_db.csv")

print(eda_db.columns)

In [None]:
# Connect to the Azure MySQL Database:
try:
        conn = mysql.connector.connect(
                hostname = MYSQL_CONFIG["hostname"],
                port = MYSQL_CONFIG["port"],
                username = MYSQL_CONFIG["username"],
                password = MYSQL_CONFIG["password"],
                database = MYSQL_CONFIG["database"],
                ssl_mode = MYSQL_CONFIG["ssl_mode"]
        )
        print("Connected to the Azure MySQL Database successfully!")
except mysql.connector.Error as e:
        print(f"Error connecting to the database: {e}")
        exit(1)

In [None]:
try:
    cursor = conn.cursor()
    
    # Need to create a new table in the database,
    # this will provide the foundational structure for our pending data upload.
    table_name = "chembl_data"
    create_table_query = f"""
    CREATE TABLE IF NOT EXISTS {table_name} (
        {', '.join([f'{col} VARCHAR(255)' for col in eda_db.columns])}
    );
    """
    cursor.execute(create_table_query)
    print(f"Table '{table_name} created or verified successfully!")
    
    # Now we will insert and upload our data to our table we just created.
    insert_query = f"""
    INSERT INTO {table_name} ({', '.join(eda_db.columns)})
    VALUES ({', '.join(['%s' for _ in eda_db.columns])});
    """
    
    cursor.exectutemany(insert_query, eda_db.values.tolist())
    conn.commit()
    
    print(f"Data uploaded successfully to table '{table_name}'.")
except mysql.connector.Error as e:
    print(f"Error uploading data: {e}")
finally:
    cursor.close()
    conn.close()
    print("Database connection closed.")
    
    