## Read and upload chemical clustering of PubChem cids
* clustering was done outside of this script using morgan2 fingerprints from rdkit with tanimoto cutoff of 0.5 and mcl clustering with perplexity of 1.8
* In general any clustering method should work here which will result in relatively granular clusters, i.e. which will not join compounds together that would not be considered to belong together, so that one can take automated decisions per cluster without risk of mixing the profiles / MoAs of multiple chemotypes 

In [None]:
import sqlite3 
import pandas as pd
import glob
import os

In [None]:
df = pd.read_csv("pubchem_cids.chemfp_1.8_clusters.csv")

In [None]:
conn = sqlite3.connect('../pubchem_gcm.db')

In [None]:
conn.execute('''DROP TABLE IF EXISTS gcm_clusters;''')

# create table with keys before and add via pandas
conn.execute('''
CREATE TABLE gcm_clusters(
         inchi_key INT,
         gcm_cluster INT,
         cluster_size INT,
         cid INT,
         smiles TEXT,
         PRIMARY KEY(gcm_cluster, cid)
         );
         ''')

In [None]:
df.to_sql('gcm_clusters', conn, if_exists='append', index=False) 

In [None]:
conn.execute('''CREATE INDEX gcm_cluster_cid_index ON gcm_clusters (cid);''')

### stats

In [None]:
pd.read_sql('select count (*) from gcm_clusters ', conn)

In [None]:
conn.close()