# PubChem scraping 

The aim of this notebook is to extract the melting points of various molecules from the pubchem site. On the pubcehm website, all cids of molecules for which at least 1 melting point has been recorded are downloadable. The corresponding data are in PubChem_comp_with_mp_list_.xlsx. 



As the datasets obtained from this notebook will be used simply to get an idea of the model's performance and not for the training, we will simplify and extract only 1 melting point per molecule, even if sometimes more are registred on pubchem.

## Scraping 

The aim now is to scrape the pubchem site to extract the melting points. To do this, we'll use the previously extracted cids and the fact that the pages referencing the melting point contain these cids. We will just loop through each page and extract the melting point.

*Note : that a scrapping using request and JSON page of each compounds can be faster but the JSON path to the melting point is not consistent across all pages.*



The following code takes time to run, the result df is in the Data folder. You can simply load it :

In [1]:
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

# Load the data
base_df = pd.read_excel("../Data/scrapping/PubChem_comp_with_mp_list_.xlsx")

# Get the list of CID
cid_list = base_df["cid"].to_list()

driver = webdriver.Chrome()

# Initialize a df to store the data
df = pd.DataFrame(columns=['CID', 'mpC'])

# Loop to access each CID URL
for cid in cid_list[:12000]:
    url = f"https://pubchem.ncbi.nlm.nih.gov/compound/{cid}#section=Melting-Point&fullscreen=true"
    driver.get(url)
    
    try:
        # Wait for the Melting Point information to load and extract the melting point
        WebDriverWait(driver, 3).until(
            EC.visibility_of_element_located((By.CSS_SELECTOR, "#Melting-Point .break-words.space-y-1"))
        )
        melting_point = driver.find_element(By.CSS_SELECTOR, "#Melting-Point .break-words.space-y-1").text
        print(f"Accessed page: {url} - Melting Point: {melting_point}")
        
        new_row = pd.DataFrame({'CID': [cid], 'mpC': [melting_point]})
        df = pd.concat([df, new_row], ignore_index=True)
    except Exception as e:
        print(f"Failed to extract melting point for CID {cid}: {e}")
        new_row = pd.DataFrame({'CID': [cid], 'mpC': 'Not Found'})
        df = pd.concat([df, new_row], ignore_index=True)


# Close
driver.quit()
"""
import pandas as pd

#Load the data
df = pd.read_csv("../Data/scrapping/pubchem_scraped_mpC.csv")

In [2]:
df.head(5)

Unnamed: 0,cid,complexity,inchi,isosmiles,canonicalsmiles,inchikey,iupacname,pubchem_mpC
0,118856773,18600,InChI=1S/C287H440N80O111S6/c1-24-132(17)225-28...,CC[C@H](C)[C@H]1C(=O)N[C@H](C(=O)NCC(=O)N[C@H]...,CCC(C)C1C(=O)NC(C(=O)NCC(=O)NC(C(=O)NC(C(=O)NC...,FIBJDTSHOUXTKV-BRHMIFOHSA-N,(2S)-5-amino-2-[[(2S)-2-[[(2S)-2-[[(2S)-2-[[(2...,65°C
1,139600825,18600,InChI=1S/C287H440N80O111S6/c1-24-132(17)225-28...,CCC(C)C1C(=O)NC(C(=O)NCC(=O)NC(C(=O)NC(C(=O)NC...,CCC(C)C1C(=O)NC(C(=O)NCC(=O)NC(C(=O)NC(C(=O)NC...,FIBJDTSHOUXTKV-UHFFFAOYSA-N,5-amino-2-[[2-[[2-[[2-[[2-[[1-[2-[[2-[[2-[[2-[...,65°C
2,16130295,16700,InChI=1S/C284H432N84O79S7/c1-21-144(9)222-271(...,CC[C@H](C)[C@H]1C(=O)N[C@H](C(=O)N[C@H](C(=O)N...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,ZPNFWUPYTFPOJU-LPYSRVMUSA-N,"(3S)-4-[[(2S)-1-[[(1R,2aS,4S,5aS,8aS,11aS,13S,...",>100 °C
3,22833874,16700,InChI=1S/C284H432N84O79S7/c1-21-144(9)222-271(...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,ZPNFWUPYTFPOJU-UHFFFAOYSA-N,"4-[[1-[[29a,62a,69,84-tetrakis(4-aminobutyl)-3...",>100 °C
4,129627711,16700,InChI=1S/C284H432N84O79S7/c1-21-144(9)222-271(...,CC[C@H](C)[C@H]1C(=O)N[C@@H](C(=O)N[C@@H](C(=O...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,ZPNFWUPYTFPOJU-IAUBLLHJSA-N,"(3S)-4-[[(2R)-1-[[(1S,2aS,4S,5aR,8aS,11aS,13R,...",>100 °C
