# PubChem scraping 

The aim of this notebook is to extract the melting points of various molecules from the pubchem site. On the pubcehm website, all cids of molecules for which at least 1 melting point has been recorded are downloadable. The corresponding data are in PubChem_comp_with_mp_list_.xlsx. 



As the datasets obtained from this notebook will be used simply to get an idea of the model's performance, various approximations such as the exctraction of only 1 melting point per molecule, or taking an average if the melting point is given in a range, will be made.

In [2]:
import pandas as pd

# Load the data
base_df = pd.read_excel("PubChem_comp_with_mp_list_.xlsx")

In [3]:
# Get the list of CID
cid_list = base_df["cid"].to_list()

In [4]:
# Initialize a df to store the data
df = pd.DataFrame(columns=['CID', 'Melting Point 1'])
df.shape

## Scraping 

The aim now is to scrape the pubchem site to extract the melting points. To do this, we'll use the previously extracted cids and the fact that the pages referencing the melting point contain these cids. We will just loop through each page and extract the melting point. Here, to simplify the extraction, only the first value in the melting point table is extracted. 



*Note : Scraping using request and the pubchem JSON page of each compound could be faster but the path to get the melting point depends on the compound and the source. The use of selenium and css selector is a solution to this problem*

In [36]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

# Initialize a df to store the data
df = pd.DataFrame(columns=['CID', 'Melting Point 1'])

# Loop to access each CID URL
for cid in cid_list:
    url = f"https://pubchem.ncbi.nlm.nih.gov/compound/{cid}#section=Melting-Point&fullscreen=true"
    driver.get(url)
    
    try:
        # Wait for the Melting Point information to load and extract the melting point
        WebDriverWait(driver, 3).until(
            EC.visibility_of_element_located((By.CSS_SELECTOR, "#Melting-Point .break-words.space-y-1"))
        )
        melting_point = driver.find_element(By.CSS_SELECTOR, "#Melting-Point .break-words.space-y-1").text
        print(f"Accessed page: {url} - Melting Point: {melting_point}")
        
        new_row = pd.DataFrame({'CID': [cid], 'Melting Point 1': [melting_point]})
        df = pd.concat([df, new_row], ignore_index=True)
    except Exception as e:
        print(f"No extraction for {cid}: {e}")
        new_row = pd.DataFrame({'CID': [cid], 'Melting Point 1': 'Not Found'})
        df = pd.concat([df, new_row], ignore_index=True)


# Close
driver.quit()

Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/118856773#section=Melting-Point&fullscreen=true - Melting Point: 65°C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/139600825#section=Melting-Point&fullscreen=true - Melting Point: 65°C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/16130295#section=Melting-Point&fullscreen=true - Melting Point: >100 °C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/22833874#section=Melting-Point&fullscreen=true - Melting Point: >100 °C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/129627711#section=Melting-Point&fullscreen=true - Melting Point: >100 °C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/146027575#section=Melting-Point&fullscreen=true - Melting Point: >100 °C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/156024996#section=Melting-Point&fullscreen=true - Melting Point: >100 °C
Accessed page: https://pubchem.ncbi.nlm.nih.gov/compound/166610792#section=Melting-Point&fullscree

In [45]:
# For 0 to 2000 rows, concatenate the base_df with the df
final_df = pd.concat([base_df, df[["Melting Point 1"]]], axis=1)

final_df


Unnamed: 0,cid,complexity,inchi,isosmiles,canonicalsmiles,inchikey,iupacname,Melting Point 1
0,118856773,18600,InChI=1S/C287H440N80O111S6/c1-24-132(17)225-28...,CC[C@H](C)[C@H]1C(=O)N[C@H](C(=O)NCC(=O)N[C@H]...,CCC(C)C1C(=O)NC(C(=O)NCC(=O)NC(C(=O)NC(C(=O)NC...,FIBJDTSHOUXTKV-BRHMIFOHSA-N,(2S)-5-amino-2-[[(2S)-2-[[(2S)-2-[[(2S)-2-[[(2...,65°C
1,139600825,18600,InChI=1S/C287H440N80O111S6/c1-24-132(17)225-28...,CCC(C)C1C(=O)NC(C(=O)NCC(=O)NC(C(=O)NC(C(=O)NC...,CCC(C)C1C(=O)NC(C(=O)NCC(=O)NC(C(=O)NC(C(=O)NC...,FIBJDTSHOUXTKV-UHFFFAOYSA-N,5-amino-2-[[2-[[2-[[2-[[2-[[1-[2-[[2-[[2-[[2-[...,65°C
2,16130295,16700,InChI=1S/C284H432N84O79S7/c1-21-144(9)222-271(...,CC[C@H](C)[C@H]1C(=O)N[C@H](C(=O)N[C@H](C(=O)N...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,ZPNFWUPYTFPOJU-LPYSRVMUSA-N,"(3S)-4-[[(2S)-1-[[(1R,2aS,4S,5aS,8aS,11aS,13S,...",>100 °C
3,22833874,16700,InChI=1S/C284H432N84O79S7/c1-21-144(9)222-271(...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,ZPNFWUPYTFPOJU-UHFFFAOYSA-N,"4-[[1-[[29a,62a,69,84-tetrakis(4-aminobutyl)-3...",>100 °C
4,129627711,16700,InChI=1S/C284H432N84O79S7/c1-21-144(9)222-271(...,CC[C@H](C)[C@H]1C(=O)N[C@@H](C(=O)N[C@@H](C(=O...,CCC(C)C1C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)N...,ZPNFWUPYTFPOJU-IAUBLLHJSA-N,"(3S)-4-[[(2R)-1-[[(1S,2aS,4S,5aR,8aS,11aS,13R,...",>100 °C
...,...,...,...,...,...,...,...,...
1995,435277,910,InChI=1S/C30H50O5/c1-25(2)14-18-17-8-9-20-27(5...,CC1(CC2C3=CCC4C5(CCC(C(C5CCC4(C3(CC(C2(C(C1O)O...,CC1(CC2C3=CCC4C5(CCC(C(C5CCC4(C3(CC(C2(C(C1O)O...,AYDKOFQQBHRXEW-UHFFFAOYSA-N,"4a-(hydroxymethyl)-2,2,6a,6b,9,9,12a-heptameth...",302 - 308 °C
1996,12310208,910,InChI=1S/C20H24O10/c1-5-6-8(27-13(5)24)10(22)1...,CC1C2C(C(C34C25C(=O)OC3C(C(C46C(C(=O)OC6O5)O)C...,CC1C2C(C(C34C25C(=O)OC3C(C(C46C(C(=O)OC6O5)O)C...,KDKROYXEHCYLJQ-UHFFFAOYSA-N,"8-tert-butyl-6,9,12-trihydroxy-16-methyl-2,4,1...",280 °C
1997,131750868,910,InChI=1S/C22H36N6O12S2/c23-11(21(37)38)1-3-15(...,C(CC(=O)NC(CSSCC(C(=O)NCCC(=O)O)NC(=O)CCC(C(=O...,C(CC(=O)NC(CSSCC(C(=O)NCCC(=O)O)NC(=O)CCC(C(=O...,NHIHYSIMMYLVDO-UHFFFAOYSA-N,2-amino-5-[[3-[[2-[(4-amino-4-carboxybutanoyl)...,194 - 197 °C
1998,131752289,910,InChI=1S/C32H52O2/c1-21(33)34-26-13-14-30(7)22...,CC(=O)O[C@H]1CC[C@]2(C(C1(C)C)CCC3=C2CC[C@@]4(...,CC(=O)OC1CCC2(C(C1(C)C)CCC3=C2CCC4(C3(CCC5(C4C...,IQPSCJJRYFMIOC-VVFZLFBHSA-N,"[(3S,6aS,6bS,8aR,12aR,14bS)-4,4,6a,6b,8a,11,11...",216 - 217 °C


In [None]:
import re

def extract_melting_point(melting_point):
    # Check for the presence of "decompose" or "decomposes" in the string
    if "decompose" in melting_point.lower():
        return None

    # Check for a range of temperatures
    range_match = re.search(r"(\d+)\s*-\s*(\d+)\s*°C", melting_point)
    if range_match:
        # Calculate the mean of the range
        low_temp = float(range_match.group(1))
        high_temp = float(range_match.group(2))
        return (low_temp + high_temp) / 2
    else:
        # Check for a single temperature
        single_match = re.search(r"(\d+)\s*°C", melting_point)
        if single_match:
            return float(single_match.group(1))
        else:
            return None  # Return None if no pattern matches


# Example usage
example_single_no_space = extract_melting_point("125°C")
example_single_space = extract_melting_point("125 °C")
example_range_no_space = extract_melting_point("125-130°C")
example_range_space = extract_melting_point("125 - 130 °C")
example_no_match = extract_melting_point("decomposes at 125°C")
print(example_single_no_space, example_single_space, example_range_no_space, example_range_space, example_no_match)

In [47]:
#Saving the DataFrame to a CSV file
final_df.to_csv("pubChem_melting_points.csv", index=False)