# Datamining for OSEL Regulatory Science Programmatic Planning

###### This project involves the use of data mining techniques to support regulatory science initiatives within OSEL with a strong emphasis on planning and strategic implementation. The goal is to pinpoint regulatory science gaps and find potential research collaborators. Pubmed is the target database to do the webscraping to get the dataset.

## Webscraping for Search Terms: "Hemolysis Or Blood Damage"

In [2]:
## Import libraries
import csv
import re
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import requests

In [3]:
# PubMed Website URL
baseURL = 'https://pubmed.ncbi.nlm.nih.gov/?term=%22Hemolysis%22%5Btiab%5D+or+%22blood+damage%22%5Btiab%5D&filter=years.2015-2024'
response = requests.get(baseURL)
print(response)

<Response [200]>


In [4]:
# Interpret response.content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser') #lxml = library to parse HTML and other languages
print(soup)

<!DOCTYPE html>

<html lang="en">
<head itemscope="" itemtype="http://schema.org/WebPage" prefix="og: http://ogp.me/ns#">
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<!-- Mobile properties -->
<meta content="True" name="HandheldFriendly"/>
<meta content="320" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://cdn.ncbi.nlm.nih.gov" rel="preconnect"/>
<link href="https://www.ncbi.nlm.nih.gov" rel="preconnect"/>
<link href="https://www.google-analytics.com" rel="preconnect"/>
<link href="https://cdn.ncbi.nlm.nih.gov/pubmed/1f83d27a-2e41-417f-9836-9839abbecc07/CACHE/css/output.5ecf62baa0fa.css" rel="stylesheet" type="text/css"/>
<link href="https://cdn.ncbi.nlm.nih.gov/pubmed/1f83d27a-2e41-417f-9836-9839abbecc07/CACHE/css/output.452c70ce66f7.css" rel="stylesheet" type="text/css"/>
<link href="https://cdn.ncbi.nlm.nih.gov/pubmed/1f83d27a-2e41-417f-9836-9839abbecc07/CACHE/css/output.55dd827ca

In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Base URL
base_url = 'https://pubmed.ncbi.nlm.nih.gov/?term=%22Hemolysis%22%5Btiab%5D+or+%22blood+damage%22%5Btiab%5D&filter=years.2015-2024'
# Create a list to hold all data
all_data = []

# Function to scrape a single page
def scrape_page(page_url):
    response = requests.get(page_url)
    if response.status_code != 200:
        print(f"Failed to retrieve page: {page_url}")
        return

    soup = BeautifulSoup(response.content, 'html.parser')
    articles = soup.find_all('div', class_='docsum-content')

    for article in articles:
        # Extract data for each article
        title = article.find('a', class_='docsum-title').text.strip()
        authors = article.find('span', class_='docsum-authors full-authors').text.strip()
        pmid = article.find('span', class_='citation-part').text.strip()

        journal_elem = article.find('span', class_='docsum-journal-citation full-journal-citation')
        journal = journal_elem.text.strip() if journal_elem else ""
        match = re.search(r'\d{4}', journal)
        publication_year = match.group() if match else ""
        cited_by_elem = article.find('ul', class_='articles-list')
        cited_by = cited_by_elem.text.strip() if cited_by_elem else ""

        affiliations_elem = article.find('span', class_='docsum-affiliation')
        affiliations = affiliations_elem.text.strip() if affiliations_elem else ""

        # Append data to all_data list
        all_data.append({
            'Title': title,
            'Authors': authors,
            'Journal': journal,
            'PMID': pmid,
            'Publication_year': publication_year,
            'Affiliations': affiliations,
            'Cited_by': cited_by
        })

# Scrape multiple pages
for page_num in range(1, 1044):  # Adjust the range according to the number of pages to scrape
    url = f"{base_url}&page={page_num}"
    scrape_page(url)

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(all_data)

# Print the DataFrame
print(df)

print(df)


                                                  Title  \
0     Managing hemolyzed samples in clinical laborat...   
1                               Streptococcus Pyogenes.   
2           Blood damage in ventricular assist devices.   
3     Causes, consequences and management of sample ...   
4     Clinical Applications of Hemolytic Markers in ...   
...                                                 ...   
9995  Growth Promotion and Immune Stimulation in Nil...   
9996  Assessing clinical implications and perspectiv...   
9997  Construction of Hemocompatible and Histocompat...   
9998  Cholesterol Binding to the Transmembrane Regio...   
9999  α-Tocopherol Succinate-Anchored PEGylated Poly...   

                                                Authors  \
0     Simundic AM, Baird G, Cadamuro J, Costelloe SJ...   
1                                   Kanwal S, Vaitla P.   
2                  Simmonds M, Thamsen B, Kertzscher U.   
3     Heireman L, Van Geel P, Musger L, Heylen E, Uy...

In [34]:
df

Unnamed: 0,Title,Authors,Journal,PMID,Publication_year,Affiliations,Cited_by
0,Streptococcus Pyogenes.,"Kanwal S, Vaitla P.",2023 Jul 31. In: StatPearls [Internet]. Treasu...,PMID: 32119415,2023,,
1,Managing hemolyzed samples in clinical laborat...,"Simundic AM, Baird G, Cadamuro J, Costelloe SJ...",Crit Rev Clin Lab Sci. 2020 Jan;57(1):1-21. do...,PMID: 31603708,2020,,
2,"Causes, consequences and management of sample ...","Heireman L, Van Geel P, Musger L, Heylen E, Uy...",Clin Biochem. 2017 Dec;50(18):1317-1322. doi: ...,PMID: 28947321,2017,,
3,Clinical Applications of Hemolytic Markers in ...,"Barcellini W, Fattizzo B.",Dis Markers. 2015;2015:635670. doi: 10.1155/20...,PMID: 26819490,2015,,
4,Mechanism of megaloblastic anemia combined wit...,"Wu Q, Liu J, Xu X, Huang B, Zheng D, Li J.",Bioengineered. 2021 Dec;12(1):6703-6712. doi: ...,PMID: 34542005,2021,,
...,...,...,...,...,...,...,...
9995,Platelet count and indices as postpartum hemor...,"van Dijk WEM, Nijdam JS, Haitjema S, de Groot ...",J Thromb Haemost. 2021 Nov;19(11):2873-2883. d...,PMID: 34339085,2021,,
9996,Simultaneous and rapid determination of 12 tyr...,"Zhou L, Wang S, Chen M, Huang S, Zhang M, Bao ...",J Chromatogr B Analyt Technol Biomed Life Sci....,PMID: 33991955,2021,,
9997,Immunological Hallmarks of Inflammatory Status...,"Silva-Junior AL, Garcia NP, Cardoso EC, Dias S...",Front Immunol. 2021 Mar 11;12:559925. doi: 10....,PMID: 33776989,2021,,
9998,Can First-trimester AST to Platelet Ratio Inde...,"Tolunay HE, Kahraman NC, Varli EN, Reis YA, Ce...",J Coll Physicians Surg Pak. 2021 Feb;31(2):188...,PMID: 33645187,2021,,


In [35]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
if duplicates > 0:
    print(f"Found {duplicates} duplicates.")
else:
    print("No duplicates found.")

No duplicates found.


In [36]:
## Create the URL

baseurl = 'https://pubmed.ncbi.nlm.nih.gov/'
affiliations = requests.get(f'{baseurl} + df.iloc[0:3]')
affiliations

<Response [404]>

In [37]:
## Create a list to get the affiliations_url
affiliations_url = []
for i in range(len(df['PMID'])):
    search_url = f'{baseurl}' + df.iloc[i, 3][6:]
    print(search_url)
    affiliations_url.append(search_url)
affiliations_url

https://pubmed.ncbi.nlm.nih.gov/32119415
https://pubmed.ncbi.nlm.nih.gov/31603708
https://pubmed.ncbi.nlm.nih.gov/28947321
https://pubmed.ncbi.nlm.nih.gov/26819490
https://pubmed.ncbi.nlm.nih.gov/34542005
https://pubmed.ncbi.nlm.nih.gov/27034320
https://pubmed.ncbi.nlm.nih.gov/30862276
https://pubmed.ncbi.nlm.nih.gov/31986168
https://pubmed.ncbi.nlm.nih.gov/27323915
https://pubmed.ncbi.nlm.nih.gov/30773236
https://pubmed.ncbi.nlm.nih.gov/35491057
https://pubmed.ncbi.nlm.nih.gov/32000921
https://pubmed.ncbi.nlm.nih.gov/37031086
https://pubmed.ncbi.nlm.nih.gov/30073891
https://pubmed.ncbi.nlm.nih.gov/37827140
https://pubmed.ncbi.nlm.nih.gov/37216133
https://pubmed.ncbi.nlm.nih.gov/25840802
https://pubmed.ncbi.nlm.nih.gov/33571008
https://pubmed.ncbi.nlm.nih.gov/27235204
https://pubmed.ncbi.nlm.nih.gov/36949568
https://pubmed.ncbi.nlm.nih.gov/31039274
https://pubmed.ncbi.nlm.nih.gov/30797470
https://pubmed.ncbi.nlm.nih.gov/25899978
https://pubmed.ncbi.nlm.nih.gov/33989208
https://pubmed.n

['https://pubmed.ncbi.nlm.nih.gov/32119415',
 'https://pubmed.ncbi.nlm.nih.gov/31603708',
 'https://pubmed.ncbi.nlm.nih.gov/28947321',
 'https://pubmed.ncbi.nlm.nih.gov/26819490',
 'https://pubmed.ncbi.nlm.nih.gov/34542005',
 'https://pubmed.ncbi.nlm.nih.gov/27034320',
 'https://pubmed.ncbi.nlm.nih.gov/30862276',
 'https://pubmed.ncbi.nlm.nih.gov/31986168',
 'https://pubmed.ncbi.nlm.nih.gov/27323915',
 'https://pubmed.ncbi.nlm.nih.gov/30773236',
 'https://pubmed.ncbi.nlm.nih.gov/35491057',
 'https://pubmed.ncbi.nlm.nih.gov/32000921',
 'https://pubmed.ncbi.nlm.nih.gov/37031086',
 'https://pubmed.ncbi.nlm.nih.gov/30073891',
 'https://pubmed.ncbi.nlm.nih.gov/37827140',
 'https://pubmed.ncbi.nlm.nih.gov/37216133',
 'https://pubmed.ncbi.nlm.nih.gov/25840802',
 'https://pubmed.ncbi.nlm.nih.gov/33571008',
 'https://pubmed.ncbi.nlm.nih.gov/27235204',
 'https://pubmed.ncbi.nlm.nih.gov/36949568',
 'https://pubmed.ncbi.nlm.nih.gov/31039274',
 'https://pubmed.ncbi.nlm.nih.gov/30797470',
 'https://

In [38]:
# Create a DataFrame 'df' with a column 'PMID'
pmids_list = df['PMID'].tolist()

# Print the list of PMIDs
print(pmids_list)

['PMID: 32119415', 'PMID: 31603708', 'PMID: 28947321', 'PMID: 26819490', 'PMID: 34542005', 'PMID: 27034320', 'PMID: 30862276', 'PMID: 31986168', 'PMID: 27323915', 'PMID: 30773236', 'PMID: 35491057', 'PMID: 32000921', 'PMID: 37031086', 'PMID: 30073891', 'PMID: 37827140', 'PMID: 37216133', 'PMID: 25840802', 'PMID: 33571008', 'PMID: 27235204', 'PMID: 36949568', 'PMID: 31039274', 'PMID: 30797470', 'PMID: 25899978', 'PMID: 33989208', 'PMID: 35152829', 'PMID: 33740880', 'PMID: 30847662', 'PMID: 31754583', 'PMID: 37109037', 'PMID: 29956069', 'PMID: 30918601', 'PMID: 29608016', 'PMID: 27034315', 'PMID: 33402176', 'PMID: 36347061', 'PMID: 37317847', 'PMID: 35030287', 'PMID: 31403730', 'PMID: 33136218', 'PMID: 36134430', 'PMID: 31809514', 'PMID: 35369808', 'PMID: 29396875', 'PMID: 31693564', 'PMID: 37123255', 'PMID: 31766155', 'PMID: 28643335', 'PMID: 29075076', 'PMID: 33985796', 'PMID: 28803746', 'PMID: 26521297', 'PMID: 34596538', 'PMID: 36571824', 'PMID: 32673671', 'PMID: 32843935', 'PMID: 33

In [42]:
import requests
from bs4 import BeautifulSoup

# Base URL for PubMed
baseurl = 'https://pubmed.ncbi.nlm.nih.gov/'

# Initialize empty lists to store all affiliations and cited_by
affiliations_list = []
#cited_by_list = []

# DataFrame 'df' with a column 'PMID'
pmids_list = df['PMID'].tolist()

# Loop through each pmid in the pmids_list
for pmid in pmids_list:
    # Construct the URL for each PubMed ID
    url = baseurl + f'?term={pmid}'

    # Send a GET request to the URL
    response = requests.get(url)

    if response.status_code == 200:
        # Parse the content of the request with BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the ul element with class "item-list" for affiliations
        ul_element = soup.find('ul', class_='item-list')

        # Extract and process the affiliations
        if ul_element:
            # Find all the li elements within the ul element
            li_elements = ul_element.find_all('li')

            # Extracting the affiliations as a single string
            affiliations = ", ".join([li.text for li in li_elements])
            affiliations_list.append(affiliations)
        else:
            # Find the affiliation element if the 'item-list' is not found
            affiliation_element = soup.find('div', class_='affiliations')

            if affiliation_element:
                affiliations_list.append(affiliation_element.text.strip())
            else:
                affiliations_list.append("No affiliations found")

        # Find the ul element with class "articles-list" for cited_by
        #articles_list_element = soup.find('ul', class_='articles-list')

        # Extract and process the cited_by
        #if articles_list_element:
            #articles = articles_list_element.find_all('li')
            #cited_by = ", ".join([article.text for article in articles])
            #if cited_by.strip():
                #cited_by_list.append(cited_by)
        #else:
            #cited_by_list.append("No cited by information found")
    #else:
        #print(f"Failed to retrieve {url}. Status code: {response.status_code}")

# Print all the extracted affiliations
for affiliation in affiliations_list:
    print(affiliation)

# Print all the extracted cited_by
#for cited_by in cited_by_list:
    #print(cited_by)

1 Allama Iqbal Medical College, Lahore, Pakistan, 2 University of Mississippi Medical Center
1 Department of Medical Laboratory Diagnostics, University Hospital "Sveti Duh", University of Zagreb, Faculty of Pharmacy and Biochemistry, Zagreb, Croatia., 2 Department of Laboratory Medicine, University of Washington, Seattle, WA, USA., 3 Department of Laboratory Medicine, Paracelsus Medical University Salzburg, Salzburg, Austria., 4 Department of Clinical Biochemistry, Cork University Hospital, Cork, Republic of Ireland., 5 Section of Clinical Biochemistry, University of Verona, Verona, Italy.
1 Department of Laboratory Medicine, Ziekenhuis Netwerk Antwerpen, Antwerp, Belgium. Electronic address: heireman.laura@gmail.com., 2 Department of Orthopedic Surgery, Ziekenhuis Netwerk Antwerpen, Antwerp, Belgium., 3 Department of Laboratory Medicine, Ziekenhuis Netwerk Antwerpen, Antwerp, Belgium.
1 U.O. Oncoematologia, Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico di Milano, Via Franc

In [49]:
import pandas as pd

# Make sure the length of the affiliations_list matches the length of df['PMID']
if len(affiliations_list) == len(df): # and len(cited_by_list) == len(df):
    # Add the affiliations_list and cited_by_list as new columns to the DataFrame
    df['Affiliations'] = affiliations_list
    #df['Cited_By'] = cited_by_list
else:
    print("Lengths of the lists do not match the length of the DataFrame.")

# Display the updated DataFrame with new columns
print(df)


                                                  Title  \
0                               Streptococcus Pyogenes.   
1     Managing hemolyzed samples in clinical laborat...   
2     Causes, consequences and management of sample ...   
3     Clinical Applications of Hemolytic Markers in ...   
4     Mechanism of megaloblastic anemia combined wit...   
...                                                 ...   
9995  Platelet count and indices as postpartum hemor...   
9996  Simultaneous and rapid determination of 12 tyr...   
9997  Immunological Hallmarks of Inflammatory Status...   
9998  Can First-trimester AST to Platelet Ratio Inde...   
9999  Polysorbate-80 surface modified nano-stearylam...   

                                                Authors  \
0                                   Kanwal S, Vaitla P.   
1     Simundic AM, Baird G, Cadamuro J, Costelloe SJ...   
2     Heireman L, Van Geel P, Musger L, Heylen E, Uy...   
3                             Barcellini W, Fattizzo B.

In [51]:
# Drop the 'cited_by' column from the DataFrame
df.drop('Cited_by', axis=1, inplace=True)

In [52]:
df

Unnamed: 0,Title,Authors,Journal,PMID,Publication_year,Affiliations
0,Streptococcus Pyogenes.,"Kanwal S, Vaitla P.",2023 Jul 31. In: StatPearls [Internet]. Treasu...,PMID: 32119415,2023,"1 Allama Iqbal Medical College, Lahore, Pakist..."
1,Managing hemolyzed samples in clinical laborat...,"Simundic AM, Baird G, Cadamuro J, Costelloe SJ...",Crit Rev Clin Lab Sci. 2020 Jan;57(1):1-21. do...,PMID: 31603708,2020,1 Department of Medical Laboratory Diagnostics...
2,"Causes, consequences and management of sample ...","Heireman L, Van Geel P, Musger L, Heylen E, Uy...",Clin Biochem. 2017 Dec;50(18):1317-1322. doi: ...,PMID: 28947321,2017,"1 Department of Laboratory Medicine, Ziekenhui..."
3,Clinical Applications of Hemolytic Markers in ...,"Barcellini W, Fattizzo B.",Dis Markers. 2015;2015:635670. doi: 10.1155/20...,PMID: 26819490,2015,"1 U.O. Oncoematologia, Fondazione IRCCS Ca' Gr..."
4,Mechanism of megaloblastic anemia combined wit...,"Wu Q, Liu J, Xu X, Huang B, Zheng D, Li J.",Bioengineered. 2021 Dec;12(1):6703-6712. doi: ...,PMID: 34542005,2021,"1 Department of Hematology, The First Affliate..."
...,...,...,...,...,...,...
9995,Platelet count and indices as postpartum hemor...,"van Dijk WEM, Nijdam JS, Haitjema S, de Groot ...",J Thromb Haemost. 2021 Nov;19(11):2873-2883. d...,PMID: 34339085,2021,"1 Benign Hematology, Van Creveldkliniek, Unive..."
9996,Simultaneous and rapid determination of 12 tyr...,"Zhou L, Wang S, Chen M, Huang S, Zhang M, Bao ...",J Chromatogr B Analyt Technol Biomed Life Sci....,PMID: 33991955,2021,"1 Department of Clinical Pharmacy, Shanghai Ge..."
9997,Immunological Hallmarks of Inflammatory Status...,"Silva-Junior AL, Garcia NP, Cardoso EC, Dias S...",Front Immunol. 2021 Mar 11;12:559925. doi: 10....,PMID: 33776989,2021,1 Programa de Pós-Graduação em Ciências Aplica...
9998,Can First-trimester AST to Platelet Ratio Inde...,"Tolunay HE, Kahraman NC, Varli EN, Reis YA, Ce...",J Coll Physicians Surg Pak. 2021 Feb;31(2):188...,PMID: 33645187,2021,"1 Perinatology Clinic, Etlik Zubeyde Hanim Mat..."


In [53]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
if duplicates > 0:
    print(f"Found {duplicates} duplicates.")
else:
    print("No duplicates found.")

No duplicates found.
