<a href="https://colab.research.google.com/github/Mingyang0816/Startup-Founders-Graph-Analysis/blob/main/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web-Scraping Notebook

*   **Citations:** Given a journal, extract the list of citations/references it cited. Use Regex to parse each citation, before storing it in a dictionary. Respective functions created for each of the following publications/versions: IEEE, Nature, PLOS and PDF.

*   **Cited By:** Given a journal, extract the list of all other journals that cited it. Each time the code is run, maximum of 250 citations scraped. Restart notebook to continue scraping.

*   **Profile:** Given a founder, extract basic information from his/her Google Scholar profile page.

*   **Publication:** Given a founder, extract the list of all published journals from his/her Google Scholar profile page. Each time the code is run, maximum of 250 journals scraped. Restart notebook to continue scraping.

*   **Pubs without Profile:** If a founder has no Google Scholar profile page, use this code to scrape his/her list of published journals. Each time the code is run, maximum of 250 journals scraped. Restart notebook to continue scraping.


In [None]:
pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.9 (from pymupdf)
  Downloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.9 pymupdf-1.24.9


In [None]:
# Import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import fitz
import re

# Citations

## IEEE Version 1

In [None]:
def ieee_ref1(text_file):
    '''
    Retrieve citations from IEEE journal.
    Citation format: authors, title, pub, year

    Parameters
    ----------
    text_file: Text file that contains citations/references of IEEE journal.

    Return
    ------
    references: list of citations, each citation is a dictionary consisting of
                author, journal title, publication journal and year published.
    '''

    # Read file
    with open(text_file, "r") as file:
        ref_text = file.read().splitlines()

    # List of citations
    references = []

    for line in ref_text:

        if line != "":

            # Extract authors
            split_author = line.split(', "')
            author = split_author[0].strip()
            author = re.sub(r'^\d+\.\s', '', author)

            # Parse multiple authors into list
            authors = []
            first_split = author.split(", and ")
            for split in first_split:
                second_split = re.split(r',\s', split)
                for sub_split in second_split:
                    authors.extend(re.split(r'\s+and\s+', sub_split))

            # Extract title
            title_pub_year = ', "'.join(split_author[1:])
            split_tpy = re.split(r',?"\s*,?\s*', title_pub_year)
            title = split_tpy[0]

            # Extract publication and year
            pub_year = split_tpy[1].split(", ")
            year = pub_year[-1][-5:-1]
            pub = ", ".join(pub_year[:-1])

            # Store each citation as dictionary
            reference = {
                "author": authors,
                "title": title,
                "publication": pub,
                "year": year
            }

            references.append(reference)

    return references

In [None]:
# Retrieve citations of IEEE journal
ieee_refs = ieee_ref1("References.txt")
for ref in ieee_refs:
    print(ref)

{'author': ['R. Bernier', 'M. Bissonnette', 'P. Poitevin'], 'title': 'Dsa radar-development report', 'publication': 'in UAVSI', 'year': '2005'}
{'author': ['A. Bachrach', 'R. He', 'N. Roy'], 'title': 'Autonomous flight in unknown indoor environments', 'publication': 'International Journal of Micro Air Vehicles', 'year': '2009'}
{'author': ['A. Bry', 'A. Bachrach', 'N. Roy'], 'title': 'State estimation for aggressive flight in gps-denied environments using onboard sensing', 'publication': 'in ICRA', 'year': '2012'}
{'author': ['S. Scherer', 'S. Singh', 'L. Chamberlain', 'S. Saripalli'], 'title': 'Flying fast and low among obstacles', 'publication': 'in ICRA', 'year': '2007'}
{'author': ['A. Bachrach', 'S. Prentice', 'R. He', 'P. Henry', 'A. S. Huang', 'M. Krainin', 'D. Maturana', 'D. Fox', 'N. Roy'], 'title': 'Estimation, planning, and mapping for autonomous flight using an rgb-d camera in gps-denied environments', 'publication': 'Int.J.Rob.Res., vol.31', 'year': '2012'}
{'author': ['R.

## IEEE Version 2

In [None]:
def ieee_ref2(text_file):
    '''
    Retrieve citations from IEEE journal.
    Citation format: authors, year, title, (pub)

    Parameters
    ----------
    text_file: Text file that contains citations/references of IEEE journal.

    Return
    ------
    references: list of citations, each citation is a dictionary consisting of
                author, journal title, publication journal and year published.
    '''

    # Read file
    with open(text_file, "r") as file:
        ref_text = file.read().splitlines()

    # List of citations
    references = []

    for line in ref_text:

        if line != "":

            # Extract authors + year
            split_author_year = line.split(").")
            author_year = split_author_year[0]

            # Extract authors
            split_author = author_year.split("(")
            author = split_author[0].strip()
            author = re.sub(r'^\d+\.\s', '', author)

            # Parse multiple authors into list
            authors = []
            first_split = author.split(", and ")
            for split in first_split:
                second_split = re.split(r'(?<=\.),\s', split)
                for sub_split in second_split:
                    authors.extend(re.split(r'\s+and\s+', sub_split))

            # Extract year
            year = split_author[1]

            # Extract title
            title_pub = re.split(r'(?<=[a-z])\.\s(?=[A-Z])', ").".join(split_author_year[1:]))
            title = title_pub[0]

            # Extract publication (if exists)
            pub = "No pub"
            if len(title_pub) > 1:
                pub = ". ".join(title_pub[1:])

            # Store each citation as dictionary
            reference = {
                "author": authors,
                "title": title,
                "publication": pub,
                "year": year
            }

            references.append(reference)

    return references

In [None]:
# Retrieve citations of IEEE journal
ieee_refs = ieee_ref2("References.txt")
for ref in ieee_refs:
    print(ref)

{'author': ['Audi MediaInfo'], 'title': ' Travolution promotes eco-friendly driving', 'publication': 'Available at http://www.audiusanews.com/newsrelease.do?id=1016=76.', 'year': '2008'}
{'author': ['Chung, Y.-C.', 'Wang, J.-M.', 'Chen, S.-W.'], 'title': ' A vision-based traffic light detection system at intersections', 'publication': 'J. Taiwan Normal University: Mathematics, Science  Technology, 47(1):67-86.', 'year': '2002'}
{'author': ['de Charette, R.', 'Nashashibi, F.'], 'title': ' Traffic light recognition using image processing compared to learning processes', 'publication': 'In Proc. IROS 2009, pages 333-338.', 'year': '2009'}
{'author': ['Fang, C. Y.', 'Fuh, C. S.', 'Yen, P. S.', 'Cherng, S.', 'Chen, S. W.'], 'title': ' An automatic road sign recognition system based on a computational model of human recognition processing', 'publication': 'Comput. Vis. Image Underst., 96(2):237-268.', 'year': '2004'}
{'author': ['Fleyeh, H.'], 'title': ' Road and traffic sign color detection

In [None]:
# Convert list into dataframe
ieee_refs_df = pd.DataFrame(ieee_refs)
ieee_refs_df.head(5)

Unnamed: 0,author,title,publication,year
0,"[R. Bernier, M. Bissonnette, P. Poitevin]",Dsa radar-development report,in UAVSI,2005
1,"[A. Bachrach, R. He, N. Roy]",Autonomous flight in unknown indoor environments,International Journal of Micro Air Vehicles,2009
2,"[A. Bry, A. Bachrach, N. Roy]",State estimation for aggressive flight in gps-...,in ICRA,2012
3,"[S. Scherer, S. Singh, L. Chamberlain, S. Sari...",Flying fast and low among obstacles,in ICRA,2007
4,"[A. Bachrach, S. Prentice, R. He, P. Henry, A....","Estimation, planning, and mapping for autonomo...","Int.J.Rob.Res., vol.31",2012


In [None]:
# Save dataframe as Excel sheet
ieee_refs_df.to_excel("references.xlsx", index = False)

## Nature

In [None]:
def nature_ref(url):
    '''
    Retrieve citations from Nature journal webpage.

    Parameters
    ----------
    url: Link to Nature journal.

    Return
    ------
    references: list of citations, each citation is a dictionary consisting of
                author, journal title, publication journal and year published.
    '''

    # Parse webpage
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }
    response = requests.get(url, headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # List of citations
    references = []

    # Extract citations
    for ref in soup.find_all("li", class_="c-article-references__item"):

        ref_text = ref.get_text(strip = True)

        # Clean end of citation
        if ref_text.endswith("ArticleCASGoogle Scholar"):
            ref_text = ref_text[:-25]

        elif ref_text.endswith("ArticleGoogle Scholar"):
            ref_text = ref_text[:-22]

        elif ref_text.endswith("PubMedGoogle Scholar"):
            ref_text = ref_text[:-20]

        # Extract authors
        split_author = re.split(r'\.\s(?=[A-Z][^.])', ref_text)
        author = split_author[0].strip() + "."

        if author.endswith("et al."):
            author = author[:-7].rstrip()

        authors = re.split(r'\s*&\s*|\s*(?<=\.),\s*', author)
        authors = [author.strip() for author in authors]

        # Extract title
        title_pub_year = '. '.join(split_author[1:])
        split_title = title_pub_year.split(".")
        title = split_title[0]

        # Extract publication
        pub_year = '.'.join(split_title[1:])
        split_year = pub_year.split(" (")
        pub = split_year[0]

        # Extract year
        year = ""
        if len(split_year) > 1:
          year = split_year[1]
          year = re.sub(r'[^0-9]+$', '', year)

        # Store each citation as dictionary
        reference = {
            "author": authors,
            "title": title,
            "publication": pub,
            "year": year
        }

        references.append(reference)

    return references

In [None]:
# Retrieve citations from Nature webpage
nature_refs = nature_ref("https://www.nature.com/articles/nbt0509-485b")
for ref in nature_refs:
  print(ref)

In [None]:
# Convert list into dataframe
nature_refs_df = pd.DataFrame(nature_refs)
nature_refs_df.head(5)

Unnamed: 0,author,title,publication,year
0,"[Kim, Y. E., Hipp, M. S., Bracher, A., Hayer-H...",Molecular chaperone functions in protein foldi...,"Annu. Rev. Biochem.82, 323–355",2013
1,"[Hampton, R. Y.]",ER-associated degradation in protein quality c...,"Curr. Opin. Cell Biol.14, 476–482",2002
2,"[Amm, I., Sommer, T., Wolf, D. H.]",Protein quality control and elimination of pro...,"Biochim. Biophys. Acta1843, 182–196",2014
3,"[Finley, D.]",Recognition and processing of ubiquitin-protei...,"Annu. Rev. Biochem.78, 477–513",2009
4,"[Merulla, J., Fasana, E., Soldà, T., Molinari,...",Specificity and regulation of the endoplasmic ...,"Traffic14, 767–777",2013


In [None]:
# Save dataframe as Excel sheet
nature_refs_df.to_excel("references.xlsx", index = False)

## PLOS

In [None]:
def plos_ref(url):
    '''
    Retrieve citations from PLOS journal webpage.

    Parameters
    ----------
    url: Link to PLOS journal.

    Return
    ------
    references: list of citations, each citation is a dictionary consisting of
                author, journal title, publication journal and year published.
    '''

    # Parse webpage
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }
    response = requests.get(url, headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # List of citations
    references = []

    # Extract reference section
    ref_section = soup.find("ol", class_ = "references")

    for ref in ref_section.find_all("li"):

        # Extract citation
        ref_text = ref.get_text(strip = True)

        if ref_text != "View Article" and ref_text != "Google Scholar":

            # Clean end of citation
            if ref_text.endswith("View ArticleGoogle Scholar"):
                ref_text = ref_text[:-26]

            # Extract authors + year
            split_author_year = ref_text.split(") ")
            author_year = split_author_year[0]

            # Extract authors
            split_author = author_year.split("(")
            author = split_author[0].strip()
            author = re.sub(r'^\d+\.', '', author)

            if author.endswith(", et al."):
                author = author[:-8].rstrip()

            authors = author.split(", ")

            # Extract year
            year = split_author[1]

            # Extract title
            title_pub = re.split(r'\.\s|\?\s', ") ".join(split_author_year[1:]))
            title = title_pub[0]

            # Extract publication (if exists)
            pub = "No pub"
            if len(title_pub) > 1:
                pub = title_pub[1]

            # Store each citation as dictionary
            reference = {
                "author": authors,
                "title": title,
                "publication": pub,
                "year": year
            }

            references.append(reference)

    return references

In [None]:
# Retrieve citations from PLOS webpage
plos_refs = plos_ref("https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001535")
for ref in plos_refs:
    print(ref)

IndexError: list index out of range

In [None]:
# Convert list into dataframe
plos_refs_df = pd.DataFrame(plos_refs)
plos_refs_df.head(5)

Unnamed: 0,author,title,publication,year
0,"[McElrath MJ, Haynes BF]",Induction of immunity to human immunodeficienc...,Immunity 33: 542–554.,2010
1,[Luciw PA],"Fields' Virology; Fields BN, Knipe DM, Howley ...",Philadelphia: Lippincott-Raven.,2002
2,"[Haffar OK, Dowbenko DJ, Berman PW]",Topogenic analysis of the human immunodeficien...,J Cell Biol 107: 1677–1687.,1988
3,"[Miyauchi K, Komano J, Yokomaku Y, Sugiura W, ...",Role of the specific amino acid sequence of th...,J Virol 79: 4720–4729.,2005
4,"[Shang L, Hunter E]",Residues in the membrane-spanning domain core ...,Virology 404: 158–167.,2010


In [None]:
# Save dataframe as Excel sheet
plos_refs_df.to_excel("references.xlsx", index = False)

## PDF

In [None]:
def pdf_ref(pdf_path):
    '''
    Retrieve citations from journal in PDF format.

    Parameters
    ----------
    pdf_path: file path of PDF file, with respect to current file.

    Return
    ------
    references: list of citations

    '''

    # Open PDF file
    pdf_document = fitz.open(pdf_path)

    # Extract text from all pages
    text = ''
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text += page.get_text()

    # Find references section
    references_start = text.lower().rfind('references') + 11
    if not references_start:
        return []
    references_text = text[references_start:]

    # Use regex to find citation entries

    # For references that start with Cap Letter
    pattern = re.compile(r'\n(?=[A-Z])')

    # For references that start with (Cap Letter) followed by .
    pattern = re.compile(r'\n(?=[A-Z]\.)')

    # For references that start with [
    pattern = re.compile(r'\n(?=\[)')

    # For references that start with digit
    pattern = re.compile(r'\n(?=[0-9])')

    references_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', references_text)

    # Parse references into list
    references = pattern.split(references_text)
    references = [ref.strip() for ref in references if ref.strip()]
    references = [ref.replace('\n', '') for ref in references]

    return references

In [None]:
# Retrieve citations from journal PDF
ref = pdf_ref("Emily3.pdf")
for r in ref:
    print(r)

1.Sulston JE, Horvitz HR: Post-embryonic cell lineages of the nematode,Caenorhabditis elegans. Dev Biol 1977, 56:110–156.
2.Lodish HF: Molecular cell biology. In Molecular Cell Biology. 6th edition.New York: W.H. Freeman; 2008.
3.Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB,Frietze S, Harrow J, Kaul R, Khatun J, Lajoie BR, Landt SG, Lee BK, Pauli F,Rosenbloom KR, Sabo P, Safi A, Sanyal A, Shoresh N, Simon JM, Song L,Trinklein ND, Altshuler RC, Birney E, Brown JB, Cheng C, Djebali S, Dong X,Dunham I, et al: An integrated encyclopedia of DNA elements in thehuman genome. Nature 2012, 489:57–74.
4.Gerstein MB, Lu ZJ, van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY,Robilotto R, Rechtsteiner A, Ikegami K, Alves P, Chateigner A, Perry M, Morris M,Auerbach RK, Feng X, Leng J, Vielle A, Niu W, Rhrissorrakrai K, Agarwal A,Alexander RP, Barber G, Brdlik CM, Brennan J, Brouillet JJ, Carr A, Cheung MS,Clawson H, Contrino S, et al: Integrative analysis of the Caeno

# Cited By

In [None]:
def get_citations(soup):
    '''
    Retrieve citations from current page.

    Parameters
    ----------
    soup: BeautifulSoup object - parsed HTML webpage

    Return
    ------
    citations: list of publications, each publication is a dictionary consisting of
          title, authors, publication journal and snippet.

    '''

    # List of citations
    citations = []

    for element in soup.select(".gs_ri"):

        # Check if title present
        title_ele = element.select_one(".gs_rt a")

        if title_ele:

            # Title
            title = title_ele.text

            # Authors + Publication
            authors_ele = element.select_one(".gs_a")

            if authors_ele:

                # Format authors into list
                authors_pub = authors_ele.text.split("- ")
                authors = authors_pub[0].split(", ")

                # Remove trailing characters
                while authors[-1].endswith("...") or authors[-1].endswith("\xa0"):
                    authors[-1] = authors[-1][:-3]

                # Publication
                if len(authors) > 1:
                    publication = authors_pub[1]
                else:
                    publication = "No publication"

            else:
                authors = "No authors"
                publication = "No publication"

            # Snippet
            snippet_ele = element.select_one(".gs_rs")
            snippet = snippet_ele.text if snippet_ele else "No snippet"

            # Store each citation as dictionary
            cite = {
                "title": title,
                "authors": authors,
                "publication": publication,
                'snippet': snippet
            }

            # Add to list of citations
            citations.append(cite)

    return citations

In [None]:
def get_all_citations(cite_url, single_page):
    '''
    Retrieve citations (publications that cited original publication) in specified link.

    Parameters
    ----------
    cite_url: Google Scholar webpage of citations.
    single_page: only retrieve citations in current page.

    Return
    ------
    all_citations: list of all publications that cited original publication.

    '''

    # List of all citations
    all_citations = []

    # Parse webpage
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }

    # Only retrieve citations in current page
    if single_page:
        response = requests.get(cite_url, headers = headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        all_citations.extend(get_citations(soup))

    # Retrieve following 250 citations (max 250)
    else:
      curr_url = cite_url
      counter = 0

      while curr_url:

          # Reached maximum limit of 250
          if counter == 25:
              break

          # Parse current webpage
          response = requests.get(curr_url, headers = headers)
          print(response)
          soup = BeautifulSoup(response.text, 'html.parser')

          # Retrieve citations in current webpage
          all_citations.extend(get_citations(soup))

          # Find "next" button
          next_button = soup.find('td', {'align': 'left', 'nowrap': ''}).find('a')
          if next_button:

              # Move to next page
              curr_url = "https://scholar.google.com" + next_button['href']
              time.sleep(5)
              counter += 1

          else:
              curr_url = None

          print(curr_url)

    return all_citations

In [None]:
# Retrieve citations of webpage
citations = get_all_citations("https://scholar.google.com/scholar?cites=13730419627007653383&as_sdt=5,39&sciodt=0,39&hl=en", False)
for pub in citations:
    print(pub)

{'title': 'CRISPR-Cas tools for simultaneous transcription & translation control in bacteria', 'authors': ['RAL Cardiff', 'ID Faulkner', 'JG Beal'], 'publication': 'Nucleic Acids\xa0…, 2024 ', 'snippet': 'Robust control over gene translation at arbitrary mRNA targets is an outstanding challenge in microbial synthetic biology. The development of tools that can regulate translation will\xa0…'}


In [None]:
# Convert list into dataframe
citations_df = pd.DataFrame(citations)
citations_df.head(5)

Unnamed: 0,title,authors,publication,snippet
0,CRISPR-Cas tools for simultaneous transcriptio...,"[RAL Cardiff, ID Faulkner, JG Beal]","Nucleic Acids …, 2024",Robust control over gene translation at arbitr...


In [None]:
# Save dataframe as Excel sheet
citations_df.to_excel("citations.xlsx", index = False)

# Profile

In [None]:
def get_author_info(author_url):
    '''
    Retrieve basic information of author from Google Scholar profile page.

    Parameters
    ----------
    author_url: Google Scholar profile page of specified author.

    Return
    ------
    author_info: dictionary consisting of author's name, position, email, and published content.

    '''

    # Parse webpage
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }
    response = requests.get(author_url, headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Name
    name = soup.select_one("#gsc_prf_in").text

    # Affiliation
    affiliation = soup.select_one("#gsc_prf_inw+ .gsc_prf_il").text

    # Field
    field_ele = soup.select('.gsc_prf_inta')
    if field_ele:
        fields = []
        for element in field_ele:
            fields.append(element.text)
    else:
        fields = "No field"

    # Num citations, h-index and i10-index
    stats_ele = soup.select('.gsc_rsb_std')
    citations = stats_ele[0].text
    h_idx = stats_ele[2].text
    i10_idx = stats_ele[4].text

    # Store each author as dictionary
    author_info = {
        "name": name,
        "affiliation": affiliation,
        "field": fields,
        "num_citations": int(citations),
        "h_index": int(h_idx),
        "i10_index": int(i10_idx)
    }

    return author_info

In [None]:
# List of founder profile pages
founder_links = [
    "https://scholar.google.com/citations?user=MbcVEVwAAAAJ&hl=en",
    "https://scholar.google.com/citations?user=25wlvX8AAAAJ&hl=en&oi=ao",
    "https://scholar.google.com/citations?user=wFfKQlwAAAAJ&hl=en&oi=sra",
    "https://scholar.google.com/citations?user=ccnt9J4AAAAJ&hl=en&oi=sra",
    "https://scholar.google.com/citations?user=kDZB-mQAAAAJ&hl=en",
    "https://scholar.google.com/citations?user=tXD1nAcAAAAJ&hl=en",
    "https://scholar.google.com/citations?user=k0DUP3kAAAAJ",
    "https://scholar.google.com/citations?user=Cx1PHjgAAAAJ&hl=en&oi=ao",
    "https://scholar.google.com/citations?hl=en&user=iNlMvmsAAAAJ",
    "https://scholar.google.com/citations?user=4BhZfkEAAAAJ&hl=en&oi=ao",
    "https://scholar.google.com/citations?user=zgQ1xzkAAAAJ&hl=en&oi=ao"
]

In [None]:
# Generate founder profiles
founder_profiles = []
for link in founder_links:
    founder_profiles.append(get_author_info(link))

In [None]:
# Convert list into dataframe
founder_profiles_df = pd.DataFrame(founder_profiles)
founder_profiles_df.head(5)

Unnamed: 0,name,affiliation,field,num_citations,h_index,i10_index
0,Leonard Charles Jarrott,Lawrence Livermore National Laboratory,[Physics],2354,23,30
1,Austin Draycott,Graduate Student @ Yale,[Molecular Biology],375,2,2
2,David Weinberg,Freenome Inc.,No field,3414,14,17
3,Christopher R. Carlson,Unknown affiliation,"[Biochemistry, RNA]",512,7,6
4,Margaret Kocherga,UNC Charlotte,No field,96,7,5


In [None]:
# Save dataframe as Excel sheet
founder_profiles_df.to_excel("founder_profiles.xlsx", index = False)

In [None]:
# List of entrepreneur profile pages
entre_links = [
    "https://scholar.google.com/citations?user=nrxHZ50AAAAJ&hl=en&oi=ao",
    "https://scholar.google.com/citations?user=0b4S7moAAAAJ&hl=en&oi=ao",
    "https://scholar.google.com/citations?user=4bKmV08AAAAJ&hl=en&oi=ao",
    "https://scholar.google.com/citations?user=ouKJUyEAAAAJ&hl=en"
]

In [None]:
# Generate entrepreneur profiles
entre_profiles = []
for link in entre_links:
    entre_profiles.append(get_author_info(link))

In [None]:
# Convert list into dataframe
entre_profiles_df = pd.DataFrame(entre_profiles)
entre_profiles_df.head(5)

Unnamed: 0,name,affiliation,field,num_citations,h_index,i10_index
0,Emily Leproust,Twist Bioscience,"[DNA Synthesis, Synthetic Biology, Next Genera...",16976,53,86
1,Jonathan D. Steckbeck,Peptilogics,"[Peptide Antibiotics, Membrane Protein Biochem...",1620,18,23
2,Stephen Balaban,"CEO, Lambda Labs","[Deep Learning, Face Recognition, Convolutiona...",290,2,2
3,William Red Whittaker,Unknown affiliation,No field,5890,30,58


In [None]:
# Save dataframe as Excel sheet
entre_profiles_df.to_excel("entre_profiles.xlsx", index = False)

# Publication

In [None]:
def get_pubs(soup, author):
    '''
    Retrieve publications of author from current page.

    Parameters
    ----------
    soup: BeautifulSoup object - parsed HTML webpage
    author: first + last name of researcher.

    Return
    ------
    pubs: list of publications, each publication is a dictionary consisting of
          title, authors, author position, publication journal, year published, number of citations and citation link.

    '''

    # List of publications
    pubs = []

    for element in soup.select("#gsc_a_b .gsc_a_tr"):

        # Title
        title = element.select_one(".gsc_a_at").text

        # List of all authors (up to first 6)
        authors = element.select_one(".gsc_a_at+ .gs_gray").text.split(", ")
        if authors[-1] == "...":
            authors = authors[:-1]

        # Author position
        last_name = author.split(" ")[-1]
        position = next((idx for idx, name in enumerate(authors) if last_name in name), 6) + 1

        # Publication journal
        publication_ele = element.select_one(".gs_gray+ .gs_gray").text
        publication = publication_ele if publication_ele != "" else "No publication"

        # Year published
        year_ele = element.select_one(".gsc_a_h").text
        year = year_ele if year_ele else "No year"

        # Number of citations
        citation_ele = element.select_one(".gsc_a_ac")
        citations = int(citation_ele.text) if citation_ele.text.isdigit() else 0

        # Citation link
        citation_link = citation_ele.get('href') if citation_ele.text.isdigit() else "No citations"

        # Store each publication as dictionary
        pub = {
            "researcher": author,
            "title": title,
            "authors": authors,
            "position": position,
            "publication": publication,
            "year": year,
            "num_citations": citations,
            "citation_link": citation_link
        }

        # Add to list of publications
        pubs.append(pub)

    return pubs

In [None]:
def get_author_pubs(base_url, author):
    '''
    Retrieve all publications of specified author.

    Parameters
    ----------
    base_url: Google Scholar webpage of specified author.
    author: first + last name of researcher.

    Return
    ------
    all_pubs: list of all of author's publications.

    '''

    # Parse webpage
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }
    response = requests.get(base_url, headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Get all publications of author
    all_pubs = get_pubs(soup, author)

    while True:

      # Find "load_more" button
      load_more_button = soup.select_one("#gsc_bpf_more")
      if load_more_button and "disabled" not in load_more_button.attrs:

          # Get url of next page
          next_url = base_url + "&cstart=" + str(len(all_pubs)) + "&pagesize=100"
          response = requests.get(next_url, headers = headers)
          soup = BeautifulSoup(response.text, 'html.parser')

          # Append new articles
          all_pubs.extend(get_pubs(soup, author))

          # Avoid sending too many requests
          time.sleep(1)

      else:
          break

    return all_pubs

In [None]:
# List of Google Scholar webpages for founders
founders = [("https://scholar.google.com/citations?user=MbcVEVwAAAAJ&hl=en", "Leonard Charles Jarrott"),
           ("https://scholar.google.com/citations?user=25wlvX8AAAAJ&hl=en&oi=ao", "Austin Draycott"),
           ("https://scholar.google.com/citations?user=wFfKQlwAAAAJ&hl=en&oi=sra", "David Weinberg"),
           ("https://scholar.google.com/citations?user=ccnt9J4AAAAJ&hl=en&oi=sra", "Christopher R Carlson"),
           ("https://scholar.google.com/citations?user=kDZB-mQAAAAJ&hl=en", "Margaret Kocherga"),
           ("https://scholar.google.com/citations?user=tXD1nAcAAAAJ&hl=en", "Felix Wong"),
           ("https://scholar.google.com/citations?user=k0DUP3kAAAAJ", "Maxwell Z Wilson"),
           ("https://scholar.google.com/citations?user=Cx1PHjgAAAAJ&hl=en&oi=ao", "Daniele Foresti"),
           ("https://scholar.google.com/citations?hl=en&user=iNlMvmsAAAAJ", "Stuart Diller"),
           ("https://scholar.google.com/citations?user=4BhZfkEAAAAJ&hl=en&oi=ao", "Galen Clark Haynes"),
           ("https://scholar.google.com/citations?user=zgQ1xzkAAAAJ&hl=en&oi=ao", "Jason Fontana")
           ]

In [None]:
# Retrieve publications of all founders
founder_pubs = []

for (link, founder) in founders:
    founder_pubs.extend(get_author_pubs(link, founder))
    print(f"{founder} completed")

Leonard Charles Jarrott completed
Austin Draycott completed
David Weinberg completed
Christopher R Carlson completed
Margaret Kocherga completed
Felix Wong completed
Maxwell Z Wilson completed
Daniele Foresti completed
Stuart Diller completed
Galen Clark Haynes completed
Jason Fontana completed


In [None]:
# Convert list into dataframe
founder_pubs_df = pd.DataFrame(founder_pubs)
founder_pubs_df.head(5)

Unnamed: 0,researcher,title,authors,position,publication,year,num_citations,citation_link
0,Leonard Charles Jarrott,Burning plasma achieved in inertial fusion,"[AB Zylstra, OA Hurricane, DA Callahan, AL Kri...",7,"Nature 601 (7894), 542-548, 2022",2022,400,https://scholar.google.com/scholar?oi=bibs&hl=...
1,Leonard Charles Jarrott,Constraints on sub-GeV dark-matter–electron sc...,"[P Agnes, IFM Albuquerque, T Alexander, AK Alt...",7,"Physical review letters 121 (11), 111303, 2018",2018,274,https://scholar.google.com/scholar?oi=bibs&hl=...
2,Leonard Charles Jarrott,Lawson criterion for ignition exceeded in an i...,"[H Abu-Shawareb, R Acree, P Adams, J Adams, B ...",7,"Physical review letters 129 (7), 075001, 2022",2022,270,https://scholar.google.com/scholar?oi=bibs&hl=...
3,Leonard Charles Jarrott,Focusing of short-pulse high-intensity laser-a...,"[T Bartal, ME Foord, C Bellei, MH Key, KA Flip...",7,"Nature Physics 8 (2), 139-142, 2012",2012,170,https://scholar.google.com/scholar?oi=bibs&hl=...
4,Leonard Charles Jarrott,Design of inertial fusion implosions reaching ...,"[AL Kritcher, CV Young, HF Robey, CR Weber, AB...",7,"Nature Physics 18 (3), 251-258, 2022",2022,137,https://scholar.google.com/scholar?oi=bibs&hl=...


In [None]:
# Save dataframe as Excel sheet
founder_pubs_df.to_excel("founder_pubs.xlsx", index = False)

In [None]:
# List of Google Scholar webpages for entrepreneurs
entrepreneurs = [("https://scholar.google.com/citations?user=nrxHZ50AAAAJ&hl=en&oi=ao", "Emily Leproust"),
                 ("https://scholar.google.com/citations?user=0b4S7moAAAAJ&hl=en&oi=ao", "Jonathan D Steckbeck"),
                 ("https://scholar.google.com/citations?user=4bKmV08AAAAJ&hl=en&oi=ao", "Stephen Balaban"),
                 ("https://scholar.google.com/citations?user=ouKJUyEAAAAJ&hl=en", "William Red Whittaker")]

In [None]:
# Retrieve publications of all entrepreneurs
entrepreneur_pubs = []

for (link, entrepreneur) in entrepreneurs:
    entrepreneur_pubs.extend(get_author_pubs(link, entrepreneur))
    print(f"{entrepreneur} completed")

Emily Leproust completed
Jonathan D Steckbeck completed
Stephen Balaban completed
William Red Whittaker completed


In [None]:
# Convert list into dataframe
entrepreneur_pubs_df = pd.DataFrame(entrepreneur_pubs)
entrepreneur_pubs_df.head(5)

Unnamed: 0,researcher,title,authors,position,publication,year,num_citations,citation_link
0,Emily Leproust,Solution hybrid selection with ultra-long olig...,"[A Gnirke, A Melnikov, J Maguire, P Rogov, EM ...",7,"Nature biotechnology 27 (2), 182-189, 2009",2009,1731,https://scholar.google.com/scholar?oi=bibs&hl=...
1,Emily Leproust,The DNA-encoded nucleosome organization of a e...,"[N Kaplan, IK Moore, Y Fondufe-Mittendorf, AJ ...",7,"Nature 458 (7236), 362-366, 2009",2009,1427,https://scholar.google.com/scholar?oi=bibs&hl=...
2,Emily Leproust,Targeted and genome-scale strategies reveal ge...,"[MP Ball, JB Li, Y Gao, JH Lee, EM LeProust, I...",7,"Nature biotechnology 27 (4), 361-368, 2009",2009,1241,https://scholar.google.com/scholar?oi=bibs&hl=...
3,Emily Leproust,"Towards practical, high-capacity, low-maintena...","[N Goldman, P Bertone, S Chen, C Dessimoz, EM ...",7,"nature 494 (7435), 77-80, 2013",2013,1210,https://scholar.google.com/scholar?oi=bibs&hl=...
4,Emily Leproust,Mapping long-range promoter contacts in human ...,"[B Mifsud, F Tavares-Cadete, AN Young, R Sugar...",7,"Nature genetics 47 (6), 598-606, 2015",2015,1023,https://scholar.google.com/scholar?oi=bibs&hl=...


In [None]:
# Save dataframe as Excel sheet
entrepreneur_pubs_df.to_excel("entrepreneur_pubs.xlsx", index = False)

# Pubs without Profile

In [None]:
def get_pubs_without_profile(soup, author):
    '''
    Retrieve publications of author from current page.

    Parameters
    ----------
    soup: BeautifulSoup object - parsed HTML webpage
    author: first + last name of researcher.

    Return
    ------
    pubs: list of publications, each publication is a dictionary consisting of
          title, authors, author position, publication journal, year published, number of citations and citation link.

    '''

    # List of publications
    pubs = []

    for element in soup.select(".gs_ri"):

        # Check if title present
        title_ele = element.select_one(".gs_rt")

        if title_ele:

            # Title
            title = title_ele.text.strip()

            # Authors + Publication + Year
            all_ele = element.select_one(".gs_a")

            if all_ele:

                info_list = all_ele.text.strip().split("- ")

                # Format authors into list
                authors = info_list[0].strip().split(", ")

                # Remove starting characters
                if authors[0] == "…":
                    authors = authors[1:]

                # Remove trailing characters
                while authors[-1].endswith("…") or authors[-1].endswith("\xa0"):
                    authors[-1] = authors[-1][:-1]

                # Author position
                last_name = author.split(" ")[-1]
                position = next((idx for idx, name in enumerate(authors) if last_name in name), 6) + 1

                # Publication + Year
                if len(info_list) > 1:
                    pub_year = info_list[1].strip().split(", ")
                    year = pub_year[-1]
                    publication = pub_year[0] if len(pub_year) > 1 else "No publication"

                    # Remove starting characters
                    if publication.startswith("…\xa0"):
                        publication = publication[2:]

                    # Remove trailing characters
                    if publication.endswith("\xa0…"):
                        publication = publication[:-2]

                else:
                    publication = "No publication"
                    year = "No year"

            else:
                authors = "No authors"
                publication = "No publication"
                year = "No year"

            # Citation
            citation_ele = element.select_one('.gs_fl a:contains("Cited by")')
            if citation_ele:
                num_citations = int(citation_ele.text.split("Cited by")[-1].strip())
                citation_link = citation_ele["href"]

        # Store each publication as dictionary
        pub = {
            "researcher": author,
            "title": title,
            "authors": authors,
            "position": position,
            "publication": publication,
            "year": year,
            "num_citations": num_citations,
            "citation_link": citation_link
        }

        # Add to list of publications
        pubs.append(pub)

    return pubs

In [None]:
def get_all_pubs_without_profile(cite_url, author, num_pages, single_page):
    '''
    Retrieve all of author's publications in specified link.

    Parameters
    ----------
    cite_url: Google Scholar webpage of publications.
    author: first + last name of researcher.
    num_pages: number of webpages to scrape. Maximum value: 25.
    single_page: only retrieve publications in current page.

    Return
    ------
    all_pubs: list of all publications of specified author.

    '''

    # List of all publications
    all_pubs = []

    # Parse webpage
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }

    # Only retrieve publications in current page
    if single_page:
        response = requests.get(cite_url, headers = headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        all_pubs.extend(get_pubs_without_profile(soup, author))

    # Retrieve following publications (max 250)
    else:
      curr_url = cite_url
      counter = 0

      while curr_url:

          # Reached maximum limit
          if counter == num_pages:
              break

          # Parse current webpage
          response = requests.get(curr_url, headers = headers)
          print(response)
          soup = BeautifulSoup(response.text, 'html.parser')

          # Retrieve publications in current webpage
          all_pubs.extend(get_pubs_without_profile(soup, author))

          # Find "next" button
          next_button = soup.find('td', {'align': 'left', 'nowrap': ''}).find('a')
          if next_button:

              # Move to next page
              curr_url = "https://scholar.google.com" + next_button['href']
              time.sleep(5)
              counter += 1

          else:
              curr_url = None

          print(curr_url)

    return all_pubs

In [None]:
# Retrieve publications of Chris Urmson
urmson_pubs = get_all_pubs_without_profile("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=chris+urmson&btnG=&oq=chris+urmson", "Chris Urmson", 7, False)
for pub in urmson_pubs:
    print(pub)

<Response [200]>




https://scholar.google.com/scholar?start=10&q=chris+urmson&hl=en&as_sdt=0,5
<Response [200]>
https://scholar.google.com/scholar?start=20&q=chris+urmson&hl=en&as_sdt=0,5
<Response [200]>
https://scholar.google.com/scholar?start=30&q=chris+urmson&hl=en&as_sdt=0,5
<Response [200]>
https://scholar.google.com/scholar?start=40&q=chris+urmson&hl=en&as_sdt=0,5
<Response [200]>
https://scholar.google.com/scholar?start=50&q=chris+urmson&hl=en&as_sdt=0,5
<Response [200]>
https://scholar.google.com/scholar?start=60&q=chris+urmson&hl=en&as_sdt=0,5
<Response [200]>
https://scholar.google.com/scholar?start=70&q=chris+urmson&hl=en&as_sdt=0,5
{'researcher': 'Chris Urmson', 'title': 'Autonomous driving in urban environments: Boss and the urban challenge', 'authors': ['C Urmson', 'J Anhalt', 'D Bagnell', 'C Baker'], 'position': 1, 'publication': 'Journal of field', 'year': '2008', 'num_citations': 2429, 'citation_link': '/scholar?cites=10041822319387343277&as_sdt=2005&sciodt=0,5&hl=en'}
{'researcher': 'C

In [None]:
# Convert list into dataframe
urmson_pubs_df = pd.DataFrame(urmson_pubs)
urmson_pubs_df.head(5)

Unnamed: 0,researcher,title,authors,position,publication,year,num_citations,citation_link
0,Chris Urmson,Autonomous driving in urban environments: Boss...,"[C Urmson, J Anhalt, D Bagnell, C Baker]",1,Journal of field,2008,2429,/scholar?cites=10041822319387343277&as_sdt=200...
1,Chris Urmson,Motion planning for autonomous driving with a ...,"[M McNaughton, C Urmson, JM Dolan]",2,on Robotics and,2011,528,/scholar?cites=4848134244943377378&as_sdt=2005...
2,Chris Urmson,Traffic light mapping and detection,"[N Fairfield, C Urmson]",2,2011 IEEE international conference on,2011,270,/scholar?cites=16774767332325565196&as_sdt=200...
3,Chris Urmson,Approaches for heuristically biasing RRT growth,"[C Urmson, R Simmons]",1,Robots and Systems (IROS 2003)(Cat,2003,481,/scholar?cites=17227151616455674617&as_sdt=200...
4,Chris Urmson,[PDF][PDF] Tartan racing: A multi-modal approa...,"[C Urmson, JA Bagnell, C Baker, M Hebert, A Ke...",1,No publication,2007,140,/scholar?cites=17777002848663963207&as_sdt=200...


In [None]:
# Save dataframe as Excel sheet
urmson_pubs_df.to_excel("urmson.xlsx", index = False)