# Introduction to Web Scraping with Python

Welcome to this Jupyter Notebook tutorial on web scraping with Python! In this lesson, we'll explore the fascinating world of web scraping, a technique used to extract data from websites. 

## What is Web Scraping?

Web scraping is the process of extracting information from websites using automated tools or scripts. It's a powerful technique used in various fields such as data science, research, and business intelligence to gather data from the web for analysis and decision-making.

## Why Learn Web Scraping?

Understanding web scraping opens up a world of possibilities for accessing and utilizing data available on the internet. Whether you're a data scientist looking to gather research data, a business analyst tracking market trends, or a hobbyist curious about extracting information from your favorite websites, web scraping skills are invaluable.

## What Will You Learn in This Tutorial?

In this tutorial, we'll cover the following topics:
- How to set up your environment for web scraping
- Using popular Python libraries such as BeautifulSoup and Selenium for scraping
- Navigating through web pages and extracting data
- Handling different types of web content like HTML, XML, and JSON
- Best practices and ethical considerations in web scraping

By the end of this lesson, you'll have a solid understanding of the fundamentals of web scraping and be equipped with the knowledge to start scraping data from the web on your own!

Let's dive in!


In [None]:
# Importing necessary libraries
from bs4 import BeautifulSoup
from selenium import webdriver

# Your scraping code goes here

# After importing libraries, let's understand the webpage structure.
# We'll load the webpage and use an HTML parser to examine its structure.

In [10]:
# After importing libraries, let's understand the webpage structure.
# We'll load the webpage and use an HTML parser to examine its structure.

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

try:
    # Construct the search URL for assembly
    search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

    # Navigate to the search URL
    driver.get(search_url)

    # Get the page source after Selenium waits for the page to fully load
    page_source = driver.page_source

    # Use BeautifulSoup to parse the page source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all dl elements containing organism information
    organism_dl_elements = soup.find_all("dl")

    if organism_dl_elements:
        print("Organism DL elements found on the webpage:")
        # Printing out each dl element containing organism information
        for dl_element in organism_dl_elements:
            print(dl_element)

    else:
        print("No organism DL elements found on the webpage.")

except Exception as e:
    print("An error occurred:", e)

finally:
    # Close the browser
    driver.quit()


Organism DL elements found on the webpage:
<dl class="details"><dt>Organism: </dt><dd><b>Streptomyces anthocyanicus</b> (high G+C Gram-positive bacteria)</dd></dl>
<dl class="details"><dt>Infraspecific name: </dt><dd>Strain: JCM 4739</dd></dl>
<dl class="details"><dt>Submitter: </dt><dd>WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)</dd></dl>
<dl class="details"><dt>Date: </dt><dd>2020/09/12</dd></dl>
<dl class="details"><dt>Assembly level: </dt><dd>Scaffold</dd></dl>
<dl class="details"><dt>Genome representation: </dt><dd>full</dd></dl>
<dl class="details"><dt>RefSeq category: </dt><dd>representative genome</dd></dl>
<dl class="details"><dt>Relation to type material: </dt><dd>assembly from synonym type material</dd></dl>
<dl class="details"><dt>GenBank assembly accession: </dt><dd>GCA_014650795.1 (<b>latest</b>) </dd></dl>
<dl class="details"><dt>RefSeq assembly accession: </dt><dd>GCF_014650795.1 (<b>latest</b>) </dd></dl>
<dl class="rprtid"><dt>IDs:</dt> <dd><span>8120941 [

next we will extracts specific information from HTML content using BeautifulSoup:

Consider "JCM 5058" of the organism "Streptomyces anthocyanicus" in the HTML.
So once  JCM 5058 is located we will retrieves the organism name, GenBank accession, and RefSeq accession associated with that strain.


In [11]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')

# Initialize variables to store extracted information
organism = None
genbank = None
refseq = None
looking_after_jcm = False  # Flag to track if we've seen "JCM"

for dt, dd in zip(soup.find_all("dt"), soup.find_all("dd")):
    dt_text = dt.get_text(strip=True)
    dd_text = dd.get_text(strip=True)
    
    if looking_after_jcm:
        if dt_text == "Organism:":
            organism = dd.find("b").get_text(strip=True)
        elif dt_text == "GenBank assembly accession:":
            genbank = dd_text.split()[0]
        elif dt_text == "RefSeq assembly accession:":
            refseq = dd_text.split()[0]
        
        # Stop processing data after finding the first set after JCM
        if organism and genbank and refseq:
            break
    
    if dt_text == "Infraspecific name:":
        if "JCM 5058" in dd_text:
            looking_after_jcm = True

# Print the extracted information
if organism and genbank and refseq:
    print("Organism:", organism)
    print("GenBank:", genbank)
    print("RefSeq:", refseq)
else:
    print("Information not found.")


Organism: Streptomyces anthocyanicus
GenBank: GCA_014651155.1
RefSeq: GCF_014651155.1


The extracted information is then printed.
The code stops processing data once it has gathered all the required information.
The code extracts information about a specific strain ("JCM 5058") of the organism "Streptomyces anthocyanicus" from the provided HTML content.
Once it finds this information after encountering "JCM 5058", it stops processing further data and prints out the extracted details.

Now we will use Selenium along with BeautifulSoup to scrape information from web pages. 
Here's how you can integrate Selenium with your existing code to achieve the same functionality:


In [13]:
from selenium import webdriver
from bs4 import BeautifulSoup

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

# Wait for the page to fully load (you may need to adjust the wait time based on your internet speed)
driver.implicitly_wait(10)  # Wait for 10 seconds

# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')

# Find all div elements containing assembly information
assembly_divs = soup.find_all("div", class_="rprt")

# Initialize variables to store extracted information
organism = None
genbank = None
refseq = None

# Loop through each div and check if it contains the desired information
for div in assembly_divs:
    if "JCM 5058" in div.get_text():
        # Extract information from the div
        info_text = div.get_text().strip()
        if "Organism:" in info_text:
            organism = info_text.split("Organism:")[1].split("(")[0].strip()
        if "GenBank assembly accession:" in info_text:
            genbank = info_text.split("GenBank assembly accession:")[1].split("(")[0].strip()
        if "RefSeq assembly accession:" in info_text:
            refseq = info_text.split("RefSeq assembly accession:")[1].split("(")[0].strip()
        break
else:
    print("No matched section found on the webpage.")

# Close the browser
driver.quit()

# Print the extracted information
print(f"Organism: {organism}")
print(f"GenBank Accession Number: {genbank}")
print(f"RefSeq assembly accession: {refseq}")


Organism: Streptomyces anthocyanicus
GenBank Accession Number: GCA_014651155.1
RefSeq assembly accession: GCF_014651155.1


# Similarly u can extract step by step as given below 

In [9]:
from selenium import webdriver
from bs4 import BeautifulSoup

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

try:
    # Construct the search URL for assembly
    search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

    # Navigate to the search URL
    driver.get(search_url)

    # Get the page source after Selenium waits for the page to fully load
    page_source = driver.page_source

    # Use BeautifulSoup to parse the page source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all div elements containing assembly information
    assembly_divs = soup.find_all("div", class_="rprt")

    # Loop through each div and check if it contains the desired information
    for div in assembly_divs:
        if "JCM 5058" in div.get_text():
            # Extract relevant information
            organism_name_elem = div.find("a", text=lambda text: text and "Organism:" in text)
            if organism_name_elem:
                organism_name = organism_name_elem.find_next_sibling("dd").get_text().strip()
            else:
                organism_name = "N/A"
            assembly_level_elem = div.find("dt", text=lambda text: text and "Assembly level:" in text)
            if assembly_level_elem:
                assembly_level = assembly_level_elem.find_next_sibling("dd").get_text().strip()
            else:
                assembly_level = "N/A"
            genome_representation_elem = div.find("dt", text=lambda text: text and "Genome representation:" in text)
            if genome_representation_elem:
                genome_representation = genome_representation_elem.find_next_sibling("dd").get_text().strip()
            else:
                genome_representation = "N/A"
            genbank_accession_elem = div.find("dt", text=lambda text: text and "GenBank assembly accession:" in text)
            if genbank_accession_elem:
                genbank_accession = genbank_accession_elem.find_next_sibling("dd").get_text().strip()
            else:
                genbank_accession = "N/A"
            refseq_accession_elem = div.find("dt", text=lambda text: text and "RefSeq assembly accession:" in text)
            if refseq_accession_elem:
                refseq_accession = refseq_accession_elem.find_next_sibling("dd").get_text().strip()
            else:
                refseq_accession = "N/A"
            print("Organism:", organism_name)
            print("Assembly Level:", assembly_level)
            print("Genome Representation:", genome_representation)
            print("GenBank Assembly Accession:", genbank_accession)
            print("RefSeq Assembly Accession:", refseq_accession)
            break
    else:
        print("No matched section found on the webpage.")

finally:
    # Close the browser
    driver.quit()


Organism: N/A
Assembly Level: Scaffold
Genome Representation: full
GenBank Assembly Accession: GCA_014651155.1 (latest)
RefSeq Assembly Accession: GCF_014651155.1 (latest)


  organism_name_elem = div.find("a", text=lambda text: text and "Organism:" in text)
  assembly_level_elem = div.find("dt", text=lambda text: text and "Assembly level:" in text)
  genome_representation_elem = div.find("dt", text=lambda text: text and "Genome representation:" in text)
  genbank_accession_elem = div.find("dt", text=lambda text: text and "GenBank assembly accession:" in text)
  refseq_accession_elem = div.find("dt", text=lambda text: text and "RefSeq assembly accession:" in text)


# Next we will try by searching sibling ( moving means next line)
Another approach is to move to the next 6th element after locating the match, taking advantage of the HTML structure where the GenBank Accession Number is located.

This rephrasing emphasizes the strategy of directly moving to the relevant element based on its position relative to the initially located match

In [14]:
from selenium import webdriver
from selenium.webdriver.common.by import By

# Open a Chrome browser
driver = webdriver.Chrome()

# Load the webpage
driver.get("https://www.ncbi.nlm.nih.gov/assembly/?term=Streptomyces+anthocyanicus+JCM+5058")

# Find the element containing the GenBank assembly accession using XPath
genbank_element = driver.find_element(By.XPATH, "//dl[contains(., 'JCM 5058')]/following-sibling::dl[6]") # move to 6th line

# Extract the GenBank assembly accession text
genbank_accession = genbank_element.text.split(": ")[1]

# Print the GenBank assembly accession
print(genbank_accession)

# Close the browser
driver.quit()


GCA_014651155.1 (latest)


# Navigation and Parsing of Single Genome Pages on NCBI with single entry
In cases where a genome has only one assembly or a single genome entry, clicking on it will open a different page. For example, if you search for 'Streptomyces anthocyanicus NBC 01687' in the NCBI search bar at https://www.ncbi.nlm.nih.gov/ across all databases, you may see only one assembly listed. However, upon browsing down and clicking on it, a new page will open, similar to the previous example with 'Streptomyces anthocyanicus JCM 5058'. Subsequently, we would need to parse the HTML of this new page again.


In [16]:
from bs4 import BeautifulSoup
from selenium import webdriver

# Define the search term
search_term = "Streptomyces anthocyanicus NBC 01687"

# Open a Chrome browser
driver = webdriver.Chrome()

try:
    # Construct the search URL for assembly
    search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

    # Navigate to the search URL
    driver.get(search_url)

    # Get the page source after Selenium waits for the page to fully load
    page_source = driver.page_source

    # Use BeautifulSoup to parse the page source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all dl elements containing organism information
    organism_dl_elements = soup.find_all("dl")

    if organism_dl_elements:
        print("Organism DL elements found on the webpage:")
        for dl_element in organism_dl_elements:
            print(dl_element)

    else:
        print("No organism DL elements found on the webpage.")

except Exception as e:
    print("An error occurred:", e)

finally:
    # Close the browser
    driver.quit()


Organism DL elements found on the webpage:
<dl data-testid="description_list"><dt>NCBI RefSeq assembly</dt><dd><span>GCF_036226945.1 <div class="MuiBox-root css-6kj2h2"><button aria-controls="show-link-menu-menu" aria-haspopup="true" class="MuiButtonBase-root MuiIconButton-root MuiIconButton-sizeMedium css-n5u0h3" data-ga-action="click_open_menu" data-ga-label="actions_menu_refseq" id="show-link-menu-button" style="padding: 4px;" tabindex="0" type="button"><svg aria-hidden="true" class="MuiSvgIcon-root MuiSvgIcon-fontSizeMedium css-18t5o0c" data-testid="MoreVertIcon" focusable="false" viewbox="0 0 24 24"><path d="M12 8c1.1 0 2-.9 2-2s-.9-2-2-2-2 .9-2 2 .9 2 2 2m0 2c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 2-2-.9-2-2-2m0 6c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 2-2-.9-2-2-2"></path></svg><span class="MuiTouchRipple-root css-w0pj6f"></span></button></div></span></dd><dt>Submitted GenBank assembly</dt><dd><span>GCA_036226945.1 <div class="MuiBox-root css-6kj2h2"><button aria-controls="show-link-menu-menu" ar

# As u see its looks different, we will go with different approach



In [23]:
from bs4 import BeautifulSoup
import requests

search_term = "Streptomyces anthocyanicus NBC 01687"

search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Send a GET request to the search URL and get the page content
response = requests.get(search_url)
page_content = """
<dl data-testid="description_list">
    <dt>NCBI RefSeq assembly</dt>
    <dd><span>GCF_036226945.1 <div class="MuiBox-root css-6kj2h2"><button aria-controls="show-link-menu-menu" aria-haspopup="true" class="MuiButtonBase-root MuiIconButton-root MuiIconButton-sizeMedium css-n5u0h3" data-ga-action="click_open_menu" data-ga-label="actions_menu_refseq" id="show-link-menu-button" style="padding: 4px;" tabindex="0" type="button"><svg aria-hidden="true" class="MuiSvgIcon-root MuiSvgIcon-fontSizeMedium css-18t5o0c" data-testid="MoreVertIcon" focusable="false" viewbox="0 0 24 24"><path d="M12 8c1.1 0 2-.9 2-2s-.9-2-2-2-2 .9-2 2 .9 2 2 2m0 2c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 2-2-.9-2-2-2m0 6c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 2-2-.9-2-2-2"></path></svg><span class="MuiTouchRipple-root css-w0pj6f"></span></button></div></span></dd>
    <dt>Submitted GenBank assembly</dt>
    <dd><span>GCA_036226945.1 <div class="MuiBox-root css-6kj2h2"><button aria-controls="show-link-menu-menu" aria-haspopup="true" class="MuiButtonBase-root MuiIconButton-root MuiIconButton-sizeMedium css-n5u0h3" data-ga-action="click_open_menu" data-ga-label="actions_menu_genbank" id="show-link-menu-button" style="padding: 4px;" tabindex="0" type="button"><svg aria-hidden="true" class="MuiSvgIcon-root MuiSvgIcon-fontSizeMedium css-18t5o0c" data-testid="MoreVertIcon" focusable="false" viewbox="0 0 24 24"><path d="M12 8c1.1 0 2-.9 2-2s-.9-2-2-2-2 .9-2 2 .9 2 2 2m0 2c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 2-2-.9-2-2-2"></path></svg><span class="MuiTouchRipple-root css-w0pj6f"></span></button></div></span></dd>
</dl>"""

# Parse the HTML content
soup = BeautifulSoup(page_content, 'html.parser')

# Find the dt and dd elements containing assembly information
assembly_dt = soup.find("dt", text="Submitted GenBank assembly")
assembly_dd = assembly_dt.find_next_sibling("dd")

# Extract the GenBank and RefSeq assembly accessions
genbank = assembly_dd.text.split()[0]  # Extract the first part of the text
refseq = soup.find("dt", text="NCBI RefSeq assembly").find_next_sibling("dd").text.split()[0]  # Extract the first part of the text

# Print the extracted information
print("GenBank:", genbank)
print("RefSeq:", refseq)


GenBank: GCA_036226945.1
RefSeq: GCF_036226945.1


  assembly_dt = soup.find("dt", text="Submitted GenBank assembly")
  refseq = soup.find("dt", text="NCBI RefSeq assembly").find_next_sibling("dd").text.split()[0]  # Extract the first part of the text


# using browse
Above we took the HTML example and did, now we will browse through web and will extract information


In [25]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

# Define the search term
search_term = "Streptomyces anthocyanicus NBC 01687"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')

# Find all dl elements containing assembly information
assembly_dl = soup.find("dl", {"data-testid": "description_list"})

# Extract GenBank and RefSeq assembly accessions
genbank_dd = assembly_dl.find("dt", text="Submitted GenBank assembly").find_next_sibling("dd")
genbank_accession = genbank_dd.text.strip().split()[0] if genbank_dd else None

refseq_dd = assembly_dl.find("dt", text="NCBI RefSeq assembly").find_next_sibling("dd")
refseq_accession = refseq_dd.text.strip().split()[0] if refseq_dd else None

# Close the browser
driver.quit()

# Create a DataFrame
data = [{"GenBank Accession Number": genbank_accession, "RefSeq assembly accession": refseq_accession}]
df = pd.DataFrame(data)

# Print and save the DataFrame
print(df)
df.to_csv("genome_data.csv", index=False)


  GenBank Accession Number RefSeq assembly accession
0          GCA_036226945.1           GCF_036226945.1


  genbank_dd = assembly_dl.find("dt", text="Submitted GenBank assembly").find_next_sibling("dd")
  refseq_dd = assembly_dl.find("dt", text="NCBI RefSeq assembly").find_next_sibling("dd")


# How to View Main Content and extract the desired Data
In addition to viewing the HTML file, we have the capability to access and observe the main content of a webpage.

Previously, we viewed the HTML source code and processed it and extract data.
Now we will try to access and examine the main content of a webpage and will attempt to print the main content of the webpage and extract relevant data from it.



In [28]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Define the search term
search_term = "Streptomyces anthocyanicus NBC 01687"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

try:
  # Wait for the main content to be visible
  main_content = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "maincontent")))

  # Find the assembly information
  assembly_info = main_content.text if main_content else "Assembly information not found"
  print(assembly_info)


except TimeoutException:
  print("Elements not found or timed out waiting for them to appear.")

# Close the browser
driver.quit()


Genome assembly ASM3622694v1
Download
datasets
curl
Actions
NCBI RefSeq assembly
GCF_036226945.1
Submitted GenBank assembly
GCA_036226945.1
Taxon
Streptomyces anthocyanicus
Strain
NBC 01687
Submitter
Technical University of Denmark
Date
Jan 25, 2024
View the legacy Assembly page
Assembly statistics
RefSeq GenBank
Genome size 8.9 Mb 8.9 Mb
Total ungapped length 8.9 Mb 8.9 Mb
Number of chromosomes 2 2
Number of scaffolds 2 2
Scaffold N50 8.5 Mb 8.5 Mb
Scaffold L50 1 1
Number of contigs 2 2
Contig N50 8.5 Mb 8.5 Mb
Contig L50 1 1
GC percent 72 72
Genome coverage 11.0x 11.0x
Assembly level Complete Genome Complete Genome
Sample details
BioSample ID
SAMN30553413
Description
G1000 actinobacteria
Comment
Keywords: GSC:MIxS;MIGS:6.0
Owner name
Technical University of Denmark
Strain
NBC 01687
Collection date
2021-05
Depth
NA
View more
Assembly methods
Sequencing technology
Oxford Nanopore MinION and Illumina
Comment
The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PG

In [29]:
# Define the search term
search_term = "Streptomyces anthocyanicus NBC 01687"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

try:
    # Wait for the main content to be visible
    main_content = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "maincontent")))

    # Find the assembly information
    assembly_info = main_content.text if main_content else "Assembly information not found"
    #print(assembly_info)

    # Extract GenBank and RefSeq assembly IDs if the assembly widget is present
    try:
        assembly_table = driver.find_element(By.CLASS_NAME, "assembly-widget")
        rows = assembly_table.find_elements(By.TAG_NAME, "tr")
        for row in rows:
            cells = row.find_elements(By.TAG_NAME, "td")
            if len(cells) == 3:
                label = cells[1].text.strip()
                assembly_id = cells[2].text.strip()
                if label == "NCBI RefSeq assembly":
                    print("NCBI RefSeq assembly:", assembly_id)
                elif label == "Submitted GenBank assembly":
                    print("Submitted GenBank assembly:", assembly_id)
    except NoSuchElementException:
        print("Assembly information widget not found.")

except TimeoutException:
    print("Elements not found or timed out waiting for them to appear.")

# Initialize variables to store assembly IDs
genbank_assembly = None
refseq_assembly = None

# Split the assembly information into lines and iterate over them
lines = assembly_info.split("\n")
for i in range(len(lines)):
    if "NCBI RefSeq assembly" in lines[i]:
        refseq_assembly = lines[i+1].strip()
    elif "Submitted GenBank assembly" in lines[i]:
        genbank_assembly = lines[i+1].strip()

# Print the assembly IDs if found
if refseq_assembly:
    print("NCBI RefSeq assembly:", refseq_assembly)
if genbank_assembly:
    print("Submitted GenBank assembly:", genbank_assembly)


# Close the browser
driver.quit()

Assembly information widget not found.
NCBI RefSeq assembly: GCF_036226945.1
Submitted GenBank assembly: GCA_036226945.1


# Now we will for second species Streptomyces anthocyanicus JCM 5058
Similarly we can view main content for second species which open with multiple Assembly in NCBI 

In [31]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

try:
  # Wait for the main content to be visible
  main_content = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "maincontent")))

  # Find the assembly information
  assembly_info = main_content.text if main_content else "Assembly information not found"
  print(assembly_info)


except TimeoutException:
  print("Elements not found or timed out waiting for them to appear.")

# Close the browser
driver.quit()


Summary20 per pageSort by Significance
Send to:
Download Assemblies


Search results
Items: 6
Filters activated: Latest, Exclude anomalous. Clear all to show 6 items.
1.
ASM1465079v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 4739
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
RefSeq category: representative genome
Relation to type material: assembly from synonym type material
GenBank assembly accession: GCA_014650795.1 (latest)
RefSeq assembly accession: GCF_014650795.1 (latest)
IDs: 8120941 [UID] 22193998 [GenBank] 22711958 [RefSeq]
2.
ASM3591765v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: NBC 01777
Submitter: Technical University of Denmark
Date: 2024/01/20
Assembly level: Complete Genome
Genome representation: full
GenBank assembly accession: GCA_035917655.1 (latest)
RefSeq asse

In [32]:
import re

# Define the search term
search_term = "JCM 5058"

# Define the search pattern
pattern = re.compile(r'{}.*?GenBank assembly accession:\s+(.*?)\s+\(latest\).*?RefSeq assembly accession:\s+(.*?)\s+\(latest\)'.format(re.escape(search_term)), re.DOTALL)

# Define the text containing the search results
search_results = """
Summary20 per pageSort by Significance
Send to:
Download Assemblies


Search results
Items: 6
Filters activated: Latest, Exclude anomalous. Clear all to show 6 items.
1.
ASM1465079v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 4739
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
RefSeq category: representative genome
Relation to type material: assembly from synonym type material
GenBank assembly accession: GCA_014650795.1 (latest)
RefSeq assembly accession: GCF_014650795.1 (latest)
IDs: 8120941 [UID] 22193998 [GenBank] 22711958 [RefSeq]
2.
ASM3591765v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: NBC 01777
Submitter: Technical University of Denmark
Date: 2024/01/20
Assembly level: Complete Genome
Genome representation: full
GenBank assembly accession: GCA_035917655.1 (latest)
RefSeq assembly accession: GCF_035917655.1 (latest)
IDs: 21001241 [UID] 52901348 [GenBank] 52903568 [RefSeq]
3.
ASM3622694v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: NBC 01687
Submitter: Technical University of Denmark
Date: 2024/01/25
Assembly level: Complete Genome
Genome representation: full
GenBank assembly accession: GCA_036226945.1 (latest)
RefSeq assembly accession: GCF_036226945.1 (latest)
IDs: 21158631 [UID] 53231598 [GenBank] 53363738 [RefSeq]
4.
ASM1464769v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 3037
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from synonym type material
GenBank assembly accession: GCA_014647695.1 (latest)
RefSeq assembly accession: GCF_014647695.1 (latest)
IDs: 8119251 [UID] 22190898 [GenBank] 22211148 [RefSeq]
5.
ASM1465115v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]
6.
ASM2618401v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: IPS92w
Submitter: IBPM RAS
Date: 2022/11/14
Assembly level: Contig
Genome representation: full
GenBank assembly accession: GCA_026184015.1 (latest)
RefSeq assembly accession: GCF_026184015.1 (latest)
IDs: 14409421 [UID] 37391658 [GenBank] 37487378 [RefSeq]
Summary20 per pageSort by Significance
Send to:
"""

# Find all matches
matches = pattern.findall(search_results)

# Extract the relevant information
for match in matches:
    genbank_accession = match[0]
    refseq_accession = match[1]
    print("GenBank Assembly Accession:", genbank_accession)
    print("RefSeq Assembly Accession:", refseq_accession)


GenBank Assembly Accession: GCA_014651155.1
RefSeq Assembly Accession: GCF_014651155.1


# Now we will implement with webpage 
Above we saw wile taking as a web page main content we apply simple python code to extract the content, now w e will try using webpage 

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import re

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

try:
    # Wait for the main content to be visible
    main_content = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "maincontent")))

    # Find the assembly information
    assembly_info = main_content.text if main_content else "Assembly information not found"
    #print(assembly_info)
    # Define the search term
    search_term = "JCM 5058"
    
    # Define the search pattern
    pattern = re.compile(r'{}.*?GenBank assembly accession:\s+(.*?)\s+\(latest\).*?RefSeq assembly accession:\s+(.*?)\s+\(latest\)'.format(re.escape(search_term)), re.DOTALL)
    
    # Find all matches
    matches = pattern.findall(assembly_info)

    # Extract the relevant information
    for match in matches:
        genbank_accession = match[0]
        refseq_accession = match[1]
        print("GenBank Assembly Accession:", genbank_accession)
        print("RefSeq Assembly Accession:", refseq_accession)
        
except TimeoutException:
    print("Elements not found or timed out waiting for them to appear.")

# Close the browser
driver.quit()


GenBank Assembly Accession: GCA_014651155.1
RefSeq Assembly Accession: GCF_014651155.1


# Last we will try using XPATH
using XPATH either we can move step by step and find location or we can search by text match, here we are moving forward (sibling) and backword (parents) as per location

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

# Define the search term
search_term = "Streptomyces anthocyanicus NBC 01687"

# Open a Chrome browser
driver = webdriver.Chrome()

try:
    # Construct the search URL for assembly
    search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

    # Navigate to the search URL
    driver.get(search_url)

    # Find elements containing the organism name
    elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'NCBI RefSeq assembly')]") #{search_term}

    if elements:
        print(f"Text '{search_term}' found on the webpage.")
        # Loop through elements containing the organism name
        for element in elements:
            # Find the parent element of the matched element
            parent_element = element.find_element(By.XPATH, "..") # for sibling"following-sibling::*[1]" #for parents ".." and for grand parents "../.." 
            # Print the text content of the parent element
            print("Parent element:")
            print(parent_element.text)
            
            
    else:
        print(f"Text '{search_term}' not found on the webpage.")

except Exception as e:
    print("An error occurred:", e)

finally:
    # Quit the browser
    driver.quit()

Text 'Streptomyces anthocyanicus NBC 01687' found on the webpage.
Parent element:
NCBI RefSeq assembly
GCF_036226945.1
Submitted GenBank assembly
GCA_036226945.1
Taxon
Streptomyces anthocyanicus
Strain
NBC 01687
Submitter
Technical University of Denmark
Date
Jan 25, 2024
