# Automating Web Requests and Interactions in Python for NCBI
# Introduction
Welcome to this tutorial on automating web requests and interactions in Python. In this tutorial, we'll cover two popular libraries: Requests and Selenium.
In this tutorial, we've learned how to make HTTP requests using the Requests library and automate web interactions using Selenium in Python. These libraries open up a world of possibilities for web scraping, testing, and automation tasks.

Part 1: Making HTTP Requests with Requests
Step 1: Importing the Requests Library

In [49]:
import requests

Sending a GET Request

In [50]:
url = "https://www.ncbi.nlm.nih.gov/"
response = requests.get(url)

Handling the Response

In [51]:
if response.status_code == 200:
    print("Website opened successfully!")
    # You can print or manipulate the content of the website here
else:
    print("Failed to open the website. Status code:", response.status_code)


Website opened successfully!


# Web Automation with Selenium
We will follow this steps, since it is flexible, and it will open the web page

In [52]:
#import library
from selenium import webdriver

#Opening a Chrome Browser
driver = webdriver.Chrome() 

#Navigating to a Web Page
url = "https://www.ncbi.nlm.nih.gov/"
driver.get(url)


# Fetching Web Page Content with Requests
In this section, we'll learn how to use the Requests library in Python to fetch the HTML content of a webpage.
by fetching we will able to understand the structure of web page before and after load 

In [53]:
import requests

# URL of the webpage
url = "https://www.example.com"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Print the HTML content of the webpage
    print(response.text)
else:
    print("Failed to fetch the webpage. Status code:", response.status_code)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

Step 1: Importing Necessary Modules

In [70]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver


Step 2: Opening a Chrome Browser

In [71]:
# Open a Chrome browser
driver = webdriver.Chrome()


Step 3: Navigating to a Web Page

In [72]:
# URL of the website
url = "https://www.ncbi.nlm.nih.gov/"

# Open the website
print("Opening NCBI...")
driver.get(url)
print("Opened NCBI")


Opening NCBI...
Opened NCBI


Step 4: Waiting for Elements to Load


In [73]:
# Wait for the search box to be present
print("Waiting for the search box to appear...")
search_box = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "term"))
)
print("Search box found")


Waiting for the search box to appear...
Search box found


Step 5: Interacting with Web Elements

In [74]:
# Fill in the search box with "Streptomyces anthocyanicus"
print("Filling in the search box...")
search_box.send_keys("Streptomyces anthocyanicus")
print("Filled in the search box")

# Find and click the search button
print("Waiting for the search button to be clickable...")
search_button = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.ID, "search"))
)
print("Search button is clickable")
print("Clicking the search button...")
search_button.click()
print("Clicked the search button")


Filling in the search box...
Filled in the search box
Waiting for the search button to be clickable...
Search button is clickable
Clicking the search button...
Clicked the search button


In [75]:
# Wait for the "Genomes" link to be present
print("Waiting for the 'Genomes' link to be present...")
genome_link = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.ID, "genome_table"))
)

# Click the "Genomes" link
print("Clicking the 'Genomes' link...")
genome_link.click()
print("Clicked the 'Genomes' link")



Waiting for the 'Genomes' link to be present...
Clicking the 'Genomes' link...
Clicked the 'Genomes' link


# Get the current url, further we will use it as to see HTML page 
its a html_file_path

In [76]:
# Get the current URL at the end
current_url = driver.current_url
print("Current URL:", current_url)

Current URL: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=68174


In [78]:
#u can also opent using web browse but skip this steps here 
import webbrowser

#Replace 'path/to/your/file.html' with the actual path to your HTML file
html_file_path = '"https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=68174"' #this current url obtained previously

#Open the HTML file in the default web browser
webbrowser.open(html_file_path)

True

# After load u can check html how it looks like , Now skiping

In [64]:
#import requests

# URL of the webpage
#url = "https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=68174"


# Send a GET request to the URL
#response = requests.get(url)

# Check if the request was successful (status code 200)
#if response.status_code == 200:
    # Print the HTML content of the webpage
    #print(response.text)
#else:
    #print("Failed to fetch the webpage. Status code:", response.status_code)


In [65]:
# After importing libraries, let's understand the webpage structure.
# We'll load the webpage and use an HTML parser to examine its structure.

# Define the search term
#search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
#driver = webdriver.Chrome()

# Construct the search URL for assembly
#search_url = "https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=68174"

# Navigate to the search URL
#driver.get(search_url)

# Get the page source after Selenium waits for the page to fully load
#page_source = driver.page_source

# Use BeautifulSoup to parse the page source
#soup = BeautifulSoup(page_source, 'html.parser')

#print(soup)
    

# Firts we will print the main content 
<span style="font-size:18px;">we used selenium here

In [80]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Define the search term
search_term = "Streptomyces anthocyanicus NBC 01687"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = 'https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=68174' #this current url obtained previously

# Navigate to the search URL
driver.get(search_url)

try:
  # Wait for the main content to be visible
  main_content = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "maincontent")))

  # Find the assembly information
  assembly_info = main_content.text if main_content else "Assembly information not found"
  print(assembly_info)
   

except TimeoutException:
  print("Elements not found or timed out waiting for them to appear.")

# Close the browser
driver.quit()

Genome
Download a genome data package including genome, transcript and protein sequence, annotation and a data report
Selected taxa
Streptomyces anthocyanicus
Filters
Download
Select columns
6 Genomes
Rows per page
20
1-6 of 6
Assembly
GenBank
RefSeq
Scientific name
Modifier
Annotation
Size (Mb)
Level
Release Date
WGS accession
Action
ASM1465079v1 GCA_014650795.1 GCF_014650795.1 Streptomyces anthocyanicus JCM 4739 (strain)
NCBI RefSeq
Submitter
8.675 Scaffold Sep, 2020 BMVQ01
ASM3591765v1 GCA_035917655.1 GCF_035917655.1 Streptomyces anthocyanicus NBC 01777 (strain)
NCBI RefSeq
Submitter
8.876 Complete Jan, 2024
ASM3622694v1 GCA_036226945.1 GCF_036226945.1 Streptomyces anthocyanicus NBC 01687 (strain)
NCBI RefSeq
Submitter
8.887 Complete Jan, 2024
ASM1464769v1 GCA_014647695.1 GCF_014647695.1 Streptomyces anthocyanicus JCM 3037 (strain)
NCBI RefSeq
Submitter
9.335 Scaffold Sep, 2020 BMPR01
ASM1465115v1 GCA_014651155.1 GCF_014651155.1 Streptomyces anthocyanicus JCM 5058 (strain)
NCBI RefS

**As we saw assembly_info how looks like so we will process assembly_info and get the desire information**

*see as below*


In [82]:
# Split the assembly info into lines
lines = assembly_info.splitlines()

# Find the line containing the desired organism
target_line_index = None
for i, line in enumerate(lines):
    if "Streptomyces anthocyanicus JCM 5058" in line:
        target_line_index = i
        break

if target_line_index is not None:
    # Extract GenBank and RefSeq numbers from the target line
    target_line = lines[target_line_index]
    words = target_line.split()
    genbank_number = words[1]
    refseq_number = words[2]
    print(f"GenBank number: {genbank_number}")
    print(f"RefSeq number: {refseq_number}")

    # Extract genome size from the line following the target line
    genome_size_line = lines[target_line_index + 3]
    genome_size_words = genome_size_line.split()
    genome_size = genome_size_words[0]
    print(f"Genome size (Mb): {genome_size}")
else:
    print("Streptomyces anthocyanicus JCM 5058 not found")


GenBank number: GCA_014651155.1
RefSeq number: GCF_014651155.1
Genome size (Mb): 8.187


<span style="color:blue;">we can also get using pattern insted op splitting and looping

In [83]:
# Pattern for extracting GenBank and RefSeq numbers
genbank_refseq_pattern = r'(GCA_\d+\.\d+)\s+(GCF_\d+\.\d+)\s+Streptomyces anthocyanicus\s+\w+\s+\d+\s*\(strain\)\s*NCBI RefSeq\s*Submitter'

# Pattern for extracting genome size
genome_size_pattern = r'Streptomyces anthocyanicus\s+\w+\s+\d+\s*\(strain\)\s*NCBI RefSeq\s*Submitter\s*(\d+\.\d+)'

# Find GenBank and RefSeq numbers
genbank_refseq_match = re.search(genbank_refseq_pattern, assembly_info)
if genbank_refseq_match:
    genbank_number = genbank_refseq_match.group(1)
    refseq_number = genbank_refseq_match.group(2)
    print(f"GenBank number: {genbank_number}")
    print(f"RefSeq number: {refseq_number}")
else:
    print("GenBank and RefSeq numbers not found")

# Find genome size
genome_size_match = re.search(genome_size_pattern, assembly_info)
if genome_size_match:
    genome_size = genome_size_match.group(1)
    print(f"Genome size (Mb): {genome_size}")
else:
    print("Genome size not found")


GenBank number: GCA_014650795.1
RefSeq number: GCF_014650795.1
Genome size (Mb): 8.675
