<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CIS9650/blob/main/Selenium_SimplePage_EXAMPLE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Selenium Simple Page Example  
  
<b>This notebook uses Selenium to read Dr. Chuck's simple web page at https://www.dr-chuck.com/page1.htm </b>
  
  
*by Professor Patrick, 2024*  


## Install Necessary Packages:   
  
Install **Selenium** and the `google-colab-selenium` package, which simplifies the setup process in **Colab** environments.

In [None]:
!pip install selenium google-colab-selenium

No need to set up ChromeDriver -- The `google-colab-selenium` package manages the installation and configuration of `ChromeDriver`, ensuring compatibility with the Colab environment.

## Initialize the WebDriver and Access the Web Page:  
  
Use the `google_colab_selenium` module to create a `Chrome WebDriver` instance and navigate to the desired URL (https://www.dr-chuck.com/page1.htm).

### 0. Housekeeping  

Import the package as `gcs` to save typing.

In [None]:
import google_colab_selenium as gcs

The `google-colab-selenium` package simplifies the setup of **Selenium** and **ChromeDriver** in **Google Colab** environments. However, it doesn't include all **Selenium** components. Specifically, the `By` class, which provides methods for locating web elements, is part of **Selenium's** standard library and isn't included in `google-colab-selenium`. Therefore, we need to import it directly from **Selenium** using:  
  
```
from selenium.webdriver.common.by import By
```

In [None]:
from selenium.webdriver.common.by import By

### 1. Initialize Web Driver  
  
Using the `Chrome` class in `google_colab_selenium` which was aliased as `gcs`, we create a variable `driver` as our web driver.   

You will notice some delay when we initialize the web driver in a Google Colab notebook using `gcs.Chrome()`, beacuse the following sequence of actions occurs, leading to printed messages:  
  

<b><u>APT Update and Upgrade:</b></u>  
  
- The system's package list is updated (apt-get update), and installed packages are upgraded (apt-get upgrade).  
  
*This results in the printed message:* "Updated and upgraded APT"  
  
<b><u>Google Chrome Download:</b></u>    
  
- The latest version of Google Chrome is downloaded and installed.  
  
*This results in the printed message:* "Downloaded Google Chrome"  
  
<b><u>ChromeDriver Initialization:</b></u>    
  
- ChromeDriver, which enables Selenium to control Chrome, is initialized and configured.  
  
*This results in the printed message:* "Initialized Chromedriver"  
  
These steps ensure that both **Google Chrome** and **ChromeDriver** are properly installed and compatible, allowing **Selenium** to function correctly within the **Colab** environment.  


In [None]:
# Initialize the WebDriver
driver = gcs.Chrome()

### 2. Navigate to Desired Page  
  
We pass a string ("https://www.dr-chuck.com/page1.htm") with the URL of the desired web page to our web driver's `.get()` method and it navigates to that page by issuing an **HTML** `GET` statement.  

In [None]:
# Navigate to the target page
driver.get("https://www.dr-chuck.com/page1.htm")

### 3. Retrieve the Page Title  
  
If invoking the `.get()` method was succesful, we should be able to extract the **title** from the **HTML** elements on the target URL.  Unfortunately for us, on Dr. Chuck's ['The First Page']("https://www.dr-chuck.com/page1.htm") there is no **HTML** title tag.   
  


In [None]:
# Retrieve and print the page title
print(driver.title)

#### <b>Selenium's `.find_elements()`</b>  
  
Selenium's `find_elements()` method is a fundamental tool for locating multiple web elements on a webpage, enabling comprehensive web exploration and analysis. This method returns a list of all elements matching a specified locator strategy, allowing for batch processing of web elements.  
  
<b><u>Key Features of find_elements():</b></u>  
  
- Multiple Element Retrieval: Unlike `find_element()`, which retrieves only the first matching element, `find_elements()` fetches all elements that match the given criteria.  
  
- Flexible Locator Strategies: Utilizes various strategies to locate elements, such as by ID, name, class name, tag name, link text, partial link text, CSS selector, and XPath.  
  
<b><u>Common Locator Strategies:</b></u>  
  
- <b>By.ID:</b> Locates elements with a specific id attribute.  
- <b>By.NAME:</b> Targets elements with a specific name attribute.  
- <b>By.CLASS_NAME:</b> Selects elements with a specific **CSS** class.  
- <b>By.TAG_NAME:</b> Finds elements with a specific **HTML** tag.  
- <b>By.LINK_TEXT:</b> Identifies links `(<a> tags)` with exact matching text.  
- <b>By.PARTIAL_LINK_TEXT:</b> Finds links containing the specified text.  
- <b>By.CSS_SELECTOR:</b> Uses **CSS** selectors to locate elements.  
- <b>By.XPATH:</b> Employs **XPath** expressions to find elements.  

In the following example, we use `.find_elements()` and iteration through the elements it returns in two steps --   
1. **Element Location:** `.find_elements(By.TAG_NAME, 'h?')` where `?` is a variable of the level from 1 to 6 as provided by the `range()` function in the for-loop statement, to retrieve all related heading elements.  
2. **Data Extraction:** Iterates through each heading element, extracting and printing the text attribute.  


### 4. Retrieve Up to 6 Heading Levels  


In [None]:
# Extract and print all headings (h1 to h6)
for level in range(1, 7):
    headings = driver.find_elements(By.TAG_NAME, f'h{level}')
    for heading in headings:
        print(f"{heading.tag_name}: {heading.text}")
    print(f"The variable `headings` is a {type(headings)} with {len(headings)} elements at level {level}.")

This allows you to use the `By` class in your scripts, enabling you to locate elements by various strategies, such as tag name, class name, or CSS selector.

### 5. Extract All Links (URLs)  


In the following example, we use `.find_elements()` and iteration through the elements it returns in two steps --   
1. **Element Location:** `.find_elements(By.TAG_NAME, 'a')` to retrieve all anchor tags with hyperlink elements.  
2. **Data Extraction:** Iterates through each heading element, extracting and printing the `'href'` (link) attribute.  

In [None]:
# Locate all elements with the tag name 'a' (hyperlinks)
links = driver.find_elements(By.TAG_NAME, 'a')

# Iterate through the list and print the href attribute of each link
for link in links:
    print(link.get_attribute('href'))

  
The following script uses `.find_elements()` with an `XPath` expression to locate all anchor `(<a>)` tags that have an `href` attribute. It then iterates through the list of elements, retrieving the value of each element's `href` attribute using `.get_attribute("href")` method.  

In [None]:
    # Find all anchor tags with href attributes
    links = driver.find_elements(By.XPATH, "//a[@href]")

    # Extract and print the href attribute of each link
    for link in links:
        href = link.get_attribute("href")
        print(href)

### 6. Housecleaning: Close the Driver  
  
We invoke the `driver`'s `.close()` method to close the dialog with the web server.  


In [None]:
# Close the WebDriver
driver.quit()



---



## Exercises for Practice  

Try solving these tasks to deepen your understanding:  

1. Extract all paragraph (`<p>`) tags from the webpage.
2. Extract elements from [Dr. Chuck's Page 2](http://www.dr-chuck.com/page2.htm)  



---



## Summary  
  
In summary, while google-colab-selenium streamlines the setup process in Colab, it doesn't encompass all Selenium functionalities. Importing By directly from Selenium ensures you have access to the necessary methods for element location.



---





---

