Skip to content

Andy-Pham-72/Web-Scraping-with-Selenium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Web Scraping with Selenium.

In the other repository, I introduced how to use Beautiful Soup to do a variety of quick and simple web-scraping tasks. Beautiful Soup is great, but it has its limitations, because sometimes the content we want to scrape is hidden behind buttons and links that we cannot access directly through a URL. There are also a variety of more sophisticated browsing tasks, like filling in forms and text boxes, that we might have the need to automate. For these purposes it is good to reach for another package: selenium

Selenium is actually much more than a Python package; it's a whole framework for automating web browsers for the purposes of testing web applications, and it's been ported to a variety of programming languages in addition to Python. The main reason you should be aware of this is because, if you ever need to Google something about Selenium, you should include Selenium and Python in your search query; otherwise you will probably get a lot of results in Java (and who wants that). Also, if you ever have any questions about Selenium, their unofficial documentation is always a good place to start.


Install Selenium

Open bash and run:

pip install selenium

# Standard imports
import pandas as pd

# For web scraping
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
import time

Download driver

In order to use Selenium, you must download a driver to interface with your chosen browser. Currently, Selenium supports Chrome, Firefox, Safari, and Edge. You can find a link to the browser of your choice here. Be sure to download the driver that matches the version of your chosen browser!

For the purposes of this kickoff, we'll be using Selenium with Chrome. To do this, we must first download ChromeDriver. For example, if your current Chrome browser version is 89, you have to download Chromedriver version 89.

The ChromeDriver file, once unzipped, is a single executable file called chromedriver. You may keep this file anywhere on your computer, but it is best to place it in an easy-to-reference location where you know. For example, you save chromedriver in the Download folder, you have to get the directory path as /Users/Download/chromedriver . Thus, we can assign that directory path to a variable:

# Save path to chromedriver executable file to variable
chrome_path = '/Users/quanganhpham/Downloads/chromedriver' # for example, /Users/Download/chromedriver
#Directory path to your chromedriver

Today we will be scraping the product Solaray, Vitamin D3 + K2, Soy-Free, 125 mcg (5000 IU), 60 VegCaps information (names / brand / review content) from this LINK.

# Initializes the Chrome Driver to access the website
driver = webdriver.Chrome(chrome_path)
driver1 = webdriver.Chrome(chrome_path)


# Assigns url into a variable
product_url = "https://ca.iherb.com/pr/Solaray-Vitamin-D3-K2-Soy-Free-125-mcg-5000-IU-60-VegCaps/70098"

# Initializes the Chrome Driver to access the URL
driver.get(product_url)

First, we can get the product brand in the main product page, by using this syntax: find_element_by_xpath('.//*[@id="brand"]/a/span/bdi').get_attribute('textContent') which means it will find the element with the XPATH starting from id='brand' to the order of following tags a -> span -> bdi ; then we can use the funtion get_attribute() with the textContent to parse the product brand which is Solaray.

(We can also parse the product name from the main page; However, we won't do it because we want to try another function find_element_by_css_selector later)





We want to scrape the review contents of the product. However, the all the review contents are located in this View All Reviews instead of the main product page url. Therefore, we have to get that link by using the Seleninum functions to parse the attribute which contains the url to all the reviews as in the below picture:





And for the product name, we can find it in the View All Reviews. This is when this function find_element_by_css_selector comes in handy! using the following syntax: find_element_by_css_selector('[class="nav-product-link-text"] span').text) which means it can find the element starting with class="nav-product-link-text" -> span tag -> extract the text from that tag with .text as in the below picture:





# Set a waiting time for the Driver
wait = WebDriverWait(driver, 4)

# Locate `View All Reviews` link
link = wait.until(expected_conditions.presence_of_element_located((By.CSS_SELECTOR,"span.all-reviews-link > a")))

# Get `View All Reviews` link
x = link.get_attribute("href")

# Check the link
x
'https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098'

Now we have to create 2 for loops:
- In the first loop, we will get the link of different review pages which we want to scrape.
- In the second loop, we will scrape the data that we need.

# Create lists for the dataframe:
item_name = []
item_brand = []
review_contents = []

# Scrape maximum 3 pages in the review section
max_page_num = 3

for page_num in range(1, max_page_num + 1):
    
    review_url = x + "?&p=" + str(page_num)
    print(review_url)
    
    # Initializes the Chrome Driver to access review_url
    driver1.get(review_url)
    
    # Get the all the review elements
    review_containers = driver1.find_elements_by_class_name('review-row')
    
    for container in review_containers:
        # Add the review contents
        review_contents.append(container.find_element_by_class_name('review-text').text)
        # Add the product name
        item_name.append(driver1.find_element_by_css_selector('[class="nav-product-link-text"] span').text)
        # Add the product brand
        item_brand.append(driver.find_element_by_xpath('.//*[@id="brand"]/a/span/bdi').get_attribute('textContent'))
    
    # Sleep
    time.sleep(4) 
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=1
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=2
https://ca.iherb.com/r/Solaray-Vitamin-D3-K2-Soy-Free-60-VegCaps/70098?&p=3
# Create a dataframe
df_product = pd.DataFrame({'item_brand'   : item_brand, 
                            'item_name'   : item_name, 
                        'review_contents' : review_contents }) 

# Check the dataframe shape
df_product.shape
(20, 3)
# Check the dataframe
df_product.head(15)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
item_brand item_name review_contents
0 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps Everyone around said that in Russia everyone, ...
1 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps So far I can not appreciate the dignity of thi...
2 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps I am surprised by the reviews of people who de...
3 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps very cool product, I
4 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps Very cool product, I recommend it to everyone
5 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps cool very cool product recommend it
6 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps I recommend
7 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps Because of the large cans, they noticed that t...
8 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps Very, very cool product, I recommend
9 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps After a course of these vitamins, as my nutrit...
10 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps I love that this supplement contains vitamin K...
11 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps The excellent formula of this drug will provid...
12 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps Simply the best vitamin D3 complex! The dosage...
13 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps After reading reviews about the lack of vitami...
14 Solaray Vitamin D3 + K2, Soy-Free, 60 VegCaps They drank the whole family. Raises vitamin D ...
# Let make a CSV file from the dataframe
df_product.to_csv ('product_review.csv', index = False, header=True) 

Lastly, you can also use Selenium to close your browser. (Or you just need to simply close the browser.)

driver.quit()
driver1.quit()

END

About

An intuitive tutorial of web scraping with Selenium. 🕷️

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published