Selenium can be used to scrape websites which cannot be crawled using Beautiful soup and Urllib. These websites usually load content dynamically.
<br>
<b>Selenium WebDriver tool</b> is used to automate web application which can open browsers such as Chrome, Firefox, Chrome, Safari and so on.
<br>
<b>ActionChains</b> allows automating low level interactions like mouse movements, mouse button actions , key presses and context menu interactions.
<br>
<b>WebDriverWait</b> allows waiting until the element in a website is loaded or you expect something to happen. Check more about conditions <a href="https://selenium-python.readthedocs.io/waits.html">here</a>.
<br>
<b>expected_conditions</b> allows us to check if condtions are met or not, such as if item is present, or if item is visible and so on.
<br>
<b>TimeoutException</b> is shown when element is not found in a specified time peroid
<br>
<b>NoSuchElementException</b> is shown when element does not exist.

In [2]:
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

creating a class to create objects which will be later used to store scraped data.

In [1]:
class Item():
    def __init__(self, product_name, original_price, current_price, discount, rating, no_of_rating, image_source, product_url):
        self.product_name = product_name
        self.original_price = original_price
        self.current_price = current_price
        self.discount = discount
        self.rating = rating
        self.no_of_rating = no_of_rating
        self.image_source = image_source
        self.product_url = product_url
    def to_dict(self):
        return {
            'match score': self.product_name,
            'original price': self.original_price,
            'current price': self.current_price,
            'discount': self.discount,
            'rating': self.rating,
            'number of rating': self.no_of_rating,
            'image source': self.image_source,
            'product url': self.product_url
        }

Specify the delay to wait for element to wait.

In [3]:
DELAY = 25
URL = 'https://www.newegg.com/'

specifying list of keywords

In [4]:
keywords = ['acer', 'dell']

Load the chrome driver and load the URL. Wait for the elements to appear and send keywords and submit. Get the results and traverse along pages. After getting the result from all pages then close the driver.

In [26]:
def get_rawdata(driver, keyword):
    product_list = []
    count = 0
    try:
        if count == 0:
            searchbar =  WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.ID, "haQuickSearchBox")))
            searchbar.send_keys(keyword)
            searchbar.submit()
        
        dropdown = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.ID, "Order_top")))
        best_selling = dropdown.find_element_by_xpath('//option[text()="Best Selling"]')
        best_selling.click()
        
        driver.implicitly_wait(10)
        
        isNext = True
        while isNext:
            try:
                nextbtn =  WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[contains(@title, "Next")]')))
                if nextbtn.is_enabled():
                    nextbtn.click()
                else:
                    isNext = False
                count += 1
                if count >5:
                    isNext = False
            except StaleElementReferenceException:
                driver.implicitly_wait(10)                        
            except NoSuchElementException:
                isNext = False;
        container = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.XPATH, '//div[contains(@class, "is-grid")]')))
        containers = driver.find_elements_by_xpath('//div[contains(@class, "is-grid")]')
        #get data
        for container in containers:
            rawData = container.get_attribute('outerHTML')
            bs = BeautifulSoup(rawData, 'html.parser')
            item_lists = bs.select('div.item-container')
            for item in item_lists:
                product_url_container = item.select_one('a.item-title')
                product_url = product_url_container['href']
                product_name = product_url_container.get_text().strip()
                product_image = item.find('img')['src']
                rating_value = np.nan
                previous_price_value = np.nan
                current_price_value = np.nan
                discount_value = np.nan
                num_rating = np.nan
                
                rating_container = item.find('i', {'class': 'rating'})
                if rating_container is not None:
                    rating_value = int(rating_container['class'][-1].split('-')[-1])
                num_rating_container = item.select_one('span.item-rating-num')  
                if num_rating_container is not None:
                    num_rating = num_rating_container.get_text().strip().replace('(','').replace(')','') 
                
                previous_price = item.select_one('span.price-was-data')
                if previous_price is not None:
                    previous_price_value = previous_price.get_text().strip()
                current_price = item.select_one('li.price-current')
                if current_price is not None:
                    dollars = current_price.find('strong').get_text().strip()
                    cents = current_price.find('sup').get_text().strip()
                    current_price_value = dollars+cents
                discount = item.select_one('span.price-save-percent')
                if discount is not None:
                    discount_value = discount.get_text().strip()
                product = Item(product_name, previous_price_value, current_price_value, discount_value, rating_value, num_rating, product_image, product_url)
                product_list.append(product)
    except TimeoutException:
        print("Loading took too much time!")
    return product_list

Process the rawdata obtained into required fields. Use the class created above to create objects with all the info and create a list of them.

A simple function to combine functions above so that the code is more modular.

In [20]:
def get_all_data(URL, keywords):
    product_list = []
    driver = webdriver.Chrome()
    for keyword in keywords:
        driver.get(URL)
        product_list.extend(get_rawdata(driver, keyword))
    driver.close()
    return product_list

get the list of projects using the combined function

In [27]:
product_list = get_all_data(URL, keywords)

<b>create a dataframe and clean the data</b>

In [28]:
df = pd.DataFrame.from_records([product_item.to_dict() for product_item in product_list])
df.to_csv('new_egg_data.csv', index = False)
print(df.head())

  current price discount                                       image source  \
0      1,299.00      32%  https://c1.neweggimages.com/ProductImageCompre...   
1        219.73      NaN  https://c1.neweggimages.com/ProductImageCompre...   
2        889.99      19%  https://c1.neweggimages.com/ProductImageCompre...   
3        639.99      19%  https://c1.neweggimages.com/ProductImageCompre...   
4      1,049.00      38%  https://c1.neweggimages.com/ProductImageCompre...   

                                         match score number of rating  \
0  Acer Nitro 50 Gaming Desktop,8th Gen Intel Cor...              NaN   
1  Acer TravelMate B1 B118-M TMB118-M-C80T 11.6" ...              NaN   
2  Acer Nitro 50 Desktop, Intel 6-Core i5-8400 Up...              NaN   
3  Acer Aspire TC-885 Desktop, Intel 6-Core i5-84...                1   
4  Acer Aspire GX Gaming Desktop,AMD Ryzen 7 1700...              NaN   

  original price                                        product url  rating  
0       