# Web Scrapping For Infinite Scrolling Websites Using Selenium
Nowadays many websites involve a feature called [infinite scroll](https://en.wiktionary.org/wiki/infinite_scroll), which essentially automatically expands the current page when scrolling down to show more contents, so that the users don't need to manually click "next page" to see the contents. While it is very convenient for the users, it adds difficulty to the web scrapping. In this notebook, I will show the code I developed to automatically scrap infinite scrolling web pages, and demonstrate how to use it using the Sephora Makeup shopping web page as an example.

## Required Python Libraries
- **time** (Python built-in library)
- **Selenium** (I use the chrome drive for this project. Downloadable from this [site](https://chromedriver.chromium.org/downloads))
- **BeautifulSoup** from bs4

## Problem Definition
Let's say that my objective is to obtain all of the urls of the products on [Sephora makeup shopping page](https://www.sephora.com/ca/en/beauty/new-makeup?country_switch=ca&lang=en). Through inspecting the page, we understand that the hyperlink for each product is in the a-tag with class "css-ix8km1", under the div-tag with class "css-1s223mm". The standard procedure would be to find the a-tag with class "css-ix8km1" under each div-tags with class "css-1s223mm", and extract the hyperlink, as shown below.

In [1]:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.parse import urljoin

In [2]:
driver = webdriver.Chrome(executable_path=r"E:\Chromedriver\chromedriver_win32\chromedriver.exe")
driver.get("https://www.sephora.com/ca/en/beauty/new-makeup")
urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for parent in soup.find_all(class_="css-1s223mm"):
    a_tag = parent.find("a", class_="css-ix8km1")
    base = "https://www.sephora.com/"
    link = a_tag.attrs['href']
    url = urljoin(base, link)
    urls.append(url)
len(urls)

20

You can see that we only get 20 urls. This is the number of products shown on the web page before scrolling down. But actually the number should be way more than this! Since when the web page is open by selenium, it is a fresh web page, only five div-tag with class "css-dkxsdo" can be found (each "css-dkxsdo" tag contains four "css-1s223mm" tags). ![image](20items.PNG)

However, if we keep scrolling down, new items will show up. This will also be reflected in the html code as we can see more five div-tag with class "css-dkxsdo" can be found.![image2](100items.PNG)

Thus, we need to ask the browser to scroll down the page for us. Until all the html code has shown up, we can get the html code and use BeautifulSoup to obtain the urls. 

## Code

The code to scrap infinite scrolling page is as follow:

In [4]:
# Web scrapper for infinite scrolling page 
driver = webdriver.Chrome(executable_path=r"E:\Chromedriver\chromedriver_win32\chromedriver.exe")
driver.get("https://www.sephora.com/ca/en/beauty/new-makeup")
time.sleep(2)  # Allow 2 seconds for the web page to open
scroll_pause_time = 1 # You can set your own pause time. My laptop is a bit slow so I use 1 sec
screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web
i = 1
while True:
    # scroll one screen height each time
    driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
    i += 1
    time.sleep(scroll_pause_time)
    # update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
    scroll_height = driver.execute_script("return document.body.scrollHeight;")  
    # Break the loop when the height we need to scroll to is larger than the total scroll height
    if (screen_height) * i > scroll_height:
        break

urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for parent in soup.find_all(class_="css-1s223mm"):
    a_tag = parent.find("a", class_="css-ix8km1")
    base = "https://www.sephora.com/"
    link = a_tag.attrs['href']
    url = urljoin(base, link)
    urls.append(url)
    
len(urls)

135

As we can see, using this code we can automatically extract all the urls from the page, and the actual number of urls we can get is 135, much larger than 20! 