# Web Scraping with Selenium

Web scrapping is a simple way to get images (or other kind of data) from the web. With Selenium library we can perform the web scraping with few command lines.

### Download ChromeDriver
To scrap images from the web, it's recommended to use ChromeDriver (or other navigator), to download it, [click here](https://chromedriver.chromium.org/).

For this example, we'll scrap images from Facebook.

## Importing Libraries

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import wget
import os
import time

## Step 1: Log into a Facebook account

When we're confronted with a web scrapping task, we must know the website, it means, know how to navigate in this page, know where there's notifications. If  log-in is required, we must have access to developer mode, to ensure where the XPATH is located in the website, for example. For these reasons, we utilize the ChromeDriver. The first steps to scrap images from Facebook are:

- Define the web driver
- Disable the notification
- Set the path location of your navigator
- Set the path to have access to the ChromerDiver
- Get the web page to scrap the data
- Get in the web page the path for all elements that are clickable

In [None]:
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2} # to disable the notification
chrome_options.add_experimental_option("prefs",prefs)
chrome_options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome('/path/to/chromedriver', chrome_options=chrome_options)

driver.get("https://www.facebook.com") # to open the web page

# To pass through the mensage "Accept cookies from Facebook on this browser?"
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[2]/div/div/div/div/div[3]/button[2]"))).click()

# To get the log-in elements
username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='email']")))
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='pass']")))

# Log-in
username.clear()
username.send_keys("xxxxx@xxx.xxx")
password.clear()
password.send_keys("****")

# To pass through the mensage to save the email and password
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[1]/div[2]/div[1]/div/div/div/div[2]/div/div[1]/form/div[2]/button"))).click()

## Step 2: Extracting the images from a Facebook page

Once we've connected into the Facebook, there are some steps to scrap images from a specific Facebook page

- Create an empty list to store the links
- Open the target page on Facebook
- Main loop to scroll down in the page (Facebook loads your pages dynamically)
- Create an anchors list using the attribute find_elements_by_tag_name('a'), this is a particularity of Facebook page
- List comprehension to get all possibles links
- Another list comprehension to get the photos links
- A loop inside the anchors list to open each link (open the photo page, find the image, append the image link into the image list).

In [None]:
#wait 5 seconds to allow your new page to load
time.sleep(5)
images = []

driver.get("https://www.facebook.com/page/photos")
time.sleep(10) # waiting to load the page
    
# scrolling 500 times, the total amount of photos will depend on the connection    
for i in range(0, 500):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

    #target all the link elements on the page
    anchors = driver.find_elements_by_tag_name('a')
    anchors = [a.get_attribute('href') for a in anchors]
        #narrow down all links to image links only
    anchors = [a for a in anchors if str(a).startswith("https://www.facebook.com/page/photos/")]

    for a in anchors:
        driver.get(a) 
        time.sleep(5) 
        img = driver.find_elements_by_tag_name("img") # list of links
        images.append(img[0].get_attribute("src")) # the photo link is the first element within the list, but it can change

print('The total amount of found links'+ str(len(images)))

## Step 3: Downloading the images

The final step is the images download.

- Set the path to store the images
- A loop over the images list
- Download the images

In [None]:
os.mkdir(path) # Making a folder in the currently directory, it's optional.
path = os.getcwd() # Getting the currently directory
path = os.path.join(path, "scraped_imgs") # Path to store the images

counter = 1
for image in images:
    save_as = os.path.join(path, 'photo.' + str(counter) + '.jpg') # setting the path
    wget.download(image, save_as) # downloading the images
    count += 1 # increase the counter