# Web Scrapper
Here we will be generating a dataset of images to later use in the DCGAN model for testing/training and generating new images based on the inputs.

For this will we use `selenium`, a browsing automation tool for controlling open-source webpages. As well as `beautifulsoup`, a Python package that parses HTML and XML documents and extracts specified data.

## Mount Drive
Mounting Google Drive so it will be accessable in Colab when transferring over scrapped images.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Downloading and Importing Packages
Installing `chromedriver`, a standalone server that implements an open source tool called `webdriver` for the Chrome browser. As well as installing `selenium`, `beautifulsoup` and `lxml` for handling the HTML pages.

In [2]:
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install beautifulsoup4
!pip install selenium
!pip install lxml
!pip install webdriver_manager

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ Packages [73.9 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:12 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:14 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Ge

Importing required packages and libraries.

In [3]:
from urllib.request import urlretrieve
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import os
import time

## Scrapping Function
Implementing the function for the inputted variables to generate the specific output dataset. Here `chromedriver` is setup along with the url for **shutterstock.com** which is where the images will be scrapped from.

The function scrolls to the bottom and extends the page until it can't anymore and adds all the images onto a list. It continues to parse and perform the same routine through all the inputted pages. Onces all pages have been scrapped, all the images on the list are transferred to the inputted directory folder/path.

In [4]:
def scrapper():
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument("--incognito")
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage') 
        driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver',options=chrome_options)
        driver.maximize_window()

        for i in range(1, page_num + 1):
            url = "https://www.shutterstock.com/search?searchterm=" + search + "&sort=popular&image_type=" + image_var + "&search_source=base_landing_page&language=en&page=" + str(i)
            driver.get(url)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(4)

            driver_data = driver.execute_script("return document.documentElement.outerHTML")
            print("Page " + str(i) + " is being scrapped...")

            scraper = BeautifulSoup(driver_data, "lxml")
            img_list = scraper.find_all("img", {"class":"z_h_9d80b z_h_2f2f0"})

            for j in range(0, len(img_list)-1):
                img_act = img_list[j].get("src")
                name = img_act.rsplit("/", 1)[-1]

                try:
                    urlretrieve(img_act, os.path.join(dataset_path, os.path.basename(img_act)))

                except Exception as e:
                    print(e)
        driver.close()

    except Exception as e:
        print(e)

## Variable Declarations and Main Run
Here is where the user can adjust the searching variables for what kind of dataset they want to use. The variables are as follow,

`dataset_path` - The directory for where the output images will be stored.

`search` - The term for which to search images of. (Ex. "forest", "landscape", "portrait", etc.)

`image_var` - The type of images to be scrapped. (Ex. "all", "photo", etc.)

`page_num` - Number of pages to scrape. (NOTE: The number of images scrapped from each page vary. Based on testing results, ~19 images are scrapped per page.)

After the variables are set, the scrapping function will run and iterate when completed.

In [5]:
dataset_path = "drive/MyDrive/Colab Notebooks/dataset/output"

search = "landscape"
image_var = 'photo'
page_num = 20

scrapper()

print("...Scrapping complete.")

img_num = len(os.listdir(dataset_path))
print("Number of images scrapped: " + str(img_num))

  


Page 1 is being scrapped...
Page 2 is being scrapped...
Page 3 is being scrapped...
Page 4 is being scrapped...
Page 5 is being scrapped...
Page 6 is being scrapped...
Page 7 is being scrapped...
Page 8 is being scrapped...
Page 9 is being scrapped...
Page 10 is being scrapped...
Page 11 is being scrapped...
Page 12 is being scrapped...
Page 13 is being scrapped...
Page 14 is being scrapped...
Page 15 is being scrapped...
Page 16 is being scrapped...
Page 17 is being scrapped...
Page 18 is being scrapped...
Page 19 is being scrapped...
Page 20 is being scrapped...
...Scrapping complete.
Number of images scrapped: 380
