# Web Scrapper

Here we will be generating a dataset of images to later use in the DCGAN model for testing/training and generating new images based on the inputs.

For this will we use `selenium`, a browsing automation tool for controlling open-source webpages. As well as `beautifulsoup`, a Python package that parses HTML and XML documents and extracts specified data.

## Downloading and Importing Packages

Installing `chromedriver`, a standalone server that implements an open source tool called `webdriver` for the Chrome browser. As well as installing `selenium`, `beautifulsoup` and `lxml` for handling the HTML pages.

In [1]:
!pip install beautifulsoup4
!pip install selenium
!pip install lxml
!pip install webdriver_manager



ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

conda 4.10.3 requires ruamel_yaml_conda>=0.11.14, which is not installed.
requests 2.24.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.7 which is incompatible.
huggingface-hub 0.1.0 requires packaging>=20.9, but you'll have packaging 20.4 which is incompatible.



Collecting urllib3[secure]~=1.26
  Using cached urllib3-1.26.7-py2.py3-none-any.whl (138 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.11
    Uninstalling urllib3-1.25.11:
      Successfully uninstalled urllib3-1.25.11
Successfully installed urllib3-1.26.7
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.7
    Uninstalling urllib3-1.26.7:
      Successfully uninstalled urllib3-1.26.7
Successfully installed urllib3-1.25.11


ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

conda 4.10.3 requires ruamel_yaml_conda>=0.11.14, which is not installed.
selenium 4.1.0 requires urllib3[secure]~=1.26, but you'll have urllib3 1.25.11 which is incompatible.
huggingface-hub 0.1.0 requires packaging>=20.9, but you'll have packaging 20.4 which is incompatible.


Importing required packages and libraries.

In [2]:
from urllib.request import urlretrieve
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import os
import time

## Scrapping Function

Implementing the function for the inputted variables to generate the specific output dataset. Here `chromedriver` is setup along with the url for **shutterstock.com** which is where the images will be scrapped from.

The function scrolls to the bottom and extends the page until it can't anymore and adds all the images onto a list. It continues to parse and perform the same routine through all the inputted pages. Onces all pages have been scrapped, all the images on the list are transferred to the inputted directory folder/path.

In [3]:
def scrapper():
    try:
        chrome_options = Options()
        chrome_options.add_argument("--no-sandbox")
        driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
        driver.maximize_window()
        
        for i in range(1, page_num + 1):
            url = "https://www.shutterstock.com/search?searchterm=" + search + "&sort=popular&image_type=" + image_var + "&search_source=base_landing_page&language=en&page=" + str(i)
            driver.get(url)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(4)
            
            driver_data = driver.execute_script("return document.documentElement.outerHTML")
            print("Page " + str(i) + " is being scrapped...")
            
            scraper = BeautifulSoup(driver_data, "lxml")
            img_list = scraper.find_all("img", {"class":"z_h_9d80b z_h_2f2f0"})
            
            for j in range(0, len(img_list)-1):
                img_act = img_list[j].get("src")
                name = img_act.rsplit("/", 1)[-1]
                
                try:
                    urlretrieve(img_act, os.path.join(dataset_path, os.path.basename(img_act)))
                
                except Exception as e:
                    print(e)
        driver.close()
        
    except Exception as e:
        print(e)

## Variable Declarations and Main Run

Here is where the user can adjust the searching variables for what kind of dataset they want to use. The variables are as follow,

`dataset_path` - The directory for where the output images will be stored.

`search` - The term for which to search images of. (Ex. "forest", "landscape", "portrait", etc.)

`image_var` - The type of images to be scrapped. (Ex. "all", "photo", etc.)

`page_num` - Number of pages to scrape. (NOTE: The number of images scrapped from each page vary. Based on testing results, ~100 images are scrapped per page.)

After the variables are set, the scrapping function will run and iterate when completed.

In [4]:
dataset_path = "Project/output"

search = "landscape"
image_var = 'photo'
page_num = 40

scrapper()

print("...Scrapping complete.")

img_num = len(os.listdir(dataset_path))
print("Number of images scrapped: " + str(img_num))



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/96.0.4664.45/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\ksukh\.wdm\drivers\chromedriver\win32\96.0.4664.45]
  driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)


Page 1 is being scrapped...
Page 2 is being scrapped...
Page 3 is being scrapped...
Page 4 is being scrapped...
Page 5 is being scrapped...
Page 6 is being scrapped...
Page 7 is being scrapped...
Page 8 is being scrapped...
Page 9 is being scrapped...
Page 10 is being scrapped...
Page 11 is being scrapped...
Page 12 is being scrapped...
Page 13 is being scrapped...
Page 14 is being scrapped...
Page 15 is being scrapped...
Page 16 is being scrapped...
Page 17 is being scrapped...
Page 18 is being scrapped...
Page 19 is being scrapped...
Page 20 is being scrapped...
Page 21 is being scrapped...
Page 22 is being scrapped...
Page 23 is being scrapped...
Page 24 is being scrapped...
Page 25 is being scrapped...
Page 26 is being scrapped...
Page 27 is being scrapped...
Page 28 is being scrapped...
Page 29 is being scrapped...
Page 30 is being scrapped...
Page 31 is being scrapped...
Page 32 is being scrapped...
Page 33 is being scrapped...
Page 34 is being scrapped...
Page 35 is being scrapp