# Basic google image scraper using selenium

This image scraper has been put together using the following resources:

- https://www.selenium.dev/
- https://medium.com/@nithishreddy0627/a-beginners-guide-to-image-scraping-with-python-and-selenium-38ec419be5ff
- https://github.com/StatsGary/Tensorflow_Tutorials/blob/main/01_CV_Classification/01_Download_Images.py
- https://stackoverflow.com/questions/75750522/i-can-not-get-more-than-20-images-from-google-images-with-selenium

Please feel free to edit as needed.

### 1. Uncomment to install selenium

In [1]:
#!pip install selenium

### 2. Import dependencies

In [2]:
import time
from PIL import Image
import requests
import base64
import io
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC
import os
import sys

### 3. Download Chromedriver
Follow the instructions from: 
https://medium.com/@nithishreddy0627/a-beginners-guide-to-image-scraping-with-python-and-selenium-38ec419be5ff

### 4. Note where the path to chromedriver is, and replace the below path with your own.

In [3]:
PATH =  "/Users/shirleneliew/Documents/chromedriver-mac-x64"

### 5. Initialise chromerdrive.

In [4]:
options = ChromeOptions()
options.add_argument("--start-maximized")
# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.executable_path=PATH

driver = webdriver.Chrome(options=options)

### 6. Define the scrape_image function. 

Essentially, the scraper finds all images by using the "img" tag, and getting the source code.
This gets the source of all images on the page, including thumbnails and small logos.

We want to exclude these thumbnails and small logos, so we put a minimum image size of 100000 bytes

Google images scrolls dynamically, so we get the browser to scroll a number of times to load enough images, based on the number of images desired. If num_images is too large, no further images can be downloaded.

You can play around with driver.find_elements to see if there are any other html/css elements that might be useful for isolating the images you want to download.

In [5]:

def scrape_images(query, num_images, save_path):
    # Create a Google Images search URL
    search_url = f"https://www.google.com/search?q={query}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568"

    # Open the Google Images search page
    driver.get(search_url)

    # Scroll down to load more images
    for _ in range(num_images // 20):
        driver.execute_script("window.scrollBy(0,10000)")

    # Wait for the images to load
    time.sleep(2)
    #WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "img.Q4LuWd")))

    # Get all image elements
    img_elements = driver.find_elements(By.TAG_NAME, 'img')
    src = [img.get_attribute('src') for img in img_elements]

    # Create the save directory on the desktop
    desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
    save_path = os.path.join(save_path, f"{query}")
    save_path = os.path.join(desktop_path, save_path)
    os.makedirs(save_path, exist_ok=True)

    # Define minimum image size to skip thumbnails
    min_img_size = 100000

    # Loop through the first num_images images
    
    for i in range(num_images):
        try:
            if src[i] is None:
                pass
            else:
                img_name = f"{query}_{i+1}.jpg"
                img_path = os.path.join(save_path, img_name)
                
                # if it's base64 images
                if src[i].startswith('data'):
                    imgdata = base64.b64decode(str(src[i]).split(',')[1])
                    #print("src: ", imgdata)
                    img = Image.open(io.BytesIO(imgdata))
                    
                    print(f"img {i+1} size in memory in bytes: ", sys.getsizeof(img.tobytes()))
                    img_size = sys.getsizeof(img.tobytes())
                    if img_size>min_img_size:
                        img.save(img_path)
                        print(f"Image {i+1} downloaded successfully")
                    else:
                        print(f"Image {i+1} too small to download")
                # if it's image url
                else:
                    #print("src: ", src)
                    img = Image.open(requests.get(src[i], stream=True).raw).convert('RGB')               
                    print(f"img {i+1} size in memory in bytes: ", sys.getsizeof(img.tobytes()))
                    img_size = sys.getsizeof(img.tobytes())
                    if img_size>min_img_size:
                        img.save(img_path)
                        print(f"Image {i+1} downloaded successfully")
                    else:
                        print(f"Image {i+1} too small to download")
            

            
        except Exception as e:
            print(f"Failed to download image {i+1}: {e}")

### 7. Scrape images!
Now we just need to define some args, and we can run the function.


In [6]:
# This is the google image search query you want to run.
# I suggest you try the search term on google first, 
# so you can tell if the search term gives you what you want.
query = "alsatian dogs"

# This is the maximum number of images that the scraper will try to download. 
# Note that a bit more than half of the found images will not meet the image size requirement.
# Suggest maximum num_images = 500
num_images = 200

# This is the folder name that the images will be saved to
save_path = "dogs"

# #un the function.
scrape_images(query, num_images, save_path)

img 1 size in memory in bytes:  33153
Image 1 too small to download
img 8 size in memory in bytes:  11073
Image 8 too small to download
img 9 size in memory in bytes:  8451
Image 9 too small to download
img 10 size in memory in bytes:  9555
Image 10 too small to download
img 11 size in memory in bytes:  8727
Image 11 too small to download
img 12 size in memory in bytes:  6381
Image 12 too small to download
img 13 size in memory in bytes:  34
Image 13 too small to download
img 14 size in memory in bytes:  34
Image 14 too small to download
img 15 size in memory in bytes:  34
Image 15 too small to download
img 16 size in memory in bytes:  34
Image 16 too small to download
img 17 size in memory in bytes:  34
Image 17 too small to download
img 18 size in memory in bytes:  34
Image 18 too small to download
img 19 size in memory in bytes:  34
Image 19 too small to download
img 20 size in memory in bytes:  34
Image 20 too small to download
img 21 size in memory in bytes:  34
Image 21 too small

When you're done with image scraping, you can close the browser.

In [7]:
driver.quit()

## Important note
This is a fast way to download many images, but it <i> doesn't guarantee </i> that all images will be relevant to training your model. 

We recommend you <b> check all downloaded images </b> to ensure your dataset is as clean as possible.