# Google Images Crawler 

Quick jupyter notebook for Google Images scraping. Google images now features an infinite scroll, therefore we're gonna use Selenium for this scraping.

In [11]:
from bs4 import BeautifulSoup
import requests
import urllib.request
import sys
import os
import traceback
import re
import time
import json
import unidecode

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

We're gonna try with one headers first. If this doesn't work, we're gonna try proxy rotation and multi-headers to hide our identity. But given the fact that we are not going to scrap a lot of data, we might be able to go anonymous. Google Images should not raise an issue though.


In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0'}

Choose the kind of images you want to get. Here, we're gonna try pants (Cargo, Chinos, Casual and Dungarees), and the path to save your data. We show and example for our first search, 'Pantalons Cargos'.

In [8]:
DownloadPath = "/media/arthur/DATA/Databases/GoogleImagesScraping/Pantalons"

# Parameters
words_to_search = ['Pantalons Cargos', 'Pantalons Chinos', 'Pantalons Casuals', 'Salopette']
nb_to_download = [1000, 1000, 1000, 1000]
first_image_position = [0, 0, 0, 0]

# Number of scrolls (because of infinite scroll)
number_of_scrolls = int((nb_to_download[0] + first_image_position[0])/ 400 + 1)

Create the folders to save the images

In [14]:
if not os.path.exists(DownloadPath + words_to_search[0].replace(" ", "_")):
    os.makedirs(DownloadPath + words_to_search[0].replace(" ", "_"))

We are now ready to start. Selenium is gonna create an instance of our web browser (in my case, Mozilla Firefox) which will be controlled by this script.
Tip for installing: go to https://github.com/mozilla/geckodriver/releases and get the geckodriver corresponding to your brwoser.

In [35]:
url = "https://www.google.co.in/search?q=" + words_to_search[0] + "&source=lnms&tbm=isch"
driver = webdriver.Firefox()
driver.get(url)
extensions = {"jpg", "jpeg", "png"}

Now, we automatically scroll the page to load all the images we want to download. We use the scrollBy script in Selenium.
Tips: Need to put a sleep between each scroll to allow the images to be loaded on the page. If not, your crawler will stop cause it has no time to scroll down.

In [36]:
for _ in range(5):
    driver.execute_script("window.scrollBy(0,1000000)")
    time.sleep(1)        

Now that our crawler is at the bottom of the page, we can download the images. Our crawler gets 303 images, whereas we only had 103 without the scroll down.

In [52]:
images = driver.find_elements_by_xpath('//div[contains(@class, "rg_meta")]')
print(len(images))
i = 0
for img in images:
    i+=1
    print("Dowloading image no", i)
    img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
    img_type = json.loads(img.get_attribute('innerHTML'))["ity"]
    file_name =DownloadPath + words_to_search[0].replace(" ", "_")+"/"+ str(i)+"."+img_type
    f = open(file_name, 'wb')
    try:
        urllib.request.urlretrieve(img_url, file_name)
    except Exception:
        print("Downloading Unauthorized")
        continue

303
Dowloading image no 1
Dowloading image no 2
Dowloading image no 3
Dowloading image no 4
Dowloading image no 5
Dowloading image no 6
Dowloading image no 7
Dowloading image no 8
Dowloading image no 9
Dowloading image no 10
Dowloading image no 11
Dowloading image no 12
Dowloading image no 13
Dowloading image no 14
Dowloading image no 15
Dowloading image no 16
Dowloading image no 17
Dowloading image no 18
Dowloading image no 19
Dowloading image no 20
Dowloading image no 21
Dowloading image no 22
Dowloading image no 23
Dowloading image no 24
Dowloading image no 25
Dowloading image no 26
Dowloading image no 27
Dowloading image no 28
Dowloading image no 29
Dowloading image no 30
Downloading Unauthorized
Dowloading image no 31
Dowloading image no 32
Dowloading image no 33
Dowloading image no 34
Dowloading image no 35
Dowloading image no 36
Dowloading image no 37
Dowloading image no 38
Dowloading image no 39
Dowloading image no 40
Dowloading image no 41
Dowloading image no 42
Dowloading ima