# WEB SCRAPING IMAGES FROM GOOGLE

Came across two methods:
1. Using Beautiful Soup
2. Using Selenium

Got stuck on Beautiful Soup so had to switch to Selenium. I found Selenium a bit more complex than Beautiful Soup, but I think it's visual output feature makes it a fascinating tool.

The Selenium package does what a human user would normally do on the web browser. In this case I want to go to Google Images and search for images of Dogs and store it on my desktop, Selenium would automate the process for us by specifying which image you want to download and how many images you want to download. 

### Importing libraries

In [158]:
import selenium
from selenium import webdriver
import time
import os
import requests
from PIL import Image
import io
import hashlib

### Chrome Driver Path

To use Selenium with Google Chrome we need to download a Chrome Driver, depending on the Google Chrome Version the Chrome Driver is installed. 

In [159]:
DRIVER_PATH =  '/Users/apurvasalvi/Desktop/GauguinBot/chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
wd.get('https://google.com') 
search_box = wd.find_element_by_css_selector('input.gLFyf') #input box selector
search_box.send_keys('dogs')
wd.quit()

The above lines of code only opens the browser and gives the input query and quits.

The second phase would involve to search for the query, go to the image section and get the respective image links using css selectors. 

The third phase of the code will be to download the images from the link onto your local computer. 

### Code for Web Scraping from Google Images

In [None]:
def fetch_image_urls(plot:str, max_links_to_fetch:int, wd:webdriver):
    # build the google plot
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
    wd.get(search_url.format(q=plot))
    image_urls = set() #used so duplicates won't be added
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        #find elements based on the tag and class name using css selector
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd") 
        number_links = len(thumbnail_results)
        print(f"Found: {number_links} search results. Extracting links from {results_start}:{number_links}")
        for img in thumbnail_results[results_start:number_links]:
            #try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(1)
            except Exception:
                continue
            #extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))
            image_count = len(image_urls)
            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links only")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.plotSelector('.mye4qd').click();")
        results_start = len(thumbnail_results)
    return image_urls

def download_image(folder_path:str,file_name:str,url:str):
    try:
        image_content = requests.get(url).content
    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")
        
    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')  #Opens and identifies the given image file
        folder_path = os.path.join(folder_path,file_name) #Joins 2 or more pathname components
        if os.path.exists(folder_path):
            #if the path exists, add file to the folder path
            file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        else:
            #else create a new folder and add file to the new folder
            os.mkdir(folder_path)
            file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:  #'wb': mode for binary random access, opens and truncates the file to 0 bytes
            image.save(f, "JPEG", quality=85)  
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

if __name__ == '__main__':
    wd = webdriver.Chrome(executable_path=DRIVER_PATH)  #controls chrome driver and allows you to drive the browser
    plot_names = ["cats"]  #list of search keywords
    for plot in plot_names:
        wd.get('https://google.com')   #loads webpage in the current browser session
        search_box = wd.find_element_by_css_selector('input.gLFyf')   #finds an element by css selector and returns it if found
        search_box.send_keys(plot)   #simulates typing into the element
        links = fetch_image_urls(plot,10,wd)  #gets image urls
        images_path = '/Users/apurvasalvi/Desktop/GauguinBot/images' #folder to save the element
        for i in links:
            download_image(images_path,plot,i)   #downloads images to the specified path
    wd.quit()

### Contribution

External Source: 70%
Personal Contribution: 30%

### Citations

1. Article title:	Web Scraping Images from Google

   Website title:	Medium
   
   URL          :	https://medium.com/@wwwanandsuresh/web-scraping-images-from-google-9084545808a2
   
   
2. Article title:	Msalmannasir/Google_image_scraper

   Website title:	GitHub
   
   URL          :	https://github.com/Msalmannasir/Google_image_scraper/blob/master/google_img.

### Conclusion

This task helped me learn how to use Selenium to web scrape any image from the internet. The task is divided into 3 steps: Opening the Web Browser using Selenium, Getting the URLs, and Downloading the images using these URLs. The above lines of codes can be used to download any number of images from the internet. Thus, I have automated the process of getting images from the web using Selenium. 