# WEB SCRAPING IMAGES FROM GOOGLE

Came across two methods:
1. Using Beautiful Soup
2. Using Selenium

Got stuck on Beautiful Soup so had to switch to Selenium. I found Selenium a bit more complex than Beautiful Soup, but I think it's visual output feature makes it a fascinating tool.

The Selenium package does what a human user would normally do on the web browser. In this case I want to go to Google Images and search for images of Dogs and store it on my desktop, Selenium would automate the process for us by specifying which image you want to download and how many images you want to download. 

### Importing libraries

In [6]:
import selenium
from selenium import webdriver
import time
import os
import requests
from PIL import Image
from PIL.ExifTags import TAGS
import io
import hashlib
import pandas as pd
from openpyxl import load_workbook

### Chrome Driver Path

To use Selenium with Google Chrome we need to download a Chrome Driver, depending on the Google Chrome Version the Chrome Driver is installed. 

In [7]:
DRIVER_PATH =  '/Users/apurvasalvi/Desktop/GauguinBot/chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
wd.get('https://google.com') 
#search_box = wd.find_element_by_css_selector('input.gLFyf') #input box selector
wd.quit()

The above lines of code only opens the browser and gives the input query and quits.

The second phase would involve to search for the query, go to the image section and get the respective image links using css selectors. 

The third phase of the code will be to download the images from the link onto your local computer. 

### Code for Web Scraping from Google Images

In [17]:
def fetch_image_urls(plot:str, max_links_to_fetch:int, wd:webdriver):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
    # build the google plot
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
    
    wd.get(search_url.format(q=plot))
    
    image_urls = set() #duplicates won't be added
    image_count = 0
    results_start = 0
    image_tags = set()
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)
        
        #find elements based on the tag and class name using css selector
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd") 
        number_links = len(thumbnail_results)
        print(f"Found: {number_links} search results. Extracting links from {results_start}:{number_links}")
        
        for img in thumbnail_results[results_start:number_links]:
            #try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(1)
            except Exception:
                continue
            
            #extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))
                    if actual_image.get_attribute('alt'):
                        image_tags.add(actual_image.get_attribute('alt'))
                    else:
                        image_tags.add("Image Tag not found")
                        
            image_count = len(image_urls)
            
            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(10)
            show_more_results = wd.find_element_by_css_selector(".mye4qd")
            if show_more_results:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)
    
    return image_urls, image_tags

def download_image(folder_path:str,file_name:str,url:str):
    try:
        image_content = requests.get(url).content
    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")
        
    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')  #Opens and identifies the given image file
        folder_path = os.path.join(folder_path,file_name) #Joins 2 or more pathname components
        if os.path.exists(folder_path):
            #if the path exists, add file to the folder path
            file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
            image_id = hashlib.sha1(image_content).hexdigest()[:10]
        else:
            #else create a new folder and add file to the new folder
            os.mkdir(folder_path)
            file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:  #'wb': mode for binary random access, opens and truncates the file to 0 bytes
            image.save(f, "JPEG", quality=85)  
        print(f"SUCCESS - saved {url} - as {file_path}")
        if image_id is not None:
            return image_id
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

if __name__ == '__main__':
    wd = webdriver.Chrome(executable_path=DRIVER_PATH)  #controls chrome driver and allows you to drive the browser
    plot_names = ["Nightingale Rose Chart"]  #list of search keywords
    for plot in plot_names:
        wd.get('https://google.com')   #loads webpage in the current browser session
        search_box = wd.find_element_by_css_selector('input.gLFyf')   #finds an element by css selector and returns it if found
        search_box.send_keys(plot)   #simulates typing into the element
        links, tags = fetch_image_urls(plot,500,wd)  #gets image urls
        images_path = '/Users/apurvasalvi/Desktop/GauguinBot/images' #folder to save the element
        metadata = pd.DataFrame(columns=["Image ID", "Image Source", "Image Tag", "Image Format", "Image Mode", "Image Size", "Image Palette"])
        for i,j in zip(links, tags):
            image_id = download_image(images_path,plot,i)   #downloads images to the specified pathNone:
            if image_id is not None:
                file_path = str(images_path) + "/" + str(plot) + "/" + str(image_id) + ".jpg"
                image = Image.open(file_path)
            else: 
                continue
            data = {"Image ID": image_id, "Image Source": i, "Image Tag": j, "Image Format": image.format, "Image Mode": image.mode, "Image Size": image.size, "Image Palette": image.palette}
            metadata = metadata.append(data, ignore_index=True) 
        out_path = "/Users/apurvasalvi/Desktop/GauguinBot/Metadata.xlsx"
        writer = pd.ExcelWriter(out_path , engine='openpyxl', mode='a')
        metadata.to_excel(writer, sheet_name=plot)
        writer.save()
        writer.close()
    wd.quit()

Found: 100 search results. Extracting links from 0:100
Found: 167 image links, looking for more ...
Found: 312 search results. Extracting links from 100:312
Found: 500 image links, done!
SUCCESS - saved https://d2mvzyuse3lwjc.cloudfront.net/images/WikiWeb/Graphing/Graphing_SmithChart.png?v=11016 - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/971e33425b.jpg
ERROR - Could not save https://d2mvzyuse3lwjc.cloudfront.net/images/WikiWeb/Graphing/Graphing_SmithChart.png?v=11016 - local variable 'image_id' referenced before assignment
SUCCESS - saved https://miro.medium.com/fit/c/160/160/1*l5ScCuzT0SKFbZ8ko5skDg@2x.jpeg - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/8b32a87f83.jpg
SUCCESS - saved https://www.page45.com/world/wp-content/uploads/2014/08/Corpse-Talk-Florence-Nightingale.jpg - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/97a1c065fd.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd

SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRRf_jxIRjKqAWoj7v6Yze6eHO0lIEJOeHu_w&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/6f94751167.jpg
SUCCESS - saved https://nightingalebenefits.com/wp-content/uploads/2020/03/Nightingale-Benefits-member-savings-chart.png - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/e0cabf612c.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSZVcHWohft-KafFCV0jRs2boH-z3_CxaI0Lw&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/3ab0b3d38d.jpg
SUCCESS - saved https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Piechart.svg/300px-Piechart.svg.png - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/66ca0a037d.jpg
SUCCESS - saved https://www.amcharts.com/wp-content/uploads/2018/11/demo_12356_none.png - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/283e5b2a2d.jpg
SUCCE

SUCCESS - saved https://vizzlo.com/img/vizzards/examples/nightingales-rose-chart/the-expanding-demand-for-coding-skills-400x.png? - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/07a60fe7c4.jpg
SUCCESS - saved https://image.slidesharecdn.com/slideshare-121114031408-phpapp02/95/open-source-web-charts-19-638.jpg?cb=1352937594 - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/9835372095.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQHX_OB31e0gVqeBsTMT60__vHI1bv01SxK7Q&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/b24ca93993.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQteBWssqTzL9x0-abi7eB7mU3v0zNCpXb8Lw&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/8fa9c454a6.jpg
SUCCESS - saved https://uploads-ssl.webflow.com/5dc27bf865f41cedb4b7b9a5/5ec13c8b7b4d114444f12d5c_FWc9t4IYxTiwQNvi-XkDCGQpfNHIzCkUDezCBxszD_BAgJ5h41Ot

SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQNnL-bSiYLvD8Dk_ekyKRf_KyR088ekxsI1A&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/21973b807e.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQ_B2LOdZQQP3I1M_QTM3C03WehUKPkLWUXjg&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/0de36728e7.jpg
SUCCESS - saved https://online.visual-paradigm.com/repository/images/b5e5eb25-1eba-4908-8e10-111f1f824dd4.png - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/9a658eab36.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRVIfrBmTqP33eQS9LQHxEIMoGUIPhuWN0njA&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/29eb6e7ae8.jpg
SUCCESS - saved https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Doughnut_shape_Pie_Chart.jpg/220px-Doughnut_shape_Pie_Chart.jpg - as /Users/apurvasalvi/Desktop/GauguinBot/ima

SUCCESS - saved https://www.finereport.com/en/wp-content/uploads/2020/03/01.gif - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/19160e1fb6.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQEDvDJYuvLuDRwNhQY-Z0_oF6wHgxcBrdyRw&usqp=CAU - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/1b5d6d4949.jpg
SUCCESS - saved https://sites.google.com/site/distantyetneversoclose/_/rsrc/1326438595627/excel-charts/florence-nightingale-circumplex-chart/Flo%20Nightingle%20Circumplex%20Chart.png?height=237&width=400 - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/377587d422.jpg
SUCCESS - saved https://www.thedataschool.co.uk/wp-content/uploads/2017/01/pathid-points.png - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/c41c681c19.jpg
SUCCESS - saved https://preview.anychart.com/pg/l2lnp24g.png - as /Users/apurvasalvi/Desktop/GauguinBot/images/Nightingale Rose Chart/d860a885a7.jpg
SUCC

In [9]:
metadata

Unnamed: 0,Image ID,Image Source,Image Tag,Image Format,Image Mode,Image Size,Image Palette
0,5567760fa9,https://i.pinimg.com/originals/ef/97/b5/ef97b5...,A Step-by-Step Guide to Making a Choropleth Ma...,JPEG,RGB,"(642, 316)",
1,22ede0538d,https://miro.medium.com/max/3354/1*UIL5MxvErJU...,Choropleth maps: theory behind making a map on...,JPEG,RGB,"(1677, 827)",
2,a056d3f388,https://upload.wikimedia.org/wikipedia/commons...,Make a Covid-19 Choropleth Map in Mapbox | by ...,JPEG,RGB,"(917, 633)",
3,c12e5def2a,https://populationeducation.org/sites/default/...,Choropleth Map | Data Visualization Standards,JPEG,RGB,"(605, 376)",
4,4a96a648db,https://xdgov.github.io/data-design-standards/...,Bivariate map - Wikipedia,JPEG,RGB,"(1200, 901)",


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [4]:
import pathlib
data_dir = pathlib.Path("/Users/apurvasalvi/Desktop/GauguinBot/images/Arc Diagram")

In [5]:
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)

0


In [3]:
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    h.request('HEAD', parsed.path)
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return response.getheader('Location')
    else:
        return url


ModuleNotFoundError: No module named 'httplib'

### Contribution

External Source: 70%
Personal Contribution: 30%

### Citations

1. Article title:	Web Scraping Images from Google

   Website title:	Medium
   
   URL          :	https://medium.com/@wwwanandsuresh/web-scraping-images-from-google-9084545808a2
   
   
2. Article title:	Msalmannasir/Google_image_scraper

   Website title:	GitHub
   
   URL          :	https://github.com/Msalmannasir/Google_image_scraper/blob/master/google_img.

### Conclusion

This task helped me learn how to use Selenium to web scrape any image from the internet. The task is divided into 3 steps: Opening the Web Browser using Selenium, Getting the URLs, and Downloading the images using these URLs. The above lines of codes can be used to download any number of images from the internet. Thus, I have automated the process of getting images from the web using Selenium. 