# Download bulk images from google


## Why we need this?

Downloading images from google is not a big deal. We can download images from google image search, but it takes some time to download them because we do it manually. If we are looking for few images then its a good idea but what if we need to download images in bulk?

Other approach is that we can use specific software to download images, but finding out right software and its dependencies
is very time consuming, so why dont we make a script which will fetch all images from google then it will start downloading the images in defined folder.

## Objective

Download images from internet of specific things.

## Usage

I am using this script to download image and create dataset for Machine Learning Models.

Now day's <b>Deep Learning</b> is mainly use in <b>Data Science</b> filed for images classification and object detection and some more.

To create your own dataset run this script with specifying the keyword of images which you want to download.

## Terminology

It uses the concept of web crawler in other words web spider or web robot.

### What is web crawler?

A Web crawler is a kind of bot which uses internet connection to crawl/scrape data from different websites. 

They crawl one page at a time through a website until all pages have been indexed. 
Web crawlers help in collecting information about a website and the links related to them, and also help in validating the HTML code and hyperlinks.

Web crawlers collect information such the URL of the website, the meta tag information, the Web page content, the links in the webpage and the destinations leading from those links, the web page title and any other relevant information.

## How it works?

Its a web crawler, use accept the search keyword from user and send request to google image search.
Once it get the response from google server it will parse the response page and start fetching image hyperlinks one by one and store it into <b>urls.txt</b> file.

Once fetching done it will start requesting the image url on server and downloading the images into images folder.

## Let's put the logic of web crawling into code 

In [1]:
import requests # Use to request and get the response from server
import time # put some delay into code.

## This function will help us to fetch url from response page

We got response in raw format of html page. So parsing the page and fetch url from there is not that easy the way we will fetch the data using BeautifulSoup and lxml library in python, beacause google change there html tag's very frequently and they use json format to hide the urls link.

To get those hidden url's i made this fucntion to handel json data.

In [2]:
def subString(mainTxt,spos,sTag,eTag):
    
    subStr = ""
    subStr = mainTxt
    global subStrValue

    try:
        tagStartPos = int(subStr.index(sTag,(int(spos))))
        tagEndPos = int(subStr.find(eTag,(int(tagStartPos)+int(len(sTag)))))
        subStrValue = ""
        subStrValue = subStr[tagStartPos+int(len(sTag)):tagEndPos]
        subStrValue = str(subStrValue)
    except:
        tagStartPos = 0
        tagEndPos = 0
        subStrValue = ""

    return str(subStrValue)

## Enter your search keyword 

e.g Enter search keyword: iPhone xs max

In [3]:
def get_search_keyword():
    return str(input("Enter search keyword:")).replace(' ','+')

## Storing all urls

In [4]:
def get_url_from_responsetext(responsetext):
    urls_list = []
    ipos = 1
    con = len(responsetext.split('"ity":"jpg"'))
    
    for i in range(con):
        ipos = responsetext.find('"ity":"jpg"',ipos)
        tempUrl = ''
        tempUrl = subString(responsetext,ipos,'http','"')
        tempUrl = tempUrl.replace('\x3d','')
        tempUrl = tempUrl.replace("'","")
        tempUrl = f'http{tempUrl}'
        if 'http' in tempUrl:
            urls_list.append(tempUrl)
        ipos = ipos+10
    return urls_list

## The main logic of requesting your search keyword is here.

While hitting any website using bot, we need to follow some protocols or rules in manner to get the correct response.

Here are few thing which need to set before hitting/requesting any website using bot.
   1. <b>headers</b>
       1. It play very important role in web scrapping/crawling.<br>
       2. We set some parameters in out crawler as it is required by smart website like.<br>
           2.1 What kind of content and language it accept.<br>
           2.2 In which lanuage it will return response.<br>
           2.3 Which user-agent(Hitting from  different browsers or mobile devices) we are using to request that website.<br>
           2.4 cookies (some time it require to maintane the session).<br>
   2. <b>Proxy</b><br>
       1. If you are using servers to hit the website then you must need to set this in your code.<br>
       2. It help us to randomize our IP address and allow us to hit website from different location.<br>
    <br>
   3. <b>Time-out</b><br>
       1. If we dont set it out code it will send the request and wait till website will response. Some time it took very   <br>          long time or stuck in between.<br>
       2. Setting timeout in code will wait for response for specific time.<br>
        

### Example

In This example i am using <b>GET Method</b>.

### Setting Parameters


## Headers
<br>
headers={                                                                                                                   <br>        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-         <br>          exchange;v=b3',<br>
        'Accept-language': 'en-US,en;q=0.9',<br>
        'Content-type':'text/html; charset=UTF-8',<br>
        'Referer': 'https://images.google.com/',<br>
        'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)                     <br>         Chrome/74.0.3729.169 Safari/537.36',<br>
       }<br>

## Proxy

    HTTPS:
        proxy = {'https:','https://username:password@ipaddress:port'} // If you have credentials.
        proxy = {'https:','https://ipaddress:port'} // If you dont have credentials.

    HTTP:
        proxy = {'http:','http://username:password@ipaddress:port'} //If you have credentials.
        proxy = {'http:','http://ipaddress:port'} // If you dont have credentials.

### URL

base_url = 'https://www.google.com'

### Status
It will tell us the response status of our request which have made.

<h3><b>Status Code:</b></h3><br>
   1. <b>200</b> (Ok) => Connection established.<br>
   2. <b>301, 302</b> (Redirection) => Url is shifter to some other location.<br>
   3. <b>403</b> (Access Denied) => Website is refusing our request dues to some lack of protocols.<br>
   4. <b>404</b> (Page not found) => Page is deleted.<br>
   5. <b>407</b> (Authentication Required) => Website have user authentication rule, To access use base64 to authenticate                    your username nad password.<br>

### Implementation of headers and proxy in our request

try:
    response = requests.get(url=base_url, headers=headers, proxy=proxy, timeout=50)
    print(response.status_code)
    if response.status_code==200:
           print('Connection established')
except Exception as e:
    print(e)

In [5]:
def fetch_urls():
    base_url = 'https://www.google.com/search?safe=active&tbm=isch&source=hp&biw=628&bih=657&ei=Px4LXdqLH9fLrQGk-5nwDQ&q='
    
    headers = {
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
               'Accept-language': 'en-US,en;q=0.9',
               'Content-type':'text/html; charset=UTF-8',
               'Referer': 'https://images.google.com/',
               'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
              }
    try:
        response = requests.get(url=f'{base_url}{get_search_keyword()}', headers=headers, timeout=50)
    except Exception as e:
        print(e)
    print('Response status:',response.status_code)
    if response.status_code==200:
        try:
            responseText = response.text
            all_list_urls = get_url_from_responsetext(responseText)
#             print(all_list_urls)
            
            writFile = open('urls.txt','a')
            for urls in all_list_urls:
                writFile.write(urls+'\n')
        except Exception as e:
            print('Error while parsing:',e)
    return all_list_urls
            

## Requesting our search keyword and storing the image urls in text file

In [6]:
fetched_urls = fetch_urls()

Enter search keyword:iPhone xs max
Response status: 200


## Downloading all images into image folder with some random name

In [7]:
import urllib.request
def download_images_using(url, img_no):
    try:
        urllib.request.urlretrieve(url,f"images/img_{img_no}.jpg")
    except Exception as e:
        print(f'Error while downloading image {img_no}', e)

In [8]:
images_link = open('urls.txt','r')
print(images_link.read())

https://cdn.tmobile.com/content/dam/t-mobile/en-p/cell-phones/apple/apple-iphone-xs-max/gold/Apple-iPhoneXsMax-Gold-2-3x.jpg
https://static.toiimg.com/photo/65786818/Apple-iPhone-XS-Max.jpg
http://cdn.shopify.com/s/files/1/1043/3082/products/iPhoneXSMax_line_up_1200x630.jpg?v\u003d1538647006
http://d176tvmxv7v9ww.cloudfront.net/product/cache/12/image/9df78eab33525d08d6e5fb8d27136e95/i/p/iphone-xs-max-space-select-2018_av2_4.jpg
https://images-na.ssl-images-amazon.com/images/I/61pu8v9oXrL._SX569_.jpg
https://icdn2.digitaltrends.com/image/iphone-xs-max-review-1-1500x994.jpg
https://zdnet3.cbsistatic.com/hub/i/r/2018/09/27/bd8a6105-e956-460a-afa3-8f7453854654/thumbnail/770x433/0599cef31267df04cf5501ab97e3b77d/iphone-xs-max-1.jpg
https://www.imediastores.com/wp-content/uploads/2018/09/Apple-iPhone-XS-27.jpg
https://cdnblob.moshi.com/uploadedfiles/photo/v3/productImages/1063/01.jpg
https://cnet3.cbsistatic.com/img/neLbs059DWMEZSz0j9VEGLY2s1w\u003d/2018/09/17/84430c77-b39e-48bd-b3ed-752e4b54

### Getting image link from text file and pass it downloading fucntion to download the image in specific folder.



In [9]:
import random
img_no=1
with open('urls.txt','r') as f:
    for url in f:
        print(url)
        download_images_using(url,img_no)
        img_no=img_no+random.randint(0,100000)

https://cdn.tmobile.com/content/dam/t-mobile/en-p/cell-phones/apple/apple-iphone-xs-max/gold/Apple-iPhoneXsMax-Gold-2-3x.jpg

https://static.toiimg.com/photo/65786818/Apple-iPhone-XS-Max.jpg

http://cdn.shopify.com/s/files/1/1043/3082/products/iPhoneXSMax_line_up_1200x630.jpg?v\u003d1538647006

http://d176tvmxv7v9ww.cloudfront.net/product/cache/12/image/9df78eab33525d08d6e5fb8d27136e95/i/p/iphone-xs-max-space-select-2018_av2_4.jpg

https://images-na.ssl-images-amazon.com/images/I/61pu8v9oXrL._SX569_.jpg

Error while downloading image 242628 <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>
https://icdn2.digitaltrends.com/image/iphone-xs-max-review-1-1500x994.jpg

https://zdnet3.cbsistatic.com/hub/i/r/2018/09/27/bd8a6105-e956-460a-afa3-8f7453854654/thumbnail/770x433/0599cef31267df04cf5501ab97e3b77d/iphone-xs-max-1.jpg

https://www.imediastores.com/wp-content/uploads/2018/09/Apple-iPhone-XS-27.jpg

Error while downloading image 372936 <urlopen error 

https://pisces.bbystatic.com/image2/BestBuy_US/images/products/6284/6284011_rd.jpg

https://dbrand.com/sites/default/files/images/shop/device-gallery/matte-black-iphone-xs-max-skins.jpg

Error while downloading image 2924467 <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>
https://cdn.idropnews.com/wp-content/uploads/2018/09/14114222/Win-an-iPhone-XS-Max-iDrop-News-Apple-iPhone-XS-Max-Giveaway.jpg

https://www.incase.com/media/catalog/product/cache/1/small_image/9df78eab33525d08d6e5fb8d27136e95/i/n/inph220553-rgd_incase_iphonexsmax_protectiveclearcover_a.jpg

https://www.supcase.com/media/catalog/product/cache/3/image/666x666/9df78eab33525d08d6e5fb8d27136e95/u/b/ubstyle_1/i-Blason-iphone-xs-max-unicorn-beetle-style-slim-clear-case-black-31.jpg?rand\u003d0.6047268631604974

https://cdn.shoplightspeed.com/shops/606657/files/12972783/apple-iphone-xs-max-64gb-space-grey.jpg

Error while downloading image 3151574 <urlopen error [SSL: CERTIFICATE_VERIFY

### While downloading images we get some certificate error due to security reasons, so ignore it