<h1>Scraping images</h1>
<p>We are going to build a neural network to classify big cats (Lions, Tigers, and Jaguars). To do this, we have to train the neural network on thousands of images of big cats so that it learns to classify them correctly. It would take <strong>hours</strong> to manually download thousands of images from the internet. Therefore we are going to scrape images.</p>

<h1>Part I: Download images</h1>
<p>First let's import the necessary libraries.</p>

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request, urlretrieve
import urllib
from time import sleep
from random import randint

<p>I will be downloading images from <a href='https://www.dreamstime.com/photos-images/tiger.html'>dreamstime</a> which has many pages of images of big cats. To do this, I have defined the function below which takes in the following parameters:
<ul> <li>animal_name: Name of the big cat so that dowloaded images can be correctly named.</li>
    <li>start_link: The link to the first search page.</li>
    <li>page_qty: Number of pages to scrape.</li>
</ul>
</p>

In [2]:
def scrape_images(animal_name, start_link, page_qty):
    counter = 0
    for num in range(1,page_qty+1):
        page_link = start_link + str(num)
        current_page = requests.get(page_link)
        print(current_page)
        #pause for a while
        sleep(randint(8,15))
        soup = BeautifulSoup(current_page.content, 'html.parser')
        #extract relevant tag
        animal = soup.find_all(class_="showonload")    
        animal_images = animal[0].find_all("img") 
        #download all images on current page
        for i in range(len(animal_images)):
            web_links = animal_images[i]['data-src']
            imgName = '%s%d.jpg'%(animal_name,counter)
            try:
                urllib.request.urlretrieve(web_links, imgName)
                counter += 1
            except:
                print('Image Not Found')
            
    return counter

Now I create two lists with the name of each cat and the link to the first page to download the images from.

In [3]:
names = ['jaguar', 'lion', 'tiger']
links = ['https://www.dreamstime.com/search.php?srh_field=jaguar%20animal&s_all=n&s_ph=y&s_il=n&s_video=n&s_audio=n&s_ad=n&s_sl0=y&s_sl1=y&s_sl2=y&s_sl3=y&s_sl4=y&s_sl5=y&s_rf=y&s_ed=y&s_orp=y&s_orl=y&s_ors=y&s_orw=y&s_clc=y&s_clm=y&s_rsf=0&s_rst=7&s_st=new&s_sm=all&s_mrg=1&s_mrc1=y&s_mrc2=y&s_mrc3=y&s_mrc4=y&s_mrc5=y&s_exc=&pg=',
        'https://www.dreamstime.com/search.php?srh_field=lion&s_all=n&s_ph=y&s_il=n&s_video=n&s_audio=n&s_ad=n&s_sl0=y&s_sl1=y&s_sl2=y&s_sl3=y&s_sl4=y&s_sl5=y&s_rf=y&s_ed=y&s_orp=y&s_orl=y&s_ors=y&s_orw=y&s_clc=y&s_clm=y&s_rsf=0&s_rst=7&s_st=new&s_sm=all&s_mrg=1&s_mrc1=y&s_mrc2=y&s_mrc3=y&s_mrc4=y&s_mrc5=y&s_exc=&pg=',
        'https://www.dreamstime.com/photos-images/tiger.html?pg=']

Now download pictures using the scrape_images function. Only download first 10 pages (approximately 800 pictures for each cat).

In [5]:
number_images = []
for i in range(3):
    total= scrape_images(names[i], links[i], 10)
    number_images.append(total)

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
Image Not Found
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


<h2>Part II: Organise data</h2>
<p>Right now all the images are downloaded into same folder as the notebook. Let's split the images into a training and test set. 20% of the images will go in the test folder while the remaining 80% will be used to train the neural network.</p>
<p>First, you need to create folders to store the training and test images. The folders need to be organised in the following format</p>

 dataset --> training_set-----> lion
         |                 |--> jaguar
         |                 |--> tiger
         |
         |--> test_set------> lion
                          |--> jaguar
                          |--> tiger

Make directories

In [6]:
%mkdir dataset
%mkdir dataset/training_set
%mkdir dataset/training_set/lion
%mkdir dataset/training_set/jaguar
%mkdir dataset/training_set/tiger
%mkdir dataset/test_set
%mkdir dataset/test_set/lion
%mkdir dataset/test_set/jaguar
%mkdir dataset/test_set/tiger

mkdir: dataset: File exists
mkdir: dataset/training_set: File exists
mkdir: dataset/training_set/lion: File exists
mkdir: dataset/training_set/jaguar: File exists
mkdir: dataset/training_set/tiger: File exists
mkdir: dataset/test_set: File exists
mkdir: dataset/test_set/lion: File exists
mkdir: dataset/test_set/jaguar: File exists
mkdir: dataset/test_set/tiger: File exists


In [9]:
import sys, string, os
import random
def organise_data(animal,total_pics):
    if not os.path.exists('./dataset/test_set/%s'%animal):
        print('There is no /dataset/test_set/%s folder '%animal)
        return
    elif not os.path.exists('./dataset/training_set/%s'%animal):
        print('There is no /dataset/training_set/%s folder'%animal)
        return             
    else:                
        end = int(0.2*total_pics)
        test_data = random.sample(range(0, total_pics), end)
        for number in test_data:
            os.system('cp %s%s.jpg ./dataset/test_set/%s'%(animal,number,animal))
            os.system('rm %s%s.jpg'%(animal,number))
        os.system('cp %s* ./dataset/training_set/%s'%(animal,animal))
        os.system('rm %s*'%(animal))
        

In [7]:
number_images

[796, 799, 799]

In [10]:
for i in range(3):
    organise_data(names[i],number_images[i])