# Downloading data from Google Image search 

*Based on lesson 2 of fastai course (v3) by: Francisco Ingham and Jeremy Howard. Inspired by [Adrian Rosebrock](https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)*

Steps to creating an image dataset of penguins.

In [1]:
from fastai.vision import *

## Get a list of URLs

### Search and scroll

Go to [Google Images](http://images.google.com) and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

### Download into file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

Press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>J</kbd> in Windows/Linux and <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>J</kbd> in Mac, and a small window the javascript 'Console' will appear. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. You can do this by running the following commands:

```javascript
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

Save each downloaded file as class_name.txt into a folder called data\penguins underneath this notebook.

In [2]:
path = Path('data/penguins')
path.mkdir(parents=True, exist_ok=True)

### Create directory and download images

Now copy all the downloaded url files into the data/penguins folder (if not running locally you can use Upload from Jupyter notebook).

In [3]:
classes = ['adelie', 'african', 'chinstrap', 'emperor', 'erect_crested', 'fiordland', 'galapagos', 'gentoo', 'humboldt', 
           'king', 'little', 'macaroni', 'magellanic', 'rockhopper', 'royal', 'snares', 'yellow_eyed']

Following code will loop through each class, create a folder and download the images.

In [None]:
for penguin_class in classes:   
    file = path/str(penguin_class + '.txt')
    dest = path/penguin_class
    dest.mkdir(parents=True, exist_ok=True)
    download_images(file, dest, max_pics=200)

Then we can remove any images that can't be opened:

In [None]:
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

## Notes on data cleaning

Things I did to improve my image data set:

* Removing images that were not penguins.
* Removing cartoons, drawings, images of toys.
* Moving images that were in the wrong folder (eg a king penguin in the emperor folder).
* Removing images taken from a distance, where the penguins are too small.
* Removing images of chicks, as they look very different to the adults (I hope to include penguin chicks in a future version).

I also noticed that I had a lot of images containing multiple birds, so I cropped out individuals into their own image where possible. This process took about half an hour for each set of photos, which was quite a lot of manual effort, compared to the machine learning training time. It's just as well that I like looking at pictures of penguins!

