# Creating your own dataset from Google Images
*Inspired by Francisco Ingram, Jeremy Howard, and [Adrian Rosebrock](https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)*

This notebook can help you find and download large numbers of potentially relevant images using [Google Images](https://images.google.com) with the help of the [fastai library](https://github.com/fastai/fastai). See the [readme](https://github.com/fastai/fastai/blob/master/README.md) in the fastai repository for instructions on fastai library installation.

In [1]:
from fastai.vision import *

## Get a list of URLs

### Search and scroll 

Go to [Google Images](images.google.com) and search for the images you are interested in.

Scroll down until you have seen all of the images you want to download, or until you see the 'Show More Results' button. All of the images you scrolled past are now available for download. To get more, click on the button and continue scrolling. Google Images will show up to a max of 700 images.

You'll probably want to exclude things you know you won't want from your search query. For example, if you are searching for the Great White Shark, "Selachimorpha Carcharodon carcharias", you may get better results by excluding other variants:

    "carcharodon carcharias" -prionace -sphyrnidae -pristiophoridae
    
Limit results to photos by clicking on Tools and selecting Photos from the Type dropdown.

### Download URLs into file 

Now you must run some JavaScript in your browser which will save the URLs of all the images you want for your dataset.

Press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>J</kbd> in Windows/Linux and <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>J</kbd> in Mac, and a small Javascript 'Console' window will appear (note that if you are using Safari you must use <kbd>Alt</kbd><kbd>Cmd</kbd><kbd>C</kbd>). Paste your JavaScript commands here.

To the the URLs for your images, run the following commands:

```javascript
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

(On Safari this will open the results in another tab. <kbd>Cmd</kbd><kbd>S</kbd> to save. It will force you to save it as a `.csv`, but just rename it to `.txt` and you're gold.

### Create a directory and upload URLs file onto your server

Choose appropriate names for your labeled images. Run these steps as many times as necessary to collect items for all of your labels.

In [2]:
folder = 'wampeters'
file = 'urls_wampeters.txt'

In [3]:
folder = 'foma'
file = 'urls_foma.txt'

In [4]:
folder = 'granfalloons'
file = 'urls_granfallons.txt'

You will need to run the following line once for each category above:

In [None]:
path = Path('data/bokononisms')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [None]:
path.ls()

Finally, upload your URLs file. Note that when using Jupyter notebook you can do this by pressing 'Upload' in your working directory and selecting your file. Then click 'Upload' for each of the displayed files.

## Download images

Use fastai library to download images from their respective URLs using their `download_images` function, which will donwload the specified images if they can be opened, and ignore them if they cannot. Note that we can specify a maximum number of images to be downloaded.

You will need to run this once for each category.

In [None]:
classes = ['wampeters','foma','granfalloons']

In [None]:
download_images(path/file, dest, max_pics=200)

In [None]:
# If you have problems download, try with `max_workers=0` to see exceptions:
download_images(path/file, dest, max_pics=20, max_workers=0)

Then remove any images that haven't been downloaded.

In [None]:
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

## View data

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
# If you already cleaned your data, run this cell instead of the one before
np.random.seed(42)
data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

In [None]:
# If you already cleaned your data, run this cell instead of the one before
# PKXH mod: changed path, added label_delim as per: https://forums.fast.ai/t/after-cleaning-imagedatabunch-from-csv-causes-errors/36598
np.random.seed(42)
data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Let's take a look at some of our pictures.

In [None]:
data.classes

In [None]:
data.show_batch(rows=3, figsize=(7,8))

In [None]:
data.classes, data.c, len(data.train_ds), len(data.valid_ds)

## Cleaning up

There are probably images in our dataset that shouldn't be there, so use the `ImageCleaner` widget from `fastai.widgets` to weed out this noise.

In [5]:
from fastai.widgets import *

Note that the widget will not actually delete the image files but will create a `cleaned.csv` file from which a new `ImageDataBunch` with the correct labels can be created.

In [None]:
ds, idxs = DatasetFormatter().

In [7]:
??DatasetFormatter