In [1]:
from fastai.vision import *

## Get a list of URLs

### Search and scroll

Go to [Google Images](http://images.google.com) and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:

    "canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.

### Download into file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

In Google Chrome press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>j</kbd> on Windows/Linux and <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>j</kbd> on macOS, and a small window the javascript 'Console' will appear. In Firefox press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>k</kbd> on Windows/Linux or <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>k</kbd> on macOS. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. Before running the following commands, you may want to disable ad blocking extensions (uBlock, AdBlockPlus etc.) in Chrome. Otherwise the window.open() command doesn't work. Then you can run the following commands:

```javascript
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

### Scrap mountain names with bs4

In [11]:
import urllib.request

In [12]:
url = "https://en.wikipedia.org/wiki/List_of_mountains_of_Switzerland"
page = urllib.request.urlopen(url)

In [14]:
from bs4 import BeautifulSoup

In [15]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

In [18]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of mountains of Switzerland - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3f9507f0-5128-4229-bb96-8ecad5cfea7d","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_mountains_of_Switzerland","wgTitle":"List of mountains of Switzerland","wgCurRevisionId":948395209,"wgRevisionId":948395209,"wgArticleId":19717114,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Lists of coordinates","Geographic coordinate lists","Articl

In [20]:
all_tables=soup.find_all("table")
all_tables

[<table class="noprint infobox" id="GeoGroup" style="width: 23em; font-size: 88%; line-height: 1.5em">
 <tbody><tr>
 <td><b>Map all coordinates using:</b> <a class="external text" href="//tools.wmflabs.org/osm4wiki/cgi-bin/wiki/wiki-osm.pl?project=en&amp;article=List_of_mountains_of_Switzerland">OpenStreetMap</a> 
 </td></tr>
 <tr>
 <td><b>Download coordinates as:</b> <a class="external text" href="//tools.wmflabs.org/kmlexport?article=List_of_mountains_of_Switzerland">KML</a> <b>·</b> <a class="external text" href="http://tripgang.com/kml2gpx/http%3A%2F%2Ftools.wmflabs.org%2Fkmlexport%3Farticle%3DList_of_mountains_of_Switzerland?gpx=1" rel="nofollow">GPX</a>
 </td></tr></tbody></table>,
 <table class="wikitable sortable" style="margin: 1em auto 1em auto;">
 <caption>Distribution by height of the mountains with at least 300 metres of prominence
 </caption>
 <tbody><tr>
 <th>Canton</th>
 <th>-999m</th>
 <th>1000-<br/>1499m</th>
 <th>1500-<br/>1999m</th>
 <th>2000-<br/>2499m</th>
 <th>25

In [21]:
right_table=soup.find('table', class_='wikitable sortable')
right_table

<table class="wikitable sortable" style="margin: 1em auto 1em auto;">
<caption>Distribution by height of the mountains with at least 300 metres of prominence
</caption>
<tbody><tr>
<th>Canton</th>
<th>-999m</th>
<th>1000-<br/>1499m</th>
<th>1500-<br/>1999m</th>
<th>2000-<br/>2499m</th>
<th>2500-<br/>2999m</th>
<th>3000-<br/>3499m</th>
<th>3500-<br/>3999m</th>
<th>4000m+</th>
<th>Total</th>
<th>Summits/100 km<sup>2</sup>
</th></tr>
<tr>
<td align="left"><a class="mw-redirect" href="/wiki/Aargau" title="Aargau">Aargau</a>
</td>
<td align="right">1
</td>
<td align="right">0
</td>
<td align="right">0
</td>
<td align="right">0
</td>
<td align="right">0
</td>
<td align="right">0
</td>
<td align="right">0
</td>
<td align="right">0
</td>
<td align="right"><b>1</b>
</td>
<td align="right">0.07
</td></tr>
<tr>
<td align="left"><a class="mw-redirect" href="/wiki/Appenzell_Ausserrhoden" title="Appenzell Ausserrhoden">Appenzell A.</a>
</td>
<td align="right">0
</td>
<td align="right">2
</td>
<td al

### Create directory and upload urls file into your server

Choose an appropriate name for your labeled images. You can run these steps multiple times to create different labels.

In [2]:
folder = 'matterhorn'
file = 'urls_matterhorn.csv'

In [3]:
path = Path('data/mountains')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [4]:
temp_csv_file = file[:-4] + '_2' + file[-4:]

with open(path/folder/file, 'r') as inp, open(path/folder/temp_csv_file, 'w') as out:
    writer = csv.writer(out)
    for row in csv.reader(inp):
        if row:
            writer.writerow(row)
      
os.remove(path/folder/file)
os.rename(path/folder/temp_csv_file, path/folder/file)

In [5]:
with open(path/folder/file, 'r') as fp:
    reader = csv.reader(fp)
    print('nb of urls: ', len(list(reader)))

nb of urls:  573


In [6]:
download_images(path/folder/file, dest, max_pics=1000)

In [7]:
import os
print('nb of downloaded images: ', len(os.listdir(path/folder))-2)
idx = len(os.listdir(path/folder))-2

nb of downloaded images:  573


In [None]:
#from shutil import copyfile
#copyfile(path/folder/file, path/folder/'urls_matterhorn_2.csv')

In [8]:
folder = 'weisshorn'
file = 'urls_weisshorn.csv'

In [9]:
path = Path('data/mountains')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [10]:
temp_csv_file = file[:-4] + '_2' + file[-4:]
with open(path/folder/file, 'r') as inp, open(path/folder/temp_csv_file, 'w') as out:
    writer = csv.writer(out)
    for row in csv.reader(inp):
        if row:
            writer.writerow(row)

os.remove(path/folder/file)
os.rename(path/folder/temp_csv_file, path/folder/file)

FileNotFoundError: [Errno 2] No such file or directory: 'data/mountains/weisshorn/urls_weisshorn.csv'

In [None]:
download_images(path/folder/file, dest, max_pics=200)

In [None]:
folder = 'piz_bernina'
file = 'urls_piz_bernina.csv'

In [None]:
path = Path('data/mountains')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)

In [None]:
download_images(path/folder/file, dest, max_pics=200)

Finally, upload your urls file. You just need to press 'Upload' in your working directory and select your file, then click 'Upload' for each of the displayed files.

![uploaded file](images/download_images/upload.png)

## Download images

Now you will need to download your images from their respective urls.

fast.ai has a function that allows you to do just that. You just have to specify the urls filename as well as the destination folder and this function will download and save all images that can be opened. If they have some problem in being opened, they will not be saved.

Let's download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.

You will need to run this line once for every category.

In [None]:
classes = ['matterhorn','weisshorn','piz_bernina']

Then we can remove any images that can't be opened:

In [None]:
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)
    # check that urls actually lead to an image

## View data

In [None]:
path

In [None]:
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
# if we don't have a trn, val, test set => passing 'train="."' says that the current folder contains the trn data,
# and 'valid_pct=0.2' will set aside randomely 20% of the data.

In [None]:
# If you already cleaned your data, run this cell instead of the one before
# np.random.seed(42)
# data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
#         ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Good! Let's take a look at some of our pictures then.

In [None]:
data.classes

In [None]:
data.show_batch(rows=3, figsize=(7,8))

In [None]:
print(data.classes)
print(data.c)
print(len(data.train_ds))
print(len(data.valid_ds))

## Train model

In [None]:
#learn = cnn_learner(data, models.resnet34, metrics=error_rate, pretrained=True)

In [None]:
learn = cnn_learner(data, models.resnet34, metrics=error_rate)

In [None]:
learn.fit_one_cycle(30)

In [None]:
learn.save('stage-1')

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()

In [None]:
# If the plot is not showing try to give a start and end learning rate
# learn.lr_find(start_lr=1e-5, end_lr=1e-1)
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(10, max_lr=slice(1e-5,3e-3))

In [None]:
learn.save('stage-2')

## Interpretation

In [None]:
learn.load('stage-2');

In [None]:
interp = ClassificationInterpretation.from_learner(learn)

In [None]:
interp.plot_confusion_matrix()

## Cleaning Up

Some of our top losses aren't due to bad performance by our model. There are images in our data set that shouldn't be.

Using the `ImageCleaner` widget from `fastai.widgets` we can prune our top losses, removing photos that don't belong.

In [None]:
from fastai.widgets import *

First we need to get the file paths from our top_losses. We can do this with `.from_toplosses`. We then feed the top losses indexes and corresponding dataset to `ImageCleaner`.

Notice that the widget will not delete images directly from disk but it will create a new csv file `cleaned.csv` from where you can create a new ImageDataBunch with the corrected labels to continue training your model.

In order to clean the entire set of images, we need to create a new dataset without the split. The video lecture demostrated the use of the `ds_type` param which no longer has any effect. See [the thread](https://forums.fast.ai/t/duplicate-widget/30975/10) for more details.

In [None]:
db = (ImageList.from_folder(path)
                   .split_none()
                   .label_from_folder()
                   .transform(get_transforms(), size=224)
                   .databunch()
     )

In [None]:
# If you already cleaned your data using indexes from `from_toplosses`,
# run this cell instead of the one before to proceed with removing duplicates.
# Otherwise all the results of the previous step would be overwritten by
# the new run of `ImageCleaner`.

# db = (ImageList.from_csv(path, 'cleaned.csv', folder='.')
#                    .split_none()
#                    .label_from_df()
#                    .transform(get_transforms(), size=224)
#                    .databunch()
#      )

Then we create a new learner to use our new databunch with all the images.

In [None]:
learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

learn_cln.load('stage-2');

In [None]:
ds, idxs = DatasetFormatter().from_toplosses(learn_cln)

Make sure you're running this notebook in Jupyter Notebook, not Jupyter Lab. That is accessible via [/tree](/tree), not [/lab](/lab). Running the `ImageCleaner` widget in Jupyter Lab is [not currently supported](https://github.com/fastai/fastai/issues/1539).

In [None]:
# Don't run this in google colab or any other instances running jupyter lab.
# If you do run this on Jupyter Lab, you need to restart your runtime and
# runtime state including all local variables will be lost.
ImageCleaner(ds, idxs, path)


If the code above does not show any GUI(contains images and buttons) rendered by widgets but only text output, that may caused by the configuration problem of ipywidgets. Try the solution in this [link](https://github.com/fastai/fastai/issues/1539#issuecomment-505999861) to solve it.


Flag photos for deletion by clicking 'Delete'. Then click 'Next Batch' to delete flagged photos and keep the rest in that row. `ImageCleaner` will show you a new row of images until there are no more to show. In this case, the widget will show you images until there are none left from `top_losses.ImageCleaner(ds, idxs)`

You can also find duplicates in your dataset and delete them! To do this, you need to run `.from_similars` to get the potential duplicates' ids and then run `ImageCleaner` with `duplicates=True`. The API works in a similar way as with misclassified images: just choose the ones you want to delete and click 'Next Batch' until there are no more images left.

Make sure to recreate the databunch and `learn_cln` from the `cleaned.csv` file. Otherwise the file would be overwritten from scratch, losing all the results from cleaning the data from toplosses.

In [None]:
ds, idxs = DatasetFormatter().from_similars(learn_cln)

In [None]:
ImageCleaner(ds, idxs, path, duplicates=True)

Remember to recreate your ImageDataBunch from your `cleaned.csv` to include the changes you made in your data!

In [None]:
path/'cleaned.csv'

In [None]:
data_cln = ImageDataBunch.from_folder(path/'cleaned.csv', train=".", valid_pct=0.2,
        ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

## Putting your model in production

First thing first, let's export the content of our `Learner` object for production:

In [None]:
learn.export()

This will create a file named 'export.pkl' in the directory where we were working that contains everything we need to deploy our model (the model, the weights but also some metadata like the classes or the transforms/normalization used).

You probably want to use CPU for inference, except at massive scale (and you almost certainly don't need to train in real-time). If you don't have a GPU that happens automatically. You can test your model on CPU like so:

In [None]:
defaults.device = torch.device('cpu')
print(defaults.device)

In [None]:
img = open_image(path/'piz_bernina'/'00000008.jpg')
img

We create our `Learner` in production enviromnent like this, just make sure that `path` contains the file 'export.pkl' from before.

In [None]:
learn = load_learner(path)

In [None]:
pred_class,pred_idx,outputs = learn.predict(img)
pred_class.obj

So you might create a route something like this ([thanks](https://github.com/simonw/cougar-or-not) to Simon Willison for the structure of this code):

```python
@app.route("/classify-url", methods=["GET"])
async def classify_url(request):
    bytes = await get_bytes(request.query_params["url"])
    img = open_image(BytesIO(bytes))
    _,_,losses = learner.predict(img)
    return JSONResponse({
        "predictions": sorted(
            zip(cat_learner.data.classes, map(float, losses)),
            key=lambda p: p[1],
            reverse=True
        )
    })
```

(This example is for the [Starlette](https://www.starlette.io/) web app toolkit.)