# Scraping data
In this notebook we will test different scraping options to get our images for training our model.\
We'll try getting our images from google images, bing and duckduckgo.\
Our aim is to have about 2000 images per class.

In [1]:
from pathlib import Path
data_path = Path("../data")

if (data_path / "plants_not_safe_for_cats.txt").exists():
    with open(data_path / "plants_not_safe_for_cats.txt", "r") as f:
        plants_not_safe_for_cats = f.read().splitlines()
else:
    print("File 'plants_not_safe_for_cats.txt' not found")


We have a txt file of plant names that are not safe for cats. Some of these have different names (aliases) put in parentheses.\
We want to create a dictionary structure to store a name and different names under it.

In [2]:
plants_not_safe_for_cats[10:15]

['Tiger lily (lilium lancifolium, tigrinum)',
 'Western or wood lily (lilium umbellatum)',
 'Senecio (daisy bush, Brachyglottis greyi)',
 'Sweet Pea (Lathyrus)',
 'Chrysanthemum']

In [2]:
dict_plants_not_safe_for_cats = {}
for line in plants_not_safe_for_cats:
    parts = line.split("(")
    plant_name = parts[0].strip()
    aliases = parts[1].rstrip(')').strip() if len(parts) > 1 else []
    aliases = [s.strip() for s in aliases.split(",")] if aliases != [] else []
    
    if plant_name in dict_plants_not_safe_for_cats:
        print(f"Duplicate key: {plant_name}")
    else:
        dict_plants_not_safe_for_cats[plant_name] = aliases


Duplicate key: Amaryllis
Duplicate key: Anthurium


In [3]:
len(dict_plants_not_safe_for_cats)

96

We had 2 duplicate plants in our txt file. That's why we have 2 less keys than the number of lines in the file.

In [5]:
dict_plants_not_safe_for_cats

{'Asiatic lily': ['Lilium x asiatica'],
 'Convallaria': ['Lily of the Valley'],
 'Easter lily': ['lilium longiforum'],
 'Japanese showy lily': ['lilium hydridum'],
 'Madonna lily': ['lilium candidum'],
 'Roselily': [],
 'Royal lily': ['lilium regale'],
 'Rubrum lily': ['lilium rubrum'],
 'Star of Bethlehem': ['Ornithogalum'],
 'Stargazer lily': ['lilium orientalis'],
 'Tiger lily': ['lilium lancifolium', 'tigrinum'],
 'Western or wood lily': ['lilium umbellatum'],
 'Senecio': ['daisy bush', 'Brachyglottis greyi'],
 'Sweet Pea': ['Lathyrus'],
 'Chrysanthemum': [],
 'Delphinium': ['larkspur'],
 'Achillea': ['yarrow', 'milfoil', 'carpenter’s weed'],
 'Allium': ['ornamental onion', 'garlic', 'lily leek'],
 'Alstroemeria': ['Peruvian lily', 'Lily of Incas'],
 'Amaryllis': ['Hippeastrum'],
 'Ammi': ['Queen Anne’s Lace', 'bullwort', 'common bishop’s weed'],
 'Anemone': ['windflower'],
 'Anthurium': ['flamingo flower'],
 'Bird of Paradise': ['Strelitzia', 'Crane flower'],
 'Broom': ['Cytisus']

In [6]:
# Loop through the dictionary
for key, value in dict_plants_not_safe_for_cats.items():
    print(f"{key}: {value}")

Asiatic lily: ['Lilium x asiatica']
Convallaria: ['Lily of the Valley']
Easter lily: ['lilium longiforum']
Japanese showy lily: ['lilium hydridum']
Madonna lily: ['lilium candidum']
Roselily: []
Royal lily: ['lilium regale']
Rubrum lily: ['lilium rubrum']
Star of Bethlehem: ['Ornithogalum']
Stargazer lily: ['lilium orientalis']
Tiger lily: ['lilium lancifolium', 'tigrinum']
Western or wood lily: ['lilium umbellatum']
Senecio: ['daisy bush', 'Brachyglottis greyi']
Sweet Pea: ['Lathyrus']
Chrysanthemum: []
Delphinium: ['larkspur']
Achillea: ['yarrow', 'milfoil', 'carpenter’s weed']
Allium: ['ornamental onion', 'garlic', 'lily leek']
Alstroemeria: ['Peruvian lily', 'Lily of Incas']
Amaryllis: ['Hippeastrum']
Ammi: ['Queen Anne’s Lace', 'bullwort', 'common bishop’s weed']
Anemone: ['windflower']
Anthurium: ['flamingo flower']
Bird of Paradise: ['Strelitzia', 'Crane flower']
Broom: ['Cytisus']
Bupleurum: ['Bulpleurum rotundifolium Griffithii', 'hare’s ear']
Calla Lily: ['Zantedeschia', 'tru

In [5]:
# Count total number of plants
total = 0
for key, value in dict_plants_not_safe_for_cats.items():
    total += 1 + len(value)
total

245

## Image scraping
Now that we have a dictionary of plants that are not safe for cats and their aliases, we can start getting some data!

In [7]:
# Bing images
from bing_image_downloader import downloader
from tqdm import tqdm

# Download images for each plant
for plant_name, aliases in tqdm(dict_plants_not_safe_for_cats.items()):
    for plant in [plant_name] + aliases:
        downloader.download(
            plant,
            limit=3,
            output_dir=data_path / "bing_images" / plant_name,
            adult_filter_off=False,
            force_replace=False,
            timeout=60,
            verbose=True
        )

  0%|          | 0/96 [00:00<?, ?it/s]

[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Asiatic lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://trulyexperiences.com/blog/wp-content/uploads/2020/12/asiatic_lily_flower-2048x1638.jpg
[%] File Downloaded !

[%] Downloading Image #2 from http://www.whatgrowsthere.com/grow/wp-content/uploads/2016/07/Lilium-2.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://www.whiteflowerfarm.com/mas_assets/cache/image/5/1/a/e/20910.Jpg
[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Asiatic lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://static.wixstatic.com/media/6a8e04_d8a542de1d354478b331155a55ae6087~mv2.jpe/v1/fill/w_3648,h_2432,al_c,q_90/Asiatic_Lilium.jpe
[%] File Downloaded !

[%] Downloading Image #2 from https://live.stati

  1%|          | 1/96 [00:03<06:00,  3.79s/it]

[%] File Downloaded !

[%] Downloading Image #3 from https://live.staticflickr.com/7496/15665723193_432c2d77fe_b.jpg
[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Convallaria


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://img.crocdn.co.uk/images/products2/pl/20/00/01/80/pl2000018093.jpg?width=940&amp;height=940
[%] File Downloaded !

[%] Downloading Image #2 from https://i.etsystatic.com/27541807/r/il/5eb03b/2911654658/il_1588xN.2911654658_2c7i.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://cdn11.bigcommerce.com/s-1b9100svju/images/stencil/2560w/products/1969/1368/DETA3-384__54459.1624568575.jpg?c=1
[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Convallaria


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%

  2%|▏         | 2/96 [00:09<07:19,  4.68s/it]

[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Easter lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://upload.wikimedia.org/wikipedia/commons/d/dd/Lilium_longiflorum_(Easter_Lily).JPG
[%] File Downloaded !

[%] Downloading Image #2 from https://www.thespruce.com/thmb/OYZsYHWeY53t4jt33bYY6MKHifk=/5000x3300/filters:fill(auto,1)/Easter-lily-bloom-big-56a586f43df78cf77288b21c.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://2.bp.blogspot.com/-_ygnUpGfk4U/VvvLsvMLLAI/AAAAAAAAPYA/ObV4FZiB5loWxGaQZiIiYMzRhY4xbHaaQ/s1600/Easter%2BLily%2B1.jpg
[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Easter lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://worldoffloweringplants.com/wp-

  3%|▎         | 3/96 [00:11<05:27,  3.52s/it]

[%] File Downloaded !

[%] Downloading Image #3 from https://www.picturethisai.com/wiki-image/1080/DB4E9D5F6EF247BC8454B884EA28D9D1.jpeg
[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Japanese showy lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://apps.rhs.org.uk/plantselectorimages/detail/RHS_WSYD0017871_7586.JPG
[%] File Downloaded !

[%] Downloading Image #2 from https://trulyexperiences.com/blog/wp-content/uploads/2020/12/asiatic_lily_flower-2048x1638.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://images.fineartamerica.com/images-medium-large-5/showy-stargazer-lily-elisabeth-ann.jpg
[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Japanese showy lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] D

  4%|▍         | 4/96 [01:15<41:53, 27.32s/it]

[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Madonna lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://www.gardeningknowhow.com/wp-content/uploads/2020/10/madonna-lily.jpg
[!] Issue getting: https://www.gardeningknowhow.com/wp-content/uploads/2020/10/madonna-lily.jpg
[!] Error:: HTTP Error 500: Internal Server Error
[%] Downloading Image #1 from http://www.hartsnursery.co.uk/images/D/N1903258_80.jpg
[%] File Downloaded !

[%] Downloading Image #2 from https://trulyexperiences.com/blog/wp-content/uploads/2020/12/AdobeStock_288897634-scaled.jpeg
[%] File Downloaded !



[!!]Indexing page: 2

[%] Indexed 35 Images on Page 2.


[%] Downloading Image #3 from https://www.thespruce.com/thmb/YR9tUaKejBprtPMfKrcMezSIhWY=/2250x0/filters:no_upscale():max_bytes(150000):strip_icc()/growing-madonna-lily-lilium-candidum-5100935-02-ca918293c7b34122

  5%|▌         | 5/96 [01:31<35:26, 23.36s/it]

[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Roselily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://cdn.shopify.com/s/files/1/1419/7120/products/Roselily_Isabella.FH.jpg?v=1571439616
[%] File Downloaded !

[%] Downloading Image #2 from https://3.bp.blogspot.com/-Uggf6iK0NH8/VdTtIv9C__I/AAAAAAAAFeg/PBDIdWGPaU4/s1600/roselily%2Bbelonica%2B.JPG
[%] File Downloaded !

[%] Downloading Image #3 from https://s3.amazonaws.com/cdn.brecks.com/images/800/62828.jpg


  6%|▋         | 6/96 [01:33<24:11, 16.13s/it]

[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Royal lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://3.bp.blogspot.com/-aifkMvyJ3Bc/Vc5cpDoyVwI/AAAAAAAAFcw/Y2KzpoOgcio/s1600/white%2Broyal%2Blily.JPG
[%] File Downloaded !

[%] Downloading Image #2 from https://www.gardenmandy.com/wp-content/uploads/2020/09/Royal-Lily-Flower.jpg
[!] Issue getting: https://www.gardenmandy.com/wp-content/uploads/2020/09/Royal-Lily-Flower.jpg
[!] Error:: Remote end closed connection without response
[%] Downloading Image #2 from https://1.bp.blogspot.com/-_fECwgtZN0g/WBpJcXa1N-I/AAAAAAAAIuU/2iO6lKMLwpIMuUvRJLvNqsjJ4kiCwTpGACLcB/s1600/Royal%2BSunset.jpg
[%] File Downloaded !



[!!]Indexing page: 2

[%] Indexed 35 Images on Page 2.


[%] Downloading Image #3 from https://www.gardenmandy.com/wp-content/uploads/2020/09/Royal-Lily-Photo.jpg
[!] Issue getting

  7%|▋         | 7/96 [01:41<19:50, 13.37s/it]

[%] File Downloaded !



[%] Done. Downloaded 3 images.
[%] Downloading Images to /home/kaka/repo/plants-toxic-for-cats/notebooks/../data/bing_images/Rubrum lily


[!!]Indexing page: 1

[%] Indexed 3 Images on Page 1.


[%] Downloading Image #1 from https://i.pinimg.com/originals/dd/9b/53/dd9b531050a0a2ce924812b80ec63479.jpg
[%] File Downloaded !

[%] Downloading Image #2 from https://worldoffloweringplants.com/wp-content/uploads/2017/12/Lilium-speciosum-var.-rubrum-Japanese-Lily3.jpg
[%] File Downloaded !

[%] Downloading Image #3 from https://worldoffloweringplants.com/wp-content/uploads/2017/12/Lilium-speciosum-var.-rubrum-Japanese-Lily1.jpg


  7%|▋         | 7/96 [01:42<21:48, 14.70s/it]


KeyboardInterrupt: 

I have modified the bing_image_downloader package to download images straight to the `output_dir` instead of creating a subdirectory for each search term. Instead it names the images with the search term and a number.

It turns out that now it's harder to scrape images from Google Images and DuckDuckGo. We'll only scrape from Bing for now. The rest of our data will come from Pl@ntNet-300K image dataset.