### Image Scraping

In this notebook we are going to scrape images of food recipes. We are going to get the image urls from the recipe files `.json` file that can be found on my gist repository at [this](https://gist.github.com/CrispenGari/794a10de80b0bc3f5ff3a7b99ebb88de). 


Let's first import all the required packages that we are going to use in this notebook.

In [16]:
import os
import requests
import uuid
import tqdm
import json
import multiprocessing
import shutil

from concurrent.futures import ThreadPoolExecutor, as_completed

The next thing is to define file path for where we are going to load our recipe file from and where we are going to save our images to. Our recipe files are located in the `data` folder and we are going to save our images in the `recipe_images` folder with a file name generated by `uuid`.

In [3]:
data_path = 'data'
save_path = 'recipe_images'

if not os.path.exists(save_path):
    os.mkdir(save_path)
    
assert os.path.exists(data_path), f"The path '{data_path}' does not exists."
assert os.path.exists(save_path), f"The path '{save_path}' does not exists."

Next we are going to load all the `json` files and image urls for each recipe and put them in a list.  

In [4]:
images_urls = list()
for file in tqdm.tqdm(os.listdir(data_path), desc="loading..."):
    with open(os.path.join(data_path, file)) as f:
        data = json.loads(f.read())
        for recipe in data:
            images_urls.append(recipe.get('image'))

loading...: 100%|████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 36.77it/s]


The `save_image` file bellow takes in a url and save an image in the  `recipe_images` folder with a unique file name. We are also going to keep in track of the skipped images because of network issues so that we will try to save them again.

In [5]:
skipped = list()
def save_image(url):
    try:
        image_name = f"{str(uuid.uuid4())}.{url.split('.')[-1]}"
        data = requests.get(url).content
        save_name = os.path.join(save_path, image_name)
        with open(save_name, 'wb') as fp:
            fp.write(data)
    except Exception:
        skipped.append(url)
        print("url skipped:", url)
        pass

Next we are then going to download the images and save them in the `recipe_images` folder. We are going to use the `ThreadPoolExecutor` from `concurrent.futures` to do multi-processing in downloading and saving the images concurrently. First let's check the number of `cpu's` that are in this computer.

In [6]:
num_workers = multiprocessing.cpu_count()
print("CPUs: {}".format(num_workers))

CPUs: 12


Now we can download the images `concurrently` in the following code cell. We can alternatively use the following code:

```py
with multiprocessing.Pool(num_workers) as pool:
    for _ in tqdm.tqdm(pool.imap_unordered(save_image, images_urls), total=len(images_urls), desc="downloading..."):
        pass
print("Done!!")
```
However this code only works when you wrap it in the `if __name__ == '__main__':` which means in python files mostly not in notebooks.

In [7]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, url) for url in images_urls]
    for future in tqdm.tqdm(as_completed(futures), total=len(images_urls), desc="downloading..."):
        pass

print("Done!!")

downloading...:   8%|████▉                                                          | 235/3017 [02:57<34:16,  1.35it/s]

url skipped: https://images.immediate.co.uk/production/volatile/sites/30/2020/08/fun-cake-9735009.jpg


downloading...:  17%|██████████▋                                                    | 512/3017 [06:17<22:39,  1.84it/s]

url skipped: https://images.immediate.co.uk/production/volatile/sites/30/2020/08/recipe-image-legacy-id-849607_11-aaaf1ea.jpg


downloading...:  24%|███████████████▏                                               | 727/3017 [08:55<27:42,  1.38it/s]

url skipped: https://images.immediate.co.uk/production/volatile/sites/30/2021/03/Sausage-pasta-bake-f71108a.jpg


downloading...:  49%|██████████████████████████████▍                               | 1479/3017 [17:58<10:49,  2.37it/s]

url skipped: https://images.immediate.co.uk/production/volatile/sites/30/2020/08/halloumi-with-lemony-lentil-salad-a57237c.jpg


downloading...:  63%|███████████████████████████████████████▏                      | 1904/3017 [22:58<13:31,  1.37it/s]

url skipped: https://images.immediate.co.uk/production/volatile/sites/30/2020/08/epic-summer-salad-000aded.jpg


downloading...:  63%|███████████████████████████████████████▏                      | 1906/3017 [22:59<11:59,  1.54it/s]

url skipped: https://images.immediate.co.uk/production/volatile/sites/30/2020/08/potato-salad-main-272de70.jpg


downloading...: 100%|██████████████████████████████████████████████████████████████| 3017/3017 [32:14<00:00,  1.56it/s]

Done!!





Next we try to download the skipped images.

In [8]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, url) for url in skipped]
    for future in tqdm.tqdm(as_completed(futures), total=len(skipped), desc="downloading..."):
        pass
print("Done!!")

downloading...:   0%|▏                                                                | 6/3017 [00:01<13:36,  3.69it/s]

Done!!





Next we are going to create 2 folders in the `recipe_images` which are:

1. `train`
2. `test`

And then we are going to split our images by taking `20%` of the images and put them in the `test` set and the remaining images will be moved in the train dataset.

In [20]:
train_path = os.path.join(save_path, 'train')
test_path = os.path.join(save_path, 'test')

if not os.path.exists(train_path):
    os.mkdir(train_path)
if not os.path.exists(test_path):
    os.mkdir(test_path)

test_fraction = int(.20 * len(os.listdir(save_path)))
print("Test Fraction: ", test_fraction)

Test Fraction:  603


Next we are going to move the first `20%` of images into the test set.

In [21]:
def move(img, trg):
    src = os.path.join(save_path, img)
    if not os.path.isdir(src):
        shutil.move(src, trg)
    
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(move, img, test_path) for img in os.listdir(save_path)[:test_fraction]]
    for future in tqdm.tqdm(as_completed(futures), total=len(os.listdir(save_path)[:test_fraction]), desc="moving to test directory..."):
        pass
print("Done!!")
    

moving to test directory...: 100%|███████████████████████████████████████████████████| 603/603 [00:11<00:00, 50.56it/s]

Done!!





The rest of the images will be moved to the `train` folder.

In [23]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(move, img, train_path) for img in os.listdir(save_path)]
    for future in tqdm.tqdm(as_completed(futures), total=len(os.listdir(save_path)), desc="moving to train directory..."):
        pass
print("Done!!")
    

moving to train directory...: 2416it [00:54, 44.46it/s]                                                                

Done!!



