### Image Scraping

In this notebook we are going to scrape images of food recipes. We are going to get the image urls from the recipe files `.json` file that can be found on my gist repository at [this](https://gist.github.com/CrispenGari/794a10de80b0bc3f5ff3a7b99ebb88de). 


Let's first import all the required packages that we are going to use in this notebook.

In [31]:
import os
import requests
import uuid
import tqdm
import json
import multiprocessing
import shutil
import pandas as pd

from concurrent.futures import ThreadPoolExecutor, as_completed

The next thing is to define file path for where we are going to load our recipe file from and where we are going to save our images to. Our recipe files are located in the `data` folder and we are going to save our images in the `nutrients` folder with a file name generated by `uuid`.

In [32]:
data_path = 'data'
save_path = 'nutrients'

if not os.path.exists(save_path):
    os.mkdir(save_path)
    
assert os.path.exists(data_path), f"The path '{data_path}' does not exists."
assert os.path.exists(save_path), f"The path '{save_path}' does not exists."

Next we are going to load all the `json` files and image urls for each recipe and put them in a list.  

In [33]:
def get_nutient_value(nutrients, col: str)->float:
    try:
        v = nutrients.get(col).replace('g', '')
        return float(v)
    except Exception:
        return 0.0

In [34]:
columns = ['image', 'carbs', 'fat', 'fibre', 'kcal', 'protein', 'salt', 'saturates', 'sugars']
rows = []
image_urls = []

for file in tqdm.tqdm(os.listdir(data_path), desc="loading..."):
    with open(os.path.join(data_path, file)) as f:
        data = json.loads(f.read())
        for recipe in data:
            nutrients = recipe.get('nutrients')
            n = [get_nutient_value(nutrients, col) for col in columns[1:]]
            image_urls.append(recipe.get('image'))
            row = [recipe.get('image')] + n
            rows.append(row)


dataframe = pd.DataFrame(rows, columns=columns)
dataframe.head()

loading...: 100%|████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 24.92it/s]


Unnamed: 0,image,carbs,fat,fibre,kcal,protein,salt,saturates,sugars
0,https://images.immediate.co.uk/production/vola...,48.0,2.0,2.0,246.0,8.0,1.2,0.0,1.0
1,https://images.immediate.co.uk/production/vola...,30.0,11.0,2.0,275.0,17.0,1.99,6.0,4.0
2,https://images.immediate.co.uk/production/vola...,38.0,7.0,2.0,240.0,8.0,1.37,2.0,1.0
3,https://images.immediate.co.uk/production/vola...,22.0,17.0,1.0,294.0,13.0,1.5,9.0,1.0
4,https://images.immediate.co.uk/production/vola...,32.0,9.0,2.0,250.0,12.0,1.0,5.0,3.0


The `save_image` file bellow takes in a url and save an image in the  `nutrients` folder with a unique file name. We are also going to keep in track of the skipped images because of network issues so that we will try to save them again.

In [36]:
skipped = list()
def save_image(url):
    try:
        image_name = f"{url.split('/')[-1]}"
        data = requests.get(url).content
        save_name = os.path.join(save_path, image_name)
        with open(save_name, 'wb') as fp:
            fp.write(data)
    except Exception:
        print("url skipped:", url)
        skipped.append(url)

Next we are then going to download the images and save them in the `nutrients` folder. We are going to use the `ThreadPoolExecutor` from `concurrent.futures` to do multi-processing in downloading and saving the images concurrently. First let's check the number of `cpu's` that are in this computer.

In [37]:
num_workers = multiprocessing.cpu_count()
print("CPUs: {}".format(num_workers))

CPUs: 12


Now we can download the images `concurrently` in the following code cell. We can alternatively use the following code:

```py
with multiprocessing.Pool(num_workers) as pool:
    for _ in tqdm.tqdm(pool.imap_unordered(save_image, images_urls), total=len(images_urls), desc="downloading..."):
        pass
print("Done!!")
```
However this code only works when you wrap it in the `if __name__ == '__main__':` which means in python files mostly not in notebooks.

In [38]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, i) for i in image_urls]
    for future in tqdm.tqdm(as_completed(futures), desc="downloading...", total=len(paired)):
        pass
print("Done!!")

downloading...: 100%|██████████████████████████████████████████████████████████████| 3017/3017 [10:18<00:00,  4.88it/s]

Done!!





Next we try to download the skipped images.

In [39]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, i) for i in skipped]
    for future in tqdm.tqdm(as_completed(futures), total=len(skipped), desc="downloading..."):
        pass
print("Done!!")

downloading...: 0it [00:00, ?it/s]

Done!!





In [42]:
dataframe.image = dataframe.image.apply(lambda url: f"{url.split('/')[-1]}")

In [43]:
dataframe.head()

Unnamed: 0,image,carbs,fat,fibre,kcal,protein,salt,saturates,sugars
0,EasyBreadRolls-e4a7255.jpg,48.0,2.0,2.0,246.0,8.0,1.2,0.0,1.0
1,recipe-image-legacy-id-46013_11-99b8eda.jpg,30.0,11.0,2.0,275.0,17.0,1.99,6.0,4.0
2,recipe-image-legacy-id-743466_11-e87df17.jpg,38.0,7.0,2.0,240.0,8.0,1.37,2.0,1.0
3,recipe-image-legacy-id-1119465_11-4aebb21.jpg,22.0,17.0,1.0,294.0,13.0,1.5,9.0,1.0
4,recipe-image-legacy-id-1201816_10-7f0a38f.jpg,32.0,9.0,2.0,250.0,12.0,1.0,5.0,3.0


Checking the rows that does not have `0` for nutrients values.

In [47]:
columns_to_check = ['carbs', 'fat', 'fibre', 'kcal', 'protein', 'salt', 'saturates', 'sugars']
filtered_rows = dataframe[(dataframe[columns_to_check] != 0.0).all(axis=1)]
len(filtered_rows)

1258

In [48]:
filtered_rows.reset_index().to_csv(os.path.join(save_path, 'data.csv'), index=False)
print('Done')

Done
