### Image Dataset - `BeetleVsWeevilDataset`

In this notebook we are going to scrape the data from [this site](https://www.insectimages.org/index.cfm) for an image classification dataset.

### Innstallation

First we need to install `selenium` if not installed.

In [4]:
pip install selenium -q

Note: you may need to restart the kernel to use updated packages.


The following is the path to the chrome driver that was downloaded from [this website](https://googlechromelabs.github.io/chrome-for-testing/) relative to the version of chrome that is installed on this machine.

In [2]:
driver_path = 'C:\\chrome-win64\\chrome.exe'

### Imports
Next we are going to import all the packages that we will use in this notebook for this task.

In [14]:
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests
import tqdm
import random
import shutil

from concurrent.futures import ThreadPoolExecutor, as_completed
import multiprocessing

Next we are going to set the seed for random operations for reproducivity.

In [2]:
SEED = 23

random.seed(SEED)

Next we are going to create an instance of a Chrome driver

In [82]:
driver = webdriver.Chrome()
driver

<selenium.webdriver.chrome.webdriver.WebDriver (session="437658d0d9911db920dbee2f86b4c745")>

We are then going to define the path to where we want to get the images of the insects from.

In [83]:
urls_to_images ='https://www.insectimages.org/browse/taxthumb.cfm?order=39'

In [84]:
driver.get(urls_to_images)

From the website the images are paginated so we need to do the infinite scrolling So the function:

```py
def scroll_down(driver, scroll_pause_time=1.0):
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause_time)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
scroll_down(driver, 20)
```
Is used to do the inifite automatic scrolling of the page using `selenium`.


The following function will scrape a single page and returns us a list of dictionaries with `image` and `class` of the insect.

In [86]:
def scrape_page(soup):
    res = list()
    img_container = soup.find('div', {'id': 'imagecontainer'})
    for row in img_container.find_all('div', {'class': 'row vertical'}):
        for item in row.find_all('div', {'class': 'pointer item text-center well'}):
            try:
                img = f"https:{item.find('img')['src']}"
                class_ = item.find('div', {'class': 'img-foot'}).contents[0].split(' ')[-1].strip().lower()
                res.append({'class': class_, 'image': img})
            except Exception:
                pass
    return res

We are going to define the number of pages we want to scroll  and start doing webscraping to get the images and their class names.

In [87]:
number_of_pages = 1663
data = []
for page in range(number_of_pages):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    res = scrape_page(soup)
    data.extend(res)
    print(f"Got {len(res)} for page {page+1}...")


Got 24 for page 1...
Got 48 for page 2...
Got 72 for page 3...
Got 96 for page 4...
Got 120 for page 5...
Got 144 for page 6...
Got 163 for page 7...
Got 187 for page 8...
Got 211 for page 9...
Got 230 for page 10...
Got 249 for page 11...
Got 268 for page 12...
Got 292 for page 13...
Got 316 for page 14...
Got 340 for page 15...
Got 364 for page 16...
Got 388 for page 17...
Got 412 for page 18...
Got 436 for page 19...
Got 460 for page 20...
Got 484 for page 21...
Got 496 for page 22...
Got 520 for page 23...
Got 544 for page 24...
Got 568 for page 25...
Got 592 for page 26...
Got 615 for page 27...
Got 639 for page 28...
Got 654 for page 29...
Got 677 for page 30...
Got 701 for page 31...
Got 725 for page 32...
Got 749 for page 33...
Got 773 for page 34...
Got 793 for page 35...
Got 816 for page 36...
Got 840 for page 37...
Got 864 for page 38...
Got 888 for page 39...
Got 912 for page 40...
Got 936 for page 41...
Got 960 for page 42...
Got 984 for page 43...
Got 1008 for page 44...


KeyboardInterrupt: 

We can check the first `2` examples in the scrapped data.

In [89]:
data[:2]

[{'class': 'beetle',
  'image': 'https://bugwoodcloud.org/images/384x256/9009089.jpg'},
 {'class': 'borer',
  'image': 'https://bugwoodcloud.org/images/384x256/9009087.jpg'}]

In the following code cell we are going to create a dataframe based on the data that we have scrapped.

In [91]:
paired = [list(r.values()) for r in data]

df = pd.DataFrame(paired, columns=['class', 'image'])
df.head()

Unnamed: 0,class,image
0,beetle,https://bugwoodcloud.org/images/384x256/900908...
1,borer,https://bugwoodcloud.org/images/384x256/900908...
2,borer,https://bugwoodcloud.org/images/384x256/900908...
3,borer,https://bugwoodcloud.org/images/384x256/900908...
4,beetles,https://bugwoodcloud.org/images/384x256/900905...


Next we are going to drop the duplicates columns based on the `image` column and filter them to `2` labels for insects which is:

1. `beetle`
2. `weevil`


In [105]:
df_unique = df.drop_duplicates(subset='image')
df_filtered = df_unique[df_unique['class'].isin(["beetle", "weevil"])]
df_filtered.head()

Unnamed: 0,class,image
0,beetle,https://bugwoodcloud.org/images/384x256/900908...
8,beetle,https://bugwoodcloud.org/images/384x256/900904...
9,beetle,https://bugwoodcloud.org/images/384x256/900903...
10,beetle,https://bugwoodcloud.org/images/384x256/900903...
11,beetle,https://bugwoodcloud.org/images/384x256/900903...


In [107]:
df_filtered.to_csv('pickle.csv', index=False)
print("Saved")



Saved


In [3]:
df_filtered = pd.read_csv('pickle.csv')

In [4]:
df_filtered['class'].value_counts()

class
beetle    8270
weevil    1498
Name: count, dtype: int64

We can observe that we have `8270` and `1488` unique image of each class. Next we are going to download these images into their respective directories based on their class names. Before we donload them we want to make sure that they are balanced.

In [5]:
min_size = df_filtered['class'].value_counts().min()
df_balanced = df_filtered.groupby('class').apply(lambda x: x.sample(min_size)).reset_index(drop=True)
df_balanced['class'].value_counts()

class
beetle    1498
weevil    1498
Name: count, dtype: int64

In [6]:
base_dir = "dataset"
beetle_dir = os.path.join(base_dir, 'beetle')
weevil_dir = os.path.join(base_dir, 'weevil')

if not os.path.exists(base_dir):
    os.mkdir(base_dir)
    
if not os.path.exists(beetle_dir):
    os.mkdir(beetle_dir)

if not os.path.exists(weevil_dir):
    os.mkdir(weevil_dir)

We are going to use the `ThreadPoolExecutor` from `concurrent.futures` to do multi-processing in downloading and saving the images concurrently. First let's check the number of `cpu's` that are in this computer.

In [7]:
num_workers = multiprocessing.cpu_count()
print("CPUs: {}".format(num_workers))

CPUs: 12


We can try to get the images and their classes from a pandas dataframe to a python list of tuples.

In [8]:
img_urls = list(zip(
    df_balanced['class'].values,
    df_balanced['image'].values,
))

random.shuffle(img_urls)

In [9]:
img_urls[:2]

[('beetle', 'https://bugwoodcloud.org/images/384x256/5600013.jpg'),
 ('beetle', 'https://bugwoodcloud.org/images/384x256/5616726.jpg')]

The following function save a single image of an insect in it's respective folder.

In [10]:
skipped = list()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def save_image(row):
    class_, url = row
    try:
        image_name = f"{url.split('/')[-1]}"
        res = requests.get(url, headers=headers)
        if res.status_code == 200:
            save_name = os.path.join(base_dir, class_, image_name)
            with open(save_name, 'wb') as fp:
                fp.write(res.content)
        else:
            print("Failed to download the image: STATUS CODE {}".format(res.status_code))
    except Exception as e:
        print(e)
        print("url skipped:", url)
        skipped.append(url)

Next we are going to download the images from the urls.

In [11]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, i) for i in img_urls]
    for future in tqdm.tqdm(as_completed(futures), desc="downloading...", total=len(img_urls)):
        pass
print("Done!!")

downloading...: 100%|██████████████████████████████████████████████████████████████| 2996/2996 [39:59<00:00,  1.25it/s]

Done!!





After downloading these images, We want to split our dataset into `2` sets the `train` and `test`. So we are going to create the following folder structure in our `dataset` directory.


```
📁 dataset
    📁  train
      📁beetle
         - 0.jpg
         - 1.jpg
         ....
      📁  weevil
         - 0.jpg
         - 1.jpg
         ....
    📁  test
      📁  beetle
         - 0.jpg
         - 1.jpg
         ....
      📁  weevil
         - 0.jpg
         - 1.jpg
         ....
```

In [12]:
train_beetle_folder = os.path.join(base_dir, 'train', 'beetle')
train_weevil_folder = os.path.join(base_dir, 'train', 'weevil')
test_beetle_folder = os.path.join(base_dir, 'test', 'beetle')
test_weevil_folder = os.path.join(base_dir, 'test', 'weevil')

if not os.path.exists(train_beetle_folder):
    os.makedirs(train_beetle_folder)
if not os.path.exists(train_weevil_folder):
    os.makedirs(train_weevil_folder)

if not os.path.exists(test_beetle_folder):
    os.makedirs(test_beetle_folder)
if not os.path.exists(test_weevil_folder):
    os.makedirs(test_weevil_folder)

We will take the first `298` images and put them in the test folder to the respective class and the rest will be taken to the train folder of the respective class.

In [16]:
def move_to_final_destination(trg, class_:str, limit=None):
    class_dir = os.path.join(base_dir, class_)
    total = len(os.listdir(class_dir)) if limit is None else limit
    if limit is None:
        for image_name in tqdm.tqdm(os.listdir(class_dir), total=total, desc=f"moving images to the {class_} folder..."):
            src = os.path.join(class_dir, image_name)
            shutil.move(src, trg)
    else:
        for image_name in tqdm.tqdm(os.listdir(class_dir)[:limit], total=total, desc=f"moving images to the {class_} folder..."):
            src = os.path.join(class_dir, image_name)
            shutil.move(src, trg)

move_to_final_destination(test_weevil_folder, 'weevil', limit=298)
move_to_final_destination(train_weevil_folder, 'weevil', limit=None)

move_to_final_destination(test_beetle_folder, 'beetle', limit=298)
move_to_final_destination(train_beetle_folder, 'beetle', limit=None)

moving images to the weevil folder...:   0%|                                                   | 0/298 [00:00<?, ?it/s]
moving images to the weevil folder...: 0it [00:00, ?it/s]
moving images to the beetle folder...:   0%|                                                   | 0/298 [00:00<?, ?it/s]
moving images to the beetle folder...: 0it [00:00, ?it/s]


Next we are going to remove the `weevil` and `beetle` directories from the dataset.

In [19]:
shutil.rmtree(weevil_dir, 0o777)
shutil.rmtree(beetle_dir, 0o777)

### Refs
1. [github.com/CrispenGari](https://github.com/CrispenGari/web-scrapping-python/blob/main/selenium/selenium.ipynb)