### Image Dataset - `TickVrsMite`

In this notebook we are going to scrape the data from [this site](https://www.insectimages.org/index.cfm) for an image classification dataset.

### Innstallation

First we need to install `selenium` if not installed.

In [4]:
# pip install selenium -q

Note: you may need to restart the kernel to use updated packages.


### Imports
Next we are going to import all the packages that we will use in this notebook for this task.

In [57]:
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests
import tqdm
import random
import shutil

from concurrent.futures import ThreadPoolExecutor, as_completed
import multiprocessing

Next we are going to set the seed for random operations for reproducivity.

In [58]:
SEED = 23

random.seed(SEED)

Next we are going to create an instance of a Chrome driver

In [105]:
driver = webdriver.Chrome()
driver

<selenium.webdriver.chrome.webdriver.WebDriver (session="130c802a10357cde800f972a1ed01e32")>

We are then going to define the path to where we want to get the images of the insects from.

In [106]:
urls_to_images ='https://www.insectimages.org/browse/taxthumb.cfm?order=131'

In [107]:
driver.get(urls_to_images)

From the website the images are paginated so we need to do the infinite scrolling So the function:

```py
def scroll_down(driver, scroll_pause_time=1.0):
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause_time)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
scroll_down(driver, 20)
```
Is used to do the inifite automatic scrolling of the page using `selenium`.


The following function will scrape a single page and returns us a list of dictionaries with `image` and `class` of the insect.

In [108]:
def scrape_page(soup):
    res = list()
    img_container = soup.find('div', {'id': 'imagecontainer'})
    for row in img_container.find_all('div', {'class': 'row vertical'}):
        for item in row.find_all('div', {'class': 'pointer item text-center well'}):
            try:
                img = f"https:{item.find('img')['src']}"
                class_ = item.find('div', {'class': 'img-foot'}).contents[0].split(' ')[-1].strip().lower()
                res.append({'class': class_, 'image': img})
            except Exception:
                pass
    return res
    

We are going to define the number of pages we want to scroll  and start doing webscraping to get the images and their class names.

In [None]:
number_of_pages = 900
data = []
for page in range(number_of_pages):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    res = scrape_page(soup)
    data.extend(res)
    print(f"Got {len(res)} for page {page+1}...")


Got 24 for page 1...
Got 48 for page 2...
Got 72 for page 3...
Got 95 for page 4...
Got 119 for page 5...
Got 138 for page 6...


We can check the first `2` examples in the scrapped data.

In [64]:
data[:2]

[{'class': 'tick',
  'image': 'https://bugwoodcloud.org/images/384x256/5626438.jpg'},
 {'class': 'mite',
  'image': 'https://bugwoodcloud.org/images/384x256/5625783.jpg'}]

In [65]:
len(data)

349792

In the following code cell we are going to create a dataframe based on the data that we have scrapped.

In [66]:
paired = [list(r.values()) for r in data]

df = pd.DataFrame(paired, columns=['class', 'image'])
df.head()

Unnamed: 0,class,image
0,tick,https://bugwoodcloud.org/images/384x256/562643...
1,mite,https://bugwoodcloud.org/images/384x256/562578...
2,,https://bugwoodcloud.org/images/384x256/562537...
3,,https://bugwoodcloud.org/images/384x256/562537...
4,,https://bugwoodcloud.org/images/384x256/562537...


The following function will rename `plural`to `singular` for class names. 

In [67]:
df['class'].value_counts()

class
mite           333379
mites            7109
tick             5705
grapes           1171
                 1019
chiggers          676
acari)            502
ticks             179
dermacentor        30
parasitoid         22
Name: count, dtype: int64

In [75]:
def rename_class(class_):
    obj = {'ticks': 'tick', 'mites': 'mite'}
    try:
        return obj[class_]
    except KeyError:
        return class_
df['class'] = df['class'].apply(rename_class)

In [76]:
df['class'].unique()

array(['tick', 'mite', '', 'chiggers', 'acari)', 'grapes', 'parasitoid',
       'dermacentor'], dtype=object)

Next we are going to drop the duplicates columns based on the `image` column and filter them to `2` labels for insects which is:

1. `tick`
2. `mite`

In [77]:
df_unique['class'].value_counts()

class
mite           3739
tick            176
grapes           20
acari)           15
                  6
chiggers          4
dermacentor       3
parasitoid        1
Name: count, dtype: int64

In [78]:
df_unique = df.drop_duplicates(subset='image')
df_filtered = df_unique[df_unique['class'].isin(['tick', 'mite'])]
df_filtered.head()

Unnamed: 0,class,image
0,tick,https://bugwoodcloud.org/images/384x256/562643...
1,mite,https://bugwoodcloud.org/images/384x256/562578...
5,tick,https://bugwoodcloud.org/images/384x256/561910...
6,tick,https://bugwoodcloud.org/images/384x256/561909...
7,tick,https://bugwoodcloud.org/images/384x256/561909...


In [92]:
df_filtered.to_csv('pickle.csv', index=False)
print("Saved")

Saved


In [93]:
df_filtered = pd.read_csv('pickle.csv')

In [94]:
df_filtered['class'].value_counts()

class
mite    3739
tick     176
Name: count, dtype: int64

Next we are going to download these images into their respective directories based on their class names. Before we donload them we want to make sure that they are balanced.

In [95]:
min_size = df_filtered['class'].value_counts().min()
df_balanced = df_filtered.groupby('class').apply(lambda x: x.sample(min_size)).reset_index(drop=True)
df_balanced['class'].value_counts()

class
mite    176
tick    176
Name: count, dtype: int64

In [96]:
base_dir = "dataset"

if not os.path.exists(base_dir):
    os.mkdir(base_dir)

for class_ in ['tick', 'mite']:
    class_dir = os.path.join(base_dir, class_)
    if not os.path.exists(class_dir):
        os.mkdir(class_dir)


We are going to use the `ThreadPoolExecutor` from `concurrent.futures` to do multi-processing in downloading and saving the images concurrently. First let's check the number of `cpu's` that are in this computer.

In [97]:
num_workers = multiprocessing.cpu_count()
print("CPUs: {}".format(num_workers))

CPUs: 12


We can try to get the images and their classes from a pandas dataframe to a python list of tuples.

In [98]:
img_urls = list(zip(
    df_balanced['class'].values,
    df_balanced['image'].values,
))

random.shuffle(img_urls)

In [99]:
img_urls[:2]

[('tick', 'https://bugwoodcloud.org/images/384x256/5488679.jpg'),
 ('tick', 'https://bugwoodcloud.org/images/384x256/5574613.jpg')]

The following function save a single image of an insect in it's respective folder.

In [100]:
skipped = list()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def save_image(row):
    class_, url = row
    try:
        image_name = f"{url.split('/')[-1]}"
        res = requests.get(url, headers=headers)
        if res.status_code == 200:
            save_name = os.path.join(base_dir, class_, image_name)
            with open(save_name, 'wb') as fp:
                fp.write(res.content)
        else:
            print("Failed to download the image: STATUS CODE {}".format(res.status_code))
    except Exception as e:
        print(e)
        print("url skipped:", url)
        skipped.append(url)

Next we are going to download the images from the urls.

In [101]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, i) for i in img_urls]
    for future in tqdm.tqdm(as_completed(futures), desc="downloading...", total=len(img_urls)):
        pass
print("Done!!")

downloading...: 100%|████████████████████████████████████████████████████████████████| 352/352 [03:51<00:00,  1.52it/s]

Done!!





After downloading these images, We want to split our dataset into `2` sets the `train` and `test`. So we are going to create the following folder structure in our `dataset` directory.


```
📁 dataset
    📁  train
      📁class_1
         - 0.jpg
         - 1.jpg
         ....
      📁  class_2
         - 0.jpg
         - 1.jpg
         ....
      .....
    📁  test
      📁  class_1
         - 0.jpg
         - 1.jpg
         ....
      📁  class_2
         - 0.jpg
         - 1.jpg
         ....
    ....
```

In [102]:
for class_ in ["tick", "mite"]:
    train_dir = os.path.join(base_dir,'train', class_)
    test_dir = os.path.join(base_dir,'test', class_)
    if not os.path.exists(train_dir):
        os.makedirs(train_dir)
    if not os.path.exists(test_dir):
        os.makedirs(test_dir)

We will take the first `72` images and put them in the test folder to the respective class and the rest will be taken to the train folder of the respective class.

In [103]:
def move_to_final_destination(trg, class_:str, limit=None):
    class_dir = os.path.join(base_dir, class_)
    total = len(os.listdir(class_dir)) if limit is None else limit
    if limit is None:
        for image_name in tqdm.tqdm(os.listdir(class_dir), total=total, desc=f"moving images to the {class_} folder..."):
            src = os.path.join(class_dir, image_name)
            shutil.move(src, trg)
    else:
        for image_name in tqdm.tqdm(os.listdir(class_dir)[:limit], total=total, desc=f"moving images to the {class_} folder..."):
            src = os.path.join(class_dir, image_name)
            shutil.move(src, trg)

for class_ in ['tick', 'mite']:
    train_dir = os.path.join(base_dir,'train', class_)
    test_dir = os.path.join(base_dir,'test', class_)
    move_to_final_destination(test_dir, class_, limit=55)
    move_to_final_destination(train_dir, class_, limit=None)


moving images to the tick folder...: 100%|████████████████████████████████████████████| 55/55 [00:00<00:00, 222.85it/s]
moving images to the tick folder...: 100%|██████████████████████████████████████████| 121/121 [00:00<00:00, 590.07it/s]
moving images to the mite folder...: 100%|████████████████████████████████████████████| 55/55 [00:00<00:00, 664.63it/s]
moving images to the mite folder...: 100%|██████████████████████████████████████████| 121/121 [00:00<00:00, 647.06it/s]


Next we are going to remove the `class` directories from the dataset.

In [104]:
for class_ in ['tick', 'mite']:
    _dir = os.path.join(base_dir, class_)
    shutil.rmtree(_dir, 0o777)

driver.quit()

### Refs
1. [github.com/CrispenGari](https://github.com/CrispenGari/web-scrapping-python/blob/main/selenium/selenium.ipynb)