### Image Dataset - `Grasshoppers, Crickets, and Katydids`

In this notebook we are going to scrape the data from [this site](https://www.insectimages.org/index.cfm) for an image classification dataset.

### Innstallation

First we need to install `selenium` if not installed.

In [4]:
# pip install selenium -q

Note: you may need to restart the kernel to use updated packages.


### Imports
Next we are going to import all the packages that we will use in this notebook for this task.

In [1]:
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests
import tqdm
import random
import shutil

from concurrent.futures import ThreadPoolExecutor, as_completed
import multiprocessing

Next we are going to set the seed for random operations for reproducivity.

In [2]:
SEED = 23

random.seed(SEED)

Next we are going to create an instance of a Chrome driver

In [3]:
driver = webdriver.Chrome()
driver

<selenium.webdriver.chrome.webdriver.WebDriver (session="b7e62b998fc3dd857cd74fc0a8d2785c")>

We are then going to define the path to where we want to get the images of the insects from.

In [4]:
urls_to_images ='https://www.insectimages.org/browse/taxthumb.cfm?order=159'

In [5]:
driver.get(urls_to_images)

From the website the images are paginated so we need to do the infinite scrolling So the function:

```py
def scroll_down(driver, scroll_pause_time=1.0):
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause_time)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
scroll_down(driver, 20)
```
Is used to do the inifite automatic scrolling of the page using `selenium`.


The following function will scrape a single page and returns us a list of dictionaries with `image` and `class` of the insect.

In [7]:
def scrape_page(soup):
    res = list()
    img_container = soup.find('div', {'id': 'imagecontainer'})
    for row in img_container.find_all('div', {'class': 'row vertical'}):
        for item in row.find_all('div', {'class': 'pointer item text-center well'}):
            try:
                img = f"https:{item.find('img')['src']}"
                class_ = item.find('div', {'class': 'img-foot'}).contents[0].split(' ')[-1].strip().lower()
                res.append({'class': class_, 'image': img})
            except Exception:
                pass
    return res
    

We are going to define the number of pages we want to scroll  and start doing webscraping to get the images and their class names.

In [19]:
number_of_pages = 100
data = []
for page in range(number_of_pages):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    res = scrape_page(soup)
    data.extend(res)
    print(f"Got {len(res)} for page {page+1}...")


Got 216 for page 1...
Got 240 for page 2...
Got 264 for page 3...
Got 288 for page 4...
Got 312 for page 5...
Got 336 for page 6...
Got 360 for page 7...
Got 384 for page 8...
Got 408 for page 9...
Got 432 for page 10...
Got 456 for page 11...
Got 480 for page 12...
Got 504 for page 13...
Got 528 for page 14...
Got 552 for page 15...
Got 576 for page 16...
Got 600 for page 17...
Got 624 for page 18...
Got 648 for page 19...
Got 672 for page 20...
Got 696 for page 21...
Got 720 for page 22...
Got 744 for page 23...
Got 768 for page 24...
Got 792 for page 25...
Got 816 for page 26...
Got 840 for page 27...
Got 864 for page 28...
Got 888 for page 29...
Got 912 for page 30...
Got 936 for page 31...
Got 960 for page 32...
Got 984 for page 33...
Got 1008 for page 34...
Got 1032 for page 35...
Got 1056 for page 36...
Got 1080 for page 37...
Got 1104 for page 38...
Got 1128 for page 39...
Got 1152 for page 40...
Got 1176 for page 41...
Got 1200 for page 42...
Got 1224 for page 43...
Got 1248 f

KeyboardInterrupt: 

We can check the first `2` examples in the scrapped data.

In [20]:
data[:2]

[{'class': 'katydid',
  'image': 'https://bugwoodcloud.org/images/384x256/9009025.jpg'},
 {'class': 'katydid',
  'image': 'https://bugwoodcloud.org/images/384x256/9009024.jpg'}]

In [21]:
len(data)

60114

In the following code cell we are going to create a dataframe based on the data that we have scrapped.

In [26]:
paired = [list(r.values()) for r in data]

df = pd.DataFrame(paired, columns=['class', 'image'])
df.head()

Unnamed: 0,class,image
0,katydid,https://bugwoodcloud.org/images/384x256/900902...
1,katydid,https://bugwoodcloud.org/images/384x256/900902...
2,katydid,https://bugwoodcloud.org/images/384x256/900900...
3,grasshopper,https://bugwoodcloud.org/images/384x256/562780...
4,grasshopper,https://bugwoodcloud.org/images/384x256/562780...


The following function will rename `plural`to `singular` for class names. 

In [30]:
def rename_class(class_):
    obj = {'katydids': 'katydid', 'grasshoppers': 'grasshopper', 'grashoppers': 'grasshopper',
           'crickets': 'cricket', 'grasshoppper' :'grasshopper',
          }
    try:
        return obj[class_]
    except KeyError:
        return class_
df['class'] = df['class'].apply(rename_class)

In [32]:
df['class'].unique()

array(['katydid', 'grasshopper', 'cricket', 'locust', 'bush-cricket',
       'acrididae)', 'boopie', 'subfamily)', 'conehead', 'grasshopper,',
       'wetas', 'weta', 'gomphocerinae)'], dtype=object)

Next we are going to drop the duplicates columns based on the `image` column and filter them to `3` labels for insects which is:

1. `grasshopper`
2. `cricket`
3. `katydids`


In [38]:
df_unique['class'].value_counts()

class
grasshopper       1093
cricket            271
katydid            155
locust              30
weta                25
conehead            11
wetas                9
bush-cricket         7
boopie               7
subfamily)           6
acrididae)           4
gomphocerinae)       3
grasshopper,         1
Name: count, dtype: int64

In [40]:
df_unique = df.drop_duplicates(subset='image')
df_filtered = df_unique[df_unique['class'].isin(["grasshopper", "cricket", 'katydid'])]
df_filtered.head()

Unnamed: 0,class,image
0,katydid,https://bugwoodcloud.org/images/384x256/900902...
1,katydid,https://bugwoodcloud.org/images/384x256/900902...
2,katydid,https://bugwoodcloud.org/images/384x256/900900...
3,grasshopper,https://bugwoodcloud.org/images/384x256/562780...
4,grasshopper,https://bugwoodcloud.org/images/384x256/562780...


In [41]:
df_filtered.to_csv('pickle.csv', index=False)
print("Saved")

Saved


In [42]:
df_filtered = pd.read_csv('pickle.csv')

In [43]:
df_filtered['class'].value_counts()

class
grasshopper    1093
cricket         271
katydid         155
Name: count, dtype: int64

Next we are going to download these images into their respective directories based on their class names. Before we donload them we want to make sure that they are balanced.

In [44]:
min_size = df_filtered['class'].value_counts().min()
df_balanced = df_filtered.groupby('class').apply(lambda x: x.sample(min_size)).reset_index(drop=True)
df_balanced['class'].value_counts()

class
cricket        155
grasshopper    155
katydid        155
Name: count, dtype: int64

In [46]:
base_dir = "dataset"

if not os.path.exists(base_dir):
    os.mkdir(base_dir)

for class_ in ["grasshopper", "cricket", 'katydid']:
    class_dir = os.path.join(base_dir, class_)
    if not os.path.exists(class_dir):
        os.mkdir(class_dir)


We are going to use the `ThreadPoolExecutor` from `concurrent.futures` to do multi-processing in downloading and saving the images concurrently. First let's check the number of `cpu's` that are in this computer.

In [47]:
num_workers = multiprocessing.cpu_count()
print("CPUs: {}".format(num_workers))

CPUs: 12


We can try to get the images and their classes from a pandas dataframe to a python list of tuples.

In [48]:
img_urls = list(zip(
    df_balanced['class'].values,
    df_balanced['image'].values,
))

random.shuffle(img_urls)

In [49]:
img_urls[:2]

[('grasshopper', 'https://bugwoodcloud.org/images/384x256/5552856.jpg'),
 ('katydid', 'https://bugwoodcloud.org/images/384x256/5178032.jpg')]

The following function save a single image of an insect in it's respective folder.

In [50]:
skipped = list()
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

def save_image(row):
    class_, url = row
    try:
        image_name = f"{url.split('/')[-1]}"
        res = requests.get(url, headers=headers)
        if res.status_code == 200:
            save_name = os.path.join(base_dir, class_, image_name)
            with open(save_name, 'wb') as fp:
                fp.write(res.content)
        else:
            print("Failed to download the image: STATUS CODE {}".format(res.status_code))
    except Exception as e:
        print(e)
        print("url skipped:", url)
        skipped.append(url)

Next we are going to download the images from the urls.

In [51]:
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    futures = [executor.submit(save_image, i) for i in img_urls]
    for future in tqdm.tqdm(as_completed(futures), desc="downloading...", total=len(img_urls)):
        pass
print("Done!!")

downloading...: 100%|████████████████████████████████████████████████████████████████| 465/465 [05:32<00:00,  1.40it/s]

Done!!





After downloading these images, We want to split our dataset into `2` sets the `train` and `test`. So we are going to create the following folder structure in our `dataset` directory.


```
📁 dataset
    📁  train
      📁class_1
         - 0.jpg
         - 1.jpg
         ....
      📁  class_2
         - 0.jpg
         - 1.jpg
         ....
      .....
    📁  test
      📁  class_1
         - 0.jpg
         - 1.jpg
         ....
      📁  class_2
         - 0.jpg
         - 1.jpg
         ....
    ....
```

In [54]:
for class_ in ["grasshopper", "cricket", 'katydid']:
    train_dir = os.path.join(base_dir,'train', class_)
    test_dir = os.path.join(base_dir,'test', class_)
    if not os.path.exists(train_dir):
        os.makedirs(train_dir)
    if not os.path.exists(test_dir):
        os.makedirs(test_dir)

We will take the first `55` images and put them in the test folder to the respective class and the rest will be taken to the train folder of the respective class.

In [55]:
def move_to_final_destination(trg, class_:str, limit=None):
    class_dir = os.path.join(base_dir, class_)
    total = len(os.listdir(class_dir)) if limit is None else limit
    if limit is None:
        for image_name in tqdm.tqdm(os.listdir(class_dir), total=total, desc=f"moving images to the {class_} folder..."):
            src = os.path.join(class_dir, image_name)
            shutil.move(src, trg)
    else:
        for image_name in tqdm.tqdm(os.listdir(class_dir)[:limit], total=total, desc=f"moving images to the {class_} folder..."):
            src = os.path.join(class_dir, image_name)
            shutil.move(src, trg)

for class_ in ["grasshopper", "cricket", 'katydid']:
    train_dir = os.path.join(base_dir,'train', class_)
    test_dir = os.path.join(base_dir,'test', class_)
    move_to_final_destination(test_dir, class_, limit=55)
    move_to_final_destination(train_dir, class_, limit=None)


moving images to the grasshopper folder...: 100%|█████████████████████████████████████| 55/55 [00:00<00:00, 237.71it/s]
moving images to the grasshopper folder...: 100%|███████████████████████████████████| 100/100 [00:00<00:00, 588.21it/s]
moving images to the cricket folder...: 100%|█████████████████████████████████████████| 55/55 [00:00<00:00, 640.25it/s]
moving images to the cricket folder...: 100%|███████████████████████████████████████| 100/100 [00:00<00:00, 632.88it/s]
moving images to the katydid folder...: 100%|█████████████████████████████████████████| 55/55 [00:00<00:00, 768.33it/s]
moving images to the katydid folder...: 100%|███████████████████████████████████████| 100/100 [00:00<00:00, 653.60it/s]


Next we are going to remove the `class` directories from the dataset.

In [56]:
for class_ in ["grasshopper", "cricket", 'katydid']:
    _dir = os.path.join(base_dir, class_)
    shutil.rmtree(_dir, 0o777)

### Refs
1. [github.com/CrispenGari](https://github.com/CrispenGari/web-scrapping-python/blob/main/selenium/selenium.ipynb)