# Install and Import

In this notebook, I will try to scrape Commons Wikimedia to create a image dataset that can be used to train machine learning algorithms.

First part is installing and importing all the required libraries. Since selenium and chrome drive are not part of default google colab environment. I will start by installing these. I will use selenium for interacting dynamic elements in the DOM, and after that I will use beautiful soup to extract image links.

Along with selenium, requests and beautifulsoup, I will import string and re library which will help with formatting the name, if I don't intend to use the actual name of the file.

In [None]:
!apt-get update -qq
!apt install chromium-chromium-driver -qq
!pip install selenium -qq

from concurrent.futures import ThreadPoolExecutor
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import requests
import string
import time
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException, TimeoutException
import pandas as pd

from urllib.parse import quote
import os
import glob
import random
from PIL import Image

Next I setup the chrome driver and use Options() to start chrome in headless mode.

# Setup Chrome Driver

In [27]:
def setup_driver():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    return webdriver.Chrome(options=chrome_options)

driver = setup_driver()

The category pages on Wikimedia are sometimes stored in a Tree format, and when the pages are initially loaded, only the parent is visible. This means if we want to scrape the child pages, we will need to use Selenium to interact with the dynamic elements on the page.

The expand_all function will find all the Categories that have an expandable subcategory. This function will find the element of class=CategoryTreeToggle CategoryTreeToggleHandlerAttached where attribute aria-expanded="false". It will then click on these elements to expand them one at a time. This function also highlights the expanded elements in yellow. This proved to be quite useful when I was debugging and testing the code in normal mode in chrome driver.


# Expand Collapsible Section

In [None]:
def expand_all(driver):
    # Find all elements where class=CategoryTreeToggle CategoryTreeToggleHandlerAttached and aria-expanded="false"
    buttons = driver.find_elements(By.CSS_SELECTOR, 'a.CategoryTreeToggle.CategoryTreeToggleHandlerAttached[aria-expanded="false"]')

    print("Buttons found:", len(buttons))

    # Loop through each button and click it
    for button in buttons:
        parent_div = button.find_element(By.XPATH, './ancestor::div')

        # Highlight the parent div by changing its style # Useful while developing and tesing
        driver.execute_script("arguments[0].style.backgroundColor = 'yellow'; arguments[0].style.color = 'blue';", parent_div)

        # Click on the button to expand
        button.click()

    time.sleep(1)

It is possible that there are many subcategories in a page, so I created a function check_remaining_buttons which will call itself recursively and keep executing expand_all if check_remaining_buttons returns anything other 0.

# Expand all Sections

In [None]:
def check_remaining_buttons(driver):
    # Check if there are any more expandable buttons
    remaining_buttons = driver.find_elements(By.CSS_SELECTOR, 'a.CategoryTreeToggle.CategoryTreeToggleHandlerAttached[aria-expanded="false"]')

    print("Remaining buttons:", len(remaining_buttons))

    # If there are remaining buttons, expand them
    if remaining_buttons: # if remaining_buttons, the check condition becomes false
        expand_all(driver)
        check_remaining_buttons(driver)

Once all the buttons are expanded the links to the Category pages can be collected. The simplest way to do this will be finding all the divs of class=CategoryTreeItem and finding the hrefs withing them. If these contain the term "wiki/Category:", they will be stored in category_links. All the hrefs are then added to a list.

# Extract Page Links

In [None]:
def find_category_links(driver):
    all_links = []

    category_links = driver.find_elements(By.CSS_SELECTOR, 'div.CategoryTreeItem a[href*="wiki/Category:"]')

    print("Category links found:", len(category_links))

    # Print out the href of each category link
    for link in category_links:
        all_links.append(link.get_attribute('href'))
        print(link.get_attribute('href'))

    return all_links

# Test 1

Let's see this in action.

Suppose I am building a dataset to train a model that will try to identify the maker of aircraft in the photo. I will need to find thousands of examples for each model to have any hope of training the model. The code that I wrote allows me to go through all the the category pages and get all the links to category pages related to a particular aircraft.

I will use the example of Airbus A300 here. I searched for a A300 photo on wikimedia and through category pages at bottom I navigated to the page "Airbus_A300_by_variant" which is higher in the tree. This page should contain links to all of the A300 pages. When I run the code on this page I get close to 900 pages in return. This would have been quite a task to do manually, and it would be prone to a lot of errors if done manually.

In [28]:
driver.get("https://commons.wikimedia.org/wiki/Category:Airbus_A300_by_variant")
check_remaining_buttons(driver)
all_links = find_category_links(driver)
driver.quit()

Remaining buttons: 22
Buttons found: 22
Remaining buttons: 21
Buttons found: 21
Remaining buttons: 8
Buttons found: 8
Remaining buttons: 2
Buttons found: 2
Remaining buttons: 0
Category links found: 887
https://commons.wikimedia.org/wiki/Category:Airbus_A300B1
https://commons.wikimedia.org/wiki/Category:F-OCAZ_(aircraft)
https://commons.wikimedia.org/wiki/Category:F-WUAB_(Airbus_A300B1)
https://commons.wikimedia.org/wiki/Category:Airbus_A300_maiden_flight
https://commons.wikimedia.org/wiki/Category:Airbus_A300_maiden_flight_aircrew
https://commons.wikimedia.org/wiki/Category:People_at_Airbus_A300_maiden_flight
https://commons.wikimedia.org/wiki/Category:Airbus_A300_rollout
https://commons.wikimedia.org/wiki/Category:F-WUAC_(aircraft)
https://commons.wikimedia.org/wiki/Category:Airbus_A300B2-100
https://commons.wikimedia.org/wiki/Category:F-BUAD_(aircraft)
https://commons.wikimedia.org/wiki/Category:Airbus_A300_ZERO-G
https://commons.wikimedia.org/wiki/Category:COVID-19_vaccination_in_A

In [29]:
len(all_links)

887

To explain the next part I will choose a page that doesn't have too many children. I choose the page for a particular aircraft which is registered F-WUAB

In [14]:
def main():
    # Open the desired webpage
    driver.get('https://commons.wikimedia.org/wiki/Category:F-WUAB_(Airbus_A300B1)')

    # Start the process by checking and expanding all buttons
    check_remaining_buttons(driver)

    # After expanding all buttons, find the category links
    all_links = find_category_links(driver)

    # Close the driver after completion
    driver.quit()

    return all_links

if __name__ == "__main__":
    all_links = main()

Remaining buttons: 1
Buttons found: 1
Remaining buttons: 0
Category links found: 4
https://commons.wikimedia.org/wiki/Category:Airbus_A300_maiden_flight
https://commons.wikimedia.org/wiki/Category:Airbus_A300_maiden_flight_aircrew
https://commons.wikimedia.org/wiki/Category:People_at_Airbus_A300_maiden_flight
https://commons.wikimedia.org/wiki/Category:Airbus_A300_rollout


# Extracting Image URLs and Test 2

There are 4 category pages that I need to scrape. I wrote a code that will navigate to individual pages of images on these pages. This can be done by finding all hrefs that are from class=galleryfilename galleryfilename-truncate. I will visit each of these pages and get the hrefs from class=internal. This will give me the highest resolution images to work with.

I also printed the current category page name and the number of image links the code found on that page. PAGES shows the image page found on each category and IMGURLS shows the total images found so far.

In [16]:
# Lists to store image URLs and page links
l_t_p = []  # Stores page links for images
img_urls = []  # Stores final image URLs

# Loop through all the URLs (which are in `all_links`)
for url in all_links:
    # Print the current category or page being processed (removes "https://commons.wikimedia.org/wiki/Category:")
    print()
    print(url.replace("https://commons.wikimedia.org/wiki/Category:", ""))
    print()

    # Fetch the source code of the page
    source_code = requests.get(url, allow_redirects=False)

    # Encode the source code in ASCII to handle any non-ASCII characters (replace them if needed)
    plain_text = source_code.text.encode('ascii', 'replace')

    # Parse the page content using BeautifulSoup
    soup = BeautifulSoup(plain_text, 'html.parser')

    # Find all links for images in the current category page
    for link in soup.findAll('a', {'class': 'galleryfilename galleryfilename-truncate'}):
        # Get the link for each image page
        image_link = 'https://en.wikipedia.org/' + link.get('href')

        # Append the image page link to the list
        l_t_p.append(image_link)

    # Iterate over all image page links collected in l_t_p
    for l_t_p1 in l_t_p:
        # Now fetch the image page (not the category page)
        url = l_t_p1
        source_code = requests.get(url, allow_redirects=False)

        # Encode the source code in ASCII to handle any non-ASCII characters (replace them if needed)
        plain_text = source_code.text.encode('ascii', 'replace')

        # Parse the image page content using BeautifulSoup
        soup = BeautifulSoup(plain_text, 'html.parser')

        # Find all links for the actual image URLs on the image page
        for l_t_p2 in soup.findAll('a', {'class': 'internal'}):
            # Get the href attribute for the image link
            href = l_t_p2.get('href')

            # Construct the full image URL
            img_url = 'https:' + str(href)

            # Print the image URL (can be used for debugging or confirmation)
            print(img_url)

            # Append the image URL to the list
            img_urls.append(img_url)

    # Print summary for each category page processed
    print()
    print(f"IMGURLS:{len(img_urls)}")  # Shows how many image URLs have been found
    print(f"PAGES:{len(l_t_p)}")  # Shows how many image pages were processed
    l_t_p = []  # Clear the list for the next category page



Airbus_A300_maiden_flight

https://upload.wikimedia.org/wikipedia/commons/a/af/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1974.jpg
https://upload.wikimedia.org/wikipedia/commons/e/ee/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1975.jpg
https://upload.wikimedia.org/wikipedia/commons/7/7c/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1976.jpg
https://upload.wikimedia.org/wikipedia/commons/d/d4/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1977.jpg
https://upload.wikimedia.org/wikipedia/commons/5/5b/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1978.jpg
https://upload.wikimedia.org/wikipedia/commons/d/d1/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1979_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/4/4a/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1979_%28cropped%2C_restored%29.jpg
https://upload.wikimedia.org/wikipedia/commons/f/ff/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1979.jpg
https://upload.wikimedia.org/wikipedia/commons/7/77/28.10.72_1er_Vol_d%27Airbus_%281

I will use the set() function in python to remove any duplicate pages. This can happen if I start from a page that is very high in the tree.

# Remove Duplicate Images

In [23]:
unique_img_urls = list(set(img_urls))
print(f"Duplicate Links:{len(img_urls)-len(unique_img_urls)}")

Duplicate Links:0


# Method 1: Download Images Using Python

Now the images can be simply downloaded to a specified directory.

In [None]:
def download_image(url, prefix, save_dir):
    # Extract the filename from the URL
    filename = url.split('/')[-1]

    # Add the prefix to the filename
    new_filename = prefix + filename

    # Create the full path to save the image
    file_path = os.path.join(save_dir, new_filename)

    # Send an HTTP GET request to fetch the image
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file in binary write mode and save the image
        with open(file_path, 'wb') as f:
            f.write(response.content)
        print(f"Downloaded: {new_filename}")
    else:
        print(f"Failed to download: {url}")

prefix = "Airbus A300 "
save_dir = "/Users/[YOURUSERNAME]/Downloads/TRAIN/RAW/Airbus_A300"
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

for url in img_urls:
    download_image(url, prefix, save_dir)


# Method 2: Use Jdownloader to download

However, I usually use a batch download utility like Jdownloader to download a large set of images. It has a robust feature set, and allows pausing and resuming downloads.

However unlike previous code where adding a prefix to filename is very easy, using jdownloader means I need to use a different technique. Jdownloader has a great feature where embedding #filename=name at the end of url will set the filename to "name", or whatever we choose it to be.

I will use this ability to extract filename from the img urls using regex and add a prefix to a filename this way. Since urls cannot contain space character, any spaces in prefix or filename will be written as "%20" in url. Jdownloader can understand this representation and will substitute %20 with space when downloading the file.

In [24]:
# Function to generate the new URL
def generate_new_url(url):
    # Extract the part of the URL after the last '/'
    filename = url.split('/')[-1]

    # Replace spaces in the filename with '%20'
    prefix_nospace = prefix.replace(" ", "%20")

    # Construct the new URL
    new_url = url + "#filename=" + prefix_nospace + filename

    return new_url

prefix = "Airbus A300 "

new_img_urls = [generate_new_url(url) for url in unique_img_urls]

# Print the new URLs
for new_url in new_img_urls:
    print(new_url)


https://upload.wikimedia.org/wikipedia/commons/f/f0/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi2017.jpg#filename=Airbus%20A300%2028.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi2017.jpg
https://upload.wikimedia.org/wikipedia/commons/4/49/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1993.jpg#filename=Airbus%20A300%2028.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1993.jpg
https://upload.wikimedia.org/wikipedia/commons/8/89/05.02.72_Airbus_roule_pour_la_1%C3%A8re_fois_%281972%29_-_53Fi1949.jpg#filename=Airbus%20A300%2005.02.72_Airbus_roule_pour_la_1%C3%A8re_fois_%281972%29_-_53Fi1949.jpg
https://upload.wikimedia.org/wikipedia/commons/f/f2/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi2025.jpg#filename=Airbus%20A300%2028.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi2025.jpg
https://upload.wikimedia.org/wikipedia/commons/b/b2/28.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1982.jpg#filename=Airbus%20A300%2028.10.72_1er_Vol_d%27Airbus_%281972%29_-_53Fi1982.jpg
https://upload.wikimedia.org/wikipedia/co

Now all that is left is to make all the images of same resolution and if needed apply transformation like roation, horizontal or vertical mirroring to a small percentage of images. Such transformation can help a image recognition model generalise better.

# Transform Image

In [None]:
def process_image(image_path, output_path, target_width, target_height, flip_type=None, rotate_degrees=None, flip_percent=0, rotate_percent=0):
    """
    Crop, resize, flip, and rotate image based on the given parameters.

    Parameters:
        image_path (str): Path to the input image.
        output_path (str): Path to save the processed image.
        target_width (int): Target width for the crop.
        target_height (int): Target height for the crop.
        flip_type (str): 'horizontal' or 'vertical' to flip the image (default None).
        rotate_degrees (bool): True to rotate (if flip is None), False to not rotate (default None).
        flip_percent (int): The probability (in %) of flipping the image.
        rotate_percent (int): The probability (in %) of rotating the image.
    """
    with Image.open(image_path) as img:
        img_width, img_height = img.size

        # Crop the image to the target resolution (centered crop)
        left = (img_width - target_width) / 2
        top = (img_height - target_height) / 2
        right = (img_width + target_width) / 2
        bottom = (img_height + target_height) / 2

        img = img.crop((left, top, right, bottom))

        # Resize the image to the target resolution
        img = img.resize((target_width, target_height))

        # Flip the image with a certain probability (flip_percent)
        if flip_percent > 0 and random.random() < flip_percent / 100:
            if flip_type == 'horizontal':
                img = img.transpose(Image.FLIP_LEFT_RIGHT)
            elif flip_type == 'vertical':
                img = img.transpose(Image.FLIP_TOP_BOTTOM)

        # Rotate the image with a certain probability (rotate_percent)
        if rotate_percent > 0 and random.random() < rotate_percent / 100:
            if rotate_degrees is None:  # If rotate_degrees is None, pick randomly from [90, 180, 270]
                rotate_degrees = random.choice([90, 180, 270])
            img = img.rotate(rotate_degrees)

        # Save the processed image
        img.save(output_path)
        print(f"Processed image saved to: {output_path}")

def process_directory(input_dir, output_dir, target_width, target_height, flip_type=None, rotate_degrees=None, flip_percent=0, rotate_percent=0):
    """
    Process all images in the input directory.

    Parameters:
        input_dir (str): Directory containing input images.
        output_dir (str): Directory to save processed images.
        target_width (int): Target width for cropping and resizing.
        target_height (int): Target height for cropping and resizing.
        flip_type (str): 'horizontal' or 'vertical' to flip images (default None).
        rotate_degrees (int): Rotation degrees (90, 180, or 270) (default None).
        flip_percent (int): The probability (in %) of flipping images.
        rotate_percent (int): The probability (in %) of rotating images.
    """
    # Get all image files in the directory
    image_files = glob.glob(os.path.join(input_dir, '*.*'))

    # Supported image formats (optional, you can add more)
    valid_formats = ['.jpg', '.jpeg', '.png', '.bmp', '.gif']

    # Filter the list to include only valid image files
    image_files = [file for file in image_files if os.path.splitext(file)[1].lower() in valid_formats]

    # Process each image
    for image_path in image_files:
        # Prepare output path
        output_image_path = os.path.join(output_dir, os.path.basename(image_path))

        # Process the image
        process_image(image_path, output_image_path, target_width, target_height, flip_type, rotate_degrees, flip_percent, rotate_percent)

# Example usage
input_dir = '/Users/[YOURUSERNAME]/Downloads/TRAIN/RAW/Airbus_A300'
output_dir = '/Users/[YOURUSERNAME]/Downloads/TRAIN/TRANSFORMED/Airbus_A300'
target_width = 800
target_height = 600
flip_type = 'horizontal'
rotate_degrees = None
flip_percent = 10
rotate_percent = 15

process_directory(input_dir, output_dir, target_width, target_height, flip_type, rotate_degrees, flip_percent, rotate_percent)


And that's it. Now to build a image dataset, I need to just give a url, and the code will do the rest.