# Image Dataset Collection Project
- **Objective:**  Automate the collection of images from Google Images for 20 predefined categories,
  download 50 images per category, store metadata, and organize them into labeled folders.
 
## **Dataset Name:** `ImageDataset`

### **Use Case:**
- This dataset can be used to train and evaluate image classification models.
- For example, you can build a convolutional neural network that learns to distinguish between different
  categories such as "Nature", "Animals", "Cars", etc.
- The metadata (image URL, filename, and resolution) can be used during preprocessing (e.g., resizing, augmentation) before training your model.

## Step 1: Import Required Libraries 
- We import libraries for web automation (Selenium), HTTP requests, HTML parsing (BeautifulSoup),image processing (Pillow), and OS/file operations.

In [2]:
pip install selenium

Collecting selenium
  Downloading selenium-4.29.0-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.29.0-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m52.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading trio-0.29.0-py3-none-any.whl (492 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.9/492.9 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Downloading outcome-1.

In [4]:
import os
import time
import requests
import csv
from io import BytesIO

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

from bs4 import BeautifulSoup
from PIL import Image

## Step 2: Define Categories and Create Folders
- We define a list of 20 categories and then create a base directory (`ImageDataset`) with one subfolder per category.


In [5]:
categories = [
    "Nature", "Animals", "Cars", "Flowers", "Mountains", "Beaches", "Food", "Architecture",
    "Sports", "Fashion", "Technology", "Art", "Cityscapes", "Insects", "Birds",
    "Underwater", "Landscapes", "Urban", "Space", "Portraits"
]

# Create base directory for the dataset
base_dir = "ImageDataset"
if not os.path.exists(base_dir):
    os.makedirs(base_dir)

# Create subfolders for each category
for cat in categories:
    cat_dir = os.path.join(base_dir, cat)
    if not os.path.exists(cat_dir):
        os.makedirs(cat_dir)


## Step 3: Setup Selenium WebDriver
- We configure the Selenium WebDriver to run in headless mode for automation.
- **Make sure** that the ChromeDriver is installed and available in your system path.

In [None]:
import tempfile
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Create a temporary directory for Chrome user data
temp_user_data_dir = tempfile.mkdtemp()

# Setup Chrome WebDriver options
chrome_options = Options()
chrome_options.add_argument("--headless")  # run in headless mode
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument(f"--user-data-dir={temp_user_data_dir}")  # Use a unique user data directory

# Initialize the driver (ensure 'chromedriver' is in your PATH)
driver = webdriver.Chrome(options=chrome_options)


## Step 4: Define a Function to Download Images for a Category

-  **The function `download_images_for_category` will:**
     - Open a Google Images search page for the given category.
     - Scroll down a few times to load more images.
     - Parse the page with BeautifulSoup to extract image URLs.
     - Download up to 50 images.
     - Save each image in its respective category folder.
     - Extract image resolution using Pillow.
     - Return metadata for all downloaded images.

In [None]:
def download_images_for_category(category, num_images=50):
    query = category
    # Construct the Google Images search URL (using the "tbm=isch" parameter)
    search_url = f"https://www.google.com/search?q={query}&tbm=isch"
    driver.get(search_url)
    time.sleep(2)  # Wait for the page to load
    
    # Scroll to load more images (adjust the range as needed)
    for i in range(3):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
    
    # Parse the page source using BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    image_elements = soup.find_all("img")
    print(f"Found {len(image_elements)} image elements for '{category}'.")
    
    downloaded = 0
    metadata = []
    for idx, img in enumerate(image_elements):
        if downloaded >= num_images:
            break
        img_url = img.get("src")
        if img_url is None:
            continue  # Skip if no URL is found
        try:
            # Download the image data
            response = requests.get(img_url, timeout=10)
            img_data = response.content
            
            # Open the image with Pillow to get its resolution
            image = Image.open(BytesIO(img_data))
            width, height = image.size
            
            # Define a filename and save the image in the corresponding folder
            filename = f"{category}_{downloaded+1}.jpg"
            file_path = os.path.join(base_dir, category, filename)
            with open(file_path, "wb") as f:
                f.write(img_data)
            
            # Append metadata info
            metadata.append({
                "category": category,
                "image_url": img_url,
                "filename": filename,
                "resolution": f"{width}x{height}"
            })
            
            downloaded += 1
            print(f"Downloaded: {filename}")
        except Exception as e:
            print(f"Error downloading image {idx} for '{category}': {e}")
    return metadata

## Step 5: Download Images for All Categories and Save Metadata to CSV
- For each category, we call the `download_images_for_category` function.
-  Then, we combine all metadata and save it into a CSV file for easy reference.

In [None]:
all_metadata = []
for cat in categories:
    print(f"\nProcessing category: {cat}")
    cat_metadata = download_images_for_category(cat, num_images=50)
    all_metadata.extend(cat_metadata)

# Save all metadata to a CSV file
csv_filename = os.path.join(base_dir, "metadata.csv")
with open(csv_filename, "w", newline='', encoding="utf-8") as csvfile:
    fieldnames = ["category", "image_url", "filename", "resolution"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for data in all_metadata:
        writer.writerow(data)

print(f"\nMetadata saved to {csv_filename}")

## Step 6: Clean Up
- After downloading the images, we close the Selenium WebDriver.

In [None]:
driver.quit()

## Summary and Use Case

### **Dataset Name:** `ImageDataset`
 
### **Description:** 
- This dataset consists of images from 20 different categories (e.g., Nature, Animals, Cars, etc.), with 50 images per category.
- All images are stored in labeled folders, and the accompanying CSV file contains metadata such as the image URL, filename, and resolution.

### **Use Case:** 
-The dataset is ideal for training image classification models using deep learning. Researchers and developers can utilize this curated data to build and evaluate convolutional neural networks (CNNs) that automatically classify images into their respective categories. The metadata can assist in preprocessing tasks like resizing or augmentation.
