<a href="https://colab.research.google.com/github/Aakaey181/RA_Porject_ML1/blob/main/Module1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 1 of 3: Automating Image Data Collection & Preprocessing** - *Collect and prepare potholes vs. clean road” images using the DuckDuckGo API for end-to-end ML pipelines*
---



# **Introduction** <br>
This module walks through the first steps of an end-to-end machine learning workflow: collecting image data from the web using a programmatic API and preparing it for model training. You’ll learn how to automate bulk downloads of road with potholes vs. common road without potholes images via the DuckDuckGo Image Search API, organize them into a clear folder structure, and apply basic cleaning and resizing so that any standard deep-learning library can ingest the data with minimal fuss. <br>

Following the existing notebook, we’ll:

- Mount and configure the environment (Google Drive, dependencies, imports).

- Download images automatically via the DuckDuckGo Image Search API.

- Organize files into positive and negative sample folders.

- Clean the dataset, removing corrupted files and ensuring consistent labeling.

- Build a tf.data.Dataset pipeline for efficient batching and preprocessing.

By the end of this module, you’ll have a clean, folder-structured dataset ready for model training in Module 2.

<br>


# **1. Environment Setup and Data Collection**



### 1.0 Basic Setup

Mount Google Drive to access and persist your data:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Installing Dependencies in a New Environment: <br>
When you work in a new environment, it is important to install all the necessary dependencies before running your code. In Google Colab, we use the !pip install command to install Python packages directly from a notebook cell. This ensures that all the required libraries are available for this notebook.

Here are the key packages will be needed in this notebook:

In [2]:
!pip install numpy pandas matplotlib
!pip install scikit-learn
!pip install tensorflow
!pip install tensorflow-addons
!pip install duckduckgo-search

Collecting tensorflow-addons
  Downloading tensorflow_addons-0.23.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons)
  Downloading typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB)
Downloading tensorflow_addons-0.23.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (611 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.8/611.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
  Attempting uninstall: typeguard
    Found existing installation: typeguard 4.4.2
    Uninstalling typeguard-4.4.2:
      Successfully uninstalled typeguard-4.4.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
inflect 7.5.0 requires typeguard>=4.0.1, b

Importing Libraries: <br>
Import libraries such as requests, os, shutil, random, and TensorFlow/Keras components.

In [3]:
import os
import requests
import time
import shutil
import random
import numpy as np
import tensorflow as tf
from duckduckgo_search import DDGS

### 1.1 Dataset Directory Setup

In [4]:
# Base directory (where images will be saved)
BASE_DATA_DIR = "drive/MyDrive/ra/data"
RAW_IMAGE_DIR = os.path.join(BASE_DATA_DIR, "raw_images")
os.makedirs(RAW_IMAGE_DIR, exist_ok=True)

This lets subsequent steps target each class by name.

### 1.2 Image Downloading and Processing Logic
We begin by collecting images for both classes:

- Positive class: road images with potholes

- Negative class: road images without potholes (e.g., roads, highways, etc.)

A custom web-scraping script using the DuckDuckGo API downloads images and saves them to our Google Drive. Negative samples from various road-related queries are merged into a unified folder (e.g. road_without_potholes). We provide you a helper function to scrape and save images:

In [5]:
def download_images(search_query, num_images=10, save_dir=RAW_IMAGE_DIR):
    """
    Downloads images from DuckDuckGo Image Search.

    Parameters:
    - search_query: (str) The search keyword (e.g., "zebra crossing" or "random street without zebra crossing").
    - num_images: (int) Number of images to download.
    - save_dir: (str) Directory where images will be stored.
    """
    # Create a folder for the search category
    category_dir = os.path.join(save_dir, search_query.replace(" ", "_"))
    os.makedirs(category_dir, exist_ok=True)

    print(f"Searching for {num_images} images of '{search_query}'...")
    with DDGS() as ddgs:
        # Request more images than needed to filter out invalid ones
        image_results = ddgs.images(search_query, max_results=num_images * 2)

    downloaded = 0
    seen_urls = set()  # To avoid duplicate downloads

    for i, result in enumerate(image_results):
        url = result["image"]
        if url in seen_urls:
            continue
        seen_urls.add(url)

        try:
            response = requests.get(url, timeout=10)
            if response.status_code == 200:
                # Save image to file
                file_path = os.path.join(category_dir, f"{search_query.replace(' ', '_')}_{i}.jpg")
                with open(file_path, "wb") as f:
                    f.write(response.content)
                downloaded += 1
                print(f"Downloaded: {file_path}")
        except Exception as e:
            print(f"Failed to download {url}: {e}")

        # Stop when enough images are downloaded
        if downloaded >= num_images:
            break

        time.sleep(1)  # Delay to avoid getting blocked

    print(f"Successfully downloaded {downloaded}/{num_images} images for '{search_query}' in {category_dir}")

**Note**: If your target changes (e.g., detecting zebra crossings instead of potholes), replace the query string 'potholes' with 'zebra crossing'. Then uncomment or adjust the corresponding `download_images(...)` call to fetch the appropriate images.

**Class Balance Reminder**: Keep positive and negative examples roughly equal (e.g., you may try 100 each, total 200) to help your model learn both classes effectively.

Positive samples: `download_images('potholes', 100)` <br>
Negative samples: run with queries like 'road', 'highway', then merge subfolders into road_without_potholes/ using shutil.move. (next step)<br>



In [9]:
# TODO:
# Download Positive class: images with potholes
# Uncomment the following line of code
# download_images("potholes", num_images=100)


# TODO:
# Download Negative class: images without potholes
# Uncomment the following line of code
# download_images("road", num_images = 20)
# download_images("highway", num_images = 20)
# download_images("county road", num_images = 20)
# download_images("rural road", num_images = 20)
# download_images("empty road", num_images = 20)

Searching for 40 images of 'rural road'...
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_6.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_8.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_9.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_10.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_14.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_15.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_16.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_17.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_18.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_19.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_20.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_21.jpg
Downloaded: drive/MyDrive/ra/data/raw_images/rural_road/rural_road_22.jp

The folder structure will be:<br>

```plaintext
data
├── raw_images/
│   └── potholes/
│   └── road/
│   └── highway/
│   └── .../



In [10]:
#TODO: Negative class queries (`neg_queries`):
#    - Edit the `neg_queries = [...]` list to include keywords for images **without** your target.
#    - For potholes: `neg_queries = ["road", "highway", "county road", "rural road", "empty road"]`.
# You can
neg_queries = ["road", "highway", "county road", "rural road", "empty road"]

#TODO:  Negative folder naming (`NEG_FOLDER_NAME`):
#    - Use `road_without_<target>` format. E.g., `NEG_FOLDER_NAME = "road_without_potholes"`.
#    - For zebra crossings, use `NEG_FOLDER_NAME = "road_without_zebra_crossing"`.
NEG_FOLDER_NAME = "road_without_potholes"

common_neg_folder = os.path.join(BASE_DATA_DIR, "raw_images", NEG_FOLDER_NAME)
os.makedirs(common_neg_folder, exist_ok=True)

for query in neg_queries:
    src_folder = os.path.join(BASE_DATA_DIR, "raw_images", query.replace(" ", "_"))
    if os.path.exists(src_folder):
        for file in os.listdir(src_folder):
            src_file = os.path.join(src_folder, file)
            dst_file = os.path.join(common_neg_folder, file)
            shutil.move(src_file, dst_file)
        os.rmdir(src_folder)

print("combine negative sample to：", common_neg_folder)

combine negative sample to： drive/MyDrive/ra/data/raw_images/road_without_potholes


### 1.3 Data Cleaning <br>
After downloading, check for and remove files with invalid extensions or corrupted images. For example, iterate through the image file list, attempt to decode each file using TensorFlow functions, and delete any that fail to decode. Finally, print the number of valid images remaining.

-  Suppose `RAW_IMAGE_DIR` has two subfolders: `potholes`  (positive class) & `road_without_potholes` (negative class). We will read them all into arrays.

#### 1.3.1 Class discovery and labeling

Ensure a clear folder structure under raw_images/: <br>
raw_images/ <br>
  ├── potholes/       # positive samples <br>
  └── road_without_potholes/    # negative samples <br>

In [15]:
# TODO: Set TARGET to the object you want to detect.
#       e.g. TARGET = "potholes"         → folders “potholes” & “road_without_potholes”
#            TARGET = "zebra_crossing"   → folders “road_zebra_crossing” & “road_without_zebra_crossing”
# pos_pattern and neg_pattern are built from TARGET;
# make sure exactly one folder matches each before moving on.
TARGET = "potholes"

pos_pattern = f"{TARGET}"
neg_pattern = f"road_without_{TARGET}"

all_folders = os.listdir(RAW_IMAGE_DIR)
pos_folders = [f for f in all_folders if f == pos_pattern]
neg_folders = [f for f in all_folders if f == neg_pattern]

if len(pos_folders) != 1 or len(neg_folders) != 1:
    raise ValueError(
        f"Expect one pos folder matching '{pos_pattern}' and one neg matching '{neg_pattern}'.\n"
        f"Found pos: {pos_folders}, neg: {neg_folders}"
    )

class_names = pos_folders + neg_folders
print("Classes found:", class_names)

file_paths, labels = [], []
for cls_name in class_names:
    cls_folder = os.path.join(RAW_IMAGE_DIR, cls_name)
    label = 1 if cls_name in pos_folders else 0
    for fname in os.listdir(cls_folder):
        file_paths.append(os.path.join(cls_folder, fname))
        labels.append(label)

file_paths = np.array(file_paths)
labels     = np.array(labels)

print(f"Total {TARGET} images:", len(file_paths),
      "| positives:", int(labels.sum()), "negatives:", int(len(labels)-labels.sum()))

Classes found: ['potholes', 'road_without_potholes']
Total potholes images: 428 | positives: 228 negatives: 200


#### 1.3.2 Remove Corrupted Files: <br>
Iterate through all file paths, check if they are valid images (using extensions and TensorFlow’s decoding functions), print details when a file is removed, and update our list of file paths.

In [16]:
# check & remove corrupted files
def check_corrupt(path):
    try:
        data = tf.io.read_file(path)
        _ = tf.image.decode_jpeg(data, channels=3)
        return True
    except Exception as e:
        print(f"Corrupted file removed: {path} - Error: {e}")
        os.remove(path)
        return False

final_paths = []
final_labels = []
for p, l in zip(file_paths, labels):
    if check_corrupt(p):
        final_paths.append(p)
        final_labels.append(l)

file_paths = np.array(final_paths)
labels = np.array(final_labels)
print("After removing corrupted files:", len(file_paths))


Corrupted file removed: drive/MyDrive/ra/data/raw_images/potholes/potholes_3.jpg - Error: {{function_node __wrapped__DecodeJpeg_device_/job:localhost/replica:0/task:0/device:CPU:0}} Unknown image file format. One of JPEG, PNG, GIF, BMP required. [Op:DecodeJpeg]
Corrupted file removed: drive/MyDrive/ra/data/raw_images/potholes/potholes_6.jpg - Error: {{function_node __wrapped__DecodeJpeg_device_/job:localhost/replica:0/task:0/device:CPU:0}} Unknown image file format. One of JPEG, PNG, GIF, BMP required. [Op:DecodeJpeg]
Corrupted file removed: drive/MyDrive/ra/data/raw_images/potholes/potholes_9.jpg - Error: {{function_node __wrapped__DecodeJpeg_device_/job:localhost/replica:0/task:0/device:CPU:0}} Unknown image file format. One of JPEG, PNG, GIF, BMP required. [Op:DecodeJpeg]
Corrupted file removed: drive/MyDrive/ra/data/raw_images/potholes/potholes_13.jpg - Error: {{function_node __wrapped__DecodeJpeg_device_/job:localhost/replica:0/task:0/device:CPU:0}} Unknown image file format. One 

In [18]:
# You might need to check again the size of classes
print(f"Total {TARGET} images:", len(file_paths),
      "| positives:", int(labels.sum()), "negatives:", int(len(labels)-labels.sum()))

Total potholes images: 408 | positives: 215 negatives: 193


### 1.4 Wrapping Up Module 1 and Create Dataset Using tf.data.Dataset <br>

At the end of this module, we encapsulate all the file-path and label handling and other helper function into a single utils script. In Module 2 you’ll import and call this `create_dataset(...)` function directly—so you can jump straight into building and training your model without repeating the setup steps.

please run the following code before move to module 2

In [19]:
%%writefile /content/drive/MyDrive/ra/ml_utils.py
import os, time, random, shutil, requests, numpy as np, tensorflow as tf
# from duckduckgo_search import DDGS
import tensorflow.keras.applications.mobilenet_v2 as mobilenet_v2

IMG_SIZE   = (224, 224)
BATCH_SIZE = 32

def load_and_preprocess(path, label):
    img = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, IMG_SIZE)
    img = mobilenet_v2.preprocess_input(img)
    return img, label

def create_dataset(paths, labs):
    ds = tf.data.Dataset.from_tensor_slices((paths, labs))
    ds = ds.map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    ds = ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
    return ds


Overwriting /content/drive/MyDrive/ra/ml_utils.py


In [20]:
# Verify it really is a plain file
!file /content/drive/MyDrive/ra/ml_utils.py
!head -n 3 /content/drive/MyDrive/ra/ml_utils.py

/content/drive/MyDrive/ra/ml_utils.py: Python script, ASCII text executable
import os, time, random, shutil, requests, numpy as np, tensorflow as tf
# from duckduckgo_search import DDGS
import tensorflow.keras.applications.mobilenet_v2 as mobilenet_v2
