# Counterfeit Product Image Detection - Colab Setup

This notebook sets up the environment for data collection on Google Colab.


## Optional: Authenticate with Hugging Face
If you added `HF_TOKEN` to your Colab secrets, run this cell to log in. This prevents rate limit issues and allows access to gated datasets.

In [None]:
from google.colab import userdata
from huggingface_hub import login

try:
    # Retrieve token from Secrets
    hf_token = userdata.get('HF_TOKEN')

    # Login
    login(token=hf_token)
    print("\nâœ“ Successfully logged in to Hugging Face Hub")

except Exception as e:
    print(f"Login skipped or failed: {e}")
    print("Tip: Add 'HF_TOKEN' in the Secrets tab (ðŸ”‘) on the left if you need authenticated access.")


âœ“ Successfully logged in to Hugging Face Hub


In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Step 1: Mount Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## Step 2: Navigate to Project Directory


In [None]:
import os
from pathlib import Path

# Set project directory (adjust path if needed)
PROJECT_DIR = Path('/content/drive/MyDrive/Masters/UIUC_FA2025/CS441/project/Counterfeit-Product-Image-Detection')

# Create project directory if it doesn't exist
PROJECT_DIR.mkdir(parents=True, exist_ok=True)

# Change to project directory
os.chdir(PROJECT_DIR)

print(f"Working in: {PROJECT_DIR}")
print(f"Current directory: {os.getcwd()}")


âœ“ Working in: /content/drive/MyDrive/Masters/UIUC_FA2025/CS441/project/Counterfeit-Product-Image-Detection
âœ“ Current directory: /content/drive/MyDrive/Masters/UIUC_FA2025/CS441/project/Counterfeit-Product-Image-Detection


## Step 3: Install Dependencies


In [None]:
%pip install -q icrawler datasets requests pillow

print("Dependencies installed (icrawler added)")

âœ“ Dependencies installed (icrawler added)


## Step 4: Create Data Directories


In [None]:
# Define the 10 target brands
BRANDS = [
    'nike', 'adidas', 'jordan', 'yeezy', 'new balance',
    'converse', 'puma', 'reebok', 'under armour', 'vans'
]

# Create nested directory structure
data_dir = PROJECT_DIR / 'data' / 'sneakers'

for brand in BRANDS:
    # Create authentic and fake folders for each brand
    (data_dir / 'authentic' / brand).mkdir(parents=True, exist_ok=True)
    (data_dir / 'fake' / brand).mkdir(parents=True, exist_ok=True)

print(f"Directory structure created for {len(BRANDS)} brands under {data_dir}")
print(f"  Example: {data_dir / 'authentic' / 'nike'}")

âœ“ Directory structure created for 10 brands under /content/drive/MyDrive/Masters/UIUC_FA2025/CS441/project/Counterfeit-Product-Image-Detection/data/sneakers
  Example: /content/drive/MyDrive/Masters/UIUC_FA2025/CS441/project/Counterfeit-Product-Image-Detection/data/sneakers/authentic/nike


## Step 5: Check Storage


In [None]:
import shutil

# Check Drive storage
total, used, free = shutil.disk_usage("/content/drive/MyDrive")
print(f"Google Drive Storage:")
print(f"  Total: {total / (1024**3):.2f} GB")
print(f"  Used: {used / (1024**3):.2f} GB")
print(f"  Free: {free / (1024**3):.2f} GB")

# Check data folder size if it exists
data_path = PROJECT_DIR / 'data'
if data_path.exists():
    total_size = sum(f.stat().st_size for f in data_path.rglob('*') if f.is_file())
    print(f"\nCurrent data folder size: {total_size / (1024**3):.2f} GB")
else:
    print("\nData folder not found (will be created during data collection)")


Google Drive Storage:
  Total: 225.83 GB
  Used: 47.65 GB
  Free: 178.19 GB

Current data folder size: 0.04 GB


## Step 6: Run Data Collection

Now you're ready to run the data collection scripts. They will automatically save to Google Drive.


In [None]:
import os
import shutil
import requests
from bs4 import BeautifulSoup
import time
import urllib.parse
from pathlib import Path
from icrawler.builtin import BingImageCrawler
import logging
import re

# Suppress verbose icrawler logs
logging.getLogger('icrawler').setLevel(logging.ERROR)

# --- Configuration ---
# Ensure BRANDS and data_dir are defined from previous cells
# BRANDS = [...]
# data_dir = ...

NUM_REAL = 30
NUM_FAKE = 10

# URLs for authentic sites
OFFICIAL_SITES = {
    'nike': 'https://www.nike.com/w?q=sneakers&vst=sneakers',
    'adidas': 'https://www.adidas.com/us/men-athletic_sneakers',
    'jordan': 'https://www.nike.com/w/mens-jordan-shoes-37eefznik1zy7ok',
    'yeezy': 'https://www.shoebacca.com/collections/adidas-yeezy',
    'new balance': 'https://www.newbalance.com/shoes/?searchKey=sneakers&sm=Search%20Bar%20and%20Type%20Text',
    'converse': 'https://www.converse.com/search?q=sneakers',
    'puma': 'https://us.puma.com/us/en/search?q=Sneakers&offset=24',
    'reebok': 'https://www.reebok.com/pages/search-results?q=sneakers&current=2',
    'under armour': 'https://www.underarmour.com/en-us/c/mens/shoes/running/',
    'vans': 'https://www.vans.com/en-us/c/mens/shoes-1100?icn=subnav&filters={%22Product+Category%22:[%22Sneakers%22]}&sort=bestMatches'
}

# --- Helper Functions ---
def download_single_image(src, save_path):
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        # Increased timeout slightly
        response = requests.get(src, timeout=15, headers=headers)
        if response.status_code == 200:
            # Check content type or extension
            content_type = response.headers.get("content-type", "")
            if "image" in content_type or str(save_path).endswith(".jpg"):
                with open(save_path, "wb") as f:
                    f.write(response.content)
                return True
    except Exception:
        pass
    return False

# --- 1. Real Scraper (Official + Bing Fallback) ---
def scrape_real_images(brand):
    print(f"\n[REAL] {brand}...")
    save_dir = data_dir / 'authentic' / brand

    # Clean directory first
    if save_dir.exists():
        shutil.rmtree(save_dir)
    save_dir.mkdir(parents=True, exist_ok=True)

    count = 0

    # A. Try Official Site First
    url = OFFICIAL_SITES.get(brand)
    if url:
        print(f"  - Scraping official site: {url}...")
        try:
            headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
            }
            page = requests.get(url, headers=headers, timeout=15)
            soup = BeautifulSoup(page.text, 'html.parser')
            images = soup.find_all("img")

            # Regex for price like $100 or $100.99
            price_pattern = re.compile(r'[\$â‚¬Â£]\s*(\d+(?:\.\d{2})?)')

            for img in images:
                if count >= NUM_REAL: break

                src = img.get("src") or img.get("data-src")
                if not src: continue

                # Filter small icons/logos
                if "logo" in src.lower() or "icon" in src.lower() or ".svg" in src.lower():
                    continue

                # Attempt to find price in parent containers
                price_str = "0_00"
                try:
                    curr = img
                    # Traverse up to 4 parents to find a price
                    for _ in range(4):
                        if curr.parent:
                            curr = curr.parent
                            txt = curr.get_text()
                            match = price_pattern.search(txt)
                            if match:
                                # Format: 105.99 -> 105_99
                                price_str = match.group(1).replace('.', '_')
                                break
                except Exception:
                    pass

                # Absolute URL
                if src.startswith("//"):
                    src = "https:" + src
                elif src.startswith("/"):
                    # Handle relative URLs
                    parsed_url = urllib.parse.urlparse(url)
                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
                    src = urllib.parse.urljoin(base_url, src)

                # Download with price in filename
                # Format: brand_index_price.jpg
                filename = save_dir / f"{brand}_{count}_{price_str}.jpg"
                if download_single_image(src, filename):
                    count += 1
                    print(".", end="")
            print(f"\n    Got {count} images from official site.")
        except Exception as e:
            print(f"    Error scraping official site: {e}")

    # B. Fallback to Bing (if count < NUM_REAL)
    if count < NUM_REAL:
        needed = NUM_REAL - count
        print(f"  - Target not met ({count}/{NUM_REAL}). Scraping {needed} via Bing...")
        try:
            crawler = BingImageCrawler(storage={'root_dir': str(save_dir)}, downloader_threads=4)
            # IMPROVED QUERY: added 'footwear' and 'product' to disambiguate 'under armour' from 'under the table'
            query = f"{brand} sneakers footwear product photo white background"

            # file_idx_offset='auto' ensures we don't overwrite if filenames differ
            crawler.crawl(keyword=query, max_num=needed, file_idx_offset='auto')
            print(f"    Bing crawler finished.")
        except Exception as e:
            print(f"    Error scraping Bing: {e}")


# --- 2. Fake Scraper (AliExpress + Reddit + iOffer + Bing Fallback) ---
def scrape_aliexpress(brand):
    print(f"  - Searching AliExpress for '{brand}'...")
    query = urllib.parse.quote(f"{brand} sneakers")
    url = f"https://www.aliexpress.com/wholesale?SearchText={query}&g=y"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36",
        "Referer": "https://www.google.com/"
    }

    found_imgs = set()
    try:
        resp = requests.get(url, headers=headers, timeout=12)
        if resp.status_code == 200:
            soup = BeautifulSoup(resp.text, "html.parser")
            for img in soup.find_all("img"):
                raw = img.get("src") or img.get("data-src") or ""
                if not raw or "data:image" in raw or "base64" in raw: continue
                if raw.startswith("//"): raw = "https:" + raw

                # Loose filter for product images
                if ".jpg" in raw or ".png" in raw:
                    found_imgs.add(raw)
    except Exception:
        pass
    return list(found_imgs)

def scrape_reddit_bs4(brand):
    print(f"  - Searching Reddit for '{brand}'...")
    url = f"https://old.reddit.com/r/Repsneakers/search?q={brand}+sneakers&restrict_sr=on&include_over_18=on&sort=relevance&t=all"
    headers = {"User-Agent": "Mozilla/5.0"}
    images = []
    try:
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code == 200:
            soup = BeautifulSoup(resp.text, 'html.parser')
            for img in soup.find_all("img"):
                src = img.get("src")
                if src and "preview" in src and "redd.it" in src:
                    images.append(src)
    except Exception:
        pass
    return images

def scrape_ioffer(brand):
    # Robust iOffer scraper as requested
    print(f"  - Searching iOffer for '{brand}'...")

    # Updated URL structure based on user feedback
    base_url = "https://www.ioffer.com/products_s"
    keywords = urllib.parse.quote_plus(f"{brand} sneakers")

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5"
    }

    valid_urls = []

    # Iterate through pagination (limit to 2 pages)
    for page in range(1, 3):
        # Format: https://www.ioffer.com/products_s?dropdown_selector=%23bottom_products_search_dropdown&keywords=nike+sneakers
        target_url = f"{base_url}?dropdown_selector=%23bottom_products_search_dropdown&keywords={keywords}&page={page}"

        try:
            # Handle redirects and timeouts
            resp = requests.get(target_url, headers=headers, timeout=10, allow_redirects=True)

            # Check if page is valid
            if resp.status_code != 200:
                print(f"    [Log] iOffer Page {page} failed: Status {resp.status_code}")
                continue

            soup = BeautifulSoup(resp.text, 'html.parser')

            # Find listing images (generic approach)
            images = soup.find_all("img")

            for img in images:
                src = img.get('src')
                if not src: continue

                # Filter out obvious UI elements
                if "logo" in src or "pixel" in src or "blank" in src:
                    continue

                if src.startswith("//"): src = "https:" + src
                if not src.startswith("http"): continue

                # Quick validity check (HEAD request)
                try:
                    head = requests.head(src, headers=headers, timeout=3)
                    if head.status_code == 200:
                        valid_urls.append(src)
                    else:
                        print(f"    [Log] Discarding broken image: {src} (Status {head.status_code})")
                except Exception:
                    print(f"    [Log] Validation failed for image: {src}")

        except Exception as e:
            print(f"    [Log] iOffer scraping error on page {page}: {e}")
            break

    return list(set(valid_urls))

def scrape_fake_images(brand):
    print(f"\n[FAKE] {brand}...")
    save_dir = data_dir / 'fake' / brand

    # Clean directory first
    if save_dir.exists():
        shutil.rmtree(save_dir)
    save_dir.mkdir(parents=True, exist_ok=True)

    # 1. Gather URLs from multiple sources
    imgs = scrape_aliexpress(brand)

    # Add Reddit if needed
    if len(imgs) < NUM_FAKE:
        imgs += scrape_reddit_bs4(brand)

    # Add iOffer if needed (Combined as requested)
    if len(imgs) < NUM_FAKE:
        imgs += scrape_ioffer(brand)

    # 2. Download
    count = 0
    seen = set()
    for url in imgs:
        if count >= NUM_FAKE: break
        if url in seen: continue
        seen.add(url)

        path = save_dir / f"{brand}_fake_{count}.jpg"
        if download_single_image(url, path):
            count += 1
            print(".", end="")

    print(f"\n    Downloaded {count} fake images manually.")

    # 3. Bing Fallback
    if count < NUM_FAKE:
        print(f"  - Target not met. Running Bing Crawler fallback...")
        try:
            crawler = BingImageCrawler(storage={'root_dir': str(save_dir)}, downloader_threads=4)
            q = f"{brand} fake sneakers replica"
            crawler.crawl(keyword=q, max_num=NUM_FAKE - count, file_idx_offset='auto')
        except Exception as e:
            print(f"    Bing fallback failed: {e}")

# --- MAIN EXECUTION ---
print("Starting Unified Scraper (Official/AliExpress/Reddit/iOffer -> Bing Fallback)...")
for brand in BRANDS:
    scrape_real_images(brand)
    # scrape_fake_images(brand)

print("\nâœ“ Collection process finished")

Starting Unified Scraper (Official/AliExpress/Reddit/iOffer -> Bing Fallback)...

[REAL] nike...
  - Scraping official site: https://www.nike.com/w?q=sneakers&vst=sneakers...
.........................
    Got 25 images from official site.
  - Target not met (25/30). Scraping 5 via Bing...


ERROR:downloader:Response status code 403, file https://sneakerbardetroit.com/wp-content/uploads/2023/01/Nike-Dunk-Low-Chicago-Split-University-Red-DZ2536-600-Release-Date.jpg
ERROR:downloader:Exception caught when downloading file https://www.sportsdirect.com/images/imgzoom/13/13120641_xxl.jpg, error: HTTPSConnectionPool(host='www.sportsdirect.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2


    Bing crawler finished.

[REAL] adidas...
  - Scraping official site: https://www.adidas.com/us/men-athletic_sneakers...

    Got 0 images from official site.
  - Target not met (0/30). Scraping 30 via Bing...
    Bing crawler finished.

[REAL] jordan...
  - Scraping official site: https://www.nike.com/w/mens-jordan-shoes-37eefznik1zy7ok...
.........................
    Got 25 images from official site.
  - Target not met (25/30). Scraping 5 via Bing...
    Bing crawler finished.

[REAL] yeezy...
  - Scraping official site: https://www.shoebacca.com/collections/adidas-yeezy...
..
    Got 2 images from official site.
  - Target not met (2/30). Scraping 28 via Bing...


ERROR:downloader:Response status code 403, file https://sneakerbardetroit.com/wp-content/uploads/2020/12/adidas-Yeezy-Boost-350-V2-Ash-Blue.jpg
ERROR:downloader:Response status code 403, file https://sneakerbardetroit.com/wp-content/uploads/2017/03/adidas-yeezy-boost-350-v2-zebra.jpg
ERROR:downloader:Response status code 403, file https://sneakerbardetroit.com/wp-content/uploads/2020/01/adidas-Yeezy-Boost-350-V2-Earth-FX9033-Release-Date-On-Feet-6.jpg
ERROR:downloader:Response status code 404, file https://image.goat.com/transform/v1/attachments/product_template_additional_pictures/images/075/775/178/original/924555_01.jpg


    Bing crawler finished.

[REAL] new balance...
  - Scraping official site: https://www.newbalance.com/shoes/?searchKey=sneakers&sm=Search%20Bar%20and%20Type%20Text...

    Got 0 images from official site.
  - Target not met (0/30). Scraping 30 via Bing...
    Bing crawler finished.

[REAL] converse...
  - Scraping official site: https://www.converse.com/search?q=sneakers...

    Got 0 images from official site.
  - Target not met (0/30). Scraping 30 via Bing...


ERROR:downloader:Response status code 403, file https://www.converse.com/dw/image/v2/BCZC_PRD/on/demandware.static/-/Sites-cnv-master-catalog/default/dw3d987bc4/images/a_107/560845C_A_107X1.jpg
ERROR:downloader:Response status code 403, file https://www.converse.com.au/media/wysiwyg/19622_L_107X1.jpg
ERROR:downloader:Response status code 403, file https://www.converse.com/on/demandware.static/-/Library-Sites-SharedLibrary/default/dwaa9aecdd/firstspirit/media/store_locator___details/gbl_store_locator/keep_it_classic/2020_07_17/D-FA20-Store-classic-shoes.jpg
ERROR:downloader:Response status code 403, file https://www.converse.com/dw/image/v2/BJJF_PRD/on/demandware.static/-/Sites-cnv-master-catalog-we/default/dw5fd10496/images/k_08/162050C_K_08X1.jpg
ERROR:downloader:Response status code 403, file https://www.converse.com/dw/image/v2/BCZC_PRD/on/demandware.static/-/Sites-cnv-master-catalog/default/dw8711f6cf/images/d_08/M9160_D_08X1.jpg
ERROR:downloader:Response status code 403, file http

    Bing crawler finished.

[REAL] puma...
  - Scraping official site: https://us.puma.com/us/en/search?q=Sneakers&offset=24...
..............................
    Got 30 images from official site.

[REAL] reebok...
  - Scraping official site: https://www.reebok.com/pages/search-results?q=sneakers&current=2...
.............................
    Got 29 images from official site.
  - Target not met (29/30). Scraping 1 via Bing...
    Bing crawler finished.

[REAL] under armour...
  - Scraping official site: https://www.underarmour.com/en-us/c/mens/shoes/running/...

    Got 0 images from official site.
  - Target not met (0/30). Scraping 30 via Bing...


ERROR:downloader:Response status code 403, file https://static.vecteezy.com/system/resources/previews/011/773/363/original/preposition-of-place-illustration-little-girl-sitting-on-and-under-the-table-english-vocabulary-words-flashcard-set-for-education-vector.jpg
ERROR:downloader:Response status code 403, file https://static.vecteezy.com/system/resources/previews/011/773/367/non_2x/preposition-of-place-illustration-little-boy-sitting-on-and-under-the-table-english-vocabulary-words-flashcard-set-for-education-vector.jpg
ERROR:downloader:Response status code 403, file https://static.vecteezy.com/system/resources/previews/021/159/365/original/under-preposition-english-line-icon-illustration-vector.jpg
ERROR:downloader:Response status code 403, file https://static.vecteezy.com/system/resources/previews/016/558/276/large_2x/underwater-scene-ocean-coral-reef-underwater-sea-world-under-water-background-free-photo.jpg
ERROR:downloader:Response status code 400, file https://media.istockphoto.co

    Bing crawler finished.

[REAL] vans...
  - Scraping official site: https://www.vans.com/en-us/c/mens/shoes-1100?icn=subnav&filters={%22Product+Category%22:[%22Sneakers%22]}&sort=bestMatches...

    Got 0 images from official site.
  - Target not met (0/30). Scraping 30 via Bing...


ERROR:downloader:Exception caught when downloading file https://www.sportsdirect.com/images/imgzoom/24/24628018_xxl_a2.jpg, error: HTTPSConnectionPool(host='www.sportsdirect.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2


    Bing crawler finished.

âœ“ Collection process finished
