
# IoT Product Scraper

This notebook extracts IoT product details from product page links.

**Features:**
- Extracts product name, supplier details, prices, and images  
- Generates unique product IDs based on technology abbreviations  
- Saves results into Excel, Word, and Text files


# Imports

Explanation:

os: Helps with file handling (check if Excel/Word/Text files already exist).

requests: Downloads the HTML content of the product page.

BeautifulSoup: Parses the HTML so we can extract product details like supplier name, prices, etc.

pandas: Manages tabular data (saves into Excel).

docx.Document: Writes product details into a Word file.

In [4]:

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
from docx import Document


# Abbreviation Mapping

Explanation:

Creates a dictionary mapping IoT technology names to their standard abbreviations.

Example: if the product name contains "Bluetooth" → it will be stored as "BLE".

This ensures consistent product ID generation.

In [5]:

# Abbreviation mapping for IoT keywords
TECH_ABBREVIATIONS = {
    "LoRaWAN": "LoRaWAN",
    "Wi-Fi HaLow": "WiFiHaLow",
    "Z-Wave": "ZWave",
    "BLE": "BLE",
    "Bluetooth": "BLE",   # in case product name has Bluetooth
    "RFID": "RFID",
    "UHF": "UHF",
    "NFC": "NFC",
    "LF": "LF",
    "NB-IoT": "NBIoT",
    "NB IoT": "NBIoT",
    "GPS": "GPS",
    "IoT": "IoT"
}


# Product ID Generator

🔹 Explanation:

Maintains counters for each abbreviation (so IDs increment properly).

Example:

First BLE product → NCRTek-BLE-001

Second BLE product → NCRTek-BLE-002

If no keyword matches, a generic ID (GEN) is given.

In [6]:

# Counter memory for product IDs
product_counters = {}

def generate_product_id(product_name):
    for keyword, abbr in TECH_ABBREVIATIONS.items():
        if keyword.lower() in product_name.lower():
            product_counters.setdefault(abbr, 0)
            product_counters[abbr] += 1
            return f"NCRTek-{abbr}-{product_counters[abbr]:03d}"
    # fallback if no keyword matched
    product_counters.setdefault("GEN", 0)
    product_counters["GEN"] += 1
    return f"NCRTek-GEN-{product_counters['GEN']:03d}"


# Main Scraper Function

🔹 Explanation:

Downloads the webpage (requests).

Extracts containers:

Product Name → inside div.product-title-container > h1.

Supplier Info → inside span.company-name > a.

Prices → loops over div.price-item and formats ranges (1 piece - 99 pieces $175).

Images → collects all img inside div[data-testid='media-image'].

Returns structured details (or None if something fails).

In [7]:

def extract_product_details(product_url):
    try:
        headers = {"User-Agent": "Mozilla/5.0"}
        response = requests.get(product_url, headers=headers, timeout=10)
        
        if response.status_code != 200:
            return None, None, None, None, None, None
        
        soup = BeautifulSoup(response.text, "html.parser")
        
        # ---- Product Name ----
        product_name = None
        product_id = None
        title_container = soup.find("div", class_="product-title-container")
        if title_container:
            h1_tag = title_container.find("h1")
            if h1_tag:
                product_name = h1_tag.text.strip()
                product_id = generate_product_id(product_name)
        
        # ---- Supplier ----
        supplier_name, supplier_link = None, None
        supplier_span = soup.find("span", class_="company-name")
        if supplier_span:
            supplier_tag = supplier_span.find("a")
            if supplier_tag and supplier_tag.get("href"):
                supplier_link = supplier_tag["href"]
                supplier_name = supplier_tag.text.strip()
        
        # ---- Prices ----
        prices = []
        price_blocks = soup.find_all("div", class_="price-item")
        for block in price_blocks:
            qty_block = block.find("div", class_="id-mb-2")
            qty_text = " ".join(qty_block.stripped_strings) if qty_block else ""
            
            price_span = block.find("span")
            price_text = price_span.text.strip() if price_span else ""
            
            if qty_text and price_text:
                prices.append(f"{qty_text} {price_text}")
        all_prices = "\n".join(prices) if prices else None

        # ---- Images ----
        image_urls = []
        main_images = soup.select("div[data-testid='media-image'] img")
        for img in main_images:
            src = img.get("src")
            if src and not src.endswith("80x80.jpg"):  # skip thumbnails
                if src.startswith("//"):
                    src = "https:" + src
                image_urls.append(src)

        return product_name, product_id, supplier_name, supplier_link, all_prices, image_urls
    except Exception:
        return None, None, None, None, None, None


# Execution + Saving

🔹 Explanation:

Takes input → Product URL.

Runs scraper → Gets all details.

Saves results in 3 formats:

Excel (.xlsx) → Table format for analysis.

Word (.docx) → Formal supplier report.

Text (.txt) → Quick reference log.

Prints a success/error message.

In [9]:

product_url = input("Enter the product page link: ").strip()

product_name, product_id, supplier_name, supplier_link, all_prices, image_urls = extract_product_details(product_url)

if supplier_name and supplier_link:
    # ---------- Excel ----------
    excel_filename = "product_data.xlsx"
    new_row = pd.DataFrame([[product_url, product_name, product_id, supplier_name, supplier_link, all_prices]], 
                           columns=["Product URL", "Product Name", "Product ID", "Supplier Name", "Supplier Link", "Prices"])
    
    if os.path.exists(excel_filename):
        df = pd.read_excel(excel_filename)
        df = pd.concat([df, new_row], ignore_index=True)
    else:
        df = new_row
    df.to_excel(excel_filename, index=False)

    # ---------- Word ----------
    doc_filename = "product_data.docx"
    if os.path.exists(doc_filename):
        doc = Document(doc_filename)
    else:
        doc = Document()
        doc.add_heading("Supplier & Product Information", 0)
    
    doc.add_paragraph(f"Product URL: {product_url}")
    doc.add_paragraph(f"Supplier Name: {supplier_name}")
    doc.add_paragraph(f"Supplier Link: {supplier_link}")
    if all_prices:
        doc.add_paragraph("Prices:\n" + all_prices)
    if product_name:
        doc.add_paragraph(f"Product Name: {product_name}")
    if product_id:
        doc.add_paragraph(f"Product ID: {product_id}")

    if image_urls:
        doc.add_paragraph("Product Images:")
        for url in image_urls:
            doc.add_paragraph(url)

    doc.add_paragraph("-" * 40)
    doc.save(doc_filename)

    # ---------- Text ----------
    txt_filename = "product_data.txt"
    mode = "a" if os.path.exists(txt_filename) else "w"
    with open(txt_filename, mode, encoding="utf-8") as f:
        f.write("Supplier & Product Information\n")
        f.write(f"Product URL: {product_url}\n")
        f.write(f"Supplier Name: {supplier_name}\n")
        f.write(f"Supplier Link: {supplier_link}\n")
        if all_prices:
            f.write("Prices:\n" + all_prices + "\n")
        if image_urls:
            f.write("Product Images:\n")
            f.write("\n".join(image_urls) + "\n")
        if product_name:
            f.write(f"Product Name: {product_name}\n")
        if product_id:
            f.write(f"Product ID: {product_id}\n")
        f.write("-" * 40 + "\n")

    print("\n✅ Product details extracted successfully!\n")
else:
    print("\n⚠️ Could not extract supplier/price information from this link.\n")


Enter the product page link:  https://www.alibaba.com/product-detail/Skylab-Environmental-ABS-PC-4G-Wireless_60835110608.html?spm=a2700.details.buy_together.11.4a43534aERaTrt



✅ Product details extracted successfully!

