Skip to content

MiguellDomingues/vape-finder-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vape-finder-scraper

Description

A collection of web scraping scripts to facilitate content discovery for the BC Vape Finder project

How To Use

Requires Node.js and Git.

# Clone this repository
$ git clone https://github.com/MiguellDomingues/vape-finder-scraper

# Go into the repository dir
$ cd vape-finder-scraper

# Install dependencies
$ npm install

# Run the app
$ node init

On first run, dirs will be created in the proj root that contain output/input files:

scraper/
├── inventory/
│   └── surreyvapes.JSON
│   └── ...
├── logs/
│   └── surreyvapes.log
│   └── ...
├── raw_pages/  
│   └── surreyvapes/
│             └── products.JSON
│   └── ...../
│             └── products.JSON

Configuration

Control which scripts will run on each execution by commenting or uncommenting the corresponding require(..)(..) statement in init.js

Promise.all([
      //web scrapers for each of the target websites
      require("./scripts/ezvape")(ezvapes_config), 
      require("./scripts/thunderbirdvapes")(tbvapes_config),
      require("./scripts/surreyvapes")(surreyvapes_config)
  ]).then( () => {
      //write to database. read JSON files in scraper/inventory. runs after the all above scripts have completed
      require("./scripts/inventory")(inventory_config)        
  })

Each script is passed in a config object containing control flags:

const tbvapes_config = {
  //      run the web scraping function
  //      save results to scraper/thunderbirdvapes/products.JSON 
  execute_scrape:     true,

  //      read products.JSON , clean the scraped products
  //      save to scraper/inventory/thunderbirdvapes.JSON     
  execute_inventory:  true     
}
const surreyvapes_config = {
  //       run the web scraping function
  //       save results to scraper/surreyvapes/products.JSON
  execute_scrape:      true,

  //      read products.JSON, clean the scraped products
  //      save to scraper/inventory/surreyvapes.JSON    
  execute_inventory:   true     
}
const ezvapes_config = {
  //      run the web scraping function to generate 3 files in scraper/ezvapes
  //      products.JSON, brand_links.JSON, category_links.JSON
  exec_scrape_products__category_brand_links:   true,
  
  //      read brand_links.JSON to generate crawlable links
  //      run the web scraping function to generate brand_ids.JSON
  exec_scrape_brand_ids:                        true,
  
  //      read category_links.JSON to generate crawlable links,
  //      run the web scraping function to generate category_ids.JSON
  exec_scrape_category_ids:                     true,
  
  //      read products.JSON, brand_ids.JSON, category_ids.JSON,
  //      add brand, category to matching products
  //      clean the scraped products, save to scraper/inventory/ezvapes.JSON  
  exec_inventory:                               true,
}
const inventory_config = {
  //write collections to database/collections/(timestamp).JSON
  write_collections_JSON:  true,

  //true writes to local mongodb instance, false writes atlas instance,  
  write_local_db:          true,

   //true writes/overwrites collections in db
  execute_db_write:        true 
}

Libraries

Potential Improvements

  • Instead of cleaning, categorizing scraped products manually with code, we could pass raw scrapes as inputs to a data cleaning tool (or a web service such as https://trudo.ai)
  • Use TypeScript to enforce strict type checking on function call inputs and return values
  • E-Juices could be sub-categorized across flavour types
  • Clean up the brand names (GCORE, G-CORE, GCore should all be 'Gcore'); this would deduplicate some brand names across vendors
  • Save and self-host scraped images to minimize bandwidth usage for original image providers (1500~ products with 15kB~ img size = 22.5~ mB raw storage)

About

product scraping scripts for various e-commerce websites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published