vape-finder-scraper

Description

A collection of web scraping scripts to facilitate content discovery for the BC Vape Finder project

How To Use

Requires Node.js and Git.

# Clone this repository
$ git clone https://github.com/MiguellDomingues/vape-finder-scraper

# Go into the repository dir
$ cd vape-finder-scraper

# Install dependencies
$ npm install

# Run the app
$ node init

On first run, dirs will be created in the proj root that contain output/input files:

scraper/
├── inventory/
│   └── surreyvapes.JSON
│   └── ...
├── logs/
│   └── surreyvapes.log
│   └── ...
├── raw_pages/  
│   └── surreyvapes/
│             └── products.JSON
│   └── ...../
│             └── products.JSON

Configuration

Control which scripts will run on each execution by commenting or uncommenting the corresponding require(..)(..) statement in init.js

Promise.all([
      //web scrapers for each of the target websites
      require("./scripts/ezvape")(ezvapes_config), 
      require("./scripts/thunderbirdvapes")(tbvapes_config),
      require("./scripts/surreyvapes")(surreyvapes_config)
  ]).then( () => {
      //write to database. read JSON files in scraper/inventory. runs after the all above scripts have completed
      require("./scripts/inventory")(inventory_config)        
  })

Each script is passed in a config object containing control flags:

const tbvapes_config = {
  //      run the web scraping function
  //      save results to scraper/thunderbirdvapes/products.JSON 
  execute_scrape:     true,

  //      read products.JSON , clean the scraped products
  //      save to scraper/inventory/thunderbirdvapes.JSON     
  execute_inventory:  true     
}

const surreyvapes_config = {
  //       run the web scraping function
  //       save results to scraper/surreyvapes/products.JSON
  execute_scrape:      true,

  //      read products.JSON, clean the scraped products
  //      save to scraper/inventory/surreyvapes.JSON    
  execute_inventory:   true     
}

const ezvapes_config = {
  //      run the web scraping function to generate 3 files in scraper/ezvapes
  //      products.JSON, brand_links.JSON, category_links.JSON
  exec_scrape_products__category_brand_links:   true,
  
  //      read brand_links.JSON to generate crawlable links
  //      run the web scraping function to generate brand_ids.JSON
  exec_scrape_brand_ids:                        true,
  
  //      read category_links.JSON to generate crawlable links,
  //      run the web scraping function to generate category_ids.JSON
  exec_scrape_category_ids:                     true,
  
  //      read products.JSON, brand_ids.JSON, category_ids.JSON,
  //      add brand, category to matching products
  //      clean the scraped products, save to scraper/inventory/ezvapes.JSON  
  exec_inventory:                               true,
}

const inventory_config = {
  //write collections to database/collections/(timestamp).JSON
  write_collections_JSON:  true,

  //true writes to local mongodb instance, false writes atlas instance,  
  write_local_db:          true,

   //true writes/overwrites collections in db
  execute_db_write:        true 
}

Libraries

Potential Improvements

Instead of cleaning, categorizing scraped products manually with code, we could pass raw scrapes as inputs to a data cleaning tool (or a web service such as https://trudo.ai)
Use TypeScript to enforce strict type checking on function call inputs and return values
E-Juices could be sub-categorized across flavour types
Clean up the brand names (GCORE, G-CORE, GCore should all be 'Gcore'); this would deduplicate some brand names across vendors
Save and self-host scraped images to minimize bandwidth usage for original image providers (1500~ products with 15kB~ img size = 22.5~ mB raw storage)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
database		database
scripts		scripts
.gitignore		.gitignore
README.md		README.md
init.js		init.js
package.json		package.json
utils.js		utils.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vape-finder-scraper

Description

How To Use

Configuration

Libraries

Potential Improvements

About

Releases

Packages

Languages

MiguellDomingues/vape-finder-scraper

Folders and files

Latest commit

History

Repository files navigation

vape-finder-scraper

Description

How To Use

Configuration

Libraries

Potential Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages