Skip to content

jpigla/MREIDs-from-SERPs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Get MREIDs from Google SERPs

GitHub node npm GitHub last commit

⚠ Disclaimer

This software is not authorized by Google and doesn't follow Google's robots.txt. Scraping without Google explicit written permission is a violation of thei terms and conditions on scraping and can potentially cause a lawsuit

Requirements

Local Environment

NPM-Packages

Installation

  1. Download latest project release, extract and (if desired) move folder to your home directory
  2. Check if Node and NPM are already installed
  • Open Terminal
  • Type node -v in Terminal to check NodeJS version number (and if installed already)
  • Type npm -v in Terminal to check NPM-Manager version number (and if installed already)
  • If not, install Homebrew (from https://brew.sh/index_de; Mac) and then NodeJS with brew update && brew install node
  1. In Terminal move to project folder (type cd folder/ if you named the project folder "folder")
  2. Install required NPM packages, type npm install in Terminal

Usage

Run script with arguments

  • npm run scrape -- --kw=<KEYWORD> (--headless=false)
  • node get_mreids.js --kw=<KEYWORD> (--headless=false)

Examples

  • npm run scrape -- --kw=firefox --headless=false
  • node get_mreids.js --kw=firefox --headless=false
  • npm run scrape -- --kw=barack+obama
  • node get_mreids.js --kw=barack+obama

What happens here

  • Puppeteer (Headless Browser; Chromium) opens first SERP with input keyword
  • We extract MREID if available and look for "Über XX weitere ansehen" link in Knowledge-Graph
  • We click that link and get a carousel of entities on next SERP
  • We extract urls and names of entities from carousel
  • We open each url in new tab, wait for load-event and extract MREID
  • We close Browser and export list to CSV / Terminal

Help & Information

Changelog

22.10.2019 (1.1.3, 1.2.3)

  • Fix Xpath for carousel extraction
  • Add argument for headless (optional)

16.10.2019 (1.1.2)

  • Fix Xpath for carousel extraction

11.10.2019 (1.1.1)

  • Enhance extraction of MREID from SERP (some entity SERPs show MREID differently, now catch 'em all!)

04.10.2019 (1.0.1)

  • Fix extraction of MREID from SERP (different approach because of layout change in SERP)

02.10.2019 (1.0)

  • Initial Upload
  • Functional version

License

All assets and code are under the GPL v3 License unless specified otherwise.

About

Get MREIDs of Entities from Google SERPs with Puppeteer (2019)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •