A web sraping micro service
- ADA - gives list of locations [scraped DE, IL, partial PA]
- set target['state'] variable; TODO - scrape all states
$ node app/dentists_ada.js
from node_scraper root folder
- YELP - IL, PA (202 no addr / 11,756 with address)
- YELLOW PAGES - a. scrape directionsURL on first attempt
- GOOGLE MAPS [LATER]
- SECRETARY OF STATE [LATER]
- BUSINESS LICENSING DATABASE [LATER] - https://delpros.delaware.gov/OH_VerifyLicense
- add postal abbrev https://pe.usps.com/text/pub28/28apc_002.htm
- add National Address Database from Dept of Transportation: https://www.transportation.gov/gis/national-address-database/national-address-database-nad-disclaimer
- http://us-cities.survey.okfn.org/dataset/property-transfers
- 2captcha is $0.50 for 1000 (5% of a cent per captcha)
- scraperapi is $29 for 250K requests per month (1.16% of a cent per proxy)
- Kaiser Permanente
- Blue Cross Blue Shield
- UnitedHealthCare - https://dentalsearch.yourdentalplan.com/providersearch
- Aetna
- Cigna
- HCSC
- Molina Healthcare
- Anthem
- Centene
- Humana
- CVS Health
- MCNA Health Care
- WellCare
Notes from https://www.youtube.com/watch?v=ITPDmVaOou0&t=932s
- Proxy Server (Datacenter, Residential, Specialized, Super Proxy)
- proxy rotation and session management - should be sending the same cookie from the same IP; Can use residential proxies for high value targets and ignore everything else
- Headless browsers (JS rendering) - Puppeteer, Playwright, Secret Agent
- Fingerprinting - browswer, operating system, browser extensions, audio/video hardware; fingerprint needs to match browswers and paired to sessions
- Public API scraping (avoid parsing HTML, may require a session which can be created/refreshed from a browser)
- Mobile API scrapign usign a man-in-the-middle proxy to reverse engineer mobile APIs - install custom certificate to phone and man-in-the-middle software to your laptop
- Solving Challenges/Recaptcha -