Skip to content

A Python scrapy project that crawls the websites of several organizations for store information, company information, or club information.

License

Notifications You must be signed in to change notification settings

mikeym88/Store-Information-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Store Info Web Crawler

This crawler fetches data from the websites of various websites (e.g. clubs, companies) in order to get information about their store locations, clubs, or other company informaiton. Information such as store name, locations, coordinates, phone number, operating hours, etc. See the results folder for the crawler output.

General Notes

  • Either crawler 1 or 2 was not working because the robots.txt was being misread. While the website's robots.txt allowed the specific URL to be accessed by crawlers, scapy did not read that correctly.
    • Workaround: set ROBOTSTXT_OBEY to False in settings.py
    • Further investigation needed.

Running the crawlers

Use the following commands to run the crawlers.

Output as JSON file:

scrapy crawl <name> -o results/<name>.json

Output as CSV file:

scrapy crawl <name> -o results/<name>.csv -t csv

Crawlers

The crawlers would need to be tested and changed on a regular basis to make sure they still works.

Name Last Ran
towncaredental 2020-07-15
rickysalldaygrillcanada 2020-07-15
jockey 2020-07-15
rentking 2020-07-15
uae_free 2020-07-18
marketwatch_ipo 2020-07-15
maac 2020-07-15

Pipelines

  • XlsxWriterPipeline will take the items from a spider and place them in an excel spreadsheet. If the spider yields multiple items, they will be placed in separate sheets in the excel file.

Notes

Crawler 5 "uae_free"

Resources

  1. ScraPy module for Python: https://docs.scrapy.org/en/latest/. Quick start-to-finish example: https://www.codementor.io/andy995/writing-a-simple-web-scraper-using-scrapy-myb7vrmgx
  2. XPath syntax: https://devhints.io/xpath. Use Google Chrome Inspector (Dev tools) to test XPath to access HTML nodes of a website; example: https://yizeng.me/2014/03/23/evaluate-and-validate-xpath-css-selectors-in-chrome-developer-tools/
  3. Network Log details/demo: https://developers.google.com/web/tools/chrome-devtools/network/

About

A Python scrapy project that crawls the websites of several organizations for store information, company information, or club information.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages