Web Crawling 101 - On Going Project
Wondering what this is all about? Take 2 minutes to read our short Open Code, Open Data Manifesto.
This project is structured to work as a series of classes focused on bootstrapping your data-mining / web-crawling knowledge. Some of the topics that are covered here:
- Anatomy of a Crawler (Policies and Behaviors)
- Understanding HTTP Requests
- Scrapping / Parsing data out of HTML pages
- Tooling (Frameworks and custom-made libraries)
- Finding your public source of data
- Modeling your objects
- Storing your results
- Scaling up your crawler
How do I Start ?
Keep this project Wiki open at all times, since most of the text / references will be there for you to read, while you advance through the chapters/classes of this project.
Start each chapter by going to the Wiki first, and only after reading it's text, proceed to the code.
Take your time, read the code comments, run it, modify it and run it again to understand the impact of each change.
Happy hacking :)
- Install pip (using terminal/command prompt navigate to the "Setup" directory and run
- Reload your terminal/command prompt (open and close)
- Make sure pip is installed by running:
- If it is, you can now install the needed dependencies by running from the root of the project:
pip install -U -r Setup/requirements.txt
Marcello Lins is passionate about technology and crunching data for fun. Feel free to connect with me through Linkedin and find more about what I'm working at via my AboutMe Profile. Visit https://techflow.me/ for more awesomeness !