The scraping corner is a repository where I keep track of all my scraping projects in a single place. Since I've been working on a bunch of them, I found that it was the most practical way to operate. Then anytime I need to use scraping in a project I develop the scraping part in here and use it as a third pary API from the main project.
In this README file, I'll introduce the structure and the way I work with this repo and keep a list of the website.
Disclaimer : websites are alive objects and evolve, which means that a scraping process workint at a specific date, might encounter difficulties the following day due to source's website structural change. The spiders available in this repo have been designed for specific uses and are not maintained if not necessary. That's why I'm also working on Testing Spider, to detect easily which selectors should be modified to correct an old spider.
Structure of the repo
├── .gitignore <- contains description of files not to upload on git repository ├── README.md <- the top-level README : repo description ├── docker-compose.yml <- running scraping env through docker and docker-compose ├── Dockerfile <- running scraping env through docker and docker-compose ├── requirements.txt <- contains necessary packages (incorporated in Dockerfile) ├── notebook <== To prepare a scraping I use to prepare the css selectors through notebooks ├── *.ipynb <- usually with one notebook per spider or per website ├── *.ipynb <- usually with one notebook per spider or per website ├── ... ├── scrapy_project <== contains all scripts relative to scrapy ├── scrapy.cfg <- default scrapy file. Used to deal with multiple projects. ├── run_spider.py <- start the scraping process ├── spider_dispatch.json <- configurates the scraping process ├── ProjectWebsite1 <== folder created by scrapy for a first project (see "Using scrapy" for more explanations) ├── ProjectWebsite2 <== folder created by scrapy for a first project (see "Using scrapy" for more explanations) ├── ... ├── scripts <== contains specific for common tasks while scraping ├── count_line.py <- used to count the number of released item from an executing scraping ├── display_info.py <- used to display information about the released items from an executing scraping ├── jl_to_df.py <- converts a json line file into a pandas DataFrame ├── selenium <== contains all scripts relative to scrapy ├── selenium_basics.py <- selenium basics actions wrapped into a Python Class to simplify exploration process ├── chromedriver <- not uploaded, but this its position. ├── template <== template files ├── ... <- (TBD) ├── tor <== experimentation to crawl the darkweb
The main tool I use to perform web scraping is
scrapy_projectfolder, are all the scrapy projects created with the
scrapy startproject project_nameprovided by the framework.
The list of these project is described in a following section, with brief details and status for each of them.
For the bigger scraping projects, I try to maintain a README file directly into the project folder.
Default Scrapy files
When launching a new project with scrapy, it automatically creates the following structure :
├── project_name/ ├── scrapy.cfg ├── project_name/ ├── spiders ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py
- This configuration file has the structure of an
INI fileand contains two sections which are
[deploy]. It is also used to manage multiple projects.
- It is necessary to have it at the root folder of a project to perform the
- It is the objects used to realise the scraping itself.
- we have to create
spider_*.pyfiles in this folder
- items.py : contains the structure of released items while scraping
- settings.py : define parameters about crawling
- middlewares.py : enhance the scraping process
- pipelines.py : orchestrator of the scraping process
Custom files to leverage the scraping-corner
- It is used to run any spider from the repo with the following bash commande
python run_spider.py -s SPIDER_ACRONYM
SPIDER_ACRONYMbeing a reference to the
- When running a spider, a new folder
scraped_datawill be created at the root of the
scrapingfolder. Files will be saved in it.
- It is also possible to change the log level while crawling. The library used is logzero, and loglevel (Debug/Info/Warn/Debug) are defined into spider class. Default value in here is
- Finally if you want a different start for the crawling, you will need to change the
self.start_urlsvalues, directly inside the spiders constructors.
- As running a spider might be very long (up to days if no limit is specified), this file will let you know the size of the file you are creating.
- Use the following command to get your answer :
python count_line.py scraped_data\your_file.py
- You will find here some lines to convert your
jlfile into a very manipulative pandas
DataFrame. Which is much more convenient to perform Data Science analysis then !
About scraped data
scraped_data folder does not exist on the git repo and is specified into the
*.gitignore file, so that it never appears. It will be automatically created during spider run. You will find your data into it.
About Spider testing
As mentionned in the introduction
- Spider works
- Full project : Hackathon for XHEC students 2020
- Needs further processing
- Tried to scrap with Splash (useful lines still in spider and settings code)
- Spider does not work
- Uses Splash
- Code to imprve
LBC : leboncoin
- Spider is working 80%
- Trying to repare buttemporary banned from LBC
- UnitTesting in process (To be finished when ban is over or after using proxies)
PV : ParuVendu
- Works Fine
- Logging not updated
- Code to clean
- No Unit Testing (not in a hurry as the structure as not changed in more than one year)
SL : SeLoger
- Does not work : seems to be a lot of things to improve
TA : TripAdvisor
- Multilple spiders
- airlines : not working
- hotels : not working
- restaurant (information about restos) : working but quality to be imporved
- restaurant (reviews of restos) : working but stops after few iterations (only 30k reviews for Paris => understand why)
- Some parts are working
To-Do & Improvements
- Create the jl_to_df.py file
- Implement a possibility to select max_page (at each level)
- Add a config file to make it out of the scripts (start_urls would also be in it)
- Define and create an interface to select the preciously mentionned parameters and run spiders.
- Data Camp : A course about how tu use scrapy
- ParseHub : a basic tool to realize scraping without coding
- Selenium : a tool to simulate user interface
- scraping wikipedia : definitions about scraping
- Scrapy website : a huge treasure for scrapers
- yielding multiple items in scrapy : https://stackoverflow.com/questions/39227277/can-scrapy-yield-different-kinds-of-items