Scrapple |version| documentation
The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis. Data extraction is a primary step involved in data mining and analysis. Extracting content from structured web pages is a vital task to be performed when the Internet is the principal source of data.
The current standards in web structure involve the use of CSS selectors or XPath expressions to select particular tags from which information can be extracted. Web pages are structured as element trees which can be parsed to traverse through the tags. This tree structure, which represents tags as parent/children/siblings, is very useful when tags should be represented in terms of the rest of the web page structure.
Scrapple is a project aimed at designing a framework for building web content extractors. Scrapple uses key-value based configuration files to define parameters to be considered in generating the extractor. It considers the base page URL, selectors for the data to be extracted, and the selector for the links to be crawled through. At its core, Scrapple abstracts the implementation of the extractor, focussing more on representing the selectors for the required tags. Scrapple can be used to generate single page content extractors or link crawlers.
Web content extraction is a common task in the process of collecting data for data analysis. There are several existing frameworks that aid in this task. In this chapter, a brief introduction of Scrapple is provided, with instructions on setting up the development machine to run Scrapple.
.. toctree:: :maxdepth: 2 :hidden: intro/overview intro/existing intro/requirements intro/install
- :doc:`intro/overview`
- An introduction to Scrapple
- :doc:`intro/existing`
- A review of existing systems
- :doc:`intro/requirements`
- Hardware and software requirements to run Scrapple
- :doc:`intro/install`
- Instructions for installing Scrapple and the required dependencies
Creating web content extractors requires a good understanding of the following topics :
In this chapter, a brief overview of the concepts behind Scrapple is given.
.. toctree:: :maxdepth: 2 :hidden: concepts/structure concepts/selectors concepts/formats
- :doc:`concepts/structure`
- The basics of web page structure and element trees
- :doc:`concepts/selectors`
- An introduction to tag selector expressions
- :doc:`concepts/formats`
- The primary data formats involved in handling data
This section deals with how Scrapple works - the architecture of the Scrapple framework, the commands and options provided by the framework and the specification of the configuration file.
.. toctree:: :maxdepth: 2 :hidden: framework/basic framework/commands framework/config
- :doc:`framework/basic`
- The architecture of the Scrapple framework
- :doc:`framework/commands`
- Commands provided by the Scrapple CLI
- :doc:`framework/config`
- The configuration file which is used by Scrapple to implement the required extractor/crawler
This section deals with the implementation of the Scrapple framework. This includes an explanation of the classes involved in the framework, the interaction scenarios for each of the commands supported by Scrapple, and utility functions that form a part of the implementation of the extractor.
.. toctree:: :maxdepth: 2 :hidden: implementation/classes implementation/interaction implementation/cli implementation/commands implementation/selectors implementation/utils
- :doc:`implementation/classes`
- The classes involved in the implementation of Scrapple
- :doc:`implementation/interaction`
- Interaction scenarios in the implementation of each of the Scrapple commands
- :doc:`implementation/cli`
- The Scrapple command line interface
- :doc:`implementation/commands`
- The implementation of the command classes
- :doc:`implementation/selectors`
- The implementation of the selector classes
- :doc:`implementation/utils`
- Utilities functions that support the implementation of the extractor
In this section, some experiments with Scrapple are provided. There are two main types of tools that can be implemented with the Scrapple framework :
Once you've :doc:`installed Scrapple <intro/install>`, you can see the list of available :ref:`commands <framework-commands>` and the related options using the command
$ scrapple --help
The :ref:`configuration file <framework-config>` is the backbone of Scrapple. It specifies the base page URL, selectors for the data extraction, the follow link for the link crawler and several other parameters.
Examples for each type are given.
.. toctree:: :maxdepth: 2 :hidden: intro/tutorials/single_linear intro/tutorials/link_crawler intro/tutorials/results
- :doc:`intro/tutorials/single_linear`
- Tutorial for single page linear extractors
- :doc:`intro/tutorials/link_crawler`
- Tutorial for link crawlers
Scrapple is on GitHub !
.. toctree:: :maxdepth: 2 :hidden: contributing/authors contributing/history contributing/guide
- :doc:`contributing/authors`
- The creators of Scrapple
- :doc:`contributing/history`
- History of Scrapple releases
- :doc:`contributing/guide`
- The Scrapple contribution guide
The goal of Scrapple is to provide a generalized solution to the problem of web content extraction. This framework requires a basic understanding of web page structure, which is necessary to write the necessary selector expressions. Using these selector expressions, the required web content extractors can be implemented to generate the desired datasets.
Experimentation with a wide range of websites gave consistently accurate results, in terms of the generated dataset. However, larger crawl jobs took a lot of time to complete and it was necessary to run the execution in one stretch. Scrapple could be improved to provide restartable crawlers, using caching mechanisms to keep track of the position in the URL frontier. Tag recommendation systems could also be implemented, using complex learning algorithms, though there would be a trade-off on accuracy.
.. toctree:: :maxdepth: 2 :hidden: