Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
416 lines (260 sloc) 17.3 KB

Scrapple |version| documentation

The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis. Data extraction is a primary step involved in data mining and analysis. Extracting content from structured web pages is a vital task to be performed when the Internet is the principal source of data.

The current standards in web structure involve the use of CSS selectors or XPath expressions to select particular tags from which information can be extracted. Web pages are structured as element trees which can be parsed to traverse through the tags. This tree structure, which represents tags as parent/children/siblings, is very useful when tags should be represented in terms of the rest of the web page structure.

Scrapple is a project aimed at designing a framework for building web content extractors. Scrapple uses key-value based configuration files to define parameters to be considered in generating the extractor. It considers the base page URL, selectors for the data to be extracted, and the selector for the links to be crawled through. At its core, Scrapple abstracts the implementation of the extractor, focussing more on representing the selectors for the required tags. Scrapple can be used to generate single page content extractors or link crawlers.

Overview

Web content extraction is a common task in the process of collecting data for data analysis. There are several existing frameworks that aid in this task. In this chapter, a brief introduction of Scrapple is provided, with instructions on setting up the development machine to run Scrapple.

.. toctree::
   :maxdepth: 2
   :hidden:

   intro/overview
   intro/existing
   intro/requirements
   intro/install

:doc:`intro/overview`
An introduction to Scrapple
:doc:`intro/existing`
A review of existing systems
:doc:`intro/requirements`
Hardware and software requirements to run Scrapple
:doc:`intro/install`
Instructions for installing Scrapple and the required dependencies

Concepts

Creating web content extractors requires a good understanding of the following topics :

In this chapter, a brief overview of the concepts behind Scrapple is given.

.. toctree::
   :maxdepth: 2
   :hidden:

   concepts/structure
   concepts/selectors
   concepts/formats

:doc:`concepts/structure`
The basics of web page structure and element trees
:doc:`concepts/selectors`
An introduction to tag selector expressions
:doc:`concepts/formats`
The primary data formats involved in handling data

The Scrapple framework

This section deals with how Scrapple works - the architecture of the Scrapple framework, the commands and options provided by the framework and the specification of the configuration file.

.. toctree::
   :maxdepth: 2
   :hidden:

   framework/basic
   framework/commands
   framework/config

:doc:`framework/basic`
The architecture of the Scrapple framework
:doc:`framework/commands`
Commands provided by the Scrapple CLI
:doc:`framework/config`
The configuration file which is used by Scrapple to implement the required extractor/crawler

Implementation

This section deals with the implementation of the Scrapple framework. This includes an explanation of the classes involved in the framework, the interaction scenarios for each of the commands supported by Scrapple, and utility functions that form a part of the implementation of the extractor.

.. toctree::
   :maxdepth: 2
   :hidden:

   implementation/classes
   implementation/interaction
   implementation/cli
   implementation/commands
   implementation/selectors
   implementation/utils

:doc:`implementation/classes`
The classes involved in the implementation of Scrapple
:doc:`implementation/interaction`
Interaction scenarios in the implementation of each of the Scrapple commands
:doc:`implementation/cli`
The Scrapple command line interface
:doc:`implementation/commands`
The implementation of the command classes
:doc:`implementation/selectors`
The implementation of the selector classes
:doc:`implementation/utils`
Utilities functions that support the implementation of the extractor

Experimentation & Results

In this section, some experiments with Scrapple are provided. There are two main types of tools that can be implemented with the Scrapple framework :

Once you've :doc:`installed Scrapple <intro/install>`, you can see the list of available :ref:`commands <framework-commands>` and the related options using the command

$ scrapple --help

The :ref:`configuration file <framework-config>` is the backbone of Scrapple. It specifies the base page URL, selectors for the data extraction, the follow link for the link crawler and several other parameters.

Examples for each type are given.

.. toctree::
   :maxdepth: 2
   :hidden:

   intro/tutorials/single_linear
   intro/tutorials/link_crawler
   intro/tutorials/results

:doc:`intro/tutorials/single_linear`
Tutorial for single page linear extractors
:doc:`intro/tutorials/link_crawler`
Tutorial for link crawlers

Contributing to Scrapple

Scrapple is on GitHub !

.. toctree::
   :maxdepth: 2
   :hidden:

   contributing/authors
   contributing/history
   contributing/guide

:doc:`contributing/authors`
The creators of Scrapple
:doc:`contributing/history`
History of Scrapple releases
:doc:`contributing/guide`
The Scrapple contribution guide

The goal of Scrapple is to provide a generalized solution to the problem of web content extraction. This framework requires a basic understanding of web page structure, which is necessary to write the necessary selector expressions. Using these selector expressions, the required web content extractors can be implemented to generate the desired datasets.

Experimentation with a wide range of websites gave consistently accurate results, in terms of the generated dataset. However, larger crawl jobs took a lot of time to complete and it was necessary to run the execution in one stretch. Scrapple could be improved to provide restartable crawlers, using caching mechanisms to keep track of the position in the URL frontier. Tag recommendation systems could also be implemented, using complex learning algorithms, though there would be a trade-off on accuracy.

.. toctree::
   :maxdepth: 2
   :hidden: