Bit&Black Document Crawler

Extract different parts of an HTML or XML document.

Installation

This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.

Usage

Using Crawlers to extract parts of a document

The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:

IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with <link rel="icon" ... />.
ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with <img ... />.
LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with <html lang="...">.
MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with <meta ... />.
TitleCrawler: Crawl and extract the title of a document, that has been declared with <title>...</title>.

All those crawlers work the same — they need a Dom Crawler object, that contains the document:

<?php

use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler;
use Symfony\Component\DomCrawler\Crawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$crawler = new Crawler($document);

$titleCrawler = new TitleCrawler($crawler);
$titleCrawler->crawlContent();

// This will output `Test`.
echo $titleCrawler->getTitle();

You can create a custom Crawler by implementing the CrawlerInterface.

Handling resources

In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:

The FileSystemDownloadHandler: This one loads resources and writes them to the file system. There are different Downloaders available to fetch resources:
- The HttpDiscoveryDownloader is the default one and makes use of whatever library your project uses to download resources.
- The ReactDownloader needs the react/http library and fetches resources asynchronously.
- You can — for sure — create a custom Downloader by implementing the FileSystemDownloaderInterface.
The PassiveResourceHandler: This handler does nothing and is the default one.

You can create a custom Resource Handler by implementing the ResourceHandlerInterface.

Crawling everything at once

In case you don't want to setup something, there is the HolisticDocumentCrawler, that does all the work for you:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$holisticDocumentCrawler = new HolisticDocumentCrawler('https://www.bitandblack.com');

// Get all icons:
$icons = $holisticDocumentCrawler->getIcons();

// Get all images:
$images = $holisticDocumentCrawler->getImages();

// Get the language code:
$languageCode = $holisticDocumentCrawler->getLanguageCode();

// Get all meta tags:
$metaTags = $holisticDocumentCrawler->getMetaTags();

// Get the title:
$title = $holisticDocumentCrawler->getTitle();

Help

If you have any questions, feel free to contact us under hello@bitandblack.com.

Further information about Bit&Black can be found under www.bitandblack.com.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
example		example
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
ecs.php		ecs.php
phpstan.neon		phpstan.neon
rector.php		rector.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bit&Black Document Crawler

Installation

Usage

Using Crawlers to extract parts of a document

Handling resources

Crawling everything at once

Help

About

Uh oh!

Releases

Packages

Languages

License

BitAndBlack/document-crawler

Folders and files

Latest commit

History

Repository files navigation

Bit&Black Document Crawler

Installation

Usage

Using Crawlers to extract parts of a document

Handling resources

Crawling everything at once

Help

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages