Skip to content

BitAndBlack/document-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHP from Packagist Latest Stable Version Total Downloads License

Bit&Black Logo

Bit&Black Document Crawler

Extract different parts of an HTML or XML document.

Installation

This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.

Usage

Using Crawlers to extract parts of a document

The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:

  • IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with <link rel="icon" ... />.
  • ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with <img ... />.
  • LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with <html lang="...">.
  • MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with <meta ... />.
  • TitleCrawler: Crawl and extract the title of a document, that has been declared with <title>...</title>.

All those crawlers work the same — they need a Dom Crawler object, that contains the document:

<?php

use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler;
use Symfony\Component\DomCrawler\Crawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$crawler = new Crawler($document);

$titleCrawler = new TitleCrawler($crawler);
$titleCrawler->crawlContent();

// This will output `Test`.
echo $titleCrawler->getTitle();

You can create a custom Crawler by implementing the CrawlerInterface.

Handling resources

In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:

You can create a custom Resource Handler by implementing the ResourceHandlerInterface.

Crawling everything at once

In case you don't want to setup something, there is the HolisticDocumentCrawler, that does all the work for you:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$holisticDocumentCrawler = new HolisticDocumentCrawler('https://www.bitandblack.com');

// Get all icons:
$icons = $holisticDocumentCrawler->getIcons();

// Get all images:
$images = $holisticDocumentCrawler->getImages();

// Get the language code:
$languageCode = $holisticDocumentCrawler->getLanguageCode();

// Get all meta tags:
$metaTags = $holisticDocumentCrawler->getMetaTags();

// Get the title:
$title = $holisticDocumentCrawler->getTitle();

Help

If you have any questions, feel free to contact us under hello@bitandblack.com.

Further information about Bit&Black can be found under www.bitandblack.com.

About

Extract different parts of an HTML or XML document.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages