Extract different parts of an HTML or XML document.
This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.
The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:
- IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with
<link rel="icon" ... />. - ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with
<img ... />. - LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with
<html lang="...">. - MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with
<meta ... />. - TitleCrawler: Crawl and extract the title of a document, that has been declared with
<title>...</title>.
All those crawlers work the same — they need a Dom Crawler object, that contains the document:
<?php
use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler;
use Symfony\Component\DomCrawler\Crawler;
$document = <<<HTML
<!doctype html>
<html lang="en">
<head>
<title>Test</title>
</head>
<body>
<h1>Hello world</h1>
</body>
</html>
HTML;
$crawler = new Crawler($document);
$titleCrawler = new TitleCrawler($crawler);
$titleCrawler->crawlContent();
// This will output `Test`.
echo $titleCrawler->getTitle();You can create a custom Crawler by implementing the CrawlerInterface.
In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:
-
The FileSystemDownloadHandler: This one loads resources and writes them to the file system. There are different Downloaders available to fetch resources:
- The HttpDiscoveryDownloader is the default one and makes use of whatever library your project uses to download resources.
- The ReactDownloader needs the
react/httplibrary and fetches resources asynchronously. - You can — for sure — create a custom Downloader by implementing the FileSystemDownloaderInterface.
-
The PassiveResourceHandler: This handler does nothing and is the default one.
You can create a custom Resource Handler by implementing the ResourceHandlerInterface.
In case you don't want to setup something, there is the HolisticDocumentCrawler, that does all the work for you:
<?php
use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;
$holisticDocumentCrawler = new HolisticDocumentCrawler('https://www.bitandblack.com');
// Get all icons:
$icons = $holisticDocumentCrawler->getIcons();
// Get all images:
$images = $holisticDocumentCrawler->getImages();
// Get the language code:
$languageCode = $holisticDocumentCrawler->getLanguageCode();
// Get all meta tags:
$metaTags = $holisticDocumentCrawler->getMetaTags();
// Get the title:
$title = $holisticDocumentCrawler->getTitle();If you have any questions, feel free to contact us under hello@bitandblack.com.
Further information about Bit&Black can be found under www.bitandblack.com.