Skip to content

makryl/AParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AParser

Simple text parser. Any text format. Low memory usage (~ 2 buffers) for large files.

Usage

Basic

$parser = new ListParser();
$parser->open('http://google.com/sitemap.xml');
$parser->parseList(
    '<urlset',
    '<url>',
    function() use ($parser) {
        printf(
            "loc: %s\npriority: %s\n\n",
            $parser->parseBetween('<loc>', '</'),
            $parser->parseBetween('<priority>', '</')
        );
    }
);

To parse few files with one parser use parseFiles method.

$parser = new ListParser();
$parser->parseFiles('
        http://google.com/sitemap.xml
        http://php.net/sitemap.xml
    ',
    '<urlset',
    '<url>',
    function() use ($parser) {
        printf(
            "loc: %s\n",
            $parser->parseBetween('<loc>', '</')
        );
    }
);

To store results in array (large result array can cause high memory usage):

$parser = new ListParser();
$parser->open('http://google.com/sitemap.xml');
$result = $parser->parseList(
    '<urlset',
    '<url>',
    function() use ($parser) {
        return [
            'loc' => $parser->parseBetween('<loc>', '</'),
            'priority' => $parser->parseBetween('<priority>', '</'),
        ];
    }
);

Storing results in array works also with parseFiles method.

Except method parseBetween there are two methods seekTo and parseTo.

Method parseTo returns string from current position to specified string, but can't parse string longer than 1 buffer length.

Method seekTo moves file pointer to specified string, and can seek over any amount of buffers, using less than 2 buffers memory.

Method parseBetween uses these methods: seeks to first argument and parses to second argument.

Extending

It may be much flexible to extend class ListParser or AParser with you own:

class MyParser extends ListParser
{
    public $buffer = 4096;
    public $encoding = 'UTF-8';

    public $beginOfList = '<urlset';
    public $beginOfItem = '<url>';

    public function parseItem()
    {
        printf(
            "loc: %s\npriority: %s\n\n",
            $this->parseBetween('<loc>', '</'),
            $this->parseBetween('<priority>', '</')
        );
    }
}

$myParser = new MyParser();
$myParser->open('http://google.com/sitemap.xml');
$myParser->parseList();

Parse images

To parse images use class ImageParser and make parseItem handler so that it return array with two elements: src - url of image to download, dest - local path to save.

Let's download some cute foxes and cats from google:

$parser = new ImageParser();
$parser->parseFiles('
        http://www.google.ru/search?tbm=isch&q=cat
        http://www.google.ru/search?tbm=isch&q=fox
    ',
    '<table class="images_table',
    '<td style="width:25%;word-wrap:break-word">',
    function() use ($parser) {
        return [
            'src' => $parser->parseBetween('src="', '"'),
            'dest' => 'parsed/google-cat-and-fox/' .
                preg_replace(
                    '/[^\w\d]+/',
                    '.',
                    html_entity_decode(
                        strip_tags(
                            $parser->parseBetween('</a>', '</td>')
                        )
                    )
                ),
        ];
    }
);

License

Copyright © 2014 Krylosov Maksim Aequiternus@gmail.com

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

About

Simple text parser. Any text format. Low memory usage for large files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages