Lets Use QueryPath instead #26

Open
nickl- opened this Issue Jun 4, 2012 · 22 comments

Comments

Projects
None yet
4 participants
Member

nickl- commented Jun 4, 2012

Let me put this as a question and let the title serve as the answer.

Why exactly are we using Zend's DOM Library?

Already we have problems to add it as a dependency. Respect/Template#23
It does not help towards HTML5 capabilities. Respect/Template#15
It only enables a small subsection of CSS Selectors.
It has a really horrible license =(

QueryPath is mature, it made its first commit on github 3 years ago.
QueryPath has an extensive community, +QueryPath and a drupal module
QueryPath has NO problems getting installed with pear, composer, git, download
QueryPath takes part in GSoC
QueryPath supports CSS3, XPath, XML Namespace as well as Pseudo-class and pseudo-element selectors
The last mention of HTML5 issues I could find technosophos/querypath#42 was 2 years ago
QueryPath has extensive documentation
QueryPath is still very active last commit 3 days ago (at time of writing)
QueryPath gives you the choice between two equally descent LICENSES

This package is licensed under an MIT license or, at your option, the LGPL version 2.1 or later.

So I don't know guys this is difficult, what do you say? =)

Owner

alganet commented Jun 4, 2012

+1 for QueryPath, great project. Happy to see they've implemented support for PSR-0, that was a blocker way back when I saw the project for the first time!

Owner

augustohp commented Jun 4, 2012

+1 for the QueryPath! \o/

Owner

henriquemoody commented Jun 4, 2012

+1 for the QueryPath! \o/

Owner

henriquemoody commented Jun 4, 2012

Who will do the refactoring?
:p

Member

nickl- commented Jun 4, 2012

I'm already on it.

Wow 3 out of 3 that's a positive result, glad to see we're on the same wave length.

Anyone willing to assist with unit tests? I still need tests for the doctype patch. : / It doesn't hurt to ask... =)

Owner

henriquemoody commented Jun 5, 2012

I can help you, @nickl-
Tell me what do you need or if you prefer you can add me on GTalk: henriquemoody@gmail.com

Owner

augustohp commented Jun 5, 2012

@nickl- Hey, you can add me too: augusto.hp@gmail.com

And count on us to write tests ;) I'm always happy to do that!

Happy Panda

Owner

henriquemoody commented Jun 5, 2012

LOL

Member

nickl- commented Jun 5, 2012

Ah you brought us little helpers =) so cute! Agreed they don't look like they are busy at all.

These are the tests in question which are holding up the doctype patch. Over here is a write-up about the implementation which might come in handy. But if I recall I also added more doc comments so there should be more than enough to go on.

@henriquemoody I added your jabber id thanx, that should speed things up and this is also my primary means of communication so you should be able to get hold of me easier than my clients ;D

It's not that I mind to doing the tests but I just don't know when I am going to get around to it. I agree we should include tests with the patches and I will keep this in mind in the future. best to do them while you are at it. I am busy working on a site that needs to move from a proprietary CMS and decided to mirror it statically and and then injecting my solution and Template is perfect for this. Since I have to do this in any case and we all agreed that QueryPath is the bees knees I might as well pay it forward.

What I really want to get busy with is the HttpHeader/Foundation/RestProtocol solution which I've been tinkering over with help from Alexandre, for a week now. @augustohp I think you were mentioned ahh yes, I would really appreciate to hear your thoughts after you get a chance to scan through what is very quickly becoming the next Stephen King novel by the looks of it.

I stumbled across phpQuery whom I was delighted to see has put quite a bit of effort into making a hame at code google, one of the few projects there that don't look like someone is going to abandon it at any moment, if they haven't already.

But I am happy with QueryPath and think we made the right choice, see how cool is this. =)

Thanks guys, rock on! ...and see you later teddy bears...

Member

nickl- commented Jun 5, 2012

@augustohp I was looking at the bears and didn't see your jabber id, got it now.

Owner

alganet commented Jun 5, 2012

Please add me as well! Jabber/MSN/email is alexandre@gaigalas.net. I believe we should get an IRC channel soon.

Member

nickl- commented Jun 6, 2012

@alganet Added, IRC is not a half bad idea, will lift the project status that we are serious and in business. =)

Owner

augustohp commented Jun 7, 2012

@nickl- What are your progress with it?

I am willing to take a look here today. The good part is that merging Query Path inside the current Template API is not that difficult, it is quite easy actually (from what I already saw)...

Member

nickl- commented Jun 8, 2012

Hi,

Yes hardly anything stays behind =)

I've been caught up with other crisis and haven't been able to give it my full attention. Like missing your msg for 6 hours sheesh. Are you merging Vanilla QueryPath or QPTPL which makes it even easier to work with arrays =)

You're probably done already can I I merge and check? ;)

Member

nickl- commented Jun 14, 2012

@augustohp done yet? =) hehehe

@nickl- nickl- added a commit to nickl-/Template that referenced this issue Jun 14, 2012

@nickl- nickl- Template refactor simple_html_dom.
As per discussions at Respect/Template#26
7c38cba
Member

nickl- commented Jun 14, 2012

Ok ok I know not exactly what we were discussing but it's not all my fault see it actually happened like this:

It was late, sometime Sunday night and I was getting really annoyed with DOMDocument errors plus QueryPath was not making my life refactoring any easier.
Whether to use html, xml, or is it xhtml?... grrrr where is the html5 already and why is this thing constantly complaining or adding the <? xml directive frustrating!!!
I eventually took the shortest route and just changed Html class to talk to QueryPath directly thinking that I could work my way back from there.
So I modify the constructor map find and rework render to process the template array and when I was happy I fired up the example/simple.php when OueryPath choked on the few selectors we are using there.

I did say not my fault: Awestruck and amazed I turn to http://querypath.org/ and?... it was down. As was http://api.querypath.org/docs/ the whole querypath.org gone!
Not a huge problem they have phing build scripts so I can just quickly fire up the doxygen right?... wrong one dependency after the other, custom phing classes from none other than, you guessed it QueryPath. What happened to do one thing and do it right, frustrating!!! Commenting out the dependencies as I only need the docs target but alas even that used a custom class which I needed querypath.org for to get. =(

Now I specifically set this time apart to work on this and I hit a dead end. When I remembered there was another parser I wanted to look at a while ago... simple_html_dom

Member

nickl- commented Jun 14, 2012

A quick glance over the documentation I instantly spotted what they refer to as the W3C STANDARD Camel naming conventions and thought to myself hey I recognize those

string $e->getAttribute ( $name )                   
void $e->setAttribute ( $name, $value ) 
bool $e->hasAttribute ( $name ) 
void $e->removeAttribute ( $name )  
element $e->getElementById ( $id )  
mixed $e->getElementsById ( $id [,$index] )
element $e->getElementByTagName ($name )
mixed $e->getElementsByTagName ( $name [, $index] )
element $e->parentNode ()   
mixed $e->childNodes ( [$index] )
element $e->firstChild ()
element $e->lastChild ()
element $e->nextSibling ()
element $e->previousSibling ()

I was immediately suspicious and started looking for our "friend" DOMDocument but he is truly nowhere to be found. Can this be true no more libxml? Indeed my friends no more libxml and it's constant bickering and modifying your source (breaking it) and the frustrations from it's unnatural implementations and insistence with having the DOMDocument create the element while you can only append children.... ag I don't want to go there it's annoying.

Where QueryPath tries to hide the dom simple_html_dom exposes everything to you and I have yet to even go look at what kind of selectors work as everything just works. Especially because it exposes the DOM like methods it stands to very easily be a drop in replacement and did I say it just works. All the time I spent was making Template work with simple_html_dom it just allows you to do the most amazing things and so easy.

What is the catch? ... to be continued.

Member

nickl- commented Jun 14, 2012

At least it is object orientated and it does funky things with magic to get to the element attributes, which I'll get to in a bit but it is still focused on that require_once obscurity, what was that for again?

Getting autoloading on is a piece of cake and added it to composer as classpath for now.

It is brilliant at scraping and reading from a document.
Traversal is awesome and as mentioned before it seems the selectors are all working.
Very easy to change things, the have "magic" properties tag, innertext, outertext which you can just change with whatever html you you want and it's done. But as it turns out there is a catch.

If you look closer at the list you will notice several of the most commonly used, especially in Respect/Template, functianality have not been mapped like createElement, hasChildNodes, appendChild. Append the usual way using their "magic" properties work as such:

$html->find('#something-wants-changing')-innertext .= '<p>append me</p>';

or

$html->find('#something-wants-changing')-innertext = '<p>replace contents</p>';

To replace the element it self, rocket science I tell you, parent uhm.. find child, create element, replace child uhm... any of the ring a bell? None of that:

$html->find('#something-wants-changing')-outertext = '<div><p>I am new awesome stuff and the old is gone</p></div>';

So where's the catch? These "magic" properties are merely text placeholders that overwrite the rendered content if they have been set and does not reflect in the children count nor can the new content be traversed. So the current implementation needed some tweaking to work for our heavy-duty requirements which turned out to be just as simple due to the awesome design.

Member

nickl- commented Jun 14, 2012

I couldn't find any licensing mention of any kind but this is already a port from a previous implementation so I would consider it public domain. They obviously need help with 108 open issues in the bug tracker so I don't foresee much resistance to aid if we offer it.

Things that needed adding which I have implemented to the simple_html_dom.php file in the pull request are as follows.

  • Faking inheritance to make the dom act like a node in the fashion of DOMDocument is a DOMNode fashion without complicating things since the overload method was not implemented:
<?php
    // to create the illussion of dom_node inheritance
    function __call($name, $arguments) {
        if (isset($this->root))
            if (method_exists($this->root, $name))
                    return call_user_method_array ($name, $this->root, $arguments);
    }

Now all node/element methods are exposed from the root element.

  • I first wanted to fix what I saw as broken with the innertext not being parsed but that is actually awesome just the way it is so I left it. If the innertext is not overwritten the nodes in the nodes collection (which is actually exposed to us btw) gets processed so what we actually needed was.
  • appendChild() functionality for to update the nodes and children collections.
  • createElement() implementation to create elements from their tag names instead of providing the actual html (which is not a bad thing who wants to use createElment anyway when you can just add html that's awesome) =)
  • createTextNode for when you don't really want a really really real node I guess.
  • hasChildNode and nodeName mappings because we use them so lets simplify.
<?php
    function createElement($name, $value=null) {return @str_get_html('<'.$name.'>'.$value.'</'.$name.'>')->first_child();}
    function createTextNode($value) {return @end(str_get_html($value)->nodes);}

    function appendChild($node) {$node->parent($this); return $node;}
    function nodeName() {return $this->tag;}
    function hasChildNodes() {return $this->has_child();}

As you can see these were no brainers no implement. With appendChild I was already updating the nodes and children collections and thought if you wanted to traverse up you would need access to the parent property and this is protected. There is an existing implementation of a parent() method to retrieve the parent so why not do this then:

<?php

   // returns the parent of node
    function parent($parent=null)
    {
        if (isset($parent)) {
            $this->parent = $parent;
            $this->parent->nodes[] = $this;
            $this->parent->children[] = $this;
        }
        return $this->parent;
    }

Works like a charm and it really simple to add, referring you back to my frustrations to simply generate documentation for QueryPath this does not have all that bloat and is very simple as the name says.

It will still require more some benchmarking and rigorous testing which is why I made the Pull request.

  • Completely ported Respect/Template to work with simple_html_dom
  • All tests working and all the functionality revised
  • Added a few extra asserts for things that broke which wasn't evident or tested for
  • Nothing was removed all the previous functionality has been commented out where they were replaced to easily verify and asses the changes.
<?php
    /**
     * Excerpt from HtmlElement class
     */
    public function getDOMNode($dom, $current=null) // DOMDocument $dom, $current=null)
    {
//        if (is_string($current))
//            return new DOMText($current);

        $current = $current ?: $this ;
        $html = new simple_html_dom();
        return $html->load($current);

//        $node    = $dom->createElement($current->nodeName);
//        foreach ($current->attributes as $name=>$value)
//            $node->setAttribute($name, $value);
//
//        if (!count($current->childrenNodes))
//            return $node;
//
//        foreach ($current->childrenNodes as $child)
//            $node->appendChild($this->getDOMNode($dom, $child));
//
//        return $node;
    }

Please take it for a test drive and don't be shy to be skeptical. Especially try and break it that is the point actually no use in considering something that is not robust. I will continue writing some motivations and comparing weighing this up against QueryPath. Any other suggestions?

Owner

augustohp commented Jul 5, 2012

Just some update on this: We are currently evaluating another alternatives other then the query path. I am into creating different branches for each implementation. @nickl- already presented us with a branch: simple-html-dom site.

I am really focused on getting a list of pros and cons so we can better choose from our options.

@nickl-: Sorry for yesterday, my net was just gone for a while. =(

Member

nickl- commented Jul 7, 2012

@augustohp same happened here funny enough =)

I am not trying querypath again. I have tried and 3 times had no joy. It is heavily depended on that qp ffunction of theirs which is not very PSR-0 so you hit a brick wall just as you pick up momentum. Trying to hack around or over it causes even mare crap. The last time I tried extending the QueryDocument I think which started of ok but same problems in the end.

And then you go work with simple_html_dom again and it is just so sweet. I'm kind of getting a bad taste in my mouth for php's DomDocument in all it's flavours. Besides it chokes on html5 so it's going out no use denying it.

I would like to enhance functionality already what other options do you want to compare still?

Member

nickl- commented Jul 7, 2012

Patch upstream@sourceforge for the changes made to simple_html_dom if you have more suggestions please add it, lets see if we get some interest there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment