Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TwitterBridge] Fully decode item #926

Merged
merged 4 commits into from
Nov 15, 2018
Merged

[TwitterBridge] Fully decode item #926

merged 4 commits into from
Nov 15, 2018

Conversation

triatic
Copy link
Contributor

@triatic triatic commented Nov 15, 2018

Fully decode item. Some incidences of " in the RSS output.

Fully decode item. Some incidences of " in the RSS output.
Fix line length
Copy link
Member

@logmanoriginal logmanoriginal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Find below a few comments. Could you also provide a sample query for testing?

@@ -148,7 +148,7 @@ public function collectData(){
// extract fullname (pseudonym)
$item['fullname'] = $tweet->getAttribute('data-name');
// get author
$item['author'] = $item['fullname'] . ' (@' . $item['username'] . ')';
$item['author'] = htmlspecialchars_decode($item['fullname'] . ' (@' . $item['username'] . ')', ENT_QUOTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't make sense.
$item['fullname'] and $item['username'] may still contain special chars, so they should be decoded beforehand.

@@ -158,7 +158,8 @@ public function collectData(){
// extract tweet timestamp
$item['timestamp'] = $tweet->find('span.js-short-timestamp', 0)->getAttribute('data-time');
// generate the title
$item['title'] = strip_tags($this->fixAnchorSpacing($tweet->find('p.js-tweet-text', 0), '<a>'));
$item['title'] = htmlspecialchars_decode(
strip_tags($this->fixAnchorSpacing($tweet->find('p.js-tweet-text', 0), '<a>')), ENT_QUOTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htmlspecialchars_decode should directly be called on $tweet->find('p.js-tweet-text', 0) (before calling $this->fixAnchorSpacing(...) and strip_tags(...)

@triatic
Copy link
Contributor Author

triatic commented Nov 15, 2018

Sample query containing &amp;quot; in output (basically any content with double quotes):

php index.php action=display bridge=Twitter u=rachelparris format=Atom

@logmanoriginal
Copy link
Member

logmanoriginal commented Nov 15, 2018

Ah okay, you use the CLI.

Notice that HTML contents must be encoded inside XML, otherwise parsers don't know how where XML ends and HTML starts. This is also clearly defined in the output data (i.e. <content type="html">...).

If you want to access the "raw" text, I suggest you opt for either format=Json or format=Plaintext. Here are the different results for comparison:

Atom

<content type="html">&lt;div style="display: inline-block; vertical-align: top;"&gt;
	&lt;a href="https://twitter.com/rachelparris"&gt;
&lt;img
	style="align:top; width:75px; border:1px solid black;"
	alt="rachelparris"
	src="https://pbs.twimg.com/profile_images/947121157307854848/7HzYN27O_bigger.jpg"
	title="Rachel Parris" /&gt;
&lt;/a&gt;
&lt;/div&gt;
&lt;div style="display: inline-block; vertical-align: top;"&gt;
	&lt;blockquote&gt;Hey folks! Watch this if you like Earth a bit!  &lt;a href="https://twitter.com/hashtag/TheMashReport?src=hash" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;TheMashReport&lt;/b&gt;&lt;/a&gt;  &lt;a href="https://twitter.com/hashtag/climatechange?src=hash" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;climatechange&lt;/b&gt;&lt;/a&gt; &lt;a href="https://twitter.com/BBCTwo/status/1060110060410601472" dir="ltr" &gt;&lt;span class="tco-ellipsis"&gt;&lt;/span&gt;&lt;span class="js-display-url"&gt;twitter.com/BBCTwo/status/&lt;/span&gt;&lt;span class="tco-ellipsis"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div style="display: block; vertical-align: top;"&gt;
	&lt;blockquote&gt;&lt;/blockquote&gt;
&lt;/div&gt;
&lt;hr&gt;
&lt;div style="display: inline-block; vertical-align: top;"&gt;
	&lt;blockquote&gt;With just 12 years left to save the planet, here&amp;#39;s &lt;span class="twitter-atreply pretty-link js-nav" dir="ltr" data-mentioned-user-id="23759767" &gt;&lt;s&gt;@&lt;/s&gt;&lt;b&gt;RachelParris&lt;/b&gt;&lt;/span&gt; on why we CAN&amp;#39;T let the world go floppy! &lt;img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f30d.png" draggable="false" alt="🌍" title="Europa-Afrika auf dem Globus" aria-label="Emoji: Europa-Afrika auf dem Globus" style=" height: 1em;"&gt; &lt;span data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;TheMashReport&lt;/b&gt;&lt;/span&gt; &lt;span class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" &gt;pic.twitter.com/RyoI19u2Ed&lt;/span&gt;&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div style="display: block; vertical-align: top;"&gt;
	&lt;blockquote&gt;&lt;a href="https://pbs.twimg.com/amplify_video_thumb/1060104074606059520/img/WnBmUi13811Y3r1C.jpg:orig"&gt;
&lt;img
	style="align:top; max-width:558px; border:1px solid black;"
	src="https://pbs.twimg.com/amplify_video_thumb/1060104074606059520/img/WnBmUi13811Y3r1C.jpg:thumb" /&gt;
&lt;/a&gt;&lt;/blockquote&gt;
&lt;/div&gt;</content>

JSON

"content": "<div style=\"display: inline-block; vertical-align: top;\">\n\t<a href=\"https:\/\/twitter.com\/rachelparris\">\n<img\n\tstyle=\"align:top; width:75px; border:1px solid black;\"\n\talt=\"rachelparris\"\n\tsrc=\"https:\/\/pbs.twimg.com\/profile_images\/947121157307854848\/7HzYN27O_bigger.jpg\"\n\ttitle=\"Rachel Parris\" \/>\n<\/a>\n<\/div>\n<div style=\"display: inline-block; vertical-align: top;\">\n\t<blockquote>Hey folks! Watch this if you like Earth a bit!  <a href=\"https:\/\/twitter.com\/hashtag\/TheMashReport?src=hash\" dir=\"ltr\" ><s>#<\/s><b>TheMashReport<\/b><\/a>  <a href=\"https:\/\/twitter.com\/hashtag\/climatechange?src=hash\" dir=\"ltr\" ><s>#<\/s><b>climatechange<\/b><\/a> <a href=\"https:\/\/twitter.com\/BBCTwo\/status\/1060110060410601472\" dir=\"ltr\" ><span class=\"tco-ellipsis\"><\/span><span class=\"js-display-url\">twitter.com\/BBCTwo\/status\/<\/span><span class=\"tco-ellipsis\">\u2026<\/span><\/a><\/blockquote>\n<\/div>\n<div style=\"display: block; vertical-align: top;\">\n\t<blockquote><\/blockquote>\n<\/div>\n<hr>\n<div style=\"display: inline-block; vertical-align: top;\">\n\t<blockquote>With just 12 years left to save the planet, here&#39;s <span class=\"twitter-atreply pretty-link js-nav\" dir=\"ltr\" data-mentioned-user-id=\"23759767\" ><s>@<\/s><b>RachelParris<\/b><\/span> on why we CAN&#39;T let the world go floppy! <img class=\"Emoji Emoji--forText\" src=\"https:\/\/abs.twimg.com\/emoji\/v2\/72x72\/1f30d.png\" draggable=\"false\" alt=\"\ud83c\udf0d\" title=\"Europa-Afrika auf dem Globus\" aria-label=\"Emoji: Europa-Afrika auf dem Globus\" style=\" height: 1em;\"> <span data-query-source=\"hashtag_click\" class=\"twitter-hashtag pretty-link js-nav\" dir=\"ltr\" ><s>#<\/s><b>TheMashReport<\/b><\/span> <span class=\"twitter-timeline-link u-hidden\" data-pre-embedded=\"true\" dir=\"ltr\" >pic.twitter.com\/RyoI19u2Ed<\/span><\/blockquote>\n<\/div>\n<div style=\"display: block; vertical-align: top;\">\n\t<blockquote><a href=\"https:\/\/pbs.twimg.com\/amplify_video_thumb\/1060104074606059520\/img\/WnBmUi13811Y3r1C.jpg:orig\">\n<img\n\tstyle=\"align:top; max-width:558px; border:1px solid black;\"\n\tsrc=\"https:\/\/pbs.twimg.com\/amplify_video_thumb\/1060104074606059520\/img\/WnBmUi13811Y3r1C.jpg:thumb\" \/>\n<\/a><\/blockquote>\n<\/div>"

Edit: JSON is formattted that way due to my browser, should return regular text on the CLI

Plaintext

[content] => <div style="display: inline-block; vertical-align: top;">
	<a href="https://twitter.com/rachelparris">
<img
	style="align:top; width:75px; border:1px solid black;"
	alt="rachelparris"
	src="https://pbs.twimg.com/profile_images/947121157307854848/7HzYN27O_bigger.jpg"
	title="Rachel Parris" />
</a>
</div>
<div style="display: inline-block; vertical-align: top;">
	<blockquote>Hey folks! Watch this if you like Earth a bit!  <a href="https://twitter.com/hashtag/TheMashReport?src=hash" dir="ltr" ><s>#</s><b>TheMashReport</b></a>  <a href="https://twitter.com/hashtag/climatechange?src=hash" dir="ltr" ><s>#</s><b>climatechange</b></a> <a href="https://twitter.com/BBCTwo/status/1060110060410601472" dir="ltr" ><span class="tco-ellipsis"></span><span class="js-display-url">twitter.com/BBCTwo/status/</span><span class="tco-ellipsis">…</span></a></blockquote>
</div>
<div style="display: block; vertical-align: top;">
	<blockquote></blockquote>
</div>
<hr>
<div style="display: inline-block; vertical-align: top;">
	<blockquote>With just 12 years left to save the planet, here&#39;s <span class="twitter-atreply pretty-link js-nav" dir="ltr" data-mentioned-user-id="23759767" ><s>@</s><b>RachelParris</b></span> on why we CAN&#39;T let the world go floppy! <img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f30d.png" draggable="false" alt="🌍" title="Europa-Afrika auf dem Globus" aria-label="Emoji: Europa-Afrika auf dem Globus" style=" height: 1em;"> <span data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" ><s>#</s><b>TheMashReport</b></span> <span class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" >pic.twitter.com/RyoI19u2Ed</span></blockquote>
</div>
<div style="display: block; vertical-align: top;">
	<blockquote><a href="https://pbs.twimg.com/amplify_video_thumb/1060104074606059520/img/WnBmUi13811Y3r1C.jpg:orig">
<img
	style="align:top; max-width:558px; border:1px solid black;"
	src="https://pbs.twimg.com/amplify_video_thumb/1060104074606059520/img/WnBmUi13811Y3r1C.jpg:thumb" />
</a></blockquote>
</div>

@triatic
Copy link
Contributor Author

triatic commented Nov 15, 2018

Please check this tweet: https://twitter.com/rachelparris/status/1063121390856007685

@logmanoriginal
Copy link
Member

It looks fine in my browser:

image

Since it is HTML inside XML, it is encoded twice:

&amp;quot; is the HTML in XML that decodes to &quot; which is HTML that decodes to ". By adding htmlspecialchars_decode you still get &quot;. Which is fine, just may not be what you want?

@logmanoriginal logmanoriginal merged commit e5a6baa into RSS-Bridge:master Nov 15, 2018
@logmanoriginal
Copy link
Member

Merged. Thanks for the fix 👍

@triatic
Copy link
Contributor Author

triatic commented Nov 15, 2018

&quot; works fine for me.

@triatic triatic deleted the patch-9 branch November 21, 2018 16:36
infominer33 pushed a commit to web-work-tools/rss-bridge that referenced this pull request Apr 17, 2020
Removes duplicate encoding like &amp;quot; (should be &quot;).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants