Link parsing: Pinboard private feeds don't seem to get parsed properly #106

drpfenderson · 2018-10-18T21:23:12Z

I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.

If I pass a public feed, like http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, like https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.

Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides)
./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt
[X] No links found :(

./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt
Traceback (most recent call last):
  File "./archive", line 161, in <module>
    links = merge_links(archive_path=out_dir, import_path=source)
  File "./archive", line 53, in merge_links
    raw_links = parse_links(import_path)
  File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links
    links += list(parser_func(file))
  File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export
    url = erg['url']
KeyError: 'url'

./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt
[X] No links found :(

Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?

The text was updated successfully, but these errors were encountered:

pirate · 2018-10-19T01:20:37Z

Looks like theres some difference in the outputted json format for private feeds that's breaking the parser. Can you post a copy of output/sources/feeds.pinboard.in-1539897226.txt in a gist somewhere (redacted/edited to hide the links if you want).

drpfenderson · 2018-10-19T16:12:30Z

@pirate Here is a link to the output of that file.

https://gist.github.com/drpfenderson/245c99f148b30cbf83dd3588c2fb0885

f0086 · 2018-10-19T19:48:19Z

I've ran into the same problem. I solved this with a little go program which will login to pinboard and klick the actual "backup my bookmarks in legacy Netscape format" button -- which works fine for me.

package main

import (
  "gopkg.in/headzoo/surf.v1"
  "os"
  "flag"
)

var username = flag.String("username", "", "pinboard username")
var password = flag.String("password", "", "pinboard password")

func main() {
  flag.Parse()

  bow := surf.NewBrowser()
  err := bow.Open("https://pinboard.in/")
  if err != nil {
    panic(err)
  }

  form, formErr := bow.Form("form[name=login]")
  if formErr != nil {
    panic(formErr)
  }

  form.Input("username", *username)
  form.Input("password", *password)
  if form.Submit() != nil {
    panic(err);
  }

  err = bow.Open("https://pinboard.in/export/format:html/")
  if err != nil {
    panic(err)
  }

  bow.Download(os.Stdout)
}

$ export GOPATH=.
$ go get gopkg.in/headzoo/surf.v1
$ go build src/aaron-fischer.net/fupin/main.go
$ ./fuPin -username=[USERNAME] -password=[PASSWORD] > bookmarks.html

drpfenderson · 2018-11-08T21:09:35Z

Do you still need my Gist up for this? Or can I make it private?

pirate · 2018-11-12T03:04:55Z

I only need one or two links in the file to debug this, so if you can keep a version up with only 1 or two links (can be example.com) in the same format, that would be helpful.

f0086 · 2018-11-19T08:59:46Z

From the settings->backup page:

Legacy HTML (seems to be broken HTML/XML?)

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Pinboard Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL>
<p>

<DT><A HREF="https://github.com/trailofbits/algo" ADD_DATE="1542616733" PRIVATE="1" TOREAD="1" TAGS="vpn,scripts,toread">Algo VPN scripts</A>
<DT><A HREF="http://www.ulisp.com/" ADD_DATE="1542374412" PRIVATE="1" TOREAD="1" TAGS="arduino,avr,embedded,lisp,toread">uLisp</A>

</DL>
</p>

XML

<?xml version="1.0" encoding="UTF-8"?>
	<posts user="aaronmueller">
<post href="https://github.com/trailofbits/algo" time="2018-11-19T08:38:53Z" description="Algo VPN scripts" extended="" tag="vpn scripts" hash="18d708f67bb26d843b1cac4530bb52aa"  shared="no" toread="yes" />
<post href="http://www.ulisp.com/" time="2018-11-16T13:20:12Z" description="uLisp" extended="" tag="arduino avr embedded lisp" hash="2a17ae95925a03a5b9bb38cf7f6c6f9b"  shared="no" toread="yes" />
</posts>

JSON

[{"href":"https:\/\/github.com\/trailofbits\/algo","description":"Algo VPN scripts","extended":"","meta":"62325ba3b577683aee854d7f191034dc","hash":"18d708f67bb26d843b1cac4530bb52aa","time":"2018-11-19T08:38:53Z","shared":"no","toread":"yes","tags":"vpn scripts"},
{"href":"http:\/\/www.ulisp.com\/","description":"uLisp","extended":"","meta":"7bd0c0ef31f69d1459e3d37366e742b3","hash":"2a17ae95925a03a5b9bb38cf7f6c6f9b","time":"2018-11-16T13:20:12Z","shared":"no","toread":"yes","tags":"arduino avr embedded lisp"}]

Private RSS feed:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (private aaronmueller)</title>
    <link>https://pinboard.in/u:aaronmueller/private/</link>
    <description></description>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="https://mehkee.com/"/>
        <rdf:li rdf:resource="https://qmk.fm/"/>
      </rdf:Seq>
    </items>
  </channel>

  <item rdf:about="https://mehkee.com/">
    <title>Mehkee - Mechanical Keyboard Parts &amp; Accessories</title>
    <dc:date>2018-11-08T21:29:32+00:00</dc:date>
    <link>https://mehkee.com/</link>
    <dc:creator>aaronmueller</dc:creator>
    <dc:subject>keyboard gadget diy</dc:subject>
    <dc:source>http://pinboard.in/</dc:source>
    <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier>
    <taxo:topics>
      <rdf:Bag>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:gadget"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:diy"/>
      </rdf:Bag>
    </taxo:topics>
  </item>
  <item rdf:about="https://qmk.fm/">
    <title>QMK Firmware - An open source firmware for AVR and ARM based keyboards</title>
    <dc:date>2018-11-06T22:36:21+00:00</dc:date>
    <link>https://qmk.fm/</link>
    <dc:creator>aaronmueller</dc:creator>
    <dc:subject>firmware keyboard</dc:subject>
    <dc:source>http://pinboard.in/</dc:source>
    <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier>
    <taxo:topics>
      <rdf:Bag>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:firmware"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/>
      </rdf:Bag>
    </taxo:topics>
  </item>
</rdf:RDF>

pirate · 2019-02-04T18:32:36Z

Can you try the latest master? It might work now... although it might try to import all the extra pinboard links that aren't articles too.

f0086 · 2019-02-04T19:28:19Z

Sorry, does not work (or do I miss something?)
It will download the bookmarks, but then hangs forever. This is the tracktrace after hitting CTRL+C:

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:xxx/u:yyy/"
[*] [2019-02-04 20:23:46] Downloading https://feeds.pinboard.in/rss/secret:xxx/u:yyy/ > output/sources/feeds.pinboard.in-1549308226.txt
^CTraceback (most recent call last):                                                                                                                                     
  File "./archive", line 189, in <module>
    links = merge_links(archive_path=out_dir, import_path=source, only_new=False)
  File "./archive", line 62, in merge_links
    raw_links = parse_links(import_path)
  File "/tmp/ArchiveBox/archivebox/parse.py", line 59, in parse_links
    links += list(parser_func(file))
  File "/tmp/ArchiveBox/archivebox/parse.py", line 271, in parse_plain_text
    'title': fetch_page_title(url),
  File "/tmp/ArchiveBox/archivebox/util.py", line 236, in fetch_page_title
    html_content = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 1345, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.7/urllib/request.py", line 1320, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt

pirate · 2019-02-04T19:33:19Z

I'm assuming you're importing a lot of links, if so, that's normal. It can take up to 10s per link to fetch the title if it didn't find a title in the pinboard import.

f0086 · 2019-02-04T20:08:05Z

You are right, I just need to wait. But it did not work. The archiver tried to download each tag(!) for each bookmark like "http://pinboard.in/u:yyy/t:lectures". Currently I do not have time to debug this further :(

pirate · 2019-02-05T04:19:04Z

Ok I just made a bunch of fixes, and tested it on all four of the snippets you posted above. All of them worked correctly and only extracted the article links, without all the other pinboard tag urls.

Give the latest version of master a try.

f0086 · 2019-02-05T17:25:43Z

I am very sorry, but it does not work. You are using the wrong URLs. You need to use the URL in the <link></link> tag. I will have a look at this.

#123 seems related to this :)

EDIT: Ok, I had a quick look at the code, but did not find a proper solution. The xml.etree.ElementTree component is not working as expected I think, but I am not a Python guy, so not sure about that. My setup (see above) works great for me, so I have no interest in spending an evening debugging this for now, sorry :( Maybe it is not worth it anyway, because of #123 ?!?

drpfenderson · 2019-02-05T21:54:03Z

Seems to work for me on the most recent master (ce25794). :) Thanks a ton.

My original issue doesn't seem to be the same problem that @f0086 is dealing with.

pirate · 2019-02-07T22:42:33Z

@f0086 when you get a chance, do you mind pulling the latest master and trying it? I've made a bunch of fixes to the parsers in the last 3 days, and now it'll tell you exactly why the parser fails if you uncomment this line:

archivebox/parse.py:75

# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))

If it still doesn't work, after uncommenting that line you can copy/paste the error output here and I'll debug it for you :)

f0086 · 2019-02-10T20:19:23Z

Here we go:

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private"
[*] [2019-02-10 21:17:13] Downloading https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private > output/sources/feeds.pinboard.in-xxx.txt
[*] [2019-02-10 21:17:14] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt and fetching titles...                                                  
    [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'

pirate · 2019-03-01T09:42:34Z

I think part of the issue was that I was fetching page titles without showing progress, so it looks like it was hanging forever / breaking when actually it was doing stuff.

That's all been changed significantly now, as I treat title fetching like any other archive method now instead of trying to do it during the parsing phase.

Try pulling the latest master and running it again. If you're still having issues, I'll need two things to debug it:

A redacted copy of the failing import file output/sources/feeds.pinboard.in-xxx.txt
The terminal output with that print statement on parse.py:56 uncommented

f0086 · 2019-03-09T16:50:11Z

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private"
[*] [2019-03-09 17:43:21] Downloading https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private
    > output/sources/feeds.pinboard.in-xxx.txt                                                                                                                    
[*] [2019-03-09 17:43:23] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt...
[!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'
    > Adding 207 new links to index (parsed import as Plain Text)
[*] [2019-03-09 17:43:23] Updating main index files...
...

<?xml version="1.0" encoding="UTF-8"?>
 <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (private YYY)</title>
    <link>https://pinboard.in/u:YYY/private/</link>
    <description></description>
    <items>
      <rdf:Seq>
	<rdf:li rdf:resource="https://bugs.archlinux.org/task/56957"/>
	<rdf:li rdf:resource="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript"/>
      </rdf:Seq>
    </items>
    </channel>
<item rdf:about="https://bugs.archlinux.org/task/56957">
    <title>FS#56957 : [systemd] systemd-networkd crash after updating to linux 4.14.11</title>
    <dc:date>2019-02-10T19:46:52+00:00</dc:date>
    <link>https://bugs.archlinux.org/task/56957</link>
    <dc:creator>YYY</dc:creator><description><![CDATA[<blockquote>Flyspray, a Bug Tracking System written in PHP.</blockquote>]]></description>
<dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier>
</item>
<item rdf:about="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript">
    <title>UnrealScript Beginners Guide</title>
    <dc:date>2019-02-08T14:24:34+00:00</dc:date>
    <link>https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript</link>
    <dc:creator>YYY</dc:creator><dc:subject>unreal</dc:subject>
<dc:source>http://pinboard.in/</dc:source>
<dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="http://pinboard.in/u:YYY/t:unreal"/>
</rdf:Bag></taxo:topics>
</item>
</rdf:RDF>

pirate · 2019-03-19T22:44:47Z

Fixed in f9a7c53, give the latest master a shot and let me know if it works.

f0086 · 2019-03-21T19:12:00Z

Looking good.
This will finally fix this issue, thank you!

Fixes #1171 Fixes #870 (probably, would need to test against a Wallabag Atom file to Fixes #135 Fixes #123 Fixes #106

pirate added type: bug report status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Oct 19, 2018

pirate changed the title ~~Pinboard live (private) feeds don't seem to get parsed properly~~ Link parsing: Pinboard private feeds don't seem to get parsed properly Oct 19, 2018

pirate added status: wip Work is in-progress / has already been partially completed and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Mar 12, 2019

pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers and removed status: wip Work is in-progress / has already been partially completed labels Mar 20, 2019

pirate closed this as completed Mar 21, 2019

This was referenced Mar 1, 2024

Use feedparser for RSS parsing #1362

Merged

Add generic_jsonl parser #1370

Merged

pirate added a commit that referenced this issue Mar 14, 2024

Use feedparser for RSS parsing (#1362)

099f7d0

Fixes #1171 Fixes #870 (probably, would need to test against a Wallabag Atom file to Fixes #135 Fixes #123 Fixes #106

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link parsing: Pinboard private feeds don't seem to get parsed properly #106

Link parsing: Pinboard private feeds don't seem to get parsed properly #106

drpfenderson commented Oct 18, 2018

pirate commented Oct 19, 2018

drpfenderson commented Oct 19, 2018

f0086 commented Oct 19, 2018

drpfenderson commented Nov 8, 2018

pirate commented Nov 12, 2018

f0086 commented Nov 19, 2018

pirate commented Feb 4, 2019

f0086 commented Feb 4, 2019

pirate commented Feb 4, 2019

f0086 commented Feb 4, 2019

pirate commented Feb 5, 2019 •

edited

f0086 commented Feb 5, 2019 •

edited

drpfenderson commented Feb 5, 2019 •

edited

pirate commented Feb 7, 2019

f0086 commented Feb 10, 2019

pirate commented Mar 1, 2019

f0086 commented Mar 9, 2019

pirate commented Mar 19, 2019

f0086 commented Mar 21, 2019 •

edited

Link parsing: Pinboard private feeds don't seem to get parsed properly #106

Link parsing: Pinboard private feeds don't seem to get parsed properly #106

Comments

drpfenderson commented Oct 18, 2018

pirate commented Oct 19, 2018

drpfenderson commented Oct 19, 2018

f0086 commented Oct 19, 2018

drpfenderson commented Nov 8, 2018

pirate commented Nov 12, 2018

f0086 commented Nov 19, 2018

pirate commented Feb 4, 2019

f0086 commented Feb 4, 2019

pirate commented Feb 4, 2019

f0086 commented Feb 4, 2019

pirate commented Feb 5, 2019 • edited

f0086 commented Feb 5, 2019 • edited

drpfenderson commented Feb 5, 2019 • edited

pirate commented Feb 7, 2019

f0086 commented Feb 10, 2019

pirate commented Mar 1, 2019

f0086 commented Mar 9, 2019

pirate commented Mar 19, 2019

f0086 commented Mar 21, 2019 • edited

pirate commented Feb 5, 2019 •

edited

f0086 commented Feb 5, 2019 •

edited

drpfenderson commented Feb 5, 2019 •

edited

f0086 commented Mar 21, 2019 •

edited