Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link parsing: Pinboard private feeds don't seem to get parsed properly #106

Closed
drpfenderson opened this issue Oct 18, 2018 · 19 comments
Closed
Labels
status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers type: bug report

Comments

@drpfenderson
Copy link

I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.

If I pass a public feed, like http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, like https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.

Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides)
./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt
[X] No links found :(

./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt
Traceback (most recent call last):
  File "./archive", line 161, in <module>
    links = merge_links(archive_path=out_dir, import_path=source)
  File "./archive", line 53, in merge_links
    raw_links = parse_links(import_path)
  File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links
    links += list(parser_func(file))
  File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export
    url = erg['url']
KeyError: 'url'

./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt
[X] No links found :(

Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?

@pirate
Copy link
Member

pirate commented Oct 19, 2018

Looks like theres some difference in the outputted json format for private feeds that's breaking the parser. Can you post a copy of output/sources/feeds.pinboard.in-1539897226.txt in a gist somewhere (redacted/edited to hide the links if you want).

@pirate pirate added type: bug report status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Oct 19, 2018
@pirate pirate changed the title Pinboard live (private) feeds don't seem to get parsed properly Link parsing: Pinboard private feeds don't seem to get parsed properly Oct 19, 2018
@drpfenderson
Copy link
Author

@pirate Here is a link to the output of that file.

https://gist.github.com/drpfenderson/245c99f148b30cbf83dd3588c2fb0885

@f0086
Copy link
Contributor

f0086 commented Oct 19, 2018

I've ran into the same problem. I solved this with a little go program which will login to pinboard and klick the actual "backup my bookmarks in legacy Netscape format" button -- which works fine for me.

package main

import (
  "gopkg.in/headzoo/surf.v1"
  "os"
  "flag"
)

var username = flag.String("username", "", "pinboard username")
var password = flag.String("password", "", "pinboard password")

func main() {
  flag.Parse()

  bow := surf.NewBrowser()
  err := bow.Open("https://pinboard.in/")
  if err != nil {
    panic(err)
  }

  form, formErr := bow.Form("form[name=login]")
  if formErr != nil {
    panic(formErr)
  }

  form.Input("username", *username)
  form.Input("password", *password)
  if form.Submit() != nil {
    panic(err);
  }

  err = bow.Open("https://pinboard.in/export/format:html/")
  if err != nil {
    panic(err)
  }

  bow.Download(os.Stdout)
}
$ export GOPATH=.
$ go get gopkg.in/headzoo/surf.v1
$ go build src/aaron-fischer.net/fupin/main.go
$ ./fuPin -username=[USERNAME] -password=[PASSWORD] > bookmarks.html

@drpfenderson
Copy link
Author

Do you still need my Gist up for this? Or can I make it private?

@pirate
Copy link
Member

pirate commented Nov 12, 2018

I only need one or two links in the file to debug this, so if you can keep a version up with only 1 or two links (can be example.com) in the same format, that would be helpful.

@f0086
Copy link
Contributor

f0086 commented Nov 19, 2018

From the settings->backup page:

Legacy HTML (seems to be broken HTML/XML?)

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Pinboard Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL>
<p>

<DT><A HREF="https://github.com/trailofbits/algo" ADD_DATE="1542616733" PRIVATE="1" TOREAD="1" TAGS="vpn,scripts,toread">Algo VPN scripts</A>
<DT><A HREF="http://www.ulisp.com/" ADD_DATE="1542374412" PRIVATE="1" TOREAD="1" TAGS="arduino,avr,embedded,lisp,toread">uLisp</A>

</DL>
</p>

XML

<?xml version="1.0" encoding="UTF-8"?>
	<posts user="aaronmueller">
<post href="https://github.com/trailofbits/algo" time="2018-11-19T08:38:53Z" description="Algo VPN scripts" extended="" tag="vpn scripts" hash="18d708f67bb26d843b1cac4530bb52aa"  shared="no" toread="yes" />
<post href="http://www.ulisp.com/" time="2018-11-16T13:20:12Z" description="uLisp" extended="" tag="arduino avr embedded lisp" hash="2a17ae95925a03a5b9bb38cf7f6c6f9b"  shared="no" toread="yes" />
</posts>

JSON

[{"href":"https:\/\/github.com\/trailofbits\/algo","description":"Algo VPN scripts","extended":"","meta":"62325ba3b577683aee854d7f191034dc","hash":"18d708f67bb26d843b1cac4530bb52aa","time":"2018-11-19T08:38:53Z","shared":"no","toread":"yes","tags":"vpn scripts"},
{"href":"http:\/\/www.ulisp.com\/","description":"uLisp","extended":"","meta":"7bd0c0ef31f69d1459e3d37366e742b3","hash":"2a17ae95925a03a5b9bb38cf7f6c6f9b","time":"2018-11-16T13:20:12Z","shared":"no","toread":"yes","tags":"arduino avr embedded lisp"}]

Private RSS feed:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (private aaronmueller)</title>
    <link>https://pinboard.in/u:aaronmueller/private/</link>
    <description></description>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="https://mehkee.com/"/>
        <rdf:li rdf:resource="https://qmk.fm/"/>
      </rdf:Seq>
    </items>
  </channel>

  <item rdf:about="https://mehkee.com/">
    <title>Mehkee - Mechanical Keyboard Parts &amp; Accessories</title>
    <dc:date>2018-11-08T21:29:32+00:00</dc:date>
    <link>https://mehkee.com/</link>
    <dc:creator>aaronmueller</dc:creator>
    <dc:subject>keyboard gadget diy</dc:subject>
    <dc:source>http://pinboard.in/</dc:source>
    <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier>
    <taxo:topics>
      <rdf:Bag>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:gadget"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:diy"/>
      </rdf:Bag>
    </taxo:topics>
  </item>
  <item rdf:about="https://qmk.fm/">
    <title>QMK Firmware - An open source firmware for AVR and ARM based keyboards</title>
    <dc:date>2018-11-06T22:36:21+00:00</dc:date>
    <link>https://qmk.fm/</link>
    <dc:creator>aaronmueller</dc:creator>
    <dc:subject>firmware keyboard</dc:subject>
    <dc:source>http://pinboard.in/</dc:source>
    <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier>
    <taxo:topics>
      <rdf:Bag>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:firmware"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/>
      </rdf:Bag>
    </taxo:topics>
  </item>
</rdf:RDF>

@pirate
Copy link
Member

pirate commented Feb 4, 2019

Can you try the latest master? It might work now... although it might try to import all the extra pinboard links that aren't articles too.

@f0086
Copy link
Contributor

f0086 commented Feb 4, 2019

Sorry, does not work (or do I miss something?)
It will download the bookmarks, but then hangs forever. This is the tracktrace after hitting CTRL+C:

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:xxx/u:yyy/"
[*] [2019-02-04 20:23:46] Downloading https://feeds.pinboard.in/rss/secret:xxx/u:yyy/ > output/sources/feeds.pinboard.in-1549308226.txt
^CTraceback (most recent call last):                                                                                                                                     
  File "./archive", line 189, in <module>
    links = merge_links(archive_path=out_dir, import_path=source, only_new=False)
  File "./archive", line 62, in merge_links
    raw_links = parse_links(import_path)
  File "/tmp/ArchiveBox/archivebox/parse.py", line 59, in parse_links
    links += list(parser_func(file))
  File "/tmp/ArchiveBox/archivebox/parse.py", line 271, in parse_plain_text
    'title': fetch_page_title(url),
  File "/tmp/ArchiveBox/archivebox/util.py", line 236, in fetch_page_title
    html_content = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 1345, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.7/urllib/request.py", line 1320, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt

@pirate
Copy link
Member

pirate commented Feb 4, 2019

I'm assuming you're importing a lot of links, if so, that's normal. It can take up to 10s per link to fetch the title if it didn't find a title in the pinboard import.

@f0086
Copy link
Contributor

f0086 commented Feb 4, 2019

You are right, I just need to wait. But it did not work. The archiver tried to download each tag(!) for each bookmark like "http://pinboard.in/u:yyy/t:lectures". Currently I do not have time to debug this further :(

@pirate
Copy link
Member

pirate commented Feb 5, 2019

Ok I just made a bunch of fixes, and tested it on all four of the snippets you posted above. All of them worked correctly and only extracted the article links, without all the other pinboard tag urls.

Give the latest version of master a try.

@f0086
Copy link
Contributor

f0086 commented Feb 5, 2019

I am very sorry, but it does not work. You are using the wrong URLs. You need to use the URL in the <link></link> tag. I will have a look at this.

#123 seems related to this :)

EDIT: Ok, I had a quick look at the code, but did not find a proper solution. The xml.etree.ElementTree component is not working as expected I think, but I am not a Python guy, so not sure about that. My setup (see above) works great for me, so I have no interest in spending an evening debugging this for now, sorry :( Maybe it is not worth it anyway, because of #123 ?!?

@drpfenderson
Copy link
Author

drpfenderson commented Feb 5, 2019

Seems to work for me on the most recent master (ce25794). :) Thanks a ton.

My original issue doesn't seem to be the same problem that @f0086 is dealing with.

@pirate
Copy link
Member

pirate commented Feb 7, 2019

@f0086 when you get a chance, do you mind pulling the latest master and trying it? I've made a bunch of fixes to the parsers in the last 3 days, and now it'll tell you exactly why the parser fails if you uncomment this line:

archivebox/parse.py:75

# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))

If it still doesn't work, after uncommenting that line you can copy/paste the error output here and I'll debug it for you :)

@f0086
Copy link
Contributor

f0086 commented Feb 10, 2019

Here we go:

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private"
[*] [2019-02-10 21:17:13] Downloading https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private > output/sources/feeds.pinboard.in-xxx.txt
[*] [2019-02-10 21:17:14] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt and fetching titles...                                                  
    [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'

@pirate
Copy link
Member

pirate commented Mar 1, 2019

I think part of the issue was that I was fetching page titles without showing progress, so it looks like it was hanging forever / breaking when actually it was doing stuff.

That's all been changed significantly now, as I treat title fetching like any other archive method now instead of trying to do it during the parsing phase.

Try pulling the latest master and running it again. If you're still having issues, I'll need two things to debug it:

  1. A redacted copy of the failing import file output/sources/feeds.pinboard.in-xxx.txt
  2. The terminal output with that print statement on parse.py:56 uncommented

@f0086
Copy link
Contributor

f0086 commented Mar 9, 2019

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private"
[*] [2019-03-09 17:43:21] Downloading https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private
    > output/sources/feeds.pinboard.in-xxx.txt                                                                                                                    
[*] [2019-03-09 17:43:23] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt...
[!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'
    > Adding 207 new links to index (parsed import as Plain Text)
[*] [2019-03-09 17:43:23] Updating main index files...
...

image

<?xml version="1.0" encoding="UTF-8"?>
 <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (private YYY)</title>
    <link>https://pinboard.in/u:YYY/private/</link>
    <description></description>
    <items>
      <rdf:Seq>
	<rdf:li rdf:resource="https://bugs.archlinux.org/task/56957"/>
	<rdf:li rdf:resource="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript"/>
      </rdf:Seq>
    </items>
    </channel>
<item rdf:about="https://bugs.archlinux.org/task/56957">
    <title>FS#56957 : [systemd] systemd-networkd crash after updating to linux 4.14.11</title>
    <dc:date>2019-02-10T19:46:52+00:00</dc:date>
    <link>https://bugs.archlinux.org/task/56957</link>
    <dc:creator>YYY</dc:creator><description><![CDATA[<blockquote>Flyspray, a Bug Tracking System written in PHP.</blockquote>]]></description>
<dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier>
</item>
<item rdf:about="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript">
    <title>UnrealScript Beginners Guide</title>
    <dc:date>2019-02-08T14:24:34+00:00</dc:date>
    <link>https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript</link>
    <dc:creator>YYY</dc:creator><dc:subject>unreal</dc:subject>
<dc:source>http://pinboard.in/</dc:source>
<dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="http://pinboard.in/u:YYY/t:unreal"/>
</rdf:Bag></taxo:topics>
</item>
</rdf:RDF>

@pirate pirate added status: wip Work is in-progress / has already been partially completed and removed status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers labels Mar 12, 2019
@pirate
Copy link
Member

pirate commented Mar 19, 2019

Fixed in f9a7c53, give the latest master a shot and let me know if it works.

@pirate pirate added status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers and removed status: wip Work is in-progress / has already been partially completed labels Mar 20, 2019
@f0086
Copy link
Contributor

f0086 commented Mar 21, 2019

Looking good.
This will finally fix this issue, thank you!

@pirate pirate closed this as completed Mar 21, 2019
pirate added a commit that referenced this issue Mar 14, 2024
Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to
Fixes #135
Fixes #123
Fixes #106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs followup Work is stalled awaiting a follow-up from the original issue poster or ArchiveBox maintainers type: bug report
Projects
None yet
Development

No branches or pull requests

3 participants