PyPocketExplore - Unofficial API to Pocket Explore data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
pypocketexplore
.gitignore
LICENSE
README.md
requirements.txt
setup.cfg
setup.py

README.md

PyPocketExplore - Unofficial API to Pocket Explore data

PyPocketExplore is a CLI-based and web-based API to access Pocket Explore data. It can be used to collect data about the most popular Pocket items for different topics.

An example usage would be crawling the data and use it as a training set to predict the number of pocket saves for a web page.

Usage

The easiest way to install the package is through PyPi. This should get you up-and-running pretty quickly.

$ pip install PyPocketExplore

Through the CLI there are two modes: topic and batch

With the first one (pypocketexplore topic) you can download items from specific topics and output them to a nicely formatted JSON file.

Usage: pypocketexplore topic [OPTIONS] [LABEL]...

  Download items for specific topics

Options:
  --limit INTEGER  Limit items to download
  --out TEXT       JSON output filepath
  --nlp            If set, also downloads the page and applies NLP (through
                   NLTK)

For example, this command

$ pypocketexplore topic python data sex books --nlp --out life_topics.json

will go through the corresponding pages: https://getpocket.com/explore/python, https://getpocket.com/explore/data, https://getpocket.com/explore/sex, https://getpocket.com/explore/books one-by-one and then:

  • scrap and extract the immediately available data for each item (item_id, title, save count, excerpt and url)
  • run each item url through the awesome Newspaper library (in-parallel)
  • apply NLP to each item's text
  • save the results to life_topics.json

In the end you'll have a rich dataset full of text to play with and of course a popularity metric - pretty cool to experiment with. You can check it out here

For each topic on Pocket Explore, there are a set of related topics which one can crawl through pretty easily in a recursive way. For example after scraping https://getpocket.com/explore/python on can then scrap the related topics: programming javascript google windows java linux data science python 3 developer.

This essentially means that one can crawl through the whole graph of topics by following the related topics as edges. To do this one of course needs a set of seed topics to initiate the crawling process. To get these seeds, the pypocketexplore batch mode fetches the taxonomy labels provided by IBM Watson. and then walks through the graph. (I guess Pocket uses the IBM Watson to label its items, so this kind of reverse-engineering make sense. (Sorry Pocket guys) )

Usage: pypocketexplore batch [OPTIONS]

  Download items for all topics recursively.  USE WITH CAUTION!

Options:
  --n INTEGER      Max number of total items to download
  --limit INTEGER  Limit items to download per topic
  --out TEXT       JSON output filepath
  --nlp            If set, also downloads the page and applies NLP (through
                   NLTK)
  --mongo TEXT     Mongo DB URI to save items
  --help           Show this message and exit.

CAUTION This mode with all goodies enabled will take few days to run and then collect around 300k unique items through 8k topics. I have tried to space the requests to Pocket's servers and handle rate limit errors, but one can never be sure with such things.

Web API

To have access to a standalone web API you need to clone the repo locally first.

$ git clone git@github.com:Florents-Tselai/PyPocketExplore.git
$ cd PyPocketExplore
$ pip install -r requirements.txt

To run this API application, use the flask command as same as Flask Quickstart

$ cd PyPocketExplore
$ export FLASK_APP=./PyPocketExplore/pypocketexplore/api/api.py
$ export FLASK_DEBUG=1 ## if you run in debug mode.
$ flask run
 * Running on http://localhost:5000/

Web API Documentation

Topic

  • GET /api/topic/{topic} - Get topic data

Example topics: python, finance, business and more

Example GET /api/topic/python

Response

[
    {
        "excerpt": "For part 1, see here. All the software written for this project is in Python. I’m not an expert python programmer, far from it but the huge number of available libraries and the fact that I can make some sense of it all without having spent a lifetime in Python made this a fairly obvious choice.",
        "image": "https://d33ypg4xwx0n86.cloudfront.net/direct?"url"=https%3A%2F%2Fjacquesmattheij.com%2Fusb-microscope.jpg&resize=w750",
        "item_id": "1731527024",
        "saves_count": 223,
        "title": "Sorting 2 Tons of Lego, The software Side · Jacques Mattheij",
        "topic": "python",
        "url": "https://jacquesmattheij.com/sorting-lego-the-software-side"
    },
    
        {
        "excerpt": "There are lots of free resources for learning Python available now. I wrote about some of them way back in 2013, but there’s even more now then there was then! In this article, I want to share these resources with you.",
        "image": "https://d33ypg4xwx0n86.cloudfront.net/direct?"url"=https%3A%2F%2Fdz2cdn1.dzone.com%2Fstorage%2Farticle-thumb%2F5158392-thumb.jpg&resize=w750",
        "item_id": "1727350036",
        "saves_count": 59,
        "title": "Free Python Resources",
        "topic": "python",
        "url": "https://dzone.com/articles/free-python-resources"
    },
    
    {
        "excerpt": "A surprisingly versatile Swiss Army knife — with very long blades!TL;DRWe (an investment bank in the Eurozone) are deploying Jupyter and the Python scientific stack in a corporate environment to provide employees and contractors with an interactive computing environment with to help them leve",
        "image": "https://d33ypg4xwx0n86.cloudfront.net/direct?"url"=https%3A%2F%2Fcdn-"image"s-1.medium.com%2Fmax%2F1600%2F1%2AmeN9gfB_nuwmGGwLQzhVQA.png&resize=w750",
        "item_id": "1726489646",
        "saves_count": 41,
        "title": "Jupyter & Python in the corporate LAN",
        "topic": "python",
        "url": "https://medium.com/@olivier.borderies/jupyter-python-in-the-corporate-lan-109e2ffde897"
    },
    ...
]

License

Copyright (c) 2017 Florents Tselai Licensed under the MIT license.