newsfeed-corpus

What

An ongoing attempt at tying together various ML techniques for trending topic and sentiment analysis, coupled with some experimental Python async coding, a distributed architecture, EventSource and lots of Docker goodness.

Why

I needed a readily available corpus for doing text analytics and sentiment analysis, so I decided to make one from my RSS feeds.

Things escalated quickly from there on several fronts:

I decided I wanted this to be fully distributed, so I split the logic into several worker processes who coordinate via Redis queues, orchestrated (and deployed) using docker-compose
I decided to take a ride on the bleeding edge and refactored everything to use asyncio/uvloop (as well as Sanic for the web front-end)
Rather than just consuming cognitive APIs, I also decided to implement a few NLP processing techniques (I started with a RAKE keyword extractor, and am now looking at NLTK-based tagging)
Instead of using React, I went with RiotJS, largely because I wanted to be able to deploy new visual components without a build step.
I also started using this as a "complex" Docker/Kubernetes demo, which meant some flashy features (like a graphical dashboard) started taking precedence.

How

This was originally the "dumb" part of the pipeline -- the corpus was fed into Azure ML and the Cognitive Services APIs for the nice stuff, so this started out mostly focusing fetching, parsing and filing away feeds.

It's now a rather more complex beast than I originally bargained for. Besides acting as a technology demonstrator for a number of things (including odds and ends like how to bundle NLTK datasets inside Docker) it is currently pushing the envelope on my Python Docker containers, which now feature Python 3.6.3 atop Ubuntu LTS.

ToDo

move to billboard.js
Add auth0 support
Move to Cosmos DB

Architecture

import.py is a one-shot OPML importer (you should place your own feeds.opml in the root directory)
metrics.py keeps tabs on various stats and pushes them out every few seconds
scheduler.py iterates through the database and queues feeds for fetching
fetcher.py fetches feeds and stores them on DocumentDB/MongoDB
parser.py parses updated feeds into separate items and performs: [x] language detection [x] keyword extraction (using langkit.py) [ ] basic sentiment analysis
cortana.py (WIP) will do topic detection and sentiment analysis
web.py provides a simple web front-end for live status updates via SSE.

Processes are written to leverage asyncio/uvloop and interact via Redis (previously they interacted via ZeroMQ, but I'm already playing around with deploying this on Swarm and an Azure VM scaleset).

A Docker compose file is supplied for running the entire stack locally - you can tweak it up to version 3 and get things running on Swarm if you manually push the images to a private registry first, but I'll automate that once things are a little more stable.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
legacy-sqlite		legacy-sqlite
screenshots		screenshots
static		static
views		views
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
ENV		ENV
LICENSE		LICENSE
Makefile		Makefile
Procfile		Procfile
README.md		README.md
common.py		common.py
compose-swarm.yml		compose-swarm.yml
config.py		config.py
docker-compose.yml		docker-compose.yml
fetcher.py		fetcher.py
import.py		import.py
langkit.py		langkit.py
metrics.py		metrics.py
parser.py		parser.py
repl.py		repl.py
requirements.txt		requirements.txt
scheduler.py		scheduler.py
web.py		web.py

License

rcarmo/newsfeed-corpus

Folders and files

Latest commit

History

Repository files navigation

newsfeed-corpus

What

Why

How

ToDo

Architecture

About

Topics

Resources

License

Stars

Watchers

Forks

Languages