Skip to content

rentzso/nextop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nexTop

My project for the Insight Data Engineering Summer 2017 session is a content discovery system of news articles based on topics.
The general idea is to recommend to the users contents that are relevant but also new and fresh. The main component used to do this is Elasticsearch and its flexible scoring system. Elasticsearch allowed to implement and compare two recommendation systems, one using a more traditional tf-idf relevance score and another, the nexTop scoring system, designed to deliver recommendations with a mix of topics, some relevant, some new to the users (details here).

Slides of the presentation.

The architecture

The main components are:

  • The ingestion pipeline collecting data from the GDELT Dataset into Elasticsearch
  • The user simulation component, generating user clicks based on the two recommendation systems available and collecting statistics
  • The Flask API matching user topics with document topics in Elasticsearch and returning recommendations

nexTop architecture

The ingestion pipeline

To load data into Elasticsearch, I created one Kafka Producer reading and parsing data from the GDELT S3 Dataset into an Avro schema.
Each parsed record (news url, date, related topics) was published to a Kafka topic.
The messages were read by a Spark Streaming consumer and stored into Elasticsearch via the native client library elasticsearch-hadoop.

Ingestion pipeline repositories:

The user simulation component

This component generates user behavior (sequences of user clicks) by using the Flask API recommendations multiple times. It then collects the simulated user statistics and, using a Kafka Producer, publishes them as Avro messages (schema) to a Kafka topic. A Spark Streaming consumer reads from the Kafka topic and stores the statistics on Elasticsearch.
There are two groups of simulated users, one receiving traditional recommendations and one receiving nexTop recommendations.

Simulation component repositories:

The Flask API

The Flask API (code) provides an interface to Elasticsearch:

  • GET /random returns a random document
  • POST /topics takes a list of user topics and the desired recommendation type and returns a set of news recommendations
  • GET /stats returns the last data points of the user statistics
  • GET / returns a dashboard showing how many new topics the two groups of users are exposed to

Supported recommendations

For the topics endpoint there are two recommendation (and score) types implemented at the moment:

  • simple - the default Elasticsearch score (Okapi BM25, a tf-idf-like similarity score)
  • custom - a scripted custom score

Both recommendations receive a list of user topics and find the best scores among the list of all documents which have at least one matching topic.
The simple recommendation type (code) computes its score summing up all the Elasticsearch scores from the matching user topics. In this setup a less frequent topic will contribute more than a common topic to the final score and the top scorers will be the documents matching more user topics.
In the custom recommendation (code), for a news article its score is the ratio of matched user topics over the number of topics related to the document. So each matching topic contributes equally to the final score, and all the scores are in the range 0 to 1. The recommended news are not the top scorers but rather the documents with the scores closest to a configurable target score (0.75 at the moment). This means that a recommendation will require a minimum number of matching topics but also a minimum number of non matching, fresh topics.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published