Nechaev Yaroslav edited this page Jun 28, 2017 · 4 revisions

SocialLink source code

The code is split into a set of executable steps that are described below. Please consult help for each of the classes to run the system.

Installation

Checkout and run mvn package in order to produce Java .jar file. The latest release is also downloadable on releases page.

Code Organisation and pipeline description

align component contains everything needed to recreate the resource from scratch. The software covers the four main blocks described in our recent paper: data acquisition phase that populates the User Index and the Entity Index from raw twitter data and DBpedia chapters, candidate acquisition phase that populates the list of candidate profiles for each entity and candidate selection phase that produces final alignments. This module also contains an evaluation pipeline used to benchmark and debug the approach.

approach

Data Acquisition

User Index is built from raw Twitter data and stored to PostgreSQL database via the Apache Flink pipeline. All the code regarding the building and testing of User Index is located in eu.fbk.fm.alignments.index package.

  • BuildUserIndex class is used to populate user_index and user_objects tables that contain all the necessary information about users.
  • BuildUserLSA class is a supplementary pipeline that gathers all content related to specific user and computes a dense vector representation of this content using the Latent Semantic Analysis. The results are stored in the user_text table.

Both pipelines are very straightforward to run — they require either a PostgreSQL access or dump the results into the specified file, they require files containing raw Twitter streaming API data that can be optionally gzipped. BuildUserLSA also requires precomputed LSA model from the eu.fbk.utils.lsa package of fbk/utils library.

The rest of the sources in this package include:

  • UserLSAInteractive script, that is used to test those three tables once the index has been built
  • FillFromIndex class that is used as a provider of index content for the rest of the component.

Candidate Acquisition

The Candidate Acquisition phase is represented by the eu.fbk.fm.alignments.pipeline.SubmitEntities class. Its job is to submit into the alignments table all entity-candidate pairs in preparation for scoring. It uses eu.fbk.fm.alignments.query package to construct the query to the User Index to get the list candidates.

Candidate Selection

Candidate Acquisition phase is represented by the eu.fbk.fm.alignments.pipeline.ScoreEntities class. It takes all entity-candidate pair from the database, extracts features, queries the DNN API installed separately (see pokedem-models repository, select the best candidate and saves everything back to the database. eu.fbk.fm.alignments.scorer package contains implementations of all features and feature set strategies.

From this point the database contains the resource in the alignments table that can be queried directly or extracted in various formats such as RDF of CSV.

Production pipeline

Once the index is populated and Virtuoso endpoint is set up, the resource is produced by running two consecutive scripts under the eu.fbk.fm.alignments.pipeline package:

  • SubmitEntities that read the list of entities to resolve, runs them through the Candidate Acquisition phase and submits the unscored candidates to the database.
  • ScoreEntities reads the entity-candidate pairs from the database and scores them

After that the is_alignment flag is set using the following query:

UPDATE alignments SET is_alignment = true FROM
(SELECT resource_id, MAX(uid), MAX(max_score) AS score FROM (
    SELECT 
        resource_id, uid,
        nth_value(score, 1) OVER w AS max_score, 
        nth_value(score, 2) OVER w AS next_score
    FROM alignments 
    WINDOW w AS (PARTITION BY resource_id ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS a
GROUP BY resource_id
HAVING MAX(max_score) > :min_score AND MAX(max_score) - MAX(next_score) > :min_improvement) AS b
WHERE alignments.resource_id = b.resource_id AND alignments.score = b.score

Note the two thresholds that are supplied as parameters to the query: :min_score and :min_improvement

Evaluation

eu.fbk.fm.alignments.Evaluate class is a self-contained pipeline that uses a file with gold standard alignment to produce libsvm-compatible feature dumps for training and also performs the evaluation of the entire system.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.