Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
SocialLink source code
The code is split into a set of executable steps that are described below. Please consult help for each of the classes to run the system.
Checkout and run
mvn package in order to produce Java .jar file. The latest release is also downloadable on releases page.
Code Organisation and pipeline description
align component contains everything needed to recreate the resource from scratch. The software covers the four main blocks described in our recent paper: data acquisition phase that populates the User Index and the Entity Index from raw twitter data and DBpedia chapters, candidate acquisition phase that populates the list of candidate profiles for each entity and candidate selection phase that produces final alignments. This module also contains an evaluation pipeline used to benchmark and debug the approach.
User Index is built from raw Twitter data and stored to PostgreSQL database via the Apache Flink pipeline. All the code regarding the building and testing of User Index is located in
BuildUserIndexclass is used to populate
user_objectstables that contain all the necessary information about users.
BuildUserLSAclass is a supplementary pipeline that gathers all content related to specific user and computes a dense vector representation of this content using the Latent Semantic Analysis. The results are stored in the
Both pipelines are very straightforward to run — they require either a PostgreSQL access or dump the results into the specified file, they require files containing raw Twitter streaming API data that can be optionally gzipped.
BuildUserLSA also requires precomputed LSA model from the
eu.fbk.utils.lsa package of fbk/utils library.
The rest of the sources in this package include:
UserLSAInteractivescript, that is used to test those three tables once the index has been built
FillFromIndexclass that is used as a provider of index content for the rest of the component.
The Candidate Acquisition phase is represented by the
eu.fbk.fm.alignments.pipeline.SubmitEntities class. Its job is to submit into the
alignments table all entity-candidate pairs in preparation for scoring. It uses
eu.fbk.fm.alignments.query package to construct the query to the User Index to get the list candidates.
Candidate Acquisition phase is represented by the
eu.fbk.fm.alignments.pipeline.ScoreEntities class. It takes all entity-candidate pair from the database, extracts features, queries the DNN API installed separately (see pokedem-models repository, select the best candidate and saves everything back to the database.
eu.fbk.fm.alignments.scorer package contains implementations of all features and feature set strategies.
From this point the database contains the resource in the
alignments table that can be queried directly or extracted in various formats such as RDF of CSV.
Once the index is populated and Virtuoso endpoint is set up, the resource is produced by running two consecutive scripts under the
SubmitEntitiesthat read the list of entities to resolve, runs them through the Candidate Acquisition phase and submits the unscored candidates to the database.
ScoreEntitiesreads the entity-candidate pairs from the database and scores them
After that the
is_alignment flag is set using the following query:
UPDATE alignments SET is_alignment = true FROM (SELECT resource_id, MAX(uid), MAX(max_score) AS score FROM ( SELECT resource_id, uid, nth_value(score, 1) OVER w AS max_score, nth_value(score, 2) OVER w AS next_score FROM alignments WINDOW w AS (PARTITION BY resource_id ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ) AS a GROUP BY resource_id HAVING MAX(max_score) > :min_score AND MAX(max_score) - MAX(next_score) > :min_improvement) AS b WHERE alignments.resource_id = b.resource_id AND alignments.score = b.score
Note the two thresholds that are supplied as parameters to the query:
eu.fbk.fm.alignments.Evaluate class is a self-contained pipeline that uses a file with gold standard alignment to produce libsvm-compatible feature dumps for training and also performs the evaluation of the entire system.