Skip to content

BaseModelAI/recsys-challenge-2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Synerise at ACM Twitter RecSys Challenge 2021

Implementation of our 2nd place solution to Twitter RecSys Challenge 2021. The goal of the competition was to predict user engagement with 1 billion tweets selected by Twitter. An additional challenge was the test phase - the models were evaluated in a very constrained test environment - just 1 CPU core, with no GPUs available, and a time limit of 24h for all predictions which gives about 6ms per single tweet prediction.

The challenge focuses on a real-world task of tweet engagement prediction in a dynamic environment. It considers predicting four different engagement types: Likes, Retweet, Quote, and Replies.

Approach

Getting Started

  1. Register and download training and validation set from competition webstie

  2. Setup a configuration file config.yaml:

    • working_dir - path where all preprocess files will be saved
    • recsys_data - path to directory with uncompressed training data parts
    • validation_part - path to validation uncompressed part
    • max_n_parts - maximum number of training parts that are taken into training. Limit it for speedup training.
    • max_n_parts_in_memory - number of training parts that are loaded into memory at the same time. Limiting it allows to limit RAM usage
    • authors_similarity_top_N - denotes maximum number of similar users to current one
    • authors_similarity_threshold - users similarity threshold
    • validation_percentage - percentage of validation set used for testing, the other part is used for finetuning
    • num_validation_chunks - number of validation chunks
    • sketch_width - sketch width for tweet sketch
    • sketch_depth - sketch depth for tweet sketch
  3. Finetune DistilBERT and precompute token sketches

    python bert_finetuning.py

BERT checkpoints will be saved periodically. You can run sketch computation on any chosen checkpoint:

Prepare sketches of tokens from checkpoint model that was trained with the above script. Change /distilbert_checkpoints/checkpoint-1000 to the most updated model path.

    python prepare_token_embeddings.py --checkpoint-path ./distilbert_checkpoints/checkpoint-1000

Apply EMDE to compute sketches

    python emde.py

Steps 2 and 3 can be run simultaneously

  1. Preprocess dataset and compute interactions of users:
    python interactions_extraction.py
  1. Train model
    python train.py

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages