Zhihu Machine Learning Challenge 2017 (https://biendata.com/competition/zhihu/)
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
conf
.gitignore
README.md

README.md


Zhihu Machine Learning Challenge 2017


Categories


Abstract

In the Zhihu Machine Learning Challenge 2017, we were asked to build a model to automaticly and accurately tag topics for Zhihu contents. Our final submission was a 2-stage process and scored 0.43436 on Public LB and 0.43273 on Private LB, ranking 3rd out of all teams. This documents describes our team's solution which can be dived into two parts:

  1. Deep Learning: build variance DL models to sort all topics.
  2. Learning To Rank: build RankGBM model to sort ten of most possible topics.

Learning To Rank

In the first stage, we can get the DL model prediction results for each <instance, topic> pairs. In the second stage, we will vote for all instances based on ML model results. After voting, each instance is associated with ten of the most likely topics. Then, build a RankGBM model to sort ten of most possible topics.

The above description can be done by the following steps:

  1. Enter root directory of the project:

    cd zhihu-machine-learning-challenge-2017/
  2. Vote for offline dataset and online dataset:

    python -m bin.rank.vote conf/rank_v29.conf vote offline
    python -m bin.rank.vote conf/rank_v29.conf vote online
  3. Generate features for offline dataset and online dataset:

    # generate <instance, topic> pair features
    python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_model offline
    python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_model online
    # generate <instance> features
    python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_instance offline
    python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_instance online
    # generate <topic> features
    python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_topic offline
    python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_topic online
  4. Generate rank data files for offline dataset and online dataset:

    python -m bin.rank.rankgbm.rank_data conf/rank_v29.conf generate_offline
    python -m bin.rank.rankgbm.rank_data conf/rank_v29.conf generate_online
  5. train a RankGBM model based on offline dataset and predict for online dataset:

    # 3-fold cross validation
    python -m bin.rank.rankgbm.run conf/rank_v29.conf train 0 rank_v29
    python -m bin.rank.rankgbm.run conf/rank_v29.conf train 1 rank_v29
    python -m bin.rank.rankgbm.run conf/rank_v29.conf train 2 rank_v29
    # predict for online dataset
    python -m bin.rank.rankgbm.run out/rank_v29/conf/featwheel.conf test
  6. Finally, you can get a submit file here:

    vim out/rank_v29/pred/rank_submit.online.29