GitHub - jeremybarnes/github_contest: Github contest

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
blenders		blenders
doc		doc
download		download
jgraph		jgraph
svdlibc		svdlibc
.gitignore		.gitignore
Makefile		Makefile
README		README
analyze_keywords.cc		analyze_keywords.cc
authors.txt		authors.txt
candidate_source.cc		candidate_source.cc
candidate_source.h		candidate_source.h
config.txt		config.txt
data.cc		data.cc
data.h		data.h
decompose.cc		decompose.cc
decompose.h		decompose.h
exception_hook.cc		exception_hook.cc
exception_hook.h		exception_hook.h
fake-results.txt		fake-results.txt
fake-results.txt.log		fake-results.txt.log
get_repo.sh		get_repo.sh
get_user.sh		get_user.sh
github.cc		github.cc
jml		jml
keywords.cc		keywords.cc
keywords.h		keywords.h
loadbuild.mk		loadbuild.mk
notes.txt		notes.txt
prob-results.txt.log		prob-results.txt.log
process_repos.awk		process_repos.awk
ranker-classifier-training-config.txt		ranker-classifier-training-config.txt
ranker.cc		ranker.cc
ranker.h		ranker.h
raw_repo_data.txt		raw_repo_data.txt
raw_repo_data2.txt		raw_repo_data2.txt
raw_repo_data3.txt		raw_repo_data3.txt
repo_descriptions.txt		repo_descriptions.txt
results.txt		results.txt
results.txt.log		results.txt.log
siamese.cc		siamese.cc
stop_words.txt		stop_words.txt
Repository files navigation

This is Jeremy Barnes's Github recommendation engine.


Advertisement
=============

I'm currently looking for consulting contracts.  I do pretty much any data mining, machine learning or early-stage technology stuff, but I'm particularly experienced in Natural Language Processing.

You can contact me on jeremy (at) barneso (dot) com.


Copying
=======

Feel free to browse my code and take my ideas for your system, if ever they are useful.  You can also use my results.txt however you like.

If ever you want to use my code, you can do so under the Affero GPL v3.  Note that getting it to work might be difficult; it relies on JML (Jeremy's Machine Learning Library) at http://bitbucket.org/jeremy_barnes/jml/.  This code needs to be downloaded and compiled and the jml symlink in the github repo set to point to it.  I build it on 32 and 64 bit Ubuntu 8.10/9.04 machines; anything else might be difficult.  There are also a *lot* of supporting libraries to install (all of which come with Ubuntu, but many of which aren't installed by default).


For Blenders
============

I have included the following files for the benefit of blenders (in the blenders directory):
* training-50-repos.txt: 50 results/user ranked for the held-out user set from Daniel Haran's repo;
* testing-50-repos.txt: 50 results/user ranked for the official test set
* prob-results.txt: 100 results/user + probability from the ranking classifier for the official test set

Feel free to incorporate them any way you want to, though you should give credit.


Features
========

The system that I have implemented has a few unique features that might be interesting.
* Probabilistic predictions: the system provides probabilistic predictions of the score of the repos.  The major advantage of this is that we can see the confidence in our predictions.
* Ability to explain predictions: using the explain functionality of the machine learning library (which has been designed and used commercially, but is not in the current version of the library), we can tell the user why we though that they would like a particular repo.
* Practicality: it takes about 8 minutes to generate the predictions for all 4788 users on a single CPU core, which means that about 10 users/second/CPU core can be processed.  The training process takes about 35 minutes.
* Malleability: there is essentially no hand-tuning that has been done on the system, and it is very easy to add extra features to take advantage of further information.  It would also be relatively easy to take out of the context of the competition and put into production.


System Architecture
===================

The system architecture for prediction is as follows:

1.  We generate a set of candidate predictions, based upon the following things:
    - parents of watched repositories
    - ancestors (ie, grandparents or older) of watched repositories
    - repos by authors of watched repositories
    - repos with the same name (but a different author) as the watched repositories
    - children of watched repositories
    - repos that are in the same cluster as a watched repo (see below)
    - repos of users that are in the same cluster as the user we are predicting (see below)
    - repos that were co-suggested with a watched repo by a given user
    - repos that are watched by users that had co-suggested a watched repo (I guess I should include a picture)
    - repos that have an ID in the range of the user-repo line (see below)
    - popular repositories

    Each of these is done by a candidate source, which will generate a large set of possible candidates and features to rank them with.  These features are then passed to a classifier which identifies a subset to be passed through to the next step.

2.  We merge all of the candidate predictions (normally the top 100) from each source, and take a superset of about 400 possible candidates to process further.

3.  We run each candidate prediction through a ranking classifier, which predicts the probability that the candidate is correct.  The probability depends upon a lot of features that model all kinds of aspects of the process.  The training of the classifier and the features that are used are described below.

4.  We filter out predictions for already watched repos, sort by the score, and dump the top 100 with their probabilities to prob-results.txt.

5.  We take the 10 top predictions from this file for the results.txt, without the probabilities.


Algorithms
==========

The following algorithms are used:
- a SVD in order to map each user and repo onto a low dimensional representation;
- k-means clustering in order to cluster the users and repos;
- bagged boosted decision trees in order to generate the ranking classifier
- a stochastic random walk process to determine a popularity ranking of the repos.  It mostly gives the same results as ranking by the number of watchers, though.
- a (programming) language similarity model, which takes the cosine between normalized language vectors
- a co-occurrence model, where weight is awarded when two repositories are recommended together by the same user, or when two users both watch the same repository together. 
- a keyword model, where tf and tf-idf vectors of the repository names and descriptions are used.


Extra Information
=================

I downloaded information for each author about the date that the person joined and the number of followed and following people, as well as the original github ID (which is completely useless, and ignored).  It's in the authors.txt file, which anyone can feel free to use however they want.  Personally, I got nearly zero benefit from this information.

I also have a file with the descriptions of (most of) the repositories in repo_descriptions.txt.  I used this to generate fuller tf-idf vectors of the repository name.  Note that this file has a lot of syntax errors (escaping, ...) due to the not-ideally-robust nature of the scripts used to generate it.


Candidate Source Feature Vector Contents
========================================

Each candidate source calculates a feature vector (which is a list of variables with values) for each of the candidates.  These candidate feature vectors are used to reduce the list of (potentially) 1000s of candidates into set of 100 or so *likely* candidates.

The challenge here was to improve the diversity of the results.  For example, if the user already watched some repos by rails, then we didn't want to add another 200 rails/ repos at the expense of other repos from other authors (this could happen as the rails repos are all very popular).  By ranking the results within each set and providing features about this, the classifier was able to learn that we only need a few results per author from several authors rather than an exhaustive list from one single author.

We also want each candidate source to provide different results; we don't just want all of them saying that "parents of watched repos are likely".

In order to deal with these challenges, we use a two-part feature vector.  

- Features from an early attempt to trace the user/repo ID line; nearly useless:
  * density of points in this part of the userid/repoid space
  * the actual user ID
  * the ratio of the user ID to the repo ID;

- General features about the user/repo IDs:
  * the number of repos that the user is watching
  * the number of users watching the proposed repo
  * the number of lines of code in the repo

- Features from the stochastic random walk:
  * the probability of the repo as determined by the stochastic random walk
  * the rank of this probability over all repos
  * the probability of the user as determined by the stochastic random walk
  * the rank of this probability over all users
  * the product of the user and the repo probabilities

- Features about the family tree:
  * does the repo have a parent?
  * how many child repos does the repo have?
  * how many ancestor repos does the repo have?
  * how many sibling repos does the repo have?
  * how many watchers does the parent repo have?

Specific to coocs and coocs2:

        result.add_feature("cooc_total_score", Feature_Info::REAL);
        result.add_feature("cooc_max_score", Feature_Info::REAL);
        result.add_feature("cooc_avg_score", Feature_Info::REAL);
        result.add_feature("cooc_num_scores", Feature_Info::REAL);


Specific to repos by same author:

        result.add_feature("author_already_watched_num", Feature_Info::REAL);
        result.add_feature("author_unwatched_num", Feature_Info::REAL);
        result.add_feature("author_already_watched_prop",
                           Feature_Info::REAL);
        result.add_feature("author_num_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("author_prop_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("author_abs_rank", Feature_Info::REAL);
        result.add_feature("author_abs_percentile", Feature_Info::REAL);
        result.add_feature("author_unwatched_rank", Feature_Info::REAL);
        result.add_feature("author_unwatched_percentile", Feature_Info::REAL);

Specific to repos with same name:

        result.add_feature("same_name_already_watched_num", Feature_Info::REAL);
        result.add_feature("same_name_unwatched_num", Feature_Info::REAL);
        result.add_feature("same_name_already_watched_prop",
                           Feature_Info::REAL);
        result.add_feature("same_name_num_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("same_name_prop_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("same_name_abs_rank", Feature_Info::REAL);
        result.add_feature("same_name_abs_percentile", Feature_Info::REAL);
        result.add_feature("same_name_unwatched_rank", Feature_Info::REAL);
        result.add_feature("same_name_unwatched_percentile",
                           Feature_Info::REAL);

Specific to in_cluster_repo:

        result.add_feature("rcluster_num_watched_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_prop_watched_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_rank_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_best_dp_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_best_norm_dp_in_cluster",
                           Feature_Info::REAL);

Specific to in_cluster_user:

        result.add_feature("ucluster_num_watchers", Feature_Info::REAL);
        result.add_feature("ucluster_watcher_score", Feature_Info::REAL);
        result.add_feature("ucluster_highest_dp", Feature_Info::REAL);
        result.add_feature("ucluster_highest_dp_norm", Feature_Info::REAL);


Performance of Candidate Selection
==================================

For each candidate selection algorithm, I have two tables of statisics.

The first provides statistics about generation of candidates by each of the candidate sources, before the candidate source classifiers (the generalized linear models) run:

                     fired  nrcl   rcl%  watch  totalnum peruser    prec
parents_of_watched    2971  1458(49.07%)  4695     8143(   2.74) 75.5618
ancestors_of_watched   246     3( 1.22%)    24      288(   1.17)  9.3750
by_watched_authors    3426  1245(36.34%)     0   562516( 164.19)  0.2213
same_name             4242  1617(38.12%)     0  1470496( 346.65)  0.1100
children_of_watched   3080   175( 5.68%)  5072  1078146( 350.05)  0.4867
in_cluster_repo       4788  1300(27.15%)     0  2318617( 484.26)  0.0561
in_cluster_user       4788  1982(41.40%)     0  2391605( 499.50)  0.0829
in_id_range           4788   340( 7.10%)  2219    43466(   9.08)  5.8874
coocs                 3722  2103(56.50%) 40747  9915205(2663.95)  0.4322
coocs2                3754  2552(67.98%) 48276 24308132(6475.26)  0.2091
most_watched          4788   795(16.60%)  7754   478800( 100.00)  1.7855

The columns are:
fired: number of the 2788 users for which at least one was generated
nrcl:  number of held-out (target) examples that were produced
rcl%:  which percentage of times it fired did it produce the correct answer?
watch: how many of the results produced were already watched?
totalnum: how many results were produced in total?
peruser:  average number of candidates produced pre user
prec:     (precision) how many of the results produced were either correct or already watched

We can see that choosing parents of the watched algorithms was by far the most effective result.  We can also see that there is a *lot* of filtering necessary to reduce the average of 6475 candidates from the cooc source to a reasonable number, and to improve the small precision numbers.

The next table is for after the results have been classified and the top 100 have been kept for each source.  Note that in this table the order is important, as the values in the incremental part only count *new* results that weren't already added by an earlier algorithm.

                      --------- absolute ---------------  -------- incremental --------
                     fired  corr   rcl%   nadded     avg   corr   rcl%   nadded     avg    maxsz   
parents_of_watched    2971  1458(49.07%)    3448(   1.16)  1458(49.07%)    3448(   1.16)      16
ancestors_of_watched   246     3( 1.22%)     264(   1.07)     3( 1.22%)     264(   1.07)       4
by_watched_authors    3426  1233(35.99%)  265847(  77.60)  1096(31.99%)  265512(  77.50)     200
same_name             4242  1606(37.86%)  246959(  58.22)   120( 2.83%)  241763(  56.99)     100
children_of_watched   3080   164( 5.32%)  182070(  59.11)     7( 0.23%)   69443(  22.55)     100
in_cluster_repo       4788   679(14.18%)  477770(  99.78)   104( 2.17%)  435446(  90.95)     100
in_cluster_user       4788  1344(28.07%)  478800( 100.00)   645(13.47%)  465371(  97.20)     100
in_id_range           4788   340( 7.10%)   41247(   8.61)   104( 2.17%)   40617(   8.48)      57
coocs                 3722  1296(34.82%)  255004(  68.51)   130( 3.49%)  143509(  38.56)     100
coocs2                3754  1386(36.92%)  275257(  73.32)    59( 1.57%)   69956(  18.64)     100
most_watched          4788   795(16.60%)  471046(  98.38)    24( 0.50%)  212581(  44.40)     100

The columns:
fired:         number of times the rule fired
corr:          number of times the held out value was in the set
rcl%:          % of times where candidates produced that the held out value was there
nadded:        total number of added candidates
avg:           average number of candidates added per user where it fired

It is interesting to look at the incremental results to see how the different candidate sources succeeded in adding diversity.  We can also look back to the previous table to see how much of the potential was removed by filtering down to 500 candidates.  For example, in the coocs2 source, we went from 67.98% to 36.92% after taking the top 100, and down to 1.57% when we exclude those that were added in previous rules.

The in_cluster results are particularly disappointing as the GLZ for them actually hurts rather than helps (I kept it in anyway, but I have a heuristic pre-filtering stage that limits the damage).  This is probably due to the way that I generated the training data to train the generalized linear model.  (UPDATE: it's due to a bug in the optimization of the classifier.  It works properly when fixed, but doesn't make a huge difference to the results).


Ranker Feature Vector Contents
==============================

The goal of the ranker feature vector is to take the 400 or so unique candidates that are left after the candidate sources are merged, and choose the 10 most likely to be correct.  In order to do this, we use very powerful features and a very high capacity classifier.

The final results are pretty good (over a held-out set):
fake test results:
     total:      real: 2944/4788 =  61.49%  poss: 3750/4788 =  78.32%  avg num: 406.8

So, by reducing the 406 candidates/user to 10, we keep 2944/3750 = 82.4% of the good answers.

The features from all of the candidate sources are included in the feature vector.  To those, we add the following:

- User features: we try to model the behaviour of the user: does the user follow popular repos or non-popular ones?  Does he tend to be one of the only watchers or does he watch repos that other people watch?  Does he tend to follow a few repos from lots of authors or lots of the repos of a few authors?  Does he tend to follow lots of forks of a project or a few forks of lots of projects?  These features provided a very small boost.

- Heuristic features, that calculates a heuristics score based upon the following:
  * First, add the parent repositories of repositories that the user is already watching, in order of popularity
  * Then, those other repositories by the same author as the current one
  * Then, add ancestor repositories (ie, more than one parent away)
  * Finally, add in order of popularity

- Date features.  These features were nearly useless.
- Author following/followed features.  These features were nearly useless.

- The dot-product and normalized dot product between the programming language vectors for the user's watched repos and the suggested repo (in order to see if it's in a language that the user is interested in).  Worth about 2%.
- Features based upon the name of the repo (note that these require us to match up authors of repos and users)
  * Does the user's name include the repo's name?
  * Does the repo's name include the user's name?

- Features about the author and their repos, to get an idea of their popularity:
  * How many repos does the author have?
  * How many watchers does this author's repos have?
  * How many repos are there with the same name?
  * How many watchers does this author's repos have in total
  * Were we able to infer an author for this user?
  * Were we able to infer a unique author for this user?

- Features about the cooccurrences, both using just a TF model (coocs) and an TF-IDF-like model (coocs2).  These were worth 1% or so.

- Features based upon the SVD.  These could do with a better algorithm than SVD (for example, PLSA or one of the methods in Investigation of Various Matrix Factorization Methods for Large Recommender Systems):
  * Dot products, normalized and unnormalized, between the user's topic space representation and the repo's tpoic space representation, and the highest value within the dot product
  * Same between the repo and the user's average repo (in the topic space)

- Features about the range of repo IDs that we though should be in this user id's set (see Side Channel Information).  Worth about 1.5%.

- Features about keyword matches, both tf and tf-idf, between the user's repos and the repos.  These are based upon the repo descriptions in repo_descriptions.txt and the names of the repos (eg, rails-authentication-plugin becomes "rails authentication plugin").  Note that the Natural Language Processing used was extremely basic: no removal of punctuation, no contractions, no stemming, etc.  Note also that some kind of smoothing and some domain-specific tokenization (eg, RailsAuthenticationPlugin --> rails authentication plugin) could easily improve the information.


Side Channel Information
========================

There are also features that model the particular dataset that we have here,
and do not actually contribute to making a recommendation.  These features
include:
- Modeling the density of the user/repo space.  If you look at doc/user_repo.png in my repo, you will see that there is a line that goes up the middle.
  Further investivation shows that this is a mapping between user IDs and repo IDs.  It must have been created by the process that created the dataset; this process only created a user or repo ID at the moment when it first encountered it, and the shape of the line shows the relationship between the new user creation and the new repo creation.
  If we look further, by plotting the *lowest* watching user ID against the repo ID, we can see that almost everything is below this curve, except for a sparse set of scattered points above the line (doc/repo_id_vs_lowest_user_id.png).  Most of these points above the line are due to a point on the line having been removed for the testing set.  See the comment in data.cc for more data/investigation.  By using nothing but the joint distribution of user IDs and the repo IDs, it is possible to score nearly 5.77%.
  By tracing this line (which isn't that easy, as there is a lot of noise) and proposing repositories that cover the missing points, as well as exploiting the scattered points above, we manage to pick up about 2% of repositories that we wouldn't have got before.
- Matching authors to users.  If we see that user 2839 is the only watcher on repos 6879, 7039 and 23405, and that these have the same author, then we can be pretty sure that user 2839 *is* that author.  Using this algorithm, we map 35873 of the 56521 users unambiguously to a single author id, and a further 535 users to several authors (I haven't yet investigated what happens in this case; I suspect that one person has created more than one account to get around size limits or there are user/project accounts both controlled by the same person).

Note that I decided not to look at the watcher counts in the data from the API.  One could, with some added noise, probably guess which repos had been removed somewhere with this information; I consider this a little too close to cheating.


Testing
=======

I do testing locally by generating a fake test set (the code is in data.cc).  This allows me to test my ideas without needing to always test against the official test set, which stops me from overfitting too badly.  Overfitting is, of course, a benefit in this competition but my goal is to write a good recommendation engine, not just to get the highest score.

Unfortunately, the test results that I get, regardless of the random seed that I use, tend to be about 10% better than the results I get when I submit to the official scoring program.  I have checked that the repos that I take out are similar to the official ones and they are; I am at a loss to explain it.  But there must be some systematic bias due to me not having implemented one of the details of the way the repos were held out in the generation of the official test set.

Due to this problem, I sometimes see an improvement on my test set that is not reflected on the official test set.

One of the other useful things about local testing is that I can test both my set of candidate repos and the reduced set once they have been ranked.  This testing allows me to see that for about 25% of users I never propose the correct repo; on those where I do propose the correct one, the classifier performance is quite good.

I perform training in the same manner, by holding out some more of the repos and testing the ability of the algorithm to predict what I held out.  This is particularly useful to generate the feature vector for the ranking classifier.


Training
========

The training process works roughly as follows:

1.  Perform the (slow) k-means clustering to generate clusters of users and repos.  We hold out 20,000 watches whilst doing the clustering so that we can train the rest on these without biasing.

2.  For each of the candidate sources:

    2.a  Generate a feature vector file with the features about the possible candidate and whether it was correct or not.  We use 10,000 of the 20,000 held out watches for this.

    2.b  Train a generalized linear model to provide a proability for the candidate

2.  Generate a feature vector file for the ranking classifier, with entries for each user where we had the correct result somewhere in their candidate set.  Here we use more powerful but expensive features.  For each user we include the feature vector for the held-out repo (with label 1 and weight 1.0), and a random sample of 20 feature vectors for non-held-out (and so incorrect) repos, with a weight of 1/20.  We use the other 10,000 held-out watches for this.

3.  Train a classifier on the feature vector file.  We use a much higher capacity classifier here (bagged boosted decision trees, with a tree depth of 3, 50 iterations of boosting and 5 bags).

4.  Run the classifier over both a fake testing set to get non-official results, and the official testing set to get a results.txt file to submit.

The training process is in the loadbuild.mk file.


Unbiasing
=========

One of the problems with a system like this is systematic biasing.  If we generate clusters knowing the missing repos, the clusters are likely to represent the missing repos.  Then when we train a classifier it will learn to trust the clustering more than it really should, and this will lead to poor results.

In order to avoid this problem, we remove the classifier training data from the dataset before we perform any of the steps in training, including the clustering.


Code
====

The code is organized as follows:

Compiling: (make or make -jx where x is number of cores on machine)
github.cc: main driver program; --help shows main modes (or look in loadbuild.mk)
data.{h,cc}: data structures to read the data and save the results of analysis
candidate_source.{h,cc}: implementation of the base class and specializations for the candidate generators.
ranker.{h,cc}: data structures and algorithms for the ranking of hypotheses received from the candidate_source.
decompose.{h,cc}: algorithms for decomposition and clustering
exception_hook.{h,cc}: simple file to cause a backtrace to be printed at the point where an exception is thrown
keywords.{h,c}: functionality to process keywords in the repo names/descriptions
siamese.cc: unfinished implementation of a siamese network to perform a better vector space mapping.

svdlibc/: the external library of this name; code has been patched to work with our version of GCC.  Note that there is no license with the code, so it's redistribution status is unknown.
jml/: a symlink to the jml library's source tree.  This is used for the machine learning, basic system services and build system.

Makefile: main makefile; includes logic from the make system in jml
build/: binaries go into this directory

Training: (make -j2 loadbuild)
loadbuild.mk: makefile to generate results.txt (make -j2 loadbuild to generate a results.txt file)
ranker-classifier-training-config.txt: configuration file for the classifier generator for the ranker
config.txt: configuration file for the system; not many options have been put in it for the moment
download/: contains the original data for the competition; not included in the repo (others will need to download it for themselves)
authors.txt: information about repo authors, obtained from the API
repo_descriptions.txt: descriptions of the repos, obtained from the API


Bugs
====

Apart from the systematic biasing that gives me 10% more on my internal testing than on the official test set, there are also a few other bugs that I didn't manage to fix.

The main one is that the candidate source classifiers for the cluster sources (in_cluster_user and in_cluster_repo) cause more harm than good, as the probabilizer on the output inverts the label classes.  I suspect that this is due to the baseline accuracy being too low; whatever the cause this means that when we reduce the 1000s of possibilities down to 100, we tend to take 100 *really bad* ones.  For this reason, I had to undo most of the improvements that I had made to this; it should be possible to do much better once this problem has been sorted out.  It's also possible that the problems are due to some kind of systematic biasing.  (UPDATE: it's due to a bug in the optimization of the GLZ classifier, which led to each label having a random bias added to it.  Fixing it however doesn't make that much of a difference; the in_cluster predictor is still not specific enough with the features it has available).


Thoughts on the Contest
=======================

Firstly, thanks to GitHub for running the contest.

What I liked about it:
* It was nice to have a contest with a limited scope and duration; I avoided NetFlix precisely because I knew that it would become an obsession for several years, and real life (especially with a 2 1/2 year old) makes it difficult to deal with.  A 3 week contest (I spent about this time from go to whoa) on the other hand felt much more focused and was more fun.
* It was nice to have a real dataset.  I hope that the held-out data will also be published.
* It was nice to have a reasonable amount of data: not so much that one needed to spend a lot of time on data structures to fit it in memory, but not so little that there wasn't any rich structure to be found.
* It was actually good to have some strong heuristics that could provide decent results without too much work.  It meant that people actually concentrated on understanding the data early on, rather than blindly applying the Machine Learning technique du jour.

Now for the things that were unfortunate.  These are mostly obvious in hindsight, but it's worth enumerating them anyhow.

* Possibilities to hill-climb.  NetFlix avoided this by requiring a submission of a large set of predictions, the score on only 1/2 of which was given when a submission was made.  The score on the other 1/2 was kept by NetFlix to evaluate the grand prize, which meant that someone who had overfitted would have a gap between their public scores and the internal NetFlix one.  Indeed, it seems that the team on top of the leaderboard had overfitted more than the second team, which lead to the second placed team being considered for the grand prize.  In GitHub, however, it would be possible to perform hill-climbing (overfitting) without any generalization improvement.  Unfortunately, looking at other submissions there seems to have been a degree of this happening.
* Possibility to take other people's result files.  Blending is a valid approach and fine by me, but this means that the highest results will probably be useless in terms of creating a recommendation engine.  It would have been nice to have either a separate category or to have required co-operation rather than simply taking and using the top 20 on the leaderboard.
* Possibility to model the dataset.  See above (Side Channel Information) for my description of how I modeled the user ID/repo ID generation process.  This part of my solution, while fun, was also completely irrelevant to the task of creating a good recommendation engine.
* If there had have been a way to reduce the effect of the heuristics on the final score.  One way to do this (based upon the Boosting algorithm) would have been to keep track of a weight for each of the users, which would have increased as people submitted results that got that user wrong and decreased as people submitted results that got the user right.  Providing a weighted score like this would have encouraged systems that provided diverse results (not the same as the other systems).
* Probably, in hindsight, anyone who wants to run a competition like this without a lot of experience should give themself a week in which they're allowed to change the rules, and then do so based upon what people come up with that's not in the hoped direction.

Things that would have been nice:
* To have included two held-out sets, so that blending algorithms could have been trained on one of them (rather than using the blind weighted approaches that seem to have been favored).
* An official discussion forum or mailing list.
* Publication of the code that was used to split the training and testing samples so that we could produce other held-out sets that would have had comparable performance.
* It was a bit of a shame to see that hardly anyone tried any of the advanced techniques out there (I would have liked to try a deep siamese neural network and better matrix factorization models).  I guess that this is a combination of a) burnout from those who participated in NetFlix, b) the short timeframe and c) the relative power of heuristic techniques.

Credits
=======

Thanks to:
* Daniel Haran's code and README for the idea of using parent repositories, and him personally for discussions of the solutions
* asciiarmor's code (thanks for publishing it) for the basic features necessary to get to reasonable performance
* xlvector's code for the idea of clustering by cooccurrences (now why didn't I think of that!) and the man himself for the idea of increasing the diversity of the predictions


Todo
====

Future things that I will or may try:

* Use a classifier instead of a heuristic to rank the set of potential repositories (this will help us where the user has a lot of watched repositories) (done)
* Look at language matches (done)
* Look at keywords in the repository names (eg, "rails" says a lot about which kind of repos that people like) (done)
* Add a simple knn implementation based upon a dot product between repos to find new ones (will need pre-computation is it will be SLLOOOOWWWW!) (done).  Only for cluster members.  Didn't provide that much of an improvement; we need a better matrix factorization.
* EM on a graphical model to learn a smoothed user/repo matrix (done)
* Embed both repositories and users into the same feature space using a deep neural network and use cosine similarity between embedded representations to learn similarity.
* Potentially, other (simpler) matrix factorization models (done)
* Try with no probabilizer on the classifier(s) (done)
* Fix boosting/decision tree bug causing us to stick on language_cosine < 1 (done)
* Find where dataset is biased between fake/real test sets
* Do some reasonable processing of the keywords file
* Multithreading to produce the results file and training (done)
* Perform a SVD of the repo descriptions to smooth them
* Diversity: don't allow all of the predictions to come from the same set.  Penalize those that are too similar to the previous source.
* Cheating: hill-climb (using a Nelder-Mead or similar multivariate optimization technique).  Guaranteed to win, but not at all useful.
* Why is repo.author == -1 in some cases?
* Find and use collaborators (two users that watch each other's repositories)
* SVD on the keywords to smooth
* More exploration!  I had time to implement a bunch of ideas, but not to follow them through to check that they worked as expected.
* Split the by_this_author rule from by_same_author to allow more precision (done).  Small improvement.

Jeremy Barnes
28 August 2009