public
Description: Github contest
Homepage:
Clone URL: git://github.com/jeremybarnes/github_contest.git
name age message
file .gitignore Tue Sep 08 10:48:52 -0700 2009 don't keep copying prob-results.txt [jeremybarnes]
file Makefile Wed Aug 26 09:29:51 -0700 2009 started refactoring to classify each source sep... [Jeremy Barnes]
file README Wed Sep 02 18:30:48 -0700 2009 added support to read test results with attache... [Jeremy Barnes]
file analyze_keywords.cc Sat Aug 22 18:06:12 -0700 2009 refactored keyword processing code [Jeremy Barnes]
file authors.txt Thu Aug 20 09:14:38 -0700 2009 added date features [Jeremy Barnes]
directory blenders/ Sat Aug 29 12:56:55 -0700 2009 added files for blenders; updated README furthe... [Jeremy Barnes]
file candidate_source.cc Fri Sep 04 08:31:16 -0700 2009 removed user and author matches from coocs and ... [jeremybarnes]
file candidate_source.h Wed Aug 26 15:58:40 -0700 2009 half way through restructure [Jeremy Barnes]
file config.txt Thu Sep 03 20:26:00 -0700 2009 finished SVD for keywords and various improveme... [jeremybarnes]
file data.cc Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file data.h Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file decompose.cc Fri Sep 04 07:55:43 -0700 2009 use both keywords and watcher SVD for repo clus... [jeremybarnes]
file decompose.h Mon Aug 17 20:32:13 -0700 2009 implemented kmeans clustering; not currently used [Jeremy Barnes]
directory doc/ Thu Aug 20 06:17:32 -0700 2009 added more docs about user id/repo id relationship [Jeremy Barnes]
directory download/ Mon Sep 14 09:54:12 -0700 2009 added missing data file [jeremybarnes]
file exception_hook.cc Thu Aug 06 19:58:29 -0700 2009 Parsing and results but not much else [jeremybarnes]
file exception_hook.h Thu Aug 06 19:58:29 -0700 2009 Parsing and results but not much else [jeremybarnes]
file fake-results.txt Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file fake-results.txt.log Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file get_repo.sh Sat Aug 22 18:06:12 -0700 2009 refactored keyword processing code [Jeremy Barnes]
file get_user.sh Thu Aug 20 11:19:04 -0700 2009 test of ID inference [Jeremy Barnes]
file github.cc Tue Sep 01 19:17:20 -0700 2009 multithreading fixes; seems to work properly now [jeremybarnes]
file jml Thu Aug 06 19:58:51 -0700 2009 added jml symlink [jeremybarnes]
file keywords.cc Fri Sep 04 07:10:33 -0700 2009 use binary features for keyword svd; big improv... [jeremybarnes]
file keywords.h Thu Sep 03 20:26:00 -0700 2009 finished SVD for keywords and various improveme... [jeremybarnes]
file loadbuild.mk Wed Sep 02 20:35:10 -0700 2009 added collaborators; again it didn't make much ... [jeremybarnes]
file notes.txt Fri Sep 04 08:31:16 -0700 2009 removed user and author matches from coocs and ... [jeremybarnes]
file prob-results.txt.log Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file process_repos.awk Fri Aug 21 04:23:09 -0700 2009 added cooccurrence suggestions; big difference ... [Jeremy Barnes]
file ranker-classifier-training-config.txt Mon Sep 14 18:34:14 -0700 2009 use random forests (from the latest jml update) [Jeremy Barnes]
file ranker.cc Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file ranker.h Tue Sep 01 10:36:28 -0700 2009 multithreaded; still some problems to debug wit... [jeremybarnes]
file raw_repo_data.txt Sat Aug 22 18:06:12 -0700 2009 refactored keyword processing code [Jeremy Barnes]
file raw_repo_data2.txt Sat Aug 22 18:06:12 -0700 2009 refactored keyword processing code [Jeremy Barnes]
file raw_repo_data3.txt Sat Aug 22 18:06:12 -0700 2009 refactored keyword processing code [Jeremy Barnes]
file repo_descriptions.txt Sat Aug 22 19:29:58 -0700 2009 first use of keywords (very quick and dirty) [Jeremy Barnes]
file results.txt Mon Sep 14 09:09:49 -0700 2009 added follower graph to the information; minor ... [jeremybarnes]
file results.txt.log Mon Aug 24 10:15:35 -0700 2009 probabilistic output format then post-processed... [Jeremy Barnes]
file siamese.cc Mon Aug 31 12:04:09 -0700 2009 re-running on other machine with clustering spe... [jeremybarnes]
file stop_words.txt Tue Aug 25 11:57:44 -0700 2009 added stopword list [Jeremy Barnes]
directory svdlibc/ Sun Aug 16 19:28:05 -0700 2009 performing SVD; no analysis of results yet [Jeremy Barnes]
README
This is Jeremy Barnes's Github recommendation engine.


Advertisement
=============

I'm currently looking for consulting contracts.  I do pretty much any data mining, machine learning or early-stage 
technology stuff, but I'm particularly experienced in Natural Language Processing.

You can contact me on jeremy (at) barneso (dot) com.


Copying
=======

Feel free to browse my code and take my ideas for your system, if ever they are useful.  You can also use my results.txt 
however you like.

If ever you want to use my code, you can do so under the Affero GPL v3.  Note that getting it to work might be 
difficult; it relies on JML (Jeremy's Machine Learning Library) at http://bitbucket.org/jeremy_barnes/jml/.  This code 
needs to be downloaded and compiled and the jml symlink in the github repo set to point to it.  I build it on 32 and 64 
bit Ubuntu 8.10/9.04 machines; anything else might be difficult.  There are also a *lot* of supporting libraries to 
install (all of which come with Ubuntu, but many of which aren't installed by default).


For Blenders
============

I have included the following files for the benefit of blenders (in the blenders directory):
* training-50-repos.txt: 50 results/user ranked for the held-out user set from Daniel Haran's repo;
* testing-50-repos.txt: 50 results/user ranked for the official test set
* prob-results.txt: 100 results/user + probability from the ranking classifier for the official test set

Feel free to incorporate them any way you want to, though you should give credit.


Features
========

The system that I have implemented has a few unique features that might be interesting.
* Probabilistic predictions: the system provides probabilistic predictions of the score of the repos.  The major 
advantage of this is that we can see the confidence in our predictions.
* Ability to explain predictions: using the explain functionality of the machine learning library (which has been 
designed and used commercially, but is not in the current version of the library), we can tell the user why we though 
that they would like a particular repo.
* Practicality: it takes about 8 minutes to generate the predictions for all 4788 users on a single CPU core, which 
means that about 10 users/second/CPU core can be processed.  The training process takes about 35 minutes.
* Malleability: there is essentially no hand-tuning that has been done on the system, and it is very easy to add extra 
features to take advantage of further information.  It would also be relatively easy to take out of the context of the 
competition and put into production.


System Architecture
===================

The system architecture for prediction is as follows:

1.  We generate a set of candidate predictions, based upon the following things:
    - parents of watched repositories
    - ancestors (ie, grandparents or older) of watched repositories
    - repos by authors of watched repositories
    - repos with the same name (but a different author) as the watched repositories
    - children of watched repositories
    - repos that are in the same cluster as a watched repo (see below)
    - repos of users that are in the same cluster as the user we are predicting (see below)
    - repos that were co-suggested with a watched repo by a given user
    - repos that are watched by users that had co-suggested a watched repo (I guess I should include a picture)
    - repos that have an ID in the range of the user-repo line (see below)
    - popular repositories

    Each of these is done by a candidate source, which will generate a large set of possible candidates and features to 
    rank them with.  These features are then passed to a classifier which identifies a subset to be passed through to 
    the next step.

2.  We merge all of the candidate predictions (normally the top 100) from each source, and take a superset of about 400 
possible candidates to process further.

3.  We run each candidate prediction through a ranking classifier, which predicts the probability that the candidate is 
correct.  The probability depends upon a lot of features that model all kinds of aspects of the process.  The training 
of the classifier and the features that are used are described below.

4.  We filter out predictions for already watched repos, sort by the score, and dump the top 100 with their 
probabilities to prob-results.txt.

5.  We take the 10 top predictions from this file for the results.txt, without the probabilities.


Algorithms
==========

The following algorithms are used:
- a SVD in order to map each user and repo onto a low dimensional representation;
- k-means clustering in order to cluster the users and repos;
- bagged boosted decision trees in order to generate the ranking classifier
- a stochastic random walk process to determine a popularity ranking of the repos.  It mostly gives the same results as 
ranking by the number of watchers, though.
- a (programming) language similarity model, which takes the cosine between normalized language vectors
- a co-occurrence model, where weight is awarded when two repositories are recommended together by the same user, or 
when two users both watch the same repository together. 
- a keyword model, where tf and tf-idf vectors of the repository names and descriptions are used.


Extra Information
=================

I downloaded information for each author about the date that the person joined and the number of followed and following 
people, as well as the original github ID (which is completely useless, and ignored).  It's in the authors.txt file, 
which anyone can feel free to use however they want.  Personally, I got nearly zero benefit from this information.

I also have a file with the descriptions of (most of) the repositories in repo_descriptions.txt.  I used this to 
generate fuller tf-idf vectors of the repository name.  Note that this file has a lot of syntax errors (escaping, ...) 
due to the not-ideally-robust nature of the scripts used to generate it.


Candidate Source Feature Vector Contents
========================================

Each candidate source calculates a feature vector (which is a list of variables with values) for each of the candidates. 
 These candidate feature vectors are used to reduce the list of (potentially) 1000s of candidates into set of 100 or so 
*likely* candidates.

The challenge here was to improve the diversity of the results.  For example, if the user already watched some repos by 
rails, then we didn't want to add another 200 rails/ repos at the expense of other repos from other authors (this could 
happen as the rails repos are all very popular).  By ranking the results within each set and providing features about 
this, the classifier was able to learn that we only need a few results per author from several authors rather than an 
exhaustive list from one single author.

We also want each candidate source to provide different results; we don't just want all of them saying that "parents of 
watched repos are likely".

In order to deal with these challenges, we use a two-part feature vector.  

- Features from an early attempt to trace the user/repo ID line; nearly useless:
  * density of points in this part of the userid/repoid space
  * the actual user ID
  * the ratio of the user ID to the repo ID;

- General features about the user/repo IDs:
  * the number of repos that the user is watching
  * the number of users watching the proposed repo
  * the number of lines of code in the repo

- Features from the stochastic random walk:
  * the probability of the repo as determined by the stochastic random walk
  * the rank of this probability over all repos
  * the probability of the user as determined by the stochastic random walk
  * the rank of this probability over all users
  * the product of the user and the repo probabilities

- Features about the family tree:
  * does the repo have a parent?
  * how many child repos does the repo have?
  * how many ancestor repos does the repo have?
  * how many sibling repos does the repo have?
  * how many watchers does the parent repo have?

Specific to coocs and coocs2:

        result.add_feature("cooc_total_score", Feature_Info::REAL);
        result.add_feature("cooc_max_score", Feature_Info::REAL);
        result.add_feature("cooc_avg_score", Feature_Info::REAL);
        result.add_feature("cooc_num_scores", Feature_Info::REAL);


Specific to repos by same author:

        result.add_feature("author_already_watched_num", Feature_Info::REAL);
        result.add_feature("author_unwatched_num", Feature_Info::REAL);
        result.add_feature("author_already_watched_prop",
                           Feature_Info::REAL);
        result.add_feature("author_num_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("author_prop_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("author_abs_rank", Feature_Info::REAL);
        result.add_feature("author_abs_percentile", Feature_Info::REAL);
        result.add_feature("author_unwatched_rank", Feature_Info::REAL);
        result.add_feature("author_unwatched_percentile", Feature_Info::REAL);

Specific to repos with same name:

        result.add_feature("same_name_already_watched_num", Feature_Info::REAL);
        result.add_feature("same_name_unwatched_num", Feature_Info::REAL);
        result.add_feature("same_name_already_watched_prop",
                           Feature_Info::REAL);
        result.add_feature("same_name_num_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("same_name_prop_watchers_already",
                           Feature_Info::REAL);
        result.add_feature("same_name_abs_rank", Feature_Info::REAL);
        result.add_feature("same_name_abs_percentile", Feature_Info::REAL);
        result.add_feature("same_name_unwatched_rank", Feature_Info::REAL);
        result.add_feature("same_name_unwatched_percentile",
                           Feature_Info::REAL);

Specific to in_cluster_repo:

        result.add_feature("rcluster_num_watched_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_prop_watched_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_rank_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_best_dp_in_cluster",
                           Feature_Info::REAL);
        result.add_feature("rcluster_best_norm_dp_in_cluster",
                           Feature_Info::REAL);

Specific to in_cluster_user:

        result.add_feature("ucluster_num_watchers", Feature_Info::REAL);
        result.add_feature("ucluster_watcher_score", Feature_Info::REAL);
        result.add_feature("ucluster_highest_dp", Feature_Info::REAL);
        result.add_feature("ucluster_highest_dp_norm", Feature_Info::REAL);


Performance of Candidate Selection
==================================

For each candidate selection algorithm, I have two tables of statisics.

The first provides statistics about generation of candidates by each of the candidate sources, before the candidate 
source classifiers (the generalized linear models) run:

                     fired  nrcl   rcl%  watch  totalnum peruser    prec
parents_of_watched    2971  1458(49.07%)  4695     8143(   2.74) 75.5618
ancestors_of_watched   246     3( 1.22%)    24      288(   1.17)  9.3750
by_watched_authors    3426  1245(36.34%)     0   562516( 164.19)  0.2213
same_name             4242  1617(38.12%)     0  1470496( 346.65)  0.1100
children_of_watched   3080   175( 5.68%)  5072  1078146( 350.05)  0.4867
in_cluster_repo       4788  1300(27.15%)     0  2318617( 484.26)  0.0561
in_cluster_user       4788  1982(41.40%)     0  2391605( 499.50)  0.0829
in_id_range           4788   340( 7.10%)  2219    43466(   9.08)  5.8874
coocs                 3722  2103(56.50%) 40747  9915205(2663.95)  0.4322
coocs2                3754  2552(67.98%) 48276 24308132(6475.26)  0.2091
most_watched          4788   795(16.60%)  7754   478800( 100.00)  1.7855

The columns are:
fired: number of the 2788 users for which at least one was generated
nrcl:  number of held-out (target) examples that were produced
rcl%:  which percentage of times it fired did it produce the correct answer?
watch: how many of the results produced were already watched?
totalnum: how many results were produced in total?
peruser:  average number of candidates produced pre user
prec:     (precision) how many of the results produced were either correct or already watched

We can see that choosing parents of the watched algorithms was by far the most effective result.  We can also see that 
there is a *lot* of filtering necessary to reduce the average of 6475 candidates from the cooc source to a reasonable 
number, and to improve the small precision numbers.

The next table is for after the results have been classified and the top 100 have been kept for each source.  Note that 
in this table the order is important, as the values in the incremental part only count *new* results that weren't 
already added by an earlier algorithm.

                      --------- absolute ---------------  -------- incremental --------
                     fired  corr   rcl%   nadded     avg   corr   rcl%   nadded     avg    maxsz   
parents_of_watched    2971  1458(49.07%)    3448(   1.16)  1458(49.07%)    3448(   1.16)      16
ancestors_of_watched   246     3( 1.22%)     264(   1.07)     3( 1.22%)     264(   1.07)       4
by_watched_authors    3426  1233(35.99%)  265847(  77.60)  1096(31.99%)  265512(  77.50)     200
same_name             4242  1606(37.86%)  246959(  58.22)   120( 2.83%)  241763(  56.99)     100
children_of_watched   3080   164( 5.32%)  182070(  59.11)     7( 0.23%)   69443(  22.55)     100
in_cluster_repo       4788   679(14.18%)  477770(  99.78)   104( 2.17%)  435446(  90.95)     100
in_cluster_user       4788  1344(28.07%)  478800( 100.00)   645(13.47%)  465371(  97.20)     100
in_id_range           4788   340( 7.10%)   41247(   8.61)   104( 2.17%)   40617(   8.48)      57
coocs                 3722  1296(34.82%)  255004(  68.51)   130( 3.49%)  143509(  38.56)     100
coocs2                3754  1386(36.92%)  275257(  73.32)    59( 1.57%)   69956(  18.64)     100
most_watched          4788   795(16.60%)  471046(  98.38)    24( 0.50%)  212581(  44.40)     100

The columns:
fired:         number of times the rule fired
corr:          number of times the held out value was in the set
rcl%:          % of times where candidates produced that the held out value was there
nadded:        total number of added candidates
avg:           average number of candidates added per user where it fired

It is interesting to look at the incremental results to see how the different candidate sources succeeded in adding 
diversity.  We can also look back to the previous table to see how much of the potential was removed by filtering down 
to 500 candidates.  For example, in the coocs2 source, we went from 67.98% to 36.92% after taking the top 100, and down 
to 1.57% when we exclude those that were added in previous rules.

The in_cluster results are particularly disappointing as the GLZ for them actually hurts rather than helps (I kept it in 
anyway, but I have a heuristic pre-filtering stage that limits the damage).  This is probably due to the way that I 
generated the training data to train the generalized linear model.  (UPDATE: it's due to a bug in the optimization of 
the classifier.  It works properly when fixed, but doesn't make a huge difference to the results).


Ranker Feature Vector Contents
==============================

The goal of the ranker feature vector is to take the 400 or so unique candidates that are left after the candidate 
sources are merged, and choose the 10 most likely to be correct.  In order to do this, we use very powerful features and 
a very high capacity classifier.

The final results are pretty good (over a held-out set):
fake test results:
     total:      real: 2944/4788 =  61.49%  poss: 3750/4788 =  78.32%  avg num: 406.8

So, by reducing the 406 candidates/user to 10, we keep 2944/3750 = 82.4% of the good answers.

The features from all of the candidate sources are included in the feature vector.  To those, we add the following:

- User features: we try to model the behaviour of the user: does the user follow popular repos or non-popular ones?  
Does he tend to be one of the only watchers or does he watch repos that other people watch?  Does he tend to follow a 
few repos from lots of authors or lots of the repos of a few authors?  Does he tend to follow lots of forks of a project 
or a few forks of lots of projects?  These features provided a very small boost.

- Heuristic features, that calculates a heuristics score based upon the following:
  * First, add the parent repositories of repositories that the user is already watching, in order of popularity
  * Then, those other repositories by the same author as the current one
  * Then, add ancestor repositories (ie, more than one parent away)
  * Finally, add in order of popularity

- Date features.  These features were nearly useless.
- Author following/followed features.  These features were nearly useless.

- The dot-product and normalized dot product between the programming language vectors for the user's watched repos and 
the suggested repo (in order to see if it's in a language that the user is interested in).  Worth about 2%.
- Features based upon the name of the repo (note that these require us to match up authors of repos and users)
  * Does the user's name include the repo's name?
  * Does the repo's name include the user's name?

- Features about the author and their repos, to get an idea of their popularity:
  * How many repos does the author have?
  * How many watchers does this author's repos have?
  * How many repos are there with the same name?
  * How many watchers does this author's repos have in total
  * Were we able to infer an author for this user?
  * Were we able to infer a unique author for this user?

- Features about the cooccurrences, both using just a TF model (coocs) and an TF-IDF-like model (coocs2).  These were 
worth 1% or so.

- Features based upon the SVD.  These could do with a better algorithm than SVD (for example, PLSA or one of the methods 
in Investigation of Various Matrix Factorization Methods for Large Recommender Systems):
  * Dot products, normalized and unnormalized, between the user's topic space representation and the repo's tpoic space 
  representation, and the highest value within the dot product
  * Same between the repo and the user's average repo (in the topic space)

- Features about the range of repo IDs that we though should be in this user id's set (see Side Channel Information).  
Worth about 1.5%.

- Features about keyword matches, both tf and tf-idf, between the user's repos and the repos.  These are based upon the 
repo descriptions in repo_descriptions.txt and the names of the repos (eg, rails-authentication-plugin becomes "rails 
authentication plugin").  Note that the Natural Language Processing used was extremely basic: no removal of punctuation, 
no contractions, no stemming, etc.  Note also that some kind of smoothing and some domain-specific tokenization (eg, 
RailsAuthenticationPlugin --> rails authentication plugin) could easily improve the information.


Side Channel Information
========================

There are also features that model the particular dataset that we have here,
and do not actually contribute to making a recommendation.  These features
include:
- Modeling the density of the user/repo space.  If you look at doc/user_repo.png in my repo, you will see that there is 
a line that goes up the middle.
  Further investivation shows that this is a mapping between user IDs and repo IDs.  It must have been created by the 
  process that created the dataset; this process only created a user or repo ID at the moment when it first encountered 
  it, and the shape of the line shows the relationship between the new user creation and the new repo creation.
  If we look further, by plotting the *lowest* watching user ID against the repo ID, we can see that almost everything 
  is below this curve, except for a sparse set of scattered points above the line (doc/repo_id_vs_lowest_user_id.png).  
  Most of these points above the line are due to a point on the line having been removed for the testing set.  See the 
  comment in data.cc for more data/investigation.  By using nothing but the joint distribution of user IDs and the repo 
  IDs, it is possible to score nearly 5.77%.
  By tracing this line (which isn't that easy, as there is a lot of noise) and proposing repositories that cover the 
  missing points, as well as exploiting the scattered points above, we manage to pick up about 2% of repositories that 
  we wouldn't have got before.
- Matching authors to users.  If we see that user 2839 is the only watcher on repos 6879, 7039 and 23405, and that these 
have the same author, then we can be pretty sure that user 2839 *is* that author.  Using this algorithm, we map 35873 of 
the 56521 users unambiguously to a single author id, and a further 535 users to several authors (I haven't yet 
investigated what happens in this case; I suspect that one person has created more than one account to get around size 
limits or there are user/project accounts both controlled by the same person).

Note that I decided not to look at the watcher counts in the data from the API.  One could, with some added noise, 
probably guess which repos had been removed somewhere with this information; I consider this a little too close to 
cheating.


Testing
=======

I do testing locally by generating a fake test set (the code is in data.cc).  This allows me to test my ideas without 
needing to always test against the official test set, which stops me from overfitting too badly.  Overfitting is, of 
course, a benefit in this competition but my goal is to write a good recommendation engine, not just to get the highest 
score.

Unfortunately, the test results that I get, regardless of the random seed that I use, tend to be about 10% better than 
the results I get when I submit to the official scoring program.  I have checked that the repos that I take out are 
similar to the official ones and they are; I am at a loss to explain it.  But there must be some systematic bias due to 
me not having implemented one of the details of the way the repos were held out in the generation of the official test 
set.

Due to this problem, I sometimes see an improvement on my test set that is not reflected on the official test set.

One of the other useful things about local testing is that I can test both my set of candidate repos and the reduced set 
once they have been ranked.  This testing allows me to see that for about 25% of users I never propose the correct repo; 
on those where I do propose the correct one, the classifier performance is quite good.

I perform training in the same manner, by holding out some more of the repos and testing the ability of the algorithm to 
predict what I held out.  This is particularly useful to generate the feature vector for the ranking classifier.


Training
========

The training process works roughly as follows:

1.  Perform the (slow) k-means clustering to generate clusters of users and repos.  We hold out 20,000 watches whilst 
doing the clustering so that we can train the rest on these without biasing.

2.  For each of the candidate sources:

    2.a  Generate a feature vector file with the features about the possible candidate and whether it was correct or 
    not.  We use 10,000 of the 20,000 held out watches for this.

    2.b  Train a generalized linear model to provide a proability for the candidate

2.  Generate a feature vector file for the ranking classifier, with entries for each user where we had the correct 
result somewhere in their candidate set.  Here we use more powerful but expensive features.  For each user we include 
the feature vector for the held-out repo (with label 1 and weight 1.0), and a random sample of 20 feature vectors for 
non-held-out (and so incorrect) repos, with a weight of 1/20.  We use the other 10,000 held-out watches for this.

3.  Train a classifier on the feature vector file.  We use a much higher capacity classifier here (bagged boosted 
decision trees, with a tree depth of 3, 50 iterations of boosting and 5 bags).

4.  Run the classifier over both a fake testing set to get non-official results, and the official testing set to get a 
results.txt file to submit.

The training process is in the loadbuild.mk file.


Unbiasing
=========

One of the problems with a system like this is systematic biasing.  If we generate clusters knowing the missing repos, 
the clusters are likely to represent the missing repos.  Then when we train a classifier it will learn to trust the 
clustering more than it really should, and this will lead to poor results.

In order to avoid this problem, we remove the classifier training data from the dataset before we perform any of the 
steps in training, including the clustering.


Code
====

The code is organized as follows:

Compiling: (make or make -jx where x is number of cores on machine)
github.cc: main driver program; --help shows main modes (or look in loadbuild.mk)
data.{h,cc}: data structures to read the data and save the results of analysis
candidate_source.{h,cc}: implementation of the base class and specializations for the candidate generators.
ranker.{h,cc}: data structures and algorithms for the ranking of hypotheses received from the candidate_source.
decompose.{h,cc}: algorithms for decomposition and clustering
exception_hook.{h,cc}: simple file to cause a backtrace to be printed at the point where an exception is thrown
keywords.{h,c}: functionality to process keywords in the repo names/descriptions
siamese.cc: unfinished implementation of a siamese network to perform a better vector space mapping.

svdlibc/: the external library of this name; code has been patched to work with our version of GCC.  Note that there is 
no license with the code, so it's redistribution status is unknown.
jml/: a symlink to the jml library's source tree.  This is used for the machine learning, basic system services and 
build system.

Makefile: main makefile; includes logic from the make system in jml
build/: binaries go into this directory

Training: (make -j2 loadbuild)
loadbuild.mk: makefile to generate results.txt (make -j2 loadbuild to generate a results.txt file)
ranker-classifier-training-config.txt: configuration file for the classifier generator for the ranker
config.txt: configuration file for the system; not many options have been put in it for the moment
download/: contains the original data for the competition; not included in the repo (others will need to download it for 
themselves)
authors.txt: information about repo authors, obtained from the API
repo_descriptions.txt: descriptions of the repos, obtained from the API


Bugs
====

Apart from the systematic biasing that gives me 10% more on my internal testing than on the official test set, there are 
also a few other bugs that I didn't manage to fix.

The main one is that the candidate source classifiers for the cluster sources (in_cluster_user and in_cluster_repo) 
cause more harm than good, as the probabilizer on the output inverts the label classes.  I suspect that this is due to 
the baseline accuracy being too low; whatever the cause this means that when we reduce the 1000s of possibilities down 
to 100, we tend to take 100 *really bad* ones.  For this reason, I had to undo most of the improvements that I had made 
to this; it should be possible to do much better once this problem has been sorted out.  It's also possible that the 
problems are due to some kind of systematic biasing.  (UPDATE: it's due to a bug in the optimization of the GLZ 
classifier, which led to each label having a random bias added to it.  Fixing it however doesn't make that much of a 
difference; the in_cluster predictor is still not specific enough with the features it has available).


Thoughts on the Contest
=======================

Firstly, thanks to GitHub for running the contest.

What I liked about it:
* It was nice to have a contest with a limited scope and duration; I avoided NetFlix precisely because I knew that it 
would become an obsession for several years, and real life (especially with a 2 1/2 year old) makes it difficult to deal 
with.  A 3 week contest (I spent about this time from go to whoa) on the other hand felt much more focused and was more 
fun.
* It was nice to have a real dataset.  I hope that the held-out data will also be published.
* It was nice to have a reasonable amount of data: not so much that one needed to spend a lot of time on data structures 
to fit it in memory, but not so little that there wasn't any rich structure to be found.
* It was actually good to have some strong heuristics that could provide decent results without too much work.  It meant 
that people actually concentrated on understanding the data early on, rather than blindly applying the Machine Learning 
technique du jour.

Now for the things that were unfortunate.  These are mostly obvious in hindsight, but it's worth enumerating them 
anyhow.

* Possibilities to hill-climb.  NetFlix avoided this by requiring a submission of a large set of predictions, the score 
on only 1/2 of which was given when a submission was made.  The score on the other 1/2 was kept by NetFlix to evaluate 
the grand prize, which meant that someone who had overfitted would have a gap between their public scores and the 
internal NetFlix one.  Indeed, it seems that the team on top of the leaderboard had overfitted more than the second 
team, which lead to the second placed team being considered for the grand prize.  In GitHub, however, it would be 
possible to perform hill-climbing (overfitting) without any generalization improvement.  Unfortunately, looking at other 
submissions there seems to have been a degree of this happening.
* Possibility to take other people's result files.  Blending is a valid approach and fine by me, but this means that the 
highest results will probably be useless in terms of creating a recommendation engine.  It would have been nice to have 
either a separate category or to have required co-operation rather than simply taking and using the top 20 on the 
leaderboard.
* Possibility to model the dataset.  See above (Side Channel Information) for my description of how I modeled the user 
ID/repo ID generation process.  This part of my solution, while fun, was also completely irrelevant to the task of 
creating a good recommendation engine.
* If there had have been a way to reduce the effect of the heuristics on the final score.  One way to do this (based 
upon the Boosting algorithm) would have been to keep track of a weight for each of the users, which would have increased 
as people submitted results that got that user wrong and decreased as people submitted results that got the user right.  
Providing a weighted score like this would have encouraged systems that provided diverse results (not the same as the 
other systems).
* Probably, in hindsight, anyone who wants to run a competition like this without a lot of experience should give 
themself a week in which they're allowed to change the rules, and then do so based upon what people come up with that's 
not in the hoped direction.

Things that would have been nice:
* To have included two held-out sets, so that blending algorithms could have been trained on one of them (rather than 
using the blind weighted approaches that seem to have been favored).
* An official discussion forum or mailing list.
* Publication of the code that was used to split the training and testing samples so that we could produce other 
held-out sets that would have had comparable performance.
* It was a bit of a shame to see that hardly anyone tried any of the advanced techniques out there (I would have liked 
to try a deep siamese neural network and better matrix factorization models).  I guess that this is a combination of a) 
burnout from those who participated in NetFlix, b) the short timeframe and c) the relative power of heuristic 
techniques.

Credits
=======

Thanks to:
* Daniel Haran's code and README for the idea of using parent repositories, and him personally for discussions of the 
solutions
* asciiarmor's code (thanks for publishing it) for the basic features necessary to get to reasonable performance
* xlvector's code for the idea of clustering by cooccurrences (now why didn't I think of that!) and the man himself for 
the idea of increasing the diversity of the predictions


Todo
====

Future things that I will or may try:

* Use a classifier instead of a heuristic to rank the set of potential repositories (this will help us where the user 
has a lot of watched repositories) (done)
* Look at language matches (done)
* Look at keywords in the repository names (eg, "rails" says a lot about which kind of repos that people like) (done)
* Add a simple knn implementation based upon a dot product between repos to find new ones (will need pre-computation is 
it will be SLLOOOOWWWW!) (done).  Only for cluster members.  Didn't provide that much of an improvement; we need a 
better matrix factorization.
* EM on a graphical model to learn a smoothed user/repo matrix (done)
* Embed both repositories and users into the same feature space using a deep neural network and use cosine similarity 
between embedded representations to learn similarity.
* Potentially, other (simpler) matrix factorization models (done)
* Try with no probabilizer on the classifier(s) (done)
* Fix boosting/decision tree bug causing us to stick on language_cosine < 1 (done)
* Find where dataset is biased between fake/real test sets
* Do some reasonable processing of the keywords file
* Multithreading to produce the results file and training (done)
* Perform a SVD of the repo descriptions to smooth them
* Diversity: don't allow all of the predictions to come from the same set.  Penalize those that are too similar to the 
previous source.
* Cheating: hill-climb (using a Nelder-Mead or similar multivariate optimization technique).  Guaranteed to win, but not 
at all useful.
* Why is repo.author == -1 in some cases?
* Find and use collaborators (two users that watch each other's repositories)
* SVD on the keywords to smooth
* More exploration!  I had time to implement a bunch of ideas, but not to follow them through to check that they worked 
as expected.
* Split the by_this_author rule from by_same_author to allow more precision (done).  Small improvement.

Jeremy Barnes
28 August 2009