Today we’re announcing our 2009 GitHub Contest. Since the Netflix prize is now over, we figured you guys needed something to do. Here is your chance to contribute to the open source canon, make GitHub better, and possibly win two of the best prizes probably ever offered by a contest: a bottle of Pappy Van Winkle and a large GitHub account for life! We would estimate the value here, but, honestly, they’re priceless. Also, hopefully have some fun.
So, the problem is that we want to recommend repositories to you when you log into GitHub that you’ll love. How do we find the perfect projects for you? I wanted to just look at networks of what people were watching and figure out what you might like by what your friends liked. In researching collaborative filtering and recommendation systems papers I found little that is really helpful for this sort of problem, oddly, and very little open source code. Most papers I found online (for free, because I’m cheap – why aren’t all academic papers free and open, btw?) are explicit rating system based (like the Netflix prize – figuring out what you would rate something on a 1-X scale based on previous ratings) not item-based collaborative filters for binary implicit voting (like recommending new items based on past purchasing history) which seems way more useful to most websites to me.
Anyhow, so we figured perhaps you can do this better than we can. I extracted a dataset of all the repository watches in our database – close to half a million – and withheld a sample of them. I then created a test file listing the users I held watches back from. If you can write a program to analyze our dataset and best guess the watches we held back, you win our amazing prizes.

To enter the contest, check out our contest website. Basically you just put your guesses into a file named ‘results.txt’ and push it to a public GitHub project that has “http://contest.github.com” as a post-receive hook. On each push, our site will see if you’ve changed your ‘results.txt’ file then download and score it if you have. At the end of the contest, your source code has to be released under an OSI compatible license so nobody ever has to worry about this problem again. Whoever has the highest score at noon PST on Aug 30, 2009 wins. Good luck!



This is seriously awesome
Awesome idea.
what about a list of forks?
@ssayer - what? i don't understand the question.
odoyle rulez
nice! will defo have a think about it.
Does the code used to generate the results have to be in the repository before the end of the contest?
@bcochran - it doesn't have to, but we would prefer in the spirit of openness to just use an "nobody can use this until Aug 30" license so that people can't copy you and we don't have to try to hunt down people that didn't win after the contest ends to ask them to upload their code.
for the curious, the reason it ends on Aug 30 is not because I didn't know that August has 31 days in it, but because my birthday is Aug 31 and I didn't want to have to work that day. :)
awesome!
Forking a repo automatically sets you to watch the forkee, apparently. Knowing whether a watched repo is also a repo the watching user has forked would be useful.
You haven't told us how many repository watches were withheld, unless I've missed something. I assume this is intentional.
That notwithstanding, can you give us any information that might help us construct our own test sets from the known data?
hey, just read half of it and i didnt finish..here's my idea, why not match suggestions against, watched repos, repo names and feed keywords..then, search existing repos, and find a good match..then, list them out to a user..that sounds fun..like google adwords or something..:)
Oh, I already did beat schacon :)
I wish the contest would be with fully open algorythms. Anyone who can improve someone else's code deserves to climb up the score.
oh btw, if you run your scripts and find a buggy data set after a few minutes, it helps to move that one's id to the top.
You guys are just awesome!
@schacon - any chance you would create a wiki page or something that lists the research papers that you found interesting? Then maybe others could help add to the list as they found additional material? That would speed up the goggling process for use all.
PAPPY VAN KICKS ASS!
This looks seriously awesome!
AWESOME! re-posted on http://news.ycombinator.com/item?id=731933
I thought this was a contest to make the site faster and got really excited. I'd find more projects I love if I could get to their code quicker ;-)
Is there a place to report errors in the dataset?
User 26734 is asked for in the test file, line 3215, but doesn't exist in the file data.txt. I'm guessing the user had only one watched repo, so when it was removed from the training set, there was no longer a mention of the user.
Here's another bug/quirk in the dataset: in "lang.txt", sometimes the line counts get repeated, e.g.
Considering that a users interests may change over time, would it be possible to get dates on the watches?
Dates would be good, as well as userid:username mappings and userid:userid follow relationships. I'm more likely to follow a repo if it's by someone I already follow.
This is a great idea and thank you very much for posting the data and making the contest.
Sorry if I'm a bit dense or just missed it, but whats the scoring function used? Is it just a simple count of "number correct" / "total number" or do you punish for bad guesses?
The reason why I ask is because, if you don't give a negative score for bad guesses, the scoring system would be ripe for abuse...
It's very sad to see that many people do not publish their code, as this contest is about empowering a community and the spirit of open source.
To only require the winner code to be released once the contest ended is a major mistake, unless github is more interested in the final code than in letting its community watch and learn from such a contest.
I certainly would have enjoyed watching the attempts of many teams over a month, and seeing mostly empty repositories (other than the results file) is very frustrating.
I don't share code, because I don't have any :)
My #2 entry is mostly piggy-backing on xlvector's results. We co-operate, and I made that clear in my README.
Also, I've updated my repo with some utilities, hoping others will find them useful.
I do agree with maia. It should be mandatory to share the code.
What UTC time does the contest end?