• The 2009 GitHub Contest

    schacon 29 Jul 2009

    Today we’re announcing our 2009 GitHub Contest. Since the Netflix prize is now over, we figured you guys needed something to do. Here is your chance to contribute to the open source canon, make GitHub better, and possibly win two of the best prizes probably ever offered by a contest: a bottle of Pappy Van Winkle and a large GitHub account for life! We would estimate the value here, but, honestly, they’re priceless. Also, hopefully have some fun.

    So, the problem is that we want to recommend repositories to you when you log into GitHub that you’ll love. How do we find the perfect projects for you? I wanted to just look at networks of what people were watching and figure out what you might like by what your friends liked. In researching collaborative filtering and recommendation systems papers I found little that is really helpful for this sort of problem, oddly, and very little open source code. Most papers I found online (for free, because I’m cheap – why aren’t all academic papers free and open, btw?) are explicit rating system based (like the Netflix prize – figuring out what you would rate something on a 1-X scale based on previous ratings) not item-based collaborative filters for binary implicit voting (like recommending new items based on past purchasing history) which seems way more useful to most websites to me.

    Anyhow, so we figured perhaps you can do this better than we can. I extracted a dataset of all the repository watches in our database – close to half a million – and withheld a sample of them. I then created a test file listing the users I held watches back from. If you can write a program to analyze our dataset and best guess the watches we held back, you win our amazing prizes.

    To enter the contest, check out our contest website. Basically you just put your guesses into a file named ‘results.txt’ and push it to a public GitHub project that has “http://contest.github.com” as a post-receive hook. On each push, our site will see if you’ve changed your ‘results.txt’ file then download and score it if you have. At the end of the contest, your source code has to be released under an OSI compatible license so nobody ever has to worry about this problem again. Whoever has the highest score at noon PST on Aug 30, 2009 wins. Good luck!

  • Comments

    foca Wed Jul 29 13:56:03 -0700 2009

    This is seriously awesome

    mdarby Wed Jul 29 13:56:43 -0700 2009

    Awesome idea.

    ssayer Wed Jul 29 14:09:59 -0700 2009

    what about a list of forks?

    schacon Wed Jul 29 14:13:43 -0700 2009

    @ssayer - what? i don't understand the question.

    jfernandez Wed Jul 29 14:14:18 -0700 2009

    odoyle rulez

    derekorgan Wed Jul 29 14:14:20 -0700 2009

    nice! will defo have a think about it.

    bcochran Wed Jul 29 14:28:43 -0700 2009

    Does the code used to generate the results have to be in the repository before the end of the contest?

    schacon Wed Jul 29 14:56:19 -0700 2009

    @bcochran - it doesn't have to, but we would prefer in the spirit of openness to just use an "nobody can use this until Aug 30" license so that people can't copy you and we don't have to try to hunt down people that didn't win after the contest ends to ask them to upload their code.

    schacon Wed Jul 29 15:07:20 -0700 2009

    for the curious, the reason it ends on Aug 30 is not because I didn't know that August has 31 days in it, but because my birthday is Aug 31 and I didn't want to have to work that day. :)

    awesome Wed Jul 29 15:19:57 -0700 2009

    awesome!

    automatthew Wed Jul 29 15:54:33 -0700 2009

    Forking a repo automatically sets you to watch the forkee, apparently. Knowing whether a watched repo is also a repo the watching user has forked would be useful.

    automatthew Wed Jul 29 16:06:48 -0700 2009

    You haven't told us how many repository watches were withheld, unless I've missed something. I assume this is intentional.

    That notwithstanding, can you give us any information that might help us construct our own test sets from the known data?

    jmrocela Wed Jul 29 16:15:18 -0700 2009

    hey, just read half of it and i didnt finish..here's my idea, why not match suggestions against, watched repos, repo names and feed keywords..then, search existing repos, and find a good match..then, list them out to a user..that sounds fun..like google adwords or something..:)

    TomK32 Wed Jul 29 17:06:39 -0700 2009

    Oh, I already did beat schacon :)

    I wish the contest would be with fully open algorythms. Anyone who can improve someone else's code deserves to climb up the score.

    TomK32 Wed Jul 29 17:12:41 -0700 2009

    oh btw, if you run your scripts and find a buggy data set after a few minutes, it helps to move that one's id to the top.

    brupm Wed Jul 29 17:24:34 -0700 2009

    You guys are just awesome!

    charliek Wed Jul 29 19:46:22 -0700 2009

    @schacon - any chance you would create a wiki page or something that lists the research papers that you found interesting? Then maybe others could help add to the list as they found additional material? That would speed up the goggling process for use all.

    drummel Wed Jul 29 22:01:56 -0700 2009

    PAPPY VAN KICKS ASS!

    Narnach Thu Jul 30 03:39:26 -0700 2009

    This looks seriously awesome!

    pageman Thu Jul 30 04:04:14 -0700 2009
    mixonic Thu Jul 30 05:44:45 -0700 2009

    I thought this was a contest to make the site faster and got really excited. I'd find more projects I love if I could get to their code quicker ;-)

    IanCal Thu Jul 30 07:36:34 -0700 2009

    Is there a place to report errors in the dataset?

    User 26734 is asked for in the test file, line 3215, but doesn't exist in the file data.txt. I'm guessing the user had only one watched repo, so when it was removed from the training set, there was no longer a mention of the user.

    patperry Thu Jul 30 10:01:54 -0700 2009

    Here's another bug/quirk in the dataset: in "lang.txt", sometimes the line counts get repeated, e.g.

    24792:Shell;2035,FORTRAN;8946,C++;22396,Python;3573971,C;4445864,Shell;2035,FORT RAN;8946,C++;22396,Python;3573971,C;4445864

    rich Thu Jul 30 12:31:48 -0700 2009

    Considering that a users interests may change over time, would it be possible to get dates on the watches?

    danielharan Fri Jul 31 10:25:45 -0700 2009

    Dates would be good, as well as userid:username mappings and userid:userid follow relationships. I'm more likely to follow a repo if it's by someone I already follow.

    dorkusmonkey Sat Aug 01 13:46:46 -0700 2009

    This is a great idea and thank you very much for posting the data and making the contest.

    Sorry if I'm a bit dense or just missed it, but whats the scoring function used? Is it just a simple count of "number correct" / "total number" or do you punish for bad guesses?

    The reason why I ask is because, if you don't give a negative score for bad guesses, the scoring system would be ripe for abuse...

    maia Sun Aug 02 07:05:40 -0700 2009

    It's very sad to see that many people do not publish their code, as this contest is about empowering a community and the spirit of open source.

    To only require the winner code to be released once the contest ended is a major mistake, unless github is more interested in the final code than in letting its community watch and learn from such a contest.

    I certainly would have enjoyed watching the attempts of many teams over a month, and seeing mostly empty repositories (other than the results file) is very frustrating.

    danielharan Mon Aug 03 10:46:47 -0700 2009

    I don't share code, because I don't have any :)

    My #2 entry is mostly piggy-backing on xlvector's results. We co-operate, and I made that clear in my README.

    Also, I've updated my repo with some utilities, hoping others will find them useful.

    micz Thu Aug 06 01:46:56 -0700 2009

    I do agree with maia. It should be mandatory to share the code.

    soult Fri Aug 28 07:31:29 -0700 2009

    What UTC time does the contest end?

    Please log in to comment.