No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


I wanted to build something in Scala, but didn't know what people typically used Scala for. GitHub tags repositories by the primary language, so I had the idea to try and cluster programming languages based on keywords in repositories' README's - in Scala. This project could be divided into two core components: extraction and analysis. It's a bit of a mess, because I originally was working in Scala and decided to switch to Python when I started working on the analysis component.


  1. I created a Scala wrapper for the GitHub API.
  2. Using Slick, I wrote an incremental extraction system to fetch results from the API.


  1. I made a simple Tokenizer that used the Java Wiktionary Library to filter words by part of speech.
  2. I implemented the TextRank algorithm to extract keywords. This algorithm proved to be much slower than tf-idf, so I ended up not using it.
  3. At this point, I decided to switch to Python to take advantage of the libraries it already had. I used nltk for tokenizing, and scikit-learn to vectorize repositories with tf-idf and then cluster using k-means. I played around with the parameters to improve the results from clustering. Below is a sample, listing the top ten stems for each cluster along with the number of repositories for each language in each cluster. Some clusters were a lot more defined than others (e.g. the last cluster is clearly about machine learning). Clustering results on GitHub repository data