- SOCluster is a tool based on Sentence-BERT vectorizer for creating Intent-clusters using a graph-based clustering algorithm.
- The current version clusters StackOverflow questions which do not contain any image, code-snippet or table involved.
- Support for image/code-snippet containing StackOverflow questions can be added to SOCluster in the future.
- SOCluster uses intent as a key concept to cluster the StackOverflow questions.
- It uses a graph-based clustering algortihm. (State-of-the art clustering methods are often based on graphical representations of the relationships among data points [see here])
- It evaluates the clusters on three evaluation metrics - Silhouette coefficient, Calinkski-Harabasz Index & Davies-Bouldin Index, as well as prints the spread of the clusters over different sizes.
There are many unanswered questions on StackOverflow. Main reasons behind more than 50% of these are Failing to attract an expert member; Too short, hard to follow; and Duplicate question.
Developers can use SOCluster to cluster the StackOverflow questions - including both answered and unanswered ones.
These Intent-based clusters can be leveraged to answer unanswered questions using other answered questions in the same cluster.
Also, SOCluster evaluates these clusters to tell how good the selected StackOveflow dataset is for our intended goal of Automatic Question Answering.
SOCluster can be divided into three main steps as shown:
- Dataset Generation and Pre-processing :-
- Data Dump - We downloaded SO post data from Stack-Exchange data dump archives [link]
- Pre-processing - We filtered and pre-processed the database and ignored questions that contained images, code-snippets, tables, etc.
- Feature Vectorization - We used Sentence-BERT to generate 768-dimensional feature vectors.
- Graph Construction :-
- Similarity Index - SOCluster uses cosine similarity as its metric to calculate the similairty between two vectors.
- Graph generation - It creates a weighted undirected graph using the feature vectors obtained as nodes and cosine similarity between them as the edge weights.
- Clustering :- In this step, SOCluster uses a graph-based clustering algorithm which takes the weighted undirecteed graph as input and provides a set of clusters as output. It uses threshold similarity as a parameter to invalidate those edges whose weight is less than the given threshold similarity.
Here each image represents questions clustered by SOCluster in the same cluster. Notice the similarity in the intents of the questions clustered together...
Inside the data directory, "database_script.sql" file contains the code to handle the StackOverflow data dump and create well-organized SQL tables.
Inside the clustering directory, "graph_clustering.py" script contains the source code of SOCluster tool.
The result_script.sh file is a bash script that can be used to reproduce the experiment done in the paper.
- Download this repository in your local machine.
- Unzip the folder and extract it to a location of your choice on your PC.
- Also, download the StackOverflow data dump [link] (only Posts zip file) from Stack Exchange archives in your PC and extract it.
- Inside the SOCluster repository, go to data/database_script.sql file. There, provide the local path to the Posts.xml data dump file at appropriate location (in the end).
- Run the database_script.sql file.
- Now, go to clustering/graph_clustering.py file and provide your MySQL user credentials (username and password) at required place.
- Run the graph_clustering.py file using the command
python3 graph_clustering.py -n NUMBER_OF_QUES -t THRESHOLD_SIMILARITY --tag-list="[TAG1,TAG2,..]"
. A sample command would bepython3 graph_clustering.py -n 10000 -t 0.65 --tag-list="[javascript,python]"
. - The clusters will be printed on your screen.
- You can also run the result_script.sh bash file to repeat the experiment done in the paper.
You can find the walkthrough of the tool here
We will be very happy to receive any kind of contributions. Incase of a bug or an enhancement idea or a feature improvement idea, please open an issue or a pull request. Incase of any queries or if you would like to give any suggestions, please feel free to contact Abhishek Kumar (cs17b002@iittp.ac.in), Deep Ghadiyali (cs17b011@iittp.ac.in) or Sridhar Chimalakonda (ch@iittp.ac.in) of RISHA Lab, IIT Tirupati, India.