An application to extract skills from personal communicatio data by making use of stack exchange dataset
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Extracting Skills from Personal Communication Data using StackExchange Dataset

In this project, we will see how to make use of the stack exchange publicly available dump to extract skills from the communication data. This project was implemented on an openstack linux platform.

Downloading the dataset

First, download the entire stack exchange dataset from here. There are many stackexchange websites like stackoverflow, cs, datascience, physics, history and so on. One can download the necessary compressed files or one can download the entire dump using torrents. More information about downloading the torrent files from command line can be found here. Each 7z file corresponds to a stackexchange website. Since we are interested only in technical websites, delete 7z files corresponding to websites like japanese.stackexchange, spanish.stackexchange and so on. After downloading the files extract the 7z files (Can be done in one script).

Building the Knowledge Base

In this project, we implemented a K-NN multi label classification model using lucene. A very nice explanation of setting up pylucene is given here. To build this search engine, first we need to index all the posts with two fields 'text' and 'tags'. This process is done over all folders and indexed into one file system. Run the program in the directory where all the extracted folders are present. This might take some hours to complete. But this is run only once.


A folder called index is created.

Extracting the skills

After building the knowledge base, next step is use the searcher_text() in to extract the skills. The folder index will be used in Open python and:

In [1]: from searcher import *
In [2]: searcher_text('''I want to get a list of the column headers from a pandas DataFrame. The DataFrame will come from user input so I won't know how many columns there will be or what they will be called.''')

 u'python 3.x',
 u'python 2.7']

Testing the application on Apache Spark Mailing Lists Dataset

Download the mbox file from ( Run the file One can see the output like in apache_spark_output.txt