Skip to content
A Comparison between rule-based and CRF-based Tokenizer for Posts from Stack Overflow
Branch: master
Clone or download
Pull request Compare This branch is even with liuhualin333:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

CZ4045 Natural Language Processing Project


This serves as partial fulfillment to course CZ4045 : Natural Language Processing in Nanyang Technological University, Singapore.

We developed two tokenizers on Stack Overflow posts. One is based on regular expression, and the other is based on Conditional Random Field (CRF).

The difficulty in tokenizing Stack Overflow data is that its content is highly unstructured, composing of both English text and code snippet. The tokenizer designed and developed by the team is capable to

1. tokenize the code section into smaller meaningful units
2. identify irregular name entities such as "Albert Einstein"
3. identify file path like "src/main/resources"

which greatly improved the accuracy of tokenization, thus enhanced the performance of further analysis.

In the end, our CRF-based tokenizer achieved f1 score of 0.9483 on 5-fold cross validation, and regex-based tokenizer achieved f1-score of 0.9653.


Chen Hailin @Chen-Hailin , Deng Yue @spenceryue97 , Liu Hualin @liuhualin333 , Shi Ziji @stevenshi-23


We have tested our program on python 3.

Third-party Libraries Commands:(use pip3 install if default is python 2.7 pip)

BeautifulSoup 4: pip install bs4 matplotlib pip install matplotlib
nltk: pip install nltk numpy: pip install numpy scipy: pip install scipy scikit-learn: pip install scikit-learn sklearn: pip install sklearn sklearn_crfsuite: pip install sklearn_crfsuite


pip install -U -r requirements.txt

Dataset Download Link

Installation Guide

  1. Download python3 and third party libraries according to previous instruction.
  2. Run the following command open python interpreter: python Then, run the following commands to download nltk resources:
      import nltk'stopwords')'averaged_perceptron_tagger')   
 Last, press ctrl + Z to exit.
  1. Download datasets and put it into Data/ folder according to link given.
  2. Navigate to SourceCode/ folder:
  3. Run the following command to tokenize all sentences in dataset: python3
  4. Run the following command and follow program instruction to run stemmer and POS tagging: python3
  5. Run the following command to compute the top 4 keywords in all question posts data: python3

Explanations of data

all_posts_clean.txt: contains all question posts which remove tags
all_answers_clean.txt: contains all answers posts which remove tags

posts_training_clean.txt: contains training data from question posts with tags removed answers_training_clean.txt: contains training data from answers posts with tags removed

posts_manual_tokenized.txt: contains all annotated training data from question posts answers_manual_tokenized.txt: contains all annotated training data from answers posts

all_posts_top_4_keywords.txt: contains top 4 keywords of all question posts

Explanations of sourcecode main application use nltk package to do stemming, pos-tagging and section 3.4 take a “clean“ version of dataset and tokenise both code and text utility functions which can be used among scripts evaluation helper function

Performance Summery

Results on our annotated corpus:

|                    | precision |   recall | f1-score |
|    Regex tokenizer |    0.9578 |   0.9729 |   0.9653 |
|      CRF tokenizer |    0.9478 |    0.949 |   0.9483 |
You can’t perform that action at this time.