Skip to content
VietNam News Classify Project for Data Mining subject. Vitsit project webpage:
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.idea
data
dictionary
models
src/main/java/com/classify
target/classes/com
BaoCaoDoAnJava.docx
Bia.docx
NewsClassifySystem.iml
README.md
pom.xml
uetsegmenter.jar
vn_words.txt
vnstopword.txt

README.md

VietNam Text Classify Using TF-IDF and Naive Bayes

Build with

  • Java
  • IntelIJ IDEA 2017

Preprocessing

  • Using regular expression to replace ignore character
\\"|\\r|\\t|\\n -> null
{"type":".+","title":"(.+)","content":" -> $1 + 1 space('$1 ')
","url":".+ -> null

Directory structure

NewsClassifySystem(root)
--- data
    --- data/dictionary
    --- data/pre1
    --- data/pre2
    --- data/test
--- dictionary
--- models
--- src
    --- src/main/java/com/classify/crawler
    --- src/main/java/com/classify/dictionary
    --- src/main/java/com/classify/preprocess
--- uetsegmenter.jar
--- vnstopword.txt

How to run test

  • Setup java environment
  • Clone or download source code
  • Edit config in IConfig.java (com.classify.dictionary) - MAX_NUMBER_OF_NEW
  • Run com.classify.dictionary.Runner to train the model
  • Run com.classify.test.Checker to test the model

Prepare data

Bugs

Features

  • all_news list have size < TOTAL_NEWS : some news when split have no useful words
You can’t perform that action at this time.