The goal of the Fake News Challenge is to explore how artificial intelligence technologies, particularly machine learning and natural language processing, might be leveraged to combat the fake news problem. Given a headline and the body of an article, the system classifies the body as either - {irrelevant/discusses/agrees/disagrees} w.r.t the headline.
- Stemming: Porter stemming
- Stop words elimination: Used default English stop words list
- Tokenize
Uses multi-level classifier with 3 classifiers
- Classifier 1 - classifies if the body is related to the headline or not
- Classifier 2 - classifies if the body merely discusses the headline or opinionated it (either agree or disagree)
- Classifier 3 - classifies if the body agrees or disagress to the stand taken in headline
- Classifier 1
- Bag of words (Jaccard's coefficient)
- Tf-Idf
- Binary co-occurrence
- N-grams (Bi- and Tri-)
- Classifier 2
- Hedge words {ex: hypothesis, suppose, probably, presumably, suggestive}
- Word spins {supporting(positive)/refuting(negative)/neutral(nil)}
- Classifier 3
- Word spins {supporting(positive)/refuting(negative)/neutral(nil)}
- Supporting words {ex: surely, evidently, affirmative, agree, support, accept}
- Refuting words {ex: against, fake, fraud, hoax, false, deny, doubt, retract}
When we see similar words in both headline and body, we know that they're related.
Next, when we notice words like agree or disagree, we know that the body most likely takes a stand on the headline and not just discusses it.
But we don't know what stance is it taking - agree or disagree.
To better understand this, let's consider that the word - 'travel' appears in both headline and body. But we don't know what stance does body take in comparison to the headline. And to understand this, we try to establish a little more context by look at n-grams, i.e., words appearing before 'travel'.
When you look at bi-grams, you noticed the phrase - 'agrees to travel'. [the word 'to' would be removed because its a stopword]
We now know that it adds a positive spin to the word 'travel'.
Let's now look at tri-grams, and you noticed the phrase now changes to - 'doesn't agrees to travel'.
It negates the ealier positive spin and now gets a negative spin.
By doing so, this feature helps us better understand the stance taken by the body.
- Stanford Core NLP
- Apache Lucene
- Wordnet extJWNL
- SVM - Liblinear
- Apache Commons - CSV Parser
Use Maven to compile and run
- Compile -
mvn compile
- Run -
mvn exec:java
- FNC scoring is in range [2565.5, 7585.25]. My score: 5879.25
- Accuracy: 81.54% on a dataset of 16000 articles.