In [1]:
# !wget https://github.com/HeadHunter483/data-science/raw/master/fake_news_classification/news_sample.csv

--2020-09-29 21:51:02--  https://github.com/HeadHunter483/data-science/raw/master/fake_news_classification/news_sample.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HeadHunter483/data-science/master/fake_news_classification/news_sample.csv [following]
--2020-09-29 21:51:03--  https://raw.githubusercontent.com/HeadHunter483/data-science/master/fake_news_classification/news_sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1172837 (1.1M) [text/plain]
Saving to: ‘news_sample.csv’


2020-09-29 21:51:03 (8.06 MB/s) - ‘news_sample.csv’ saved [1172837/1172837]



In [2]:
from pprint import pprint

import numpy as np
import pandas as pd
from sklearn.preprocessing import label_binarize
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer
)
from sklearn.datasets import fetch_20newsgroups

## Fake news classification based on statistical data

### 1. Data

For the experiments there was used [FakeNewsCorpus](https://github.com/several27/FakeNewsCorpus) consisting of 238 tagged news articles. Target feature is "type" of an article which can have values: "fake", "reliable", "unreliable" and some other. For the task the target feature is going to be binarized to "1" - "fake", "0" - not "fake".

In [3]:
# let's read and look at the data
df = pd.read_csv("news_sample.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,id,domain,type,url,content,scraped_at,inserted_at,updated_at,title,authors,keywords,meta_keywords,meta_description,tags,summary
0,0,141,awm.com,unreliable,http://awm.com/church-congregation-brings-gift...,Sometimes the power of Christmas will make you...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Church Congregation Brings Gift to Waitresses ...,Ruth Harris,,[''],,,
1,1,256,beforeitsnews.com,fake,http://beforeitsnews.com/awakening-start-here/...,AWAKENING OF 12 STRANDS of DNA – “Reconnecting...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,AWAKENING OF 12 STRANDS of DNA – “Reconnecting...,Zurich Times,,[''],,,
2,2,700,cnnnext.com,unreliable,http://www.cnnnext.com/video/18526/never-hike-...,Never Hike Alone: A Friday the 13th Fan Film U...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Never Hike Alone - A Friday the 13th Fan Film ...,,,[''],Never Hike Alone: A Friday the 13th Fan Film ...,,
3,3,768,awm.com,unreliable,http://awm.com/elusive-alien-of-the-sea-caught...,"When a rare shark was caught, scientists were ...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Elusive ‘Alien Of The Sea ‘ Caught By Scientis...,Alexander Smith,,[''],,,
4,4,791,bipartisanreport.com,clickbait,http://bipartisanreport.com/2018/01/21/trumps-...,Donald Trump has the unnerving ability to abil...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Trump’s Genius Poll Is Complete & The Results ...,Gloria Christie,,[''],,,


In [4]:
# now binarize "type" column
df = df.dropna(subset=['type'])
df['type'] = pd.get_dummies(df['type'])['fake']
df.head()

Unnamed: 0.1,Unnamed: 0,id,domain,type,url,content,scraped_at,inserted_at,updated_at,title,authors,keywords,meta_keywords,meta_description,tags,summary
0,0,141,awm.com,0,http://awm.com/church-congregation-brings-gift...,Sometimes the power of Christmas will make you...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Church Congregation Brings Gift to Waitresses ...,Ruth Harris,,[''],,,
1,1,256,beforeitsnews.com,1,http://beforeitsnews.com/awakening-start-here/...,AWAKENING OF 12 STRANDS of DNA – “Reconnecting...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,AWAKENING OF 12 STRANDS of DNA – “Reconnecting...,Zurich Times,,[''],,,
2,2,700,cnnnext.com,0,http://www.cnnnext.com/video/18526/never-hike-...,Never Hike Alone: A Friday the 13th Fan Film U...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Never Hike Alone - A Friday the 13th Fan Film ...,,,[''],Never Hike Alone: A Friday the 13th Fan Film ...,,
3,3,768,awm.com,0,http://awm.com/elusive-alien-of-the-sea-caught...,"When a rare shark was caught, scientists were ...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Elusive ‘Alien Of The Sea ‘ Caught By Scientis...,Alexander Smith,,[''],,,
4,4,791,bipartisanreport.com,0,http://bipartisanreport.com/2018/01/21/trumps-...,Donald Trump has the unnerving ability to abil...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Trump’s Genius Poll Is Complete & The Results ...,Gloria Christie,,[''],,,


Now the dataset will be divided into train and test sets in ratio 7:3.

In [5]:
div = int(len(df)*0.7)
train_set = df[:div]
test_set = df[div:]

### 2. Experiments with different classifiers

For the classification `content` column will be used. The text is going to be converted to vectors using `Bag of Words` model. Then `TF-IDF` characteristic will be added for normalization. And the whole model is going to be trained with classification algorithm. 

Further there will be performed experiments with base classifiers and with best parameters found using `grid search` on classifiers hyper-parameters, words ngrams, stopwords and usage of idf.

In [6]:
def evaluate(clf, test_features, test_labels):
    predicted = clf.predict(test_features)
    acc = np.mean(predicted == test_labels)
    print(f"Accuracy: {acc * 100:.2f}%")
    return acc

In [7]:
def grid_search(clf, parameters, train_features, train_labels):
    gs = GridSearchCV(clf, parameters, n_jobs=-1)
    gs = gs.fit(train_features, train_labels)
    print(f"Accuracy: {gs.best_score_ * 100:.2f}%")
    print("Best parameters:")
    pprint(gs.best_params_)
    return gs

In [8]:
def eval_grid_search_best(clf, parameters, train_features, train_labels,
                          test_features, test_labels):
    print("Grid Search:")
    gs = grid_search(clf, parameters, train_features, train_labels)
    best_clf = gs.best_estimator_

    print("\nBest Estimator on Test Set:")
    acc = evaluate(best_clf, test_features, test_labels)
    return gs

#### 2.1. Naive Bayes based classifier

In [9]:
# building classifier based on Naive Bayes
text_clf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(train_set['content'], train_set['type'])
print("Base Classifier:")
acc = evaluate(text_clf, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 69.44%


In [10]:
# grid search on model parameters for increasing accuracy
parameters = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)],  # ngrams to use
    'vect__stop_words': (None, 'english'), # removing stopwords or not
    'tfidf__use_idf': (True, False), # use idf or not
    'clf__alpha': [10**(-(i + 1)) for i in range(5)], # alpha parameter for NB
    'clf__fit_prior': (True, False) # use uniform or real distribution of the data
}
gs_clf = eval_grid_search_best(
    text_clf, parameters, 
    train_set['content'], train_set['type'], 
    test_set['content'], test_set['type'])

Grid Search:
Accuracy: 83.19%
Best parameters:
{'clf__alpha': 0.01,
 'clf__fit_prior': False,
 'tfidf__use_idf': False,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

Best Estimator on Test Set:
Accuracy: 80.56%


NB with default parameters has not very high accuracy but grid search increased it by 11% from 69.44% to 80.56% and it is quite good result.

#### 2.2. SVM

In [11]:
# building classifier based on SVM
text_clf_svm = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-svm', SGDClassifier(
        loss='hinge', penalty='l2', alpha=1e-3, 
        n_iter_no_change=5, random_state=42)),
])
text_clf_svm = text_clf_svm.fit(train_set['content'], train_set['type'])
print("Base Classifier:")
acc = evaluate(text_clf_svm, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 80.56%


In [12]:
# grid search on model parameters for SVM
parameters_svm = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)],  # ngrams to use
    'vect__stop_words': (None, 'english'), # removing stopwords or not
    'tfidf__use_idf': (True, False), # use idf or not
    'clf-svm__alpha': [10**(-(i + 1)) for i in range(5)], # alpha parameter for SVM
}
gs_clf_svm = eval_grid_search_best(
    text_clf_svm, parameters_svm, train_set['content'], train_set['type'],
    test_set['content'], test_set['type']
)

Grid Search:
Accuracy: 86.19%
Best parameters:
{'clf-svm__alpha': 0.001,
 'tfidf__use_idf': False,
 'vect__ngram_range': (1, 3),
 'vect__stop_words': 'english'}

Best Estimator on Test Set:
Accuracy: 91.67%


SVM shows good results on default parameters and with grid search accuracy increases by 11% from 80.56% to 91.67%.

#### 2.3. Decision Tree

Since there is part of randomness in decision tree, let's look through different states and find highest and lowest accuracies.

In [13]:
# getting accuracies for Decision Tree based on different random states
for i in range(43):
    text_clf_dt = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf-dt', DecisionTreeClassifier(random_state=i))
    ])
    text_clf_dt = text_clf_dt.fit(train_set['content'], train_set['type'])
    print("Base Classifier:")
    acc = evaluate(text_clf_dt, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 87.50%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 87.50%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 86.11%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 86.11%
Base Classifier:
Accuracy: 87.50%
Base Classifier:
Accuracy: 86.11%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 87.50%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 87.50%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 90.28%
Base Classifie

Lowest accuracy is 87.50% and highest is 91.67%.

In [14]:
# building classifier based on Decision Tree with lowest base accuracy
text_clf_dt = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-dt', DecisionTreeClassifier(random_state=1))
])
text_clf_dt = text_clf_dt.fit(train_set['content'], train_set['type'])
print("Base Classifier:")
acc = evaluate(text_clf_dt, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 87.50%


In [15]:
# grid search on model parameters for Decision Tree
parameters_dt = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)], # ngrams to use
    'vect__stop_words': (None, 'english'), # removing stopwords or not
    'tfidf__use_idf': (True, False), # use idf or not
    'clf-dt__max_features': (None, 'auto', 'sqrt', 'log2'), # num of features for best split
    'clf-dt__criterion': ('gini', 'entropy'), # measure for quality of split
    'clf-dt__min_samples_split': (2, 3, 4), # minimum number of samples required to split
    'clf-dt__random_state': [1], # initial state
}
gs_clf_dt = eval_grid_search_best(
    text_clf_dt, parameters_dt, train_set['content'], train_set['type'],
    test_set['content'], test_set['type']
)

Grid Search:
Accuracy: 93.44%
Best parameters:
{'clf-dt__criterion': 'gini',
 'clf-dt__max_features': None,
 'clf-dt__min_samples_split': 4,
 'clf-dt__random_state': 1,
 'tfidf__use_idf': True,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

Best Estimator on Test Set:
Accuracy: 87.50%


In [16]:
# building classifier based on Decision Tree with highest base accuracy
text_clf_dt = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-dt', DecisionTreeClassifier(random_state=9))
])
text_clf_dt = text_clf_dt.fit(train_set['content'], train_set['type'])
print("Base Classifier:")
acc = evaluate(text_clf_dt, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 91.67%


In [17]:
# grid search on model parameters for Decision Tree
parameters_dt = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)], # ngrams to use
    'vect__stop_words': (None, 'english'), # removing stopwords or not
    'tfidf__use_idf': (True, False), # use idf or not
    'clf-dt__max_features': (None, 'auto', 'sqrt', 'log2'), # num of features for best split
    'clf-dt__criterion': ('gini', 'entropy'), # measure for quality of split
    'clf-dt__min_samples_split': (2, 3, 4), # minimum number of samples required to split
    'clf-dt__random_state': [9], # initial state
}
gs_clf_dt = eval_grid_search_best(
    text_clf_dt, parameters_dt, train_set['content'], train_set['type'],
    test_set['content'], test_set['type']
)

Grid Search:
Accuracy: 93.44%
Best parameters:
{'clf-dt__criterion': 'gini',
 'clf-dt__max_features': None,
 'clf-dt__min_samples_split': 2,
 'clf-dt__random_state': 9,
 'tfidf__use_idf': True,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

Best Estimator on Test Set:
Accuracy: 91.67%


For both initial states grid search did not increase accuracy and it stayed the same 87.50% and 91.67%.

#### 2.4. Random Forest

For random forest experiments are the same as for decision tree.

In [18]:
# getting accuracies for Random Forest on different random states
for i in range(43):
    text_clf_rf = Pipeline([
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('clf-rf', RandomForestClassifier(n_estimators=100, random_state=i)),
    ])
    text_clf_rf = text_clf_rf.fit(train_set['content'], train_set['type'])
    print("Base Classifier:")
    acc = evaluate(text_clf_rf, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 95.83%
Base Classifier:
Accuracy: 93.06%
Base Classifier:
Accuracy: 95.83%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 93.06%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 94.44%
Base Classifier:
Accuracy: 93.06%
Base Classifier:
Accuracy: 94.44%
Base Classifier:
Accuracy: 94.44%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 93.06%
Base Classifier:
Accuracy: 93.06%
Base Classifier:
Accuracy: 87.50%
Base Classifier:
Accuracy: 88.89%
Base Classifier:
Accuracy: 94.44%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 90.28%
Base Classifier:
Accuracy: 94.44%
Base Classifier:
Accuracy: 91.67%
Base Classifier:
Accuracy: 95.83%
Base Classifie

Random forest has the highest accuracy 95.83% and lowest 88.89%.

In [19]:
# building classifier based on Random Forest with lowest base accuracy
text_clf_rf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()),
    ('clf-rf', RandomForestClassifier(n_estimators=100, random_state=3)),
])
text_clf_rf = text_clf_rf.fit(train_set['content'], train_set['type'])
print("Base Classifier:")
acc = evaluate(text_clf_rf, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 88.89%


In [20]:
# grid search on model parameters for Random Forest
parameters_rf = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)], # ngrams to use
    'vect__stop_words': (None, 'english'), # removing stopwords or not
    'tfidf__use_idf': (True, False), # use idf or not
    'clf-rf__max_features': ('auto', 'sqrt', 'log2'), # num of features for best split
    'clf-rf__criterion': ('gini', 'entropy'), # measure for quality of split
    'clf-rf__min_samples_split': (2,3,4), # minimum number of samples required to split
    'clf-rf__n_estimators': [50*(i+1) for i in range(6)], # number of trees in the forest
    'clf-rf__random_state': [3], # initial state
}
gs_clf_rf = eval_grid_search_best(
    text_clf_rf, parameters_rf, train_set['content'], train_set['type'],
    test_set['content'], test_set['type']
)

Grid Search:
Accuracy: 92.85%
Best parameters:
{'clf-rf__criterion': 'gini',
 'clf-rf__max_features': 'auto',
 'clf-rf__min_samples_split': 4,
 'clf-rf__n_estimators': 200,
 'clf-rf__random_state': 3,
 'tfidf__use_idf': True,
 'vect__ngram_range': (1, 3),
 'vect__stop_words': None}

Best Estimator on Test Set:
Accuracy: 88.89%


In [21]:
# building classifier based on Random Forest with highest base accuracy
text_clf_rf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()),
    ('clf-rf', RandomForestClassifier(n_estimators=100, random_state=0)),
])
text_clf_rf = text_clf_rf.fit(train_set['content'], train_set['type'])
print("Base Classifier:")
acc = evaluate(text_clf_rf, test_set['content'], test_set['type'])

Base Classifier:
Accuracy: 95.83%


In [22]:
# grid search on model parameters for Random Forest
parameters_rf = {
    'vect__ngram_range': [(1,1), (1,2), (1,3)], # ngrams to use
    'vect__stop_words': (None, 'english'), # removing stopwords or not
    'tfidf__use_idf': (True, False), # use idf or not
    'clf-rf__max_features': ('auto', 'sqrt', 'log2'), # num of features for best split
    'clf-rf__criterion': ('gini', 'entropy'), # measure for quality of split
    'clf-rf__min_samples_split': (2,3,4), # minimum number of samples required to split
    'clf-rf__n_estimators': [50*(i+1) for i in range(6)], # number of trees in the forest
    'clf-rf__random_state': [0], # initial state
}
gs_clf_rf = eval_grid_search_best(
    text_clf_rf, parameters_rf, train_set['content'], train_set['type'],
    test_set['content'], test_set['type']
)

Grid Search:
Accuracy: 94.60%
Best parameters:
{'clf-rf__criterion': 'entropy',
 'clf-rf__max_features': 'auto',
 'clf-rf__min_samples_split': 2,
 'clf-rf__n_estimators': 100,
 'clf-rf__random_state': 0,
 'tfidf__use_idf': True,
 'vect__ngram_range': (1, 1),
 'vect__stop_words': None}

Best Estimator on Test Set:
Accuracy: 93.06%


The lowest accuracy of random forest is highest among all classifiers with base parameters 88.89% and grid search does not increase it. The highest accuracy is 95.83% and grid search decreases it by 2% to 93.06%.

### Conclusion

There were considered 4 algorithms of machine learning for the task of fake news classification by news articles texts. The best of all during experiments was random forest with base parameters where accuracy is 95.83%. Though it should be noted that SVM shows good results too 91.67% after grid search which is 4% less than from random forest and the same gives highest accuracy from decision tree. But since decision tree and random forest has element of randomness, SVM can be considered as good stable classifier.