# Proposal: FakeNews Detection Algorithm

## Problem Statement

Can news articles be grouped into different levels of credibility using the language and text feature frequencies. This project will seek to build a classification model that can determine credibility as well as identify specific text features that are important in the determination. The model will be trained on vectorized word counts of news article titles, as well as body text if available. Success will be guided by the harmonic mean (f1 score) though recall and precision will also be taken into account. Baseline accuracy will be the relative proportions of the different groups of labels.

##### Grouping criteria has yet to be determined

## Methods and models

- The raw text based data will be processed using a TFIDF Vectorizer and Count Vectorizer to extract text features
- Multiple classification models will be trained to find which model is able to predict credibility with the greatest success (harmonic mean, recall, precision)
    - Preliminary list of models
        - Support Vector Classifier
        - Logistic Regression
        - Random Forest Classifier
        - K Nearest Neighbors Classifier
        - Multinomial and Gaussian Naive Bayes Classifiers
- A randominzed search will be performed over each model to determine the best hyper parameters for classification
- The logistic regression will be used specifically in order to see if certain words are important in determining the credibility of an article
- A convolutional neural network will also be trained to classify articles and provide a likelyhood of being in each category

## Risks and Assumptions

- A key assumption that this analysis is making is that the labels in the training data correctly assigned
    - This project will be using a dataset that was precollected thus I do not have direct control of the labelling process
    - A way around this might be to do a separate unsupervized clustering process on the data to see how it matches the labels that I am provided with
- A risk that follows from that assumption is that the model may be too reliant on domain
    - Labels were assigned based on domain credibility rather than based on the content itself so the model will actually be looking for linguistic patterns in the way a domain creates content and is relying on those labels being accurate and fairly assigned
    - The model might be biased towards and against certain domains
- A risk is determining how this model will handle satire
    - Removing satirical articles from the dataset is risky though including them may make it difficult for the model to determine between what is intentionally false for the purpose of satire and what is false for the purpose of misleading readers

## Data Source

### The FakeNewsCorpus provided by https://github.com/several27

- Corpus link: https://github.com/several27/FakeNewsCorpus
- Prelabeled csv file containing 9.5 million news articles
- Fields include title text, content, author, url, domain, keywords/meta keywords, and summary
- Labels (type)
    - Fake
    - Satire
    - Bias
    - Conspiracy
    - State
    - Junksci
    - Hate
    - Clickbait
    - Unreliable
    - Political
    - Reliable

## Preliminary EDA

##### Data was too large to load into memory so a sampling method was implemented

#### Sampling method skiprows

```filename = 'news_cleaned_2018_02_13.csv'
nlinesfile = 9408908
nlinesrandomsample = 100000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)```

Reads in a sample of size 100,000 from the csv skipping a randomized list of row indices.

Reading in sample

**Warning**
csv file moved to outside of the repository due to size issues- reading the file using this exact code will not be replicable.
Edit filename in the read function to the desired filepath.

In [106]:
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
pd.options.display.max_rows = 999


In [50]:


filepath = '../datasets/news_cleaned_2018_02_13.csv'
nlinesfile = 9408908
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filepath, skiprows=lines2skip)


df.shape

(9097, 17)

In [51]:
df.drop(columns = ['Unnamed: 0'], inplace= True)
df = df.reindex(index = range(0, df.shape[0]))

In [52]:
df.head()

Unnamed: 0,id,domain,type,url,content,scraped_at,inserted_at,updated_at,title,authors,keywords,meta_keywords,meta_description,tags,summary,source
0,2210,beehivebugle.com,satire,http://beehivebugle.com/2014/04/07/a-former-mo...,"Hi there, I’ve got a wonderful message that I ...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,A Former Mormon’s 1st Annual Objective Respons...,"View All Posts Lawrence B. Riff, Lawrence B. Riff",,[''],With flagrant zeal you enter the Conference Ce...,"Jesus, heavenly father, Musician, Food, LDS Ch...",,
1,2388,christianpost.com,reliable,https://www.christianpost.com/news/changing-cu...,WASHINGTON — Building a culture that values li...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Changing a Culture of Self to a Culture of Life,,,[''],Building a culture that values life will be a ...,,,
2,5133,beforeitsnews.com,fake,http://beforeitsnews.com/gold-and-precious-met...,New Home Sales Explode Higher: Biggest Monthly...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,New Home Sales Explode Higher: Biggest Monthly...,,,[''],,,,
3,6296,collectivelyconscious.net,junksci,http://collectivelyconscious.net/tag/america/,On the first part of the journey I was looking...,2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,America,,,[''],Hive Mind for the Awakened,"Joe Rogan, Revolution, Basic Income, Food & Ag...",,
4,6963,naturalnews.com,junksci,https://www.naturalnews.com/048620_glyphosate_...,"Impact on pineal gland, gut health explained\n...",2018-01-25 16:17:44.789555,2018-02-02 01:19:41.756632,2018-02-02 01:19:41.756664,Glyphosate could combine with aluminum to incr...,"Friday, February, Jennifer Lilley",,"['glyphosate', 'aluminum', 'gut flora', 'alumi...",Glyphosate could combine with aluminum to incr...,,,


Certain columns have a lot of null values.

The important columns, type and title, are each less than 5% null so those rows can be dropped.

In [53]:
df.isna().mean()

id                  0.000000
domain              0.000000
type                0.048587
url                 0.000000
content             0.000000
scraped_at          0.000000
inserted_at         0.000000
updated_at          0.000000
title               0.009124
authors             0.450588
keywords            1.000000
meta_keywords       0.041772
meta_description    0.532373
tags                0.770254
summary             1.000000
source              0.784215
dtype: float64

In [54]:
# Won't work if they are put in the same filter
df = df[df['type'].isna() == False]
df = df[df['title'].isna() == False]

In [55]:
df.isna().mean()

id                  0.000000
domain              0.000000
type                0.000000
url                 0.000000
content             0.000000
scraped_at          0.000000
inserted_at         0.000000
updated_at          0.000000
title               0.000000
authors             0.446570
keywords            1.000000
meta_keywords       0.044330
meta_description    0.543047
tags                0.779865
summary             1.000000
source              0.770999
dtype: float64

### Looking at the type (label) counts

In [56]:
df['type'].value_counts(normalize = True)

reliable      0.232851
political     0.212902
bias          0.142324
fake          0.110826
conspiracy    0.094610
rumor         0.056346
unknown       0.042231
unreliable    0.037214
clickbait     0.029748
junksci       0.017032
satire        0.014116
hate          0.009799
Name: type, dtype: float64

Sample percentages vs full dataset percentages (as provided by origin repository)

| Label      | Sample | Full |
|------------|--------|------|
| reliable   | 23.3%    | 20%   |
| political  | 21.3%   | 26%  |
| bias       | 14.2%   | 14%  |
| fake       | 11.1%   | 10%   |
| conspiracy | 9.5%   | 9%  |
| rumor      | 5.6%   | NA   |
| unknown    | 4.2%   | NA   |
| unreliable | 3.7%   | 3%  |
| clickbait  | 2.9%   | 3%  |
| junksci    | 1.7%   | 2%  |
| satire     | 1.4%   | 1%  |
| hate       | 0.9%   | 1%  |

They are very close aside from small differences.
It can be seen though that in the full dataset as well as the sample the 12 classes are unbalanced.
Another concern is that the rumor and unknown labels are not listed in the github repository and data dictionary.

Removing labels

- Rumor and unknown
    - Not listed in data dictionary
- Satire
    - May complicate initial models

In [57]:
df = df[df['type'] != 'rumor']
df = df[df['type'] != 'unknown']
df = df[df['type'] != 'satire']

In [58]:
df['type'].value_counts(normalize = True)

reliable      0.262424
political     0.239942
bias          0.160400
fake          0.124901
conspiracy    0.106626
unreliable    0.041941
clickbait     0.033526
junksci       0.019195
hate          0.011044
Name: type, dtype: float64

### Combining different labels into binary classification

- 1 if reliable (~54%)
    - reliable
    - political
    - clickbait
- 0 if unreliable (~46%)
    - unreliable
    - conspiracy
    - hate
    - fake
    - bias
    - junksci

In [59]:
df['reliable'] = df['type'].map({'reliable' : 1, 'political' : 1, 'clickbait' : 1, 'unreliable' : 0,
                                 'conspiracy' : 0, 'hate' : 0, 'fake' : 0, 'bias' : 0, 'junksci' : 0})

df['reliable'].value_counts(normalize = True)

1    0.535893
0    0.464107
Name: reliable, dtype: float64

### Engineering title data for basic model

- Tokenizing and lemmatizing the article titles for use in a vectorizer

In [60]:
def tokenize(x):
    tokenizer = RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(x)

df['tokens'] = df['title'].map(tokenize)
    
def lemmatize(x):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in x])

df['lemma'] = df['tokens'].map(lemmatize)



In [85]:
df['lemma'].sample(10)

8584    In California a Wary Eye on Hillsides a Rain F...
2171    Jim DeMint leaving the Senate to head Heritage...
2620             Hard reboot news article and information
578                      Daily Kos Comments by cedar park
7521                            Fleeting Glory Is No More
7816                              Laurence Amir Jon Regen
2655                             Earned Income Tax Credit
6363                           Ancient Chinese secret huh
5141          Pope writes to Muslims about mutual respect
7596               Which Way Out of Our National Quagmire
Name: lemma, dtype: object

In [61]:
cvec = CountVectorizer(stop_words='english', ngram_range=(1, 2) )

In [111]:
cvec.fit_transform(df['title'])



<7606x43818 sparse matrix of type '<class 'numpy.int64'>'
	with 77633 stored elements in Compressed Sparse Row format>

In [112]:
vocab_df = pd.DataFrame(data = cvec.vocabulary_.values(), index= cvec.vocabulary_.keys(), columns= ['count'])

In [124]:
vocab_df.sort_values(by = 'count', ascending=False).head(300);

Weirdly enough the top 300+ most common words and 2 word phrases in the corpus are in foreign languages.

Trying with a higher minimum df threshhold

In [120]:
cvec_2 = CountVectorizer(stop_words='english', ngram_range=(1, 2), min_df=.01 )

In [121]:
cvec_2.fit_transform(df['title'])

vocab_df_2 = pd.DataFrame(data = cvec_2.vocabulary_.values(), index= cvec_2.vocabulary_.keys(), columns= ['count'])

In [123]:
vocab_df_2.sort_values(by = 'count', ascending=False).head(100);

Using a min_df of .01, meaning that a word must appear in at least 1% of the titles, removes that phenomenon.

### Insights from EDA

- The sampling method is an effective way to get a balanced sample data frame that is possible to load into memory
- Using a heavy handed approach to class combination, the binary reliable and unreliable classes are roughly equal
    - Intend to spend more time developing more thought out class combinations
    - Perhaps tiered multiclass classification
- Min_df is an important hyper parameter that will be gridsearched over in a model
    - Removing non english words could have negative effects on the model's ability to predict credibility
- Next steps in EDA and feature engineering
    - Begin to look into sentiment analysis
    - Locate non english article titles
    - Classify and handle satirical articles