# Fake News Classifier

In this project we will be attempting to build a classification model to determine whether a particular article is considered to be "fake news".

### Fake News

The term "fake news" was promulgated by Donald Trump in the lead up to his 2016 US election win and has since become the most dominating narrative in politics and media.

**Fake news** is any false or misleaing information represented as news. It is commonly spread on the internet through social media platforms such as Twitter and Facebook, and is usually created to influence political views or as a joke.

# Method

### Training data  
First we shall read in our training data set, which consists of a variety of news headlines and texts aligned with a label stating whether the article is fake news or not.

In [2]:
import pandas as pd
df = pd.read_csv('news.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [3]:
# Let's check the size of our training data set.
len(df)

6335

In [4]:
# Here we shall check whether our data set contains any null values.
check_nan_in_df = df.isnull()
print(check_nan_in_df)

      Unnamed: 0  title   text  label
0          False  False  False  False
1          False  False  False  False
2          False  False  False  False
3          False  False  False  False
4          False  False  False  False
...          ...    ...    ...    ...
6330       False  False  False  False
6331       False  False  False  False
6332       False  False  False  False
6333       False  False  False  False
6334       False  False  False  False

[6335 rows x 4 columns]


In [5]:
# We see that the data set contains no null/missing values.
df.isnull().values.any()

False

### Pre-processing our data

We now want to convert our training data set into a series of input-output pairs, in order to build our logistic regression model.

In [6]:
# First let's combine the title and text columns to create our input x
df['total_text'] = df['title'] + ' ' + df['text']
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,total_text
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,You Can Smell Hillary’s Fear Daniel Greenfield...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Watch The Exact Moment Paul Ryan Committed Pol...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,Kerry to go to Paris in gesture of sympathy U....
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,Bernie supporters on Twitter erupt in anger ag...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,The Battle of New York: Why This Primary Matte...


We now have our inputs x within the column 'total_text' and the corrseponding outputs y in the 'label' column.  
However, we would like to alter our data into a more interpretable form.

First, let's convert our labels into binary form.

In [7]:
# We set Real to 1 and Fake to 0.
as_binary = {'REAL':1,'FAKE':0}
df['label'] = df['label'].map(as_binary)
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,total_text
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",0,You Can Smell Hillary’s Fear Daniel Greenfield...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,0,Watch The Exact Moment Paul Ryan Committed Pol...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,1,Kerry to go to Paris in gesture of sympathy U....
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",0,Bernie supporters on Twitter erupt in anger ag...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,1,The Battle of New York: Why This Primary Matte...


We now want to clean up our text, making it more analysable. We will use the natural language toolkit package to do this.

In [8]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/maxkirwan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/maxkirwan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/maxkirwan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

An attempt to remove stopwords from an example sentence.

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [12]:
stop_words = stopwords.words('english')

In [13]:
example_sentence = "This is an example sentence. Hello World! I wonder if this won't work."
word_tokens = word_tokenize(example_sentence)

In [14]:
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence

['This',
 'example',
 'sentence',
 '.',
 'Hello',
 'World',
 '!',
 'I',
 'wonder',
 'wo',
 "n't",
 'work',
 '.']

Great, that worked. Now let's do the same for each row of our data set.

In [15]:
from nltk.stem import WordNetLemmatizer

In [16]:
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

for index, row in df.iterrows():
    filter_sentence = ''
    sentence = row['total_text']
    # Tokenization
    words = nltk.word_tokenize(sentence)
    # Stopwords removal
    words = [w for w in words if not w in stop_words]
    # Lemmatization
    for words in words:
        filter_sentence = filter_sentence  + ' ' + str(lemmatizer.lemmatize(words)).lower()
    
    df.loc[index, 'total_text'] = filter_sentence
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,total_text
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",0,you can smell hillary ’ fear daniel greenfiel...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,0,watch the exact moment paul ryan committed po...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,1,kerry go paris gesture sympathy u.s. secretar...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",0,bernie supporter twitter erupt anger dnc : 'w...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,1,the battle new york : why this primary matter...


We now have our dataset as a list of inputs and their corrseponding output features.

In [17]:
df_input = df['total_text']
df_output = df['label']

### Vectorization

We need to map sentences from our corpus to a corresponding vector of real numbers in order to be able to find similarities between sentences.

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(df_input)
freq_term_matrix = count_vectorizer.transform(df_input)

tfidf = TfidfTransformer(norm = "l2")
tfidf.fit(freq_term_matrix)
tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)

print(tf_idf_matrix)

  (0, 63465)	0.026753191864679532
  (0, 63456)	0.029127230661968165
  (0, 63329)	0.017574068463097534
  (0, 63043)	0.0355978832627074
  (0, 63032)	0.023161109883121193
  (0, 62994)	0.035107806500506475
  (0, 62951)	0.02642747375066598
  (0, 62946)	0.008338732795996964
  (0, 62852)	0.015361736862855021
  (0, 62768)	0.01415977077238262
  (0, 62674)	0.027209931227428756
  (0, 62673)	0.0161121219800601
  (0, 62610)	0.037523294771871676
  (0, 62551)	0.04100357361375188
  (0, 62421)	0.03343829289162133
  (0, 62342)	0.034324696573588824
  (0, 62317)	0.01764470347226561
  (0, 62207)	0.013343695286012296
  (0, 62193)	0.013915650398710271
  (0, 62170)	0.05938126091694468
  (0, 62045)	0.011083466210046834
  (0, 62004)	0.025968437181222048
  (0, 61909)	0.04131271254666581
  (0, 61827)	0.031496376386581454
  (0, 61766)	0.022598295988027487
  :	:
  (6334, 5684)	0.034466651267223564
  (6334, 5674)	0.09538401268982359
  (6334, 5670)	0.11078944910604521
  (6334, 5576)	0.018711107807023963
  (6334, 5133

In [25]:
tf_idf_matrix

<6335x64374 sparse matrix of type '<class 'numpy.float64'>'
	with 1844605 stored elements in Compressed Sparse Row format>