# Predicting Fake and Real News

Karim El-Shammaa

_______________________________________________________________________________________________________________________________
# Data Importing and Preparation

Let us start by importing the libraries and the data files.

In [59]:
# Importing the Libraries
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [60]:
# Loading the data files
train = pd.read_csv('fake_or_real_news_training.csv')
test = pd.read_csv('fake_or_real_news_test.csv')

In [61]:
train.head()

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


First we will check for the data that is incorrectly entered into X1 and X2

In [62]:
train[(train.X1.notna()) | (train.X2.notna())]

Unnamed: 0,ID,title,text,label,X1,X2
192,599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,Election Day: No Legal Pot In Ohio; Democrats ...,REAL,
308,10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,Who rode it best? Jesse Jackson mounts up to f...,FAKE,
382,356,Black Hawk crashes off Florida,human remains found,(CNN) Thick fog forced authorities to suspend ...,REAL,
660,2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,(CNN) Aerial bombardments blew apart a Doctors...,REAL,
889,3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,A member of Al Qaeda's branch in Yemen said Fr...,REAL,
911,7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,00 UTC © USGS Map of the earthquake's epicent...,FAKE,
1010,9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,Email Print After writing a lengthy suicide no...,FAKE,
1043,9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,We Are Change \n\nIn today’s political climate...,FAKE,
1218,1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,A new national poll shows Vice President Biden...,REAL,
1438,4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,Russian warplanes began airstrikes in Syria on...,REAL,


There are two rows where the label is mistakenly put into X2. We will drop these rows

In [63]:
train = train.drop(train[train.X2.notna()].index)

We will create a new column "all" that concatenates the title with the text.

However, for the rows where the label is mistakenly put into X1, we will make sure to put it correctly under "label".

In [64]:
# Find the indeces of the rows where the label is put into X1
x1_wrong_rows = train[train['X1'].notna()].index

# Find the indeces of the rest of the rows
x1_normal_rows = train[train['X1'].isna()].index

# Converting the label column to string like the rest of the columns
train['label'] = train['label'].astype(str)

# Looping over the incorrect rows in the dataset to concatenate the title with the text under the column "all" and correctly assign the label
for i in x1_wrong_rows:
    train.loc[i,'all'] = train.loc[i,'title'] + " " + train.loc[i,'text'] + " " + train.loc[i,'label']
    train.loc[i,'label'] = train.loc[i,'X1']
     
 # Looping over the normal rows to concatenate the title with the text under the column "all"       
for i in x1_normal_rows:
    train.loc[i,'all'] = train.loc[i,'title'] + " " + train.loc[i,'text']

# Printing the incorrect rows to make sure they are fixed
train.loc[x1_wrong_rows]

Unnamed: 0,ID,title,text,label,X1,X2,all
192,599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,REAL,REAL,,Election Day: No Legal Pot In Ohio Democrats ...
308,10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,FAKE,FAKE,,Who rode it best? Jesse Jackson mounts up to f...
382,356,Black Hawk crashes off Florida,human remains found,REAL,REAL,,Black Hawk crashes off Florida human remains ...
660,2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,REAL,REAL,,Afghanistan: 19 die in air attacks on hospital...
889,3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,REAL,REAL,,Al Qaeda rep says group directed Paris magazin...
911,7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,FAKE,FAKE,,Shallow 5.4 magnitude earthquake rattles centr...
1010,9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,FAKE,FAKE,,ICE Agent Commits Suicide in NYC Leaves Note ...
1043,9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,FAKE,FAKE,,Political Correctness for Yuengling Brewery W...
1218,1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,REAL,REAL,,Poll gives Biden edge over Clinton against GOP...
1438,4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,REAL,REAL,,Russia begins airstrikes in Syria U.S. warns ...


At this point, the "X1", "X2", and "ID" columns are not needed, so we remove them.

In [65]:
del train['X1']
del train['X2']
del train['ID']

The next step is to split the dataset into training and validation sets

In [66]:
# create training and validation vars
y = train['label']
del train['label']
X_train, X_val, y_train, y_val = train_test_split(train['all'], y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)

(3197,) (3197,)
(800,) (800,)


We will first experiment with a Count Vectorizer, followed by TFIDF Transformation and then running Naive Bayes. We will use a Pipeline to group the operations.

# Naive Bayes with TFIDF

In [67]:
# Creating the Pipeline
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

# Fitting the model
text_clf = text_clf.fit(X_train, y_train)

# Predicting on the validation set
predicted = text_clf.predict(X_val)

# Computing the accuracy
np.mean(predicted == y_val)

0.7625

76%. Very Low.

Next, we will experiment with SVM with Count Vectorizer and TFIDF.

# SVM with TFIDF

In [68]:
# Creating the Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42))])

# Fitting the model
text_clf = text_clf.fit(X_train, y_train)

# Predicting on the validation set
predicted = text_clf.predict(X_val)

# Computing the accuracy
np.mean(predicted == y_val)



0.89

89%. A lot of improvement.

Next, we experiment with Grid Search.
We set the parameters for the Grid Search to try out two n-gram ranges, and two alpha values for the Naive Bayes.

# Naive Bayes with Grid Search

In [69]:
# Creating the Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

# Setting the parameters for Grid Search
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3),}

# Creating the model
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

# Fitting the model
gs_clf = gs_clf.fit(X_train, y_train)

# Printing the best score and parameters
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.9111667187988739
{'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}


91.1%. Much better than before. And it appears that the successful parameters were:
       alpha: 0.001
       No TFIDF
       N-gram range: (1,2)
       
Next, we will try Grid Search but with SVM this time.

# SVM with Grid Search

In [70]:
# Creating the Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42))])

# Setting the parameters for Grid Search
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3),}

# Creating the model
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

# Fitting the model
gs_clf = gs_clf.fit(X_train, y_train)

# Printing the best score and parameters
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.9042852674382234
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}




90.4%. Less than the above.

Let us experiment with Stemming Next. We will use the Snowball Stemmer and pair it with Naive Bayes. We will not use Grid Search with Stemming since it takes so much computing time.

# Naive Bayes with Snowball Stemmer

In [37]:
# Importing the Stemmer
from nltk.stem.snowball import SnowballStemmer

# Initializing the Stemmer and ignoring stopwords
stemmer = SnowballStemmer("english", ignore_stopwords=True)

# Creating a class for the Stemmer and the Count Vectorizer
class StemmedCountVectorizer(CountVectorizer):
    
    def build_analyzer(self):
        
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        
        # Returning the stems
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

# Initializing a new variable to call the class    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

# Creating the pipeline
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()),
                             ('clf', MultinomialNB(alpha=1e-3))])

# Fitting the Model
text_mnb_stemmed = text_mnb_stemmed.fit(X_train, y_train)

# Predicting on the validation set
predicted_mnb_stemmed = text_mnb_stemmed.predict(X_val)

# Calculating the accuracy
np.mean(predicted_mnb_stemmed == y_val)

0.90875

90.8%. A little less than our best value.

Let us try the Snowball Stemmer with SVM.

# SVM with Snowball Stemmer

In [58]:
# Creating the pipeline
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()),
                             ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42))])

# Fitting the Model
text_mnb_stemmed = text_mnb_stemmed.fit(X_train, y_train)

# Predicting on the validation set
predicted_mnb_stemmed = text_mnb_stemmed.predict(X_val)

# Calculating the accuracy
np.mean(predicted_mnb_stemmed == y_val)



0.8975

89.75%. Less than the above.

However, we will find that if we change the alpha value for SVM to 0.0001, our model will give a much higher accuracy.

In [39]:
# Creating the pipeline and changing the value of alpha to 1e-4
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()),
                             ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-4, n_iter=5, random_state=42))])

# Fitting the Model
text_mnb_stemmed = text_mnb_stemmed.fit(X_train, y_train)

# Predicting on the validation set
predicted_mnb_stemmed = text_mnb_stemmed.predict(X_val)

# Calculating the accuracy
np.mean(predicted_mnb_stemmed == y_val)



0.93

93%! Our highest model. We will use this model to predict the test data.

# Test Data Prediction

We will add the title to the text in the test data as we did with the training data.

In [49]:
# Getting the indices of the test set
test_row_indices = test.index

# Looping through the set to add the title to the text
for i in test_row_indices:
    test.loc[i,'all'] = test.loc[i,'title'] + " " + test.loc[i,'text']
    
test.head()

Unnamed: 0,ID,title,text,all
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...,September New Homes Sales Rise——-Back To 1992 ...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...,Why The Obamacare Doomsday Cult Can't Admit It...
2,864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...,"Sanders, Cruz resist pressure after NY losses,..."
3,4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...,Surviving escaped prisoner likely fatigued and...
4,662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...,Clinton and Sanders neck and neck in Californi...


Now let's find predict the labels in the test dataset and submit our predictions in a csv file.

In [52]:
# Using our final model to predict the labels on the test set
predictions = text_mnb_stemmed.predict(test['all'])

# Adding the predictions into a dataframe with the column title "label"
submission = pd.DataFrame(predictions, columns=['label'])

# Importing the test set into another variable to get the ID's and insert them into the new dataframe
orig_test=pd.read_csv("fake_or_real_news_test.csv")
submission.insert(0, 'id', orig_test['ID'])
submission.reset_index()

# Outputing the predictions into a csv file.
submission.to_csv('fake_vs_real_news_predictions.csv', index = False)