In [None]:
import pandas as pd
import string

from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords

In this notebook I'm going to clean the dataset by removing duplicates, stop words and special symbols. After data cleaning, I lemmatize and tokenize the text content in order for it to be ready for training.

In [2]:
data = pd.read_csv('../../../data/raw/fake_or_real_news.csv')
data = data[['title', 'text', 'label']]
data.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


I'll remove text duplicates.

In [3]:
print("Initial dataset size: ", data.shape[0])
data.drop_duplicates(inplace=True)
print("Dataset size after removing duplicates: ", data.shape[0])

Initial dataset size:  6335
Dataset size after removing duplicates:  6306


I'm going to concatenate the title and the article content in order to get a single piece of text. This will result in a dataframe containing just the text and its label (fake/real) 

In [4]:
data['text'] = data['title'] + " " + data['text']
data = data[['text', 'label']]
data.head()

Unnamed: 0,text,label
0,You Can Smell Hillary’s Fear Daniel Greenfield...,FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE
2,Kerry to go to Paris in gesture of sympathy U....,REAL
3,Bernie supporters on Twitter erupt in anger ag...,FAKE
4,The Battle of New York: Why This Primary Matte...,REAL


Next, I'll lowercase the text, remove stop words, punctation marks and other special symbols, because they do not contain useful information that should be used in training.  
Before lemmatization, I need to tokenize the articles contents. This implies splitting the blocks of texts into individual words.

In [5]:
stop = set(stopwords.words('english') + list(string.punctuation))
data['text'] = data['text'].apply(lambda x: [token.lower() for token in word_tokenize(x) if token.lower() not in stop and token.isalnum()])
data.head()

Unnamed: 0,text,label
0,"[smell, hillary, fear, daniel, greenfield, shi...",FAKE
1,"[watch, exact, moment, paul, ryan, committed, ...",FAKE
2,"[kerry, go, paris, gesture, sympathy, secretar...",REAL
3,"[bernie, supporters, twitter, erupt, anger, dn...",FAKE
4,"[battle, new, york, primary, matters, primary,...",REAL


Now we can continue with the lemmatization step that will convert tokens obtained at the previous step to their base form, removing their inflectional endings.

In [6]:
lemmatizer = WordNetLemmatizer()
data['text'] = data['text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [7]:
data.head()

Unnamed: 0,text,label
0,"[smell, hillary, fear, daniel, greenfield, shi...",FAKE
1,"[watch, exact, moment, paul, ryan, committed, ...",FAKE
2,"[kerry, go, paris, gesture, sympathy, secretar...",REAL
3,"[bernie, supporter, twitter, erupt, anger, dnc...",FAKE
4,"[battle, new, york, primary, matter, primary, ...",REAL


Save the preprocessed data in a separate file that will be later used for training.

In [9]:
data.to_csv('../../../data/processed/fake_or_real_news.csv')