# Fake News Detection NLP - Data Preparation

**Goal:** To make an NLP algorithm that can detect whether an article is fake news with a high degree of accuracy. This first notebook is set up to process my training and testing data prior to building the model, starting with the original source files that are not yet split into training and testing data. To begin, I am starting the ISOT Fake News Dataset, which is described here:

https://onlineacademiccommunity.uvic.ca/isot/wp-content/uploads/sites/7295/2023/02/ISOT_Fake_News_Dataset_ReadMe.pdf

Specific details about the goals of my model and my initial hypotheses are given in _Notebook 2 - Preliminary Model_.

### Initialization
I want to import my libraries and load my data.

In [1]:
##================#
# Libaries
##================#
import pandas as pd
from pandas.errors import ParserError
import numpy as np
import regex as re
import string

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Salan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
##================#
# Load ISOT Data
##================#
real_df = pd.read_csv('data/ISOT/True.csv')
fake_df = pd.read_csv('data/ISOT/Fake.csv')

### Quick EDA
I am mainly just doing this to get certain aspects of the datasets written out in this notebook (plus it's a good habit to get into).

In [167]:
# Get shape and 5 random articles
print(f"True DF shape: {real_df.shape}")
real_df.sample(5)

True DF shape: (21417, 4)


Unnamed: 0,title,text,subject,date
11658,U.S. creating 'sensational hype' over China's ...,BEIJING (Reuters) - The United States has crea...,worldnews,"December 21, 2017"
3105,U.S. Attorney General Sessions hires private a...,WASHINGTON (Reuters) - U.S. Attorney General J...,politicsNews,"June 21, 2017"
5230,Ex-President Bush says hopeful despite 'pretty...,(Reuters) - Former U.S. president George W. Bu...,politicsNews,"February 28, 2017"
3333,Autos need devices to curb child heatstroke de...,(Reuters) - Automakers could help prevent acci...,politicsNews,"June 7, 2017"
8344,California lawmakers send governor bill author...,"SACRAMENTO, Calif. (Reuters) - California woul...",politicsNews,"August 29, 2016"


In [168]:
# Get shape and 5 random articles
print(f"Fake DF shape: {fake_df.shape}")
fake_df.sample(5)

Fake DF shape: (23481, 4)


Unnamed: 0,title,text,subject,date
10514,WOW! TRUMP REVEALS Embarrassing Story About An...,"Some day, when things calm down, I'll tell the...",politics,"Jun 29, 2017"
4092,Curt Schilling Wants Elizabeth Warren’s Seat ...,"Curt Schilling, who is challenging Elizabeth W...",News,"October 21, 2016"
20759,CA DEMOCRATS HAVE SOLUTION TO MASSIVE Health C...,When a state has legislated themselves into an...,left-news,"Apr 9, 2016"
16914,“Unvetted” Illegals Turn Germany Into A Third-...,Take Notes America! Germany s Angela Merkel ha...,Government News,"Dec 16, 2015"
22531,Active Shooter Drill Suddenly ‘Goes Live’ at J...,SEE ALSO: 21WIRE calls out US media for hyping...,US_News,"June 30, 2016"


In [5]:
# Get subject categories
print(real_df['subject'].unique().tolist())
print(fake_df['subject'].unique().tolist())

['politicsNews', 'worldnews']
['News', 'politics', 'Government News', 'left-news', 'US_News', 'Middle-east']


In [6]:
# Print info
print("True DF:\n", real_df.info())
print('-' * 40)
print("Fake DF:\n", fake_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB
True DF:
 None
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB
Fake DF:
 None


In [7]:
# See date ranges
print(real_df['date'].sort_values())
print('-' * 40)
print(fake_df['date'].sort_values())

10099        April 1, 2016 
10019        April 1, 2016 
10020        April 1, 2016 
10092        April 1, 2016 
10094        April 1, 2016 
                ...        
20543    September 9, 2017 
20542    September 9, 2017 
20540    September 9, 2017 
20555    September 9, 2017 
20549    September 9, 2017 
Name: date, Length: 21417, dtype: object
----------------------------------------
9084                                             14-Feb-18
9075                                             15-Feb-18
9076                                             15-Feb-18
9077                                             15-Feb-18
9078                                             15-Feb-18
                               ...                        
17433    https://fedup.wpengine.com/wp-content/uploads/...
15840    https://fedup.wpengine.com/wp-content/uploads/...
15839    https://fedup.wpengine.com/wp-content/uploads/...
17432    https://fedup.wpengine.com/wp-content/uploads/...
21869    https://fed

So overall, we have a couple core discoveries to act on (or else to .
1. Dates are of object dtype, not date. While I may not need to worry about that just yet, I will update them regardless to be on the safe side.
2. It seems like a number of the Reuters articles begin with "CITYNAME (Reuters) - " at the start of every text field. I definitely need to remove that because I don't want the model assuming that something is factual just for mentioning Reuters (or any other source).
3. There are different categories between the two datasets, although all of the `fake_df` categories can be lumped into similarly named joint categories if need be. While I am not yet certain if I will have to do that or if I should instead retain them, I can at least make them all lowercase.
4. We have a couple more issues with dates. 
    - The dates are formatted differently between both datasets.
    - I noticed that `fake_df` has some dates that don't appear to be dates when working on the next section of this notebook, and it appears that the date has been filled with a url. 
        - I decided to open the .csv manually and take a look at this particular case for myself, and it seems like every cell for this row is just a url; the same goes for the row before it. Considering how small a contribution these two records appear to make to the dataset, I'm fine with just leaving them alone because a portion of the url still appears to form a somewhat cohesive statement that the NLP algorithm can either use or ignore. However, I still want to be able to convert the column into datetime.

There are also no null values in any of the rows in either DF; this is great!

### Data Revisions
Now that I have identified a few areas where I can improve the data prior to preprocessing it, I'll just spend a short bit of time taking care of that.

My first step is to create a new column that contains the original, unmodified article text. I also want to add a column for whether an article is "real" (int = 0) or "fake" (int = 1).

In [2]:
##================#
# Gen Functions 1
##================#
def define_type(df, type_arg):
    df['fake_news'] = type_arg
    return df

def clone_text(df, column):
    df['og_article'] = df[column]
    return df

def prepare_news(df, column, type_arg):
    df = define_type(df, type_arg)
    df = clone_text(df, column)
    return df

In [4]:
real_df = prepare_news(real_df, 'text', 0)
fake_df = prepare_news(fake_df, 'text', 1)
fake_df.head()

Unnamed: 0,title,text,subject,date,fake_news,og_article
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1,Donald Trump just couldn t wish all Americans ...
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1,House Intelligence Committee Chairman Devin Nu...
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1,"On Friday, it was revealed that former Milwauk..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1,"On Christmas day, Donald Trump announced that ..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1,Pope Francis used his annual Christmas Day mes...


In [3]:
##================#
# Gen Functions 2
##================#
def text_lower(df, column):
    df[column] = df[column].apply(str.lower)
    return

def convert_date(df, column):
    default_date = pd.to_datetime('2018-02-19', errors='coerce')  # Default date to use for invalid values
    df[column] = pd.to_datetime(df[column], errors='coerce')
    df[column].fillna(default_date, inplace=True)
    return df

##================#
# Clean pipelines
##================#
# For fake_df
def format_data_fake(df):
    text_lower(df, 'subject')
    text_lower(df, 'title')
    text_lower(df, 'text')
    convert_date(df, 'date')
    return

# For real_df; also remove Reuters prefixes
def format_data_real(df):
    df['text'] = df['text'].apply(lambda x: x.lstrip().split("(Reuters) - ", 1)[-1] if isinstance(x, str) else x)
    text_lower(df, 'subject')
    text_lower(df, 'title')
    text_lower(df, 'text')
    convert_date(df, 'date')
    return

Here is a quick demonstration; I also decided to cast my article text into lowercase here, seeing as I already had the function set up:

In [6]:
real_copy = real_df.copy(deep=True)
format_data_real(real_copy)
print(real_copy['text'].iloc[0])
real_copy.head()

the head of a conservative republican faction in the u.s. congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on sunday and urged budget restraint in 2018. in keeping with a sharp pivot under way among republicans, u.s. representative mark meadows, speaking on cbs’ “face the nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in january. when they return from the holidays on wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the november congressional election campaigns approach in which republicans will seek to keep control of congress. president donald trump and his republicans want a big budget increase in military spending, while democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific research

Unnamed: 0,title,text,subject,date,fake_news,og_article
0,"as u.s. budget fight looms, republicans flip t...",the head of a conservative republican faction ...,politicsnews,2017-12-31,0,WASHINGTON (Reuters) - The head of a conservat...
1,u.s. military to accept transgender recruits o...,transgender people will be allowed for the fir...,politicsnews,2017-12-29,0,WASHINGTON (Reuters) - Transgender people will...
2,senior u.s. republican senator: 'let mr. muell...,the special counsel investigation of links bet...,politicsnews,2017-12-31,0,WASHINGTON (Reuters) - The special counsel inv...
3,fbi russia probe helped by australian diplomat...,trump campaign adviser george papadopoulos tol...,politicsnews,2017-12-30,0,WASHINGTON (Reuters) - Trump campaign adviser ...
4,trump wants postal service to charge 'much mor...,president donald trump called on the u.s. post...,politicsnews,2017-12-29,0,SEATTLE/WASHINGTON (Reuters) - President Donal...


In [7]:
# Format actual DFs
format_data_real(real_df)
format_data_fake(fake_df)

### Preprocess
I need to split and tokenize my text, and remove all of my stopwords.

Note: I already made all of my text lowercase in the previous segment, just to keep my functions (and their use) organized.

In [4]:
##================#
# Gen Functions 3
##================#
def tokenize_key_words(text):
    stop_words = set(stopwords.words('english'))
    tokenized_text = []
    for entry in text:
        tokens = word_tokenize(str(entry))
        # Remove stopwords
        filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
        # Remove punctuation from each word
        filtered_tokens = [re.sub(r'[^\w\s]', '', word) for word in filtered_tokens]
        # Remove empty strings and any non-alphabetic characters
        filtered_tokens = [word.lower() for word in filtered_tokens if word.isalpha() and len(word) > 1]
        tokenized_text.append(filtered_tokens)
    return tokenized_text

# Turn column into list
def list_text(df, column):
    text_list = df[column].tolist()
    return text_list

##================#
# Tokenize pipeline
##================#
def clear_junk(df, column):
    text = list_text(df, column)
    text = tokenize_key_words(text)
    return text

In [9]:
# Get token lists
real_text = clear_junk(real_df, 'text')
fake_text = clear_junk(fake_df, 'text')

In [10]:
# Turn lists into dataframe columns
real_tokens = pd.DataFrame({'tokens':real_text})
fake_tokens = pd.DataFrame({'tokens':fake_text})

# Add as column
real_df['tokens'] = real_tokens
fake_df['tokens'] = fake_tokens

print(real_df.info(), "\n", fake_df.info())

# Show results
real_df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       21417 non-null  object        
 1   text        21417 non-null  object        
 2   subject     21417 non-null  object        
 3   date        21417 non-null  datetime64[ns]
 4   fake_news   21417 non-null  int64         
 5   og_article  21417 non-null  object        
 6   tokens      21417 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       23481 non-null  object        
 1   text        23481 non-null  object        
 2   subject     23481 non-null  object        
 3   date        23481 non-null  datetime64

Unnamed: 0,title,text,subject,date,fake_news,og_article,tokens
15934,uk minister says u.s. will drive hard bargain ...,foreign minister boris johnson said on wednesd...,worldnews,2017-11-01,0,LONDON (Reuters) - Foreign minister Boris John...,"[foreign, minister, boris, johnson, said, wedn..."
2227,merkel sees no military solution to u.s. dispu...,there is no military solution to the united st...,politicsnews,2017-08-11,0,BERLIN (Reuters) - There is no military soluti...,"[military, solution, united, states, dispute, ..."
20374,"u.s. calls on myanmar to stop violence, displa...",the white house said on monday the violent dis...,worldnews,2017-09-11,0,WASHINGTON (Reuters) - The White House said on...,"[white, house, said, monday, violent, displace..."
1457,"trump interviews four for fed chair job, to de...",u.s. president donald trump is ramping up his ...,politicsnews,2017-09-29,0,WASHINGTON (Reuters) - U.S. President Donald T...,"[us, president, donald, trump, ramping, search..."
9855,u.s. lawmaker wants hearing on bill to curb sh...,a u.s. lawmaker has called for a congressional...,politicsnews,2016-04-20,0,(Reuters) - A U.S. lawmaker has called for a c...,"[us, lawmaker, called, congressional, committe..."


### Shuffle and Prepare Data
The semifinal segment of this notebook exists to join both DataFrames together, shuffle all of the rows, then split them into train and test sets for export.

In [12]:
joint_df = pd.concat([real_df, fake_df], axis=0)
joint_df = joint_df.sample(frac=1).reset_index(drop=True)
print(joint_df.info())
joint_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       44898 non-null  object        
 1   text        44898 non-null  object        
 2   subject     44898 non-null  object        
 3   date        44898 non-null  datetime64[ns]
 4   fake_news   44898 non-null  int64         
 5   og_article  44898 non-null  object        
 6   tokens      44898 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 2.4+ MB
None


Unnamed: 0,title,text,subject,date,fake_news,og_article,tokens
0,hillary clinton’s forked tongue answer on e-ma...,remember when slick willy said it depends on ...,politics,2015-10-14,1,Remember when Slick Willy said It depends on ...,"[remember, slick, willy, said, depends, meanin..."
1,revealed: list of people president elect trump...,21st century wire says important: we are parti...,middle-east,2016-11-11,1,21st Century Wire says IMPORTANT: We are parti...,"[century, wire, says, important, particularly,..."
2,u.n. says still determining if myanmar crisis ...,the united nations has yet to determine whethe...,worldnews,2017-10-18,0,GENEVA (Reuters) - The United Nations has yet ...,"[united, nations, yet, determine, whether, vio..."
3,trump fires adviser bannon,president donald trump fired chief strategist ...,politicsnews,2017-08-18,0,"WASHINGTON/HAGERSTOWN, Md. (Reuters) - Preside...","[president, donald, trump, fired, chief, strat..."
4,trump tweets thanks to steve bannon for servic...,u.s. president donald trump praised his former...,politicsnews,2017-08-19,0,(Reuters) - U.S. President Donald Trump praise...,"[us, president, donald, trump, praised, former..."


Small hurdle: In cycling the previous code a few times, I discovered that some articles just have no text or tokens whatsoever, but they are not NaNs. I'm removing these from the DataFrame because there's no text to be classed, and a headline doesn't provide much information to go off of. I also realized that I still have some punctuation and individual characters in my text column, which should probably be removed just to ensure parity between text vocabulary and token list.

In [13]:
# Remove rows where the 'tokens_column' contains empty lists
joint_df = joint_df[joint_df['tokens'].apply(lambda x: len(x) > 0)]
print(joint_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44266 entries, 0 to 44897
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       44266 non-null  object        
 1   text        44266 non-null  object        
 2   subject     44266 non-null  object        
 3   date        44266 non-null  datetime64[ns]
 4   fake_news   44266 non-null  int64         
 5   og_article  44266 non-null  object        
 6   tokens      44266 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 2.7+ MB
None


In [5]:
##================#
# Gen Functions 4
##================#
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

def remove_single_characters(text):
    return ' '.join([word for word in text.split() if len(word) > 1])

##================#
# Scrub pipeline
##================#
def final_clean(df, column):
    df[column] = df[column].apply(remove_punctuation).apply(remove_single_characters)
    return df

In [15]:
joint_df = final_clean(joint_df, 'text')
joint_df = final_clean(joint_df, 'title')
joint_df.head()

Unnamed: 0,title,text,subject,date,fake_news,og_article,tokens
0,hillary clinton’s forked tongue answer on emai...,remember when slick willy said it depends on w...,politics,2015-10-14,1,Remember when Slick Willy said It depends on ...,"[remember, slick, willy, said, depends, meanin..."
1,revealed list of people president elect trump ...,21st century wire says important we are partic...,middle-east,2016-11-11,1,21st Century Wire says IMPORTANT: We are parti...,"[century, wire, says, important, particularly,..."
2,un says still determining if myanmar crisis is...,the united nations has yet to determine whethe...,worldnews,2017-10-18,0,GENEVA (Reuters) - The United Nations has yet ...,"[united, nations, yet, determine, whether, vio..."
3,trump fires adviser bannon,president donald trump fired chief strategist ...,politicsnews,2017-08-18,0,"WASHINGTON/HAGERSTOWN, Md. (Reuters) - Preside...","[president, donald, trump, fired, chief, strat..."
4,trump tweets thanks to steve bannon for servic...,us president donald trump praised his former c...,politicsnews,2017-08-19,0,(Reuters) - U.S. President Donald Trump praise...,"[us, president, donald, trump, praised, former..."


Lastly, we need to split our data and export it. I also want to be sure that the proportion of real news to fake news isn't too skewed toward one or another (i.e. I want balanced datasets). 

In [18]:
# Split the DataFrame into two new DataFrames, so that I have training and testing data ready
df1 = joint_df.sample(frac=0.7, random_state=23)    # A seed is set for consistency
df2 = joint_df.drop(df1.index)

# Verify lengths; each should be 22133 rows in length
print(df1.info())
print("-" * 30)
print(df2.info())

# Count numbers of real to fake news articles
print(f"Number of rows with value 1 in df1's fake_news' column: {df1['fake_news'].value_counts().get(1, 0)}")
print(f"Number of rows with value 1 in df2's fake_news' column: {df2['fake_news'].value_counts().get(1, 0)}")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30986 entries, 16217 to 34832
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       30986 non-null  object        
 1   text        30986 non-null  object        
 2   subject     30986 non-null  object        
 3   date        30986 non-null  datetime64[ns]
 4   fake_news   30986 non-null  int64         
 5   og_article  30986 non-null  object        
 6   tokens      30986 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.9+ MB
None
------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13280 entries, 0 to 44883
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       13280 non-null  object        
 1   text        13280 non-null  object        
 2   subject     13280 non-null  object        
 3 

In [190]:
df1.head()

Unnamed: 0,title,text,subject,date,fake_news,og_article,tokens
16219,senators introduce bill to block expansion of ...,small group of bipartisan senators introduced ...,politicsnews,2016-05-19,0,WASHINGTON (Reuters) - A small group of bipart...,"[small, group, bipartisan, senators, introduce..."
15933,exfbi chief comey to testify next week in russ...,former fbi director james comey will testify n...,politicsnews,2017-06-01,0,WASHINGTON (Reuters) - Former FBI Director Jam...,"[former, fbi, director, james, comey, testify,..."
26626,germanys greens all but rule out threeway jama...,germany greens on saturday all but ruled out t...,worldnews,2017-09-09,0,BERLIN (Reuters) - Germany s Greens on Saturda...,"[germany, greens, saturday, ruled, threeway, c..."
29859,humiliating democrats use russian warships as ...,maybe hillary russian uranium deal included ca...,politics,2016-07-29,1,Maybe Hillary s Russian uranium deal included ...,"[maybe, hillary, russian, uranium, deal, inclu..."
3307,report melania and barron trump will not be mo...,presidentelect donald trump recently asked his...,news,2016-11-20,1,President-elect Donald Trump recently asked hi...,"[presidentelect, donald, trump, recently, aske..."


### Select Features
To put it bluntly, some features just don't get used in the Word2Vec models I have decided to use. I can export `joint_df` into its own file, then clean the remaining two for use in Notebook 2.

In [21]:
# Drop index column
joint_df = joint_df.reset_index(drop=True)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)

# Drop unused column for data; it takes up space and goes totally unused
df1 = df1.drop(['og_article', 'title', 'text', 'subject', 'date'], axis=1)
df2 = df2.drop(['og_article', 'title', 'text', 'subject', 'date'], axis=1)

# Export data
df1.to_csv('data/processedISOT/Train.csv')
df2.to_csv('data/processedISOT/Test.csv')

### Prepare Secondary Test Set
By design, my models have been prepared using only the ISOT datset. This is because I want to use additional data sources to further test and explore the models and what we can learn from them. For this purpose, I have chosen to grab a Kaggle dataset, sourced from here: https://www.kaggle.com/competitions/fake-news/data?select=train.csv

In [16]:
##================#
# Load Kaggle Data
##================#
kaggle_df = pd.read_csv('data/kaggle_dataset/kaggle_train.csv')
kaggle_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


The models were trained using Word2Vec, with only the text and label columns being used from the ISOT dataset. This is because I wanted to avoid the chance of having models that were trained to recognize a specific source as valid or invalid; the goal is to attempt to classify news as real or fake, and we must be cautious of assuming that a given source will always provide real or fake news.

In [17]:
##================#
# Ready Kaggle
##================#
kaggle_df = kaggle_df.drop(['id', 'title', 'author'], axis=1)
kaggle_df = kaggle_df.rename(columns={'label':'fake_news'})
kaggle_df['text'] = kaggle_df['text'].astype(str)
kaggle_df['text'] = kaggle_df['text'].apply(str.lower)
kaggle_df = final_clean(kaggle_df, 'text')
kaggle_df.sample(5)

Unnamed: 0,text,fake_news
3379,does golf lead to uncontrollable rage by tanne...,1
11971,waitress was fired after retrieving gun for du...,0
2453,dont worry the people are coming to take back ...,1
11408,political analyst of the daily mail introduced...,1
5245,until the last mouse click of “to build better...,0


In [18]:
##================#
# Tokenize Kaggle
##================#
kaggle_text = clear_junk(kaggle_df, 'text')
kaggle_tokens = pd.DataFrame({'tokens':kaggle_text})
kaggle_df['tokens'] = kaggle_tokens
print(kaggle_df.info())
kaggle_df.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       20800 non-null  object
 1   fake_news  20800 non-null  int64 
 2   tokens     20800 non-null  object
dtypes: int64(1), object(2)
memory usage: 487.6+ KB
None


Unnamed: 0,text,fake_news,tokens
20795,rapper unloaded on black celebrities who met w...,0,"[rapper, unloaded, black, celebrities, met, do..."
20796,when the green bay packers lost to the washing...,0,"[green, bay, packers, lost, washington, redski..."
20797,the macy’s of today grew from the union of sev...,0,"[macy, today, grew, union, several, great, nam..."
20798,nato russia to hold parallel exercises in balk...,1,"[nato, russia, hold, parallel, exercises, balk..."
20799,david swanson is an author activist journalist...,1,"[david, swanson, author, activist, journalist,..."


In [19]:
##================#
# Clean Kaggle
##================#
kaggle_df = kaggle_df[kaggle_df['tokens'].apply(lambda x: len(x) > 0)]
print(kaggle_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20719 entries, 0 to 20799
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       20719 non-null  object
 1   fake_news  20719 non-null  int64 
 2   tokens     20719 non-null  object
dtypes: int64(1), object(2)
memory usage: 647.5+ KB
None


I found that some unknown issue was causing issues when trying to test my models with this data, so I decided to drop the last 10,000 rows (nearly 50%) of the dataset. I only need a small-ish subset for testing anyhow, so I do not have anything to worry about by doing this.

In [20]:
kaggle_df = kaggle_df.drop(kaggle_df.tail(10000).index)
kaggle_df = kaggle_df.dropna()

In [21]:
# Verify number of articles flagged as fake news
kaggle_df.loc[kaggle_df.fake_news == 1].count()

text         5354
fake_news    5354
tokens       5354
dtype: int64

In [22]:
##================#
# Export Kaggle
##================#
kaggle_df = kaggle_df.drop(['text'], axis=1)
kaggle_df.to_csv('data/processedKaggle/Test_2.csv')