# Importing the Necessary Libraries

Pandas is required for modifying the dataframes and csv files. Shuffle from sklearn is important to shuffle the dataset so it's more "random". train_test_split is used to split the dataset into training, validation and test sets. os is used to then remove the temporary, intermediate files at the end.

In [1]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import os

In [2]:
fake_news = pd.read_csv('fake_news.csv')
real_news = pd.read_csv('real_news.csv')

In [3]:
fake_news = shuffle(fake_news, random_state=42)
real_news = shuffle(real_news, random_state=42)

In [4]:
fake_news

Unnamed: 0,title,text,label
23962,Iceland Election: Pirate Party prepares for ma...,Click Here To Learn More About Alexandra's Per...,1
4984,"When Obama Admin Went After Banks, It Forced T...",Share on Twitter \nA new study shows that afte...,1
34975,ANTI-HILLARY HALLOWEEN HOUSE Gets Violent Thre...,Via: TMZ,1
8944,HILLARY IS FURIOUS OVER EMAIL HACKS…Openly Thr...,But the media s concerned Trump is the threat ...,1
441,This University Will Punish You For Being Rap...,Imagine being a woman going to a university wh...,1
...,...,...,...
16850,US Coalition Airstrike on Syrian Army in Al-Ta...,"Bouthaina Shaaban, Political & Media Advisor t...",1
6265,"HUD OFFICIAL Spends $366,000 in Fed Funds on B...",WHY IS THIS CRIMINAL ALLOWED TO JUST SKATE BY ...,1
11284,CONGRESS JUST DEALT A BIG BLOW To Obama And Hi...,Obama has shown favoritism towards the Muslim ...,1
860,TUCKER CARLSON Exposes Radical Middle School T...,"Tonight, Tucker Carlson took on Yvette Felarca...",1


In [5]:
real_news

Unnamed: 0,title,text,label
20340,Two People Die after Eating Raw-Milk Cheese Ma...,Two people have died following an outbreak of ...,0
27348,DUP blames Sinn Fein for Northern Ireland talk...,BELFAST (Reuters) - A senior member of Norther...,0
11114,Exclusive: Former top Brazil prosecutor says s...,BRASILIA (Reuters) - Three senior Brazilian la...,0
23881,U.S. intelligence chief says Russia involvemen...,WASHINGTON (Reuters) - U.S. Director of Nation...,0
9561,Macri's coalition poised to win key Argentina ...,BUENOS AIRES (Reuters) - Argentine President M...,0
...,...,...,...
16850,NY Assembly speaker drives home priorities ahe...,NEW YORK (Reuters) - New York’s top Democratic...,0
6265,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,0
11284,Report Accuses Facebook of Gender Bias Against...,A newly published report has accused Facebook ...,0
860,2 Valedictorians in Texas Declare Undocumented...,"When Mayte Lara Ibarra, the valedictorian of h...",0


In [6]:
combined_news = pd.concat([fake_news, real_news])

In [7]:
combined_news.to_csv('combined.csv', index=False)

In [8]:
combined_news = pd.read_csv('combined.csv')

# Creating the Train, Test and Validation Sets

The 80-20 split is pedagogically the best split but since we also have to create an additional validation set, it will technically be 80-10-10. This split is being done through train_test_split. It is paramount that if one instance of fake news appears in the training set, it doesn't appear in the validation and test set. This is to ensure that both the validation and test sets are truly "unseen" data in comparison to the training set but is still representative of the type of data that will be encountered by the model. The same holds for any data that appears in the validation and test set. The data has to be truly unique in terms of occurrence. Additional check to ensure this will be undertaken in the coming sections.

In [9]:
# Add a unique identifier column to combined_news DataFrame
combined_news['unique_id'] = range(len(combined_news))

# Define the percentages for splitting
train_percent = 0.8
val_percent = 0.1
test_percent = 0.1

# Group the data by the 'label' column
grouped = combined_news.groupby('label')

# Initialize empty DataFrames for train, validation, and test sets
train_data = pd.DataFrame(columns=combined_news.columns)
val_data = pd.DataFrame(columns=combined_news.columns)
test_data = pd.DataFrame(columns=combined_news.columns)

# Iterate over each label group
for label, group in grouped:
    # Split the group into train and remaining data
    train, remaining = train_test_split(group, train_size=train_percent, random_state=42)
    
    # Split the remaining data into validation and test sets
    val, test = train_test_split(remaining, test_size=0.5, random_state=42)
    
    # Append the splits to the respective DataFrames
    train_data = pd.concat([train_data, train])
    val_data = pd.concat([val_data, val])
    test_data = pd.concat([test_data, test])

# Reset the indices for the DataFrames
train_data.reset_index(drop=True, inplace=True)
val_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

# Save datasets to CSV files
train_data.to_csv('train.csv', index=False)
val_data.to_csv('val.csv', index=False)
test_data.to_csv('test.csv', index=False)

In [10]:
train_data

Unnamed: 0,title,text,label,unique_id
0,Trump Accused of ‘Environmental Racism’ in Wit...,"In a leap of reason, a several journals and ...",0,60151
1,Factbox: Where the bookies and trading exchang...,NEW YORK (Reuters) - A Reuters/Ipsos tracking ...,0,55787
2,Chinese official calls Trump 'irrational' on t...,WASHINGTON (Reuters) - Chinese Finance Ministe...,0,66255
3,"Factbox: Five facts about Mike Pence, Trump's ...",(Reuters) - Republican Donald Trump will name ...,0,46220
4,Mnangagwa told Mugabe he will be safe in Zimba...,HARARE (Reuters) - Incoming Zimbabwe leader Em...,0,62789
...,...,...,...,...
57670,OBAMA’S ARROGANCE: WATCH As He Admonishes Repo...,Being called out for his utter incompetence as...,1,16850
57671,IV Varitas,Mail with questions or comments about this sit...,1,6265
57672,BLACK FEMALE DEMOCRAT ON HER SUPPORT FOR TRUMP...,An Atlanta Democrat makes some very valid poin...,1,11284
57673,“They Will Kill Him Before They Let Him Become...,This letter was originally attributed to well-...,1,860


In [11]:
train_data.value_counts("label")

label
1    29653
0    28022
dtype: int64

In [12]:
val_data

Unnamed: 0,title,text,label,unique_id
0,Turkey's Erdogan angers critics with plan to r...,ISTANBUL (Reuters) - President Tayyip Erdogan ...,0,39429
1,Austin Schools Jump on ’Sanctuary’ Bandwagon,"In a special session held Monday evening, Aust...",0,49957
2,Why the way we pick our VPs is terrible,Mike Pence and Tim Kaine will take the stage f...,0,48566
3,Trump unlikely to be able to renegotiate clima...,COLOGNE (Reuters) - Donald Trump would be “hig...,0,56707
4,Key justice Kennedy wavers as Supreme Court co...,WASHINGTON (Reuters) - A closely divided Supre...,0,60961
...,...,...,...,...
7205,Smart Meter Case Testimony Before the Pennsylv...,By Catherine J. Frompovich This is the continu...,1,127
7206,CLOAKED IN CONSPIRACY: Overview of JFK Files R...,Shawn Helton 21st Century WireSince late Octob...,1,24891
7207,Revolutions and Reforms: An Eroded Culture,“Those who make peaceful revolution impossible...,1,2140
7208,A Sick Little Boy Just Found a Heart Donor Hou...,First Lady Melania Trump has been busy praying...,1,5455


In [13]:
val_data.value_counts("label")

label
1    3707
0    3503
dtype: int64

In [14]:
test_data

Unnamed: 0,title,text,label,unique_id
0,Mexico Vows $50 Million Legal Fund to Fight U....,MEXICO CITY (AP) — Mexico’s top diplomat sa...,0,58400
1,"EU, U.S. high-level meeting on laptop ban to b...",BRUSSELS (Reuters) - The European Union and th...,0,65393
2,"In valedictory speech, Obama takes note of val...",ATHENS (Reuters) - President Barack Obama exto...,0,52740
3,"Harold Hayes, Survivor of Secret World War II ...","Harold Hayes, the last surviving member of a b...",0,47619
4,Zimbabwe's Mugabe sacks VP seen as top success...,HARARE (Reuters) - Zimbabwe s President Robert...,0,44991
...,...,...,...,...
7205,“YOU’RE HIRED!” Trump Pulls Unemployed Vet Fro...,"No matter which candidate you support, this mo...",1,10600
7206,Project Veritas 4: Robert Creamer's Illegal $2...,Project Veritas 4: Robert Creamer's Illegal ...,1,22253
7207,LOL! FAKE INDIAN Elizabeth Warren Returns DNA ...,"In February, Boston-based entrepreneur and inv...",1,9192
7208,Hollande set to lose French presidency after c...,"UK Express October 26, 2016 \nThe European lea...",1,27875


In [15]:
test_data.value_counts("label")

label
1    3707
0    3503
dtype: int64

In [16]:
# Load the CSV files into DataFrames
train_data = pd.read_csv('train.csv')
val_data = pd.read_csv('val.csv')
test_data = pd.read_csv('test.csv')

# Get unique IDs for each dataset
train_unique_ids = set(train_data['unique_id'])
val_unique_ids = set(val_data['unique_id'])
test_unique_ids = set(test_data['unique_id'])

# Find unique IDs in train not in val or test
train_not_in_val = train_unique_ids - val_unique_ids
train_not_in_test = train_unique_ids - test_unique_ids

# Find unique IDs in val not in train or test
val_not_in_train = val_unique_ids - train_unique_ids
val_not_in_test = val_unique_ids - test_unique_ids

# Find unique IDs in test not in train or val
test_not_in_train = test_unique_ids - train_unique_ids
test_not_in_val = test_unique_ids - val_unique_ids

print(f"Rows in train_data: {len(train_data)}")
print(f"Rows in train_data not in val_data: {len(train_not_in_val)}")
print(f"Rows in train_data not in test_data: {len(train_not_in_test)}")
print("")

print(f"Rows in train_data: {len(val_data)}")
print(f"Rows in val_data not in train_data: {len(val_not_in_train)}")
print(f"Rows in val_data not in test_data: {len(val_not_in_test)}")
print("")

print(f"Rows in test_data: {len(test_data)}")
print(f"Rows in test_data not in train_data: {len(test_not_in_train)}")
print(f"Rows in test_data not in val_data: {len(test_not_in_val)}")

Rows in train_data: 57675
Rows in train_data not in val_data: 57675
Rows in train_data not in test_data: 57675

Rows in train_data: 7210
Rows in val_data not in train_data: 7210
Rows in val_data not in test_data: 7210

Rows in test_data: 7210
Rows in test_data not in train_data: 7210
Rows in test_data not in val_data: 7210


In [17]:
os.remove("combined.csv")

In [18]:
# Check overlap between train_data and val_data
overlap_train_val = len(set(train_data[['title','text']].apply(tuple, axis=1)).intersection(val_data[['title','text']].apply(tuple, axis=1)))
print(f"Overlap between train_data and val_data: {overlap_train_val} common texts")

# Check overlap between train_data and test_data
overlap_train_test = len(set(train_data[['title','text']].apply(tuple, axis=1)).intersection(test_data[['title','text']].apply(tuple, axis=1)))
print(f"Overlap between train_data and test_data: {overlap_train_test} common texts")

# Check overlap between val_data and test_data
overlap_val_test = len(set(val_data[['title','text']].apply(tuple, axis=1)).intersection(test_data[['title','text']].apply(tuple, axis=1)))
print(f"Overlap between val_data and test_data: {overlap_val_test} common texts")

Overlap between train_data and val_data: 1325 common texts
Overlap between train_data and test_data: 1286 common texts
Overlap between val_data and test_data: 167 common texts


In [19]:
duplicates_count = train_data.duplicated(subset=['text']).sum()
print(f"Number of duplicate entries in the 'text' column: {duplicates_count}")

duplicates_count = val_data.duplicated(subset=['text']).sum()
print(f"Number of duplicate entries in the 'text' column: {duplicates_count}")

duplicates_count = test_data.duplicated(subset=['text']).sum()
print(f"Number of duplicate entries in the 'text' column: {duplicates_count}")

Number of duplicate entries in the 'text' column: 6241
Number of duplicate entries in the 'text' column: 195
Number of duplicate entries in the 'text' column: 212
