# Importing the Necessary Libraries

Pandas is required for modifying the dataframes and csv files. Shuffle from sklearn is important to shuffle the dataset so it's more "random". train_test_split is used to split the dataset into training, validation and test sets. os is used to then remove the temporary, intermediate files at the end.

In [1]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import os

In [2]:
fake_news = pd.read_csv('../BaseDataset/fake_news.csv')
real_news = pd.read_csv('../BaseDataset/real_news.csv')

In [3]:
fake_news = shuffle(fake_news, random_state=42)
real_news = shuffle(real_news, random_state=42)

In [4]:
fake_news

Unnamed: 0,title,text,label
12907,"watch: trump defends russia, so john mccain ha...","donald trump is vladimir putin s puppet, so of...",1
12408,donald trump’s new campaign slogan is literall...,"seriously, he really picked this as his 2020 s...",1
196,over 500 russian and egyptian troops train to ...,media skeptic over 500 russian and egyptian tr...,1
24255,gop presidential candidate marco rubio casts d...,one more republican doing his part to aid obam...,1
15387,bruce campbell shatters bloodied ‘trump suppor...,while it s usually donald trump s fans that ma...,1
...,...,...,...
21575,leaked: hillary drunk in the afternoon…needs t...,"an aug. 8, 2015, email exchange between clinto...",1
5390,how safe is food packaging to our health?,"keywords: food packaging , packaging one of th...",1
860,amazing meditative session right here about wh...,amazing meditative session right here about wh...,1
15795,republicans assault the poor with a new plan t...,paul ryan and the republican leadership are ro...,1


In [5]:
real_news

Unnamed: 0,title,text,label
15961,"in trump's ohio bastion, supporters dismiss up...","little hocking, ohio (reuters) - while revelat...",0
14652,trump discusses immigration ideas in dinner wi...,washington (reuters) - president donald trump ...,0
10021,"for those keeping score, american women domina...",rio de janeiro — the very idea of a medal coun...,0
14552,u.s. small-cap firms look to spend tax savings...,new york (reuters) - a texas-based chain of st...,0
6612,why land and homes actually tend to be disappo...,buy land: they’re not making it anymore. that ...,0
...,...,...,...
16850,trump: special counsel appointment 'hurts our ...,washington (reuters) - president donald trump ...,0
6265,"scarborough: 9th circuit ruling is ’laughable,...","friday on msnbc’s “morning joe,” host joe scar...",0
11284,"trump, rnc announce joint fundraising deal",the move will help trump consolidate the repub...,0
860,"racist note that led to protests, cancelled cl...",administrators at st. olaf’s college revealed ...,0


In [6]:
combined_news = pd.concat([fake_news, real_news])

In [7]:
combined_news.to_csv('combined.csv', index=False)

In [8]:
combined_news = pd.read_csv('combined.csv')

# Creating the Train, Test and Validation Sets

The 80-20 split is pedagogically the best split but since we also have to create an additional validation set, it will technically be 80-10-10. This split is being done through train_test_split. It is paramount that if one instance of fake news appears in the training set, it doesn't appear in the validation and test set. This is to ensure that both the validation and test sets are truly "unseen" data in comparison to the training set but is still representative of the type of data that will be encountered by the model. The same holds for any data that appears in the validation and test set. The data has to be truly unique in terms of occurrence. Additional check to ensure this will be undertaken in the coming sections.

In [9]:
# Add a unique identifier column to combined_news DataFrame
combined_news['unique_id'] = range(len(combined_news))

# Define the percentages for splitting
train_percent = 0.8

# Group the data by the 'label' column
grouped = combined_news.groupby('label')

# Initialize empty DataFrames for train, validation, and test sets
train_data = pd.DataFrame(columns=combined_news.columns)
val_data = pd.DataFrame(columns=combined_news.columns)
test_data = pd.DataFrame(columns=combined_news.columns)

# Iterate over each label group
for label, group in grouped:
    # Split the group into train and remaining data
    train, remaining = train_test_split(group, train_size=train_percent, 
                                        random_state=42)
    
    # Split the remaining data into validation and test sets
    val, test = train_test_split(remaining, test_size=0.5, random_state=42)
    
    # Append the splits to the respective DataFrames
    train_data = pd.concat([train_data, train])
    val_data = pd.concat([val_data, val])
    test_data = pd.concat([test_data, test])

# Reset the indices for the DataFrames
train_data.reset_index(drop=True, inplace=True)
val_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

# Save datasets to CSV files
train_data.to_csv('train.csv', index=False)
val_data.to_csv('val.csv', index=False)
test_data.to_csv('test.csv', index=False)

# Save datasets to CSV files in BaseDataset, to be used for other Linguistic Feature Calculations
train_data.to_csv('../BaseDataset/train.csv', index=False)
val_data.to_csv('../BaseDataset/val.csv', index=False)
test_data.to_csv('../BaseDataset/test.csv', index=False)

In [10]:
train_data

Unnamed: 0,title,text,label,unique_id
0,"iranian general, assad discuss joint military ...",beirut (reuters) - iran s military chief met w...,0,27401
1,top u.s. official visits vietnam to assess hum...,hanoi (reuters) - a top u.s. envoy began a two...,0,41572
2,senators want probe of allergan transfer deal ...,(reuters) - four u.s. senators have asked the ...,0,54822
3,will the comey bombshell really shake up the 2...,first read is a morning briefing from meet the...,0,26745
4,"egypt election in view, sisi supporters fire u...",cairo (reuters) - six months before egypt s el...,0,42204
...,...,...,...,...
48634,loretta lynch makes disturbing video encouragi...,obama s former ag loretta lynch released a vid...,1,21575
48635,come on down to hole-suckers on southside and ...,come on down to hole-suckers on southside and ...,1,5390
48636,please make it stop,"posted on november 7, 2016 by walter brasch wi...",1,860
48637,report: trump’s mind shattered in the face of ...,"going into election day, trump s campaign is w...",1,15795


In [11]:
train_data.value_counts("label")

label
0    27359
1    21280
dtype: int64

In [12]:
val_data

Unnamed: 0,title,text,label,unique_id
0,u.s. allied syrian groups form civilian counci...,"deir al zor, syria (reuters) - u.s.-allied mil...",0,36938
1,ga congressional dem candidate ossoff: not an ...,dem. candidate for georgia congressional seat ...,0,40217
2,u.s. lawmakers ask trump to turn over any come...,washington (reuters) - u.s. lawmakers on sunda...,0,52849
3,meeting between bill clinton and loretta lynch...,washington — an airport encounter this week be...,0,51808
4,"merkel, bavaria allies agree on migrant policy...",berlin (reuters) - german chancellor angela me...,0,33337
...,...,...,...,...
6075,oops: donald trump’s debate guest supports ter...,this lack of oversight proves that donald trum...,1,139
6076,spot on! fox sports host calls out espn’s libe...,tucker carlson responded to an espn anchor cal...,1,11500
6077,obama commencement speech to black graduates: ...,because getting something for nothing is all t...,1,13781
6078,comment on europe’s forgotten ‘hitler’ killed ...,black emanuelle fixed all that in 1976. attila...,1,3328


In [13]:
val_data.value_counts("label")

label
0    3420
1    2660
dtype: int64

In [14]:
test_data

Unnamed: 0,title,text,label,unique_id
0,"hypothetically speaking, u.s. admiral says rea...",melbourne (reuters) - the u.s. pacific fleet c...,0,55612
1,south africa's anc says party officials barred...,johannesburg (reuters) - south africa s africa...,0,55993
2,seattle judge says trump travel ban case shoul...,seattle (reuters) - a u.s. federal judge on mo...,0,58277
3,outspoken lieutenant general named trump's top...,"west palm beach, fla./washington (reuters) - u...",0,52137
4,trump calls for worldwide action against north...,seoul (reuters) - u.s. president donald trump ...,0,47382
...,...,...,...,...
6076,watch: wolf blitzer silences trump supporter’s...,trump campaign manager kellyanne conway ran in...,1,21903
6077,mass nye sexual assaults in europe explained: ...,this is possibly the most disturbing video we ...,1,7363
6078,"pro abortion pac, emily’s list doing its part ...",of course emily s list is going to support hil...,1,22076
6079,trump raises concern over members of urban com...,nation puts 2016 election into perspective by ...,1,21417


In [15]:
test_data.value_counts("label")

label
0    3420
1    2661
dtype: int64

In [16]:
# Load the CSV files into DataFrames
train_data = pd.read_csv('train.csv')
val_data = pd.read_csv('val.csv')
test_data = pd.read_csv('test.csv')

# Get unique IDs for each dataset
train_unique_ids = set(train_data['unique_id'])
val_unique_ids = set(val_data['unique_id'])
test_unique_ids = set(test_data['unique_id'])

# Find unique IDs in train not in val or test
train_not_in_val = train_unique_ids - val_unique_ids
train_not_in_test = train_unique_ids - test_unique_ids

# Find unique IDs in val not in train or test
val_not_in_train = val_unique_ids - train_unique_ids
val_not_in_test = val_unique_ids - test_unique_ids

# Find unique IDs in test not in train or val
test_not_in_train = test_unique_ids - train_unique_ids
test_not_in_val = test_unique_ids - val_unique_ids

print(f"Rows in train_data: {len(train_data)}")
print(f"Rows in train_data not in val_data: {len(train_not_in_val)}")
print(f"Rows in train_data not in test_data: {len(train_not_in_test)}")
print("")

print(f"Rows in train_data: {len(val_data)}")
print(f"Rows in val_data not in train_data: {len(val_not_in_train)}")
print(f"Rows in val_data not in test_data: {len(val_not_in_test)}")
print("")

print(f"Rows in test_data: {len(test_data)}")
print(f"Rows in test_data not in train_data: {len(test_not_in_train)}")
print(f"Rows in test_data not in val_data: {len(test_not_in_val)}")

Rows in train_data: 48639
Rows in train_data not in val_data: 48639
Rows in train_data not in test_data: 48639

Rows in train_data: 6080
Rows in val_data not in train_data: 6080
Rows in val_data not in test_data: 6080

Rows in test_data: 6081
Rows in test_data not in train_data: 6081
Rows in test_data not in val_data: 6081


In [17]:
os.remove("combined.csv")