# 02 Data Cleaning
**Author:** Fu Zhenhui  
**Input:** WELFake_EnglishOnly.csv  
**Output:** WELFake_Cleaned.csv  
**Last updated:** June 2025

In [2]:
import pandas as pd

## Load Dataset

In [3]:
df = pd.read_csv("WELFake_EnglishOnly.csv")
df

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
2,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
3,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
4,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1
...,...,...,...
68067,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
68068,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
68069,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
68070,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


In [4]:
df["label"].value_counts()
# 0 = fake, 1 = real

label
0    34052
1    34020
Name: count, dtype: int64

In [5]:
# Display the summary table
tab_info = pd.DataFrame(df.dtypes).T.rename(index={0: "column type"})
tab_info = tab_info._append(pd.DataFrame(df.isnull().sum()).T.rename(index={0: "null values (nb)"}))
tab_info = tab_info._append(pd.DataFrame((df.isnull().sum()/df.shape[0])*100).T.rename(index={0: "null values (%)"}))
tab_info

Unnamed: 0,title,text,label
column type,object,object,int64
null values (nb),0,0,0
null values (%),0.0,0.0,0.0


## Remove ALL Leading/Trailing Whitespace (spaces, tabs, \r, \n, etc.)

In [6]:
df["title"] = df["title"].str.strip()

## ❓Check for Duplicates in the "text" Column

In [7]:
# Count how many rows have duplicate “text” values (including the first occurrences)
num_total_text = df.shape[0]
num_unique_text = df["text"].nunique(dropna=False) 
num_duplicates_text = num_total_text - num_unique_text

print(f"Total rows:       {num_total_text}")
print(f"Unique 'text'     {num_unique_text}")
print(f"Duplicate rows:   {num_duplicates_text}")

Total rows:       68072
Unique 'text'     59904
Duplicate rows:   8168


## ✂️Remove Duplicates Based on `text` by Keeping the Longest `title`
> For each duplicated text, keep the row whose `title` is the most **informative**.

In [8]:
# Compute the length of each title
title_lengths = df['title'].str.len()

# For each unique text, find the index of the row whose title is longest
longest_per_text = title_lengths.groupby(df["text"]).idxmax()

# Select and reset index
df = df.loc[longest_per_text].reset_index(drop=True)

print(f"Rows after keeping longest title per duplicate text: {df.shape[0]}")

Rows after keeping longest title per duplicate text: 59904


In [9]:
assert df.shape[0] == df['text'].nunique()

In [10]:
df["label"].value_counts()
# 0 = fake, 1 = real

label
0    33683
1    26221
Name: count, dtype: int64

## ❓Check for Duplicates in the "title" Column

In [11]:
# Count how many rows have duplicate “title” values (including the first occurrences)
num_total_title   = df.shape[0]
num_unique_title  = df["title"].nunique(dropna=False)
num_duplicates_title = num_total_title - num_unique_title

print(f"Total rows:          {num_total_title}")
print(f"Unique 'title' count: {num_unique_title}")
print(f"Duplicate rows:      {num_duplicates_title}")

Total rows:          59904
Unique 'title' count: 59144
Duplicate rows:      760


In [12]:
dup_mask_title = df["title"].duplicated(keep="first")
print(f"Number of rows flagged as duplicates in 'title' (excluding first): {dup_mask_title.sum()}")

sample_title_dups = df[dup_mask_title].head(10)
print("Sample of duplicate rows in 'title' (excluding first occurrences):")
sample_title_dups

Number of rows flagged as duplicates in 'title' (excluding first): 760
Sample of duplicate rows in 'title' (excluding first occurrences):


Unnamed: 0,title,text,label
299,Turkish Disinfo: Daesh terrorist leader Baghda...,Ian Greenhalgh is a photographer and histori...,1
531,Exclusive: Trump names career diplomat to head...,((This December 4 story has been corrected to...,0
548,Cuba tells U.S. suspension of visas is hurting...,(Corrects paragraph 7 to show Trump issued a ...,0
707,Prof Michel Chossudovsky discusses Hillary Cli...,21st Century Wire says… Amid great mainstrea...,1
710,Mexico’s Richest Oligarch Loses Billions on Ne...,21st Century Wire says… Mexico’s billionaire...,1
713,Hillary Clinton Jumps the Shark with ‘Trump’s ...,"21st Century Wire says… Yesterday, WikiLeaks...",1
978,FORMER FBI ASST DIRECTOR: “Jim Comey ‘Danced W...,He threw the reputation of the FBI under the ...,1
1736,"""Top Five Clinton Donors Are Jewish"" - How Ant...","""Top Five Clinton Donors Are Jewish"" - How Ant...",1
1792,Public vs. Media on War,(128 fans) - Advertisement - A new poll from ...,1
1793,Michael Moore Owes Me $4.99,(128 fans) - Advertisement - Michael Moore ha...,1


🔍 **Observation**
- Now that all exact-`text` duplicates are gone, we have 760 duplicate `title` entries. 

### *One example of "title" duplicate*:

In [13]:
sample_title_dups["title"].iloc[1]

'Exclusive: Trump names career diplomat to head Cuban embassy - sources'

In [14]:
df[df["title"]=="Exclusive: Trump names career diplomat to head Cuban embassy - sources"]

Unnamed: 0,title,text,label
530,Exclusive: Trump names career diplomat to head...,((This December 4 story has been corrected to...,0
531,Exclusive: Trump names career diplomat to head...,((This December 4 story has been corrected to...,0


In [15]:
# Check the “text” values for duplicated titles
dup_title_530 = df[df["title"]=="Exclusive: Trump names career diplomat to head Cuban embassy - sources"]
print(dup_title_530.at[530, "text"])
print("------------")
print(dup_title_530.at[531, "text"])

 ((This December 4 story has been corrected to change  last year  to 2015 in sixth paragraph, June to July in 11th paragraph)) By Marc Frank HAVANA (Reuters) - The Trump administration has named career diplomat Philip Goldberg to head the all-but-abandoned U.S. embassy in Havana, according to three sources familiar with the matter, at a time of heightened tensions between the United States and Cuba. Goldberg has lengthy experience in a number of countries, and was described by a U.S. congressional aide on Monday as  career and the best of the best . But his appointment may ruffle feathers in Havana. He was expelled from Cuba s socialist ally Bolivia in 2008 for what President Evo Morales claimed was fomenting social unrest. The appointment has not been publicly announced. If approved by Cuba, Goldberg will arrive at a low moment in bilateral relations. The embassy was reopened in 2015 for the first time since 1961, as part of a fragile detente by former Democratic U.S. president Barack

🔍 **Observation**
- Subtle differences—such as extra whitespace and varying punctuation—prevent the two text strings from being exactly identical.

### *Another example of "title" duplicate*:

In [16]:
sample_title_dups["title"].iloc[3]

'Prof Michel Chossudovsky discusses Hillary Clinton’s foreign policy & emerging nuclear risks'

In [17]:
df[df["title"]=="Prof Michel Chossudovsky discusses Hillary Clinton’s foreign policy & emerging nuclear risks"]

Unnamed: 0,title,text,label
688,Prof Michel Chossudovsky discusses Hillary Cli...,21st Century Wire says Amid great mainstream ...,1
707,Prof Michel Chossudovsky discusses Hillary Cli...,21st Century Wire says… Amid great mainstrea...,1


In [18]:
# Check the “text” values for duplicated titles
dup_title_688 = df[df["title"]=="Prof Michel Chossudovsky discusses Hillary Clinton’s foreign policy & emerging nuclear risks"]
print(dup_title_688.at[688, "text"])
print("------------")
print(dup_title_688.at[707, "text"])

 21st Century Wire says Amid great mainstream media and Democratic Party fanfare, Hillary Clinton s candidacy has been based on a claim she is  most experienced  candidate in history, and highlighting her foreign policy record in particular. What would a Clinton foreign policy really look like? On Episode #158 of the SUNDAY WIRE, host Patrick Henningsen spoke with a special guest, Professor Michel Chossudovsky, founder & editor of www.globalresearch.ca about  Hillary s hawks  and what a Clinton presidency will look like in terms of US-backed wars around the globe, as well as Washington s current fetish with  sustainable  nuclear conflicts.Listen to this excellent discussion:[soundcloud url= https://api.soundcloud.com/tracks/291631622  params= auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true  width= 100%  height= 450  iframe= true  /] . See more of Michel s recent book,  The Globalization of War: America s Long War Against Humanity. REA

🔍 **Observation**
- The text values for duplicated titles are very similar and refer to the same article.

## ✂️Remove Duplicates Based on `title` by Keeping the Longest `text`
> For each duplicated title, keep the row whose `text` is the most **informative**.

In [19]:
# Compute the length of each text
text_lengths = df["text"].str.len()

# For each unique title, find the index of the row whose text is longest
longest_per_title = text_lengths.groupby(df["title"]).idxmax()

# Select and reset index
df = df.loc[longest_per_title].reset_index(drop=True)

print(f"Rows after keeping longest text per duplicate title: {df.shape[0]}")

Rows after keeping longest text per duplicate title: 59144


In [20]:
assert df.shape[0] == df["title"].nunique()

In [21]:
df["label"].value_counts()
# 0 = fake, 1 = real

label
0    33318
1    25826
Name: count, dtype: int64

## Combine `title` with `text` to create a new column `content` 

In [22]:
# Create the "content" column by combining "title" and "text"
df["content"] = df["title"] + " " + df["text"]

# Reorder the columns
desired_column_order = ["title", "text", "content", "label"]
df = df[desired_column_order]

In [23]:
df

Unnamed: 0,title,text,content,label
0,"""Allahu Akbar, the Russians are here!"": Aleppo...","November 7, 2016 - Fort Russ News - RusVesna ...","""Allahu Akbar, the Russians are here!"": Aleppo...",1
1,"""America has a simple ideology"": how one of Ru...",The United States comes up constantly when you...,"""America has a simple ideology"": how one of Ru...",0
2,"""Authoritarianism"": How the West demonizes str...","November 22, 2016 - Deena Stryker, Katehon - ...","""Authoritarianism"": How the West demonizes str...",1
3,"""Blue Alerts"" to be used to keep the 'War on C...","""Blue Alerts"" to be used to keep the 'War on C...","""Blue Alerts"" to be used to keep the 'War on C...",1
4,"""CANADA READY TO RECEIVE 250",Ottawa | Prime Minister Justin Trudeau is gett...,"""CANADA READY TO RECEIVE 250 Ottawa | Prime Mi...",1
...,...,...,...,...
59139,"“You Ruined Your Own Communities, Don’t Ruin O...","- < “You Ruined Your Own Communities, Don’t Ru...","“You Ruined Your Own Communities, Don’t Ruin O...",1
59140,“Your little brother is not the ultimate autho...,Anyone old enough to remember that election ni...,“Your little brother is not the ultimate autho...,0
59141,“You’re Not Alone – We’re With You” – Video Ai...,We understand that what happened on Tuesday is...,“You’re Not Alone – We’re With You” – Video Ai...,1
59142,“You’re Not Welcome!” Obama As Welcome At Rose...,Roseberg residents and families of victims are...,“You’re Not Welcome!” Obama As Welcome At Rose...,1


## Export Cleaned Dataset
Keeps only English rows, removes duplicates, and adds a combined “content” column.

In [24]:
df.to_csv("WELFake_Cleaned.csv", index=False)