# 01 Language Filtering
**Author:** Fu Zhenhui  
**Dataset:** [WELFake Dataset](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification)  
**Output:** WELFake_EnglishOnly.csv  
**Last updated:** June 2025

In [17]:
import pandas as pd
from langdetect import detect
import time

## Load Dataset

In [18]:
# Using `index_col=0` avoids the `"Unnamed: 0"` column from auto-indexing.
df = pd.read_csv("WELFake_Dataset.csv", index_col=0)
df

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
...,...,...,...
72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


### 🔍 Observation
- The dataset has 72,134 rows with 3 columns: `title`, `text`, and `label`.
- Some entries in `title` or `text` are missing (`NaN`).
- The dataset contains a mix of English and non-English textual content.

## Clean Line Breaks and Handle Missing Values

In [19]:
# Remove line breaks '\n' to prevent a single DataFrame row from being split across multiple lines in the exported CSV file
df["title"] = df["title"].str.replace("\n", " ")
df["text"] = df["text"].str.replace("\n", " ")

In [20]:
# Check missing value patterns
only_title_missing = df["title"].isna() & df["text"].notna()
only_text_missing = df["text"].isna() & df["title"].notna()
both_missing = df["title"].isna() & df["text"].isna()

print(f"Rows missing only title: {only_title_missing.sum()}")
print(f"Rows missing only text: {only_text_missing.sum()}")
print(f"Rows missing both: {both_missing.sum()}")

Rows missing only title: 558
Rows missing only text: 39
Rows missing both: 0


In [21]:
# No rows are missing both title and text; only a few are missing one — safe to remove
df = df.dropna(subset=["title", "text"])
print(f"Rows after dropping rows with missing values: {df.shape[0]}")

Rows after dropping rows with missing values: 71537


## Filter for English Content Using `langdetect`
- `langdetect` is a lightweight and fast Python library for language detection. It provides reasonably accurate results.
- We retain rows where both `title` and `text` are detected as English (`'en'`).

<span style="color:red; font-weight:bold">⚠️ This cell takes a long time to run. Run it only once, and reuse the saved file next time.</span>

In [20]:
def is_english(text):
    try:
        return detect(text) == "en"
    except:
        return False

start_time = time.time()

# Keep only rows where both title and text are English
df_english = df[df["title"].apply(is_english) & df["text"].apply(is_english)]

end_time = time.time()

print(f"Time taken for language filtering: {end_time - start_time:.2f} seconds.")
print(f"Remaining rows after language filtering: {df_english.shape}")

Time taken for language filtering: 2179.73 seconds.
Remaining rows after language filtering: (68072, 3)


## Export English-Only Dataset
This avoids rerunning the expensive language detection process.

In [22]:
df_english.to_csv("WELFake_EnglishOnly.csv", index=False)