# Basic Dataframe Clean-Up

Loading dataframe with custom function. \
(Because of its size it's split into separate pieces, each smaller than GitHubs 50 MB repo file size limit.)

In [1]:
from custom_utils import load_and_concatenate_parquet_files

df = load_and_concatenate_parquet_files('./data/raw_fake_news_df')
display(df)

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
...,...,...,...
72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


Checking for Null Values

In [2]:
df.info()
print("Null values per column:")
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   71576 non-null  object
 1   text    72095 non-null  object
 2   label   72134 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
Null values per column:


title    558
text      39
label      0
dtype: int64

Dropping Null Values as they are quite rare

In [3]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 71537 entries, 0 to 72133
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   71537 non-null  object
 1   text    71537 non-null  object
 2   label   71537 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 2.2+ MB


Adding readable label column for better clarity

In [4]:
df["label_names"] = df["label"].apply(lambda x: "real" if x == 1 else "fake")

Merging title and text to one big text column to combine the information

In [5]:
df["full_text"] = df["title"] + df["text"]
df = df[['full_text', 'label', 'label_names']]
df.head()

Unnamed: 0,full_text,label,label_names
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,1,real
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,1,real
3,"Bobby Jindal, raised Hindu, uses story of Chri...",0,fake
4,SATAN 2: Russia unvelis an image of its terrif...,1,real
5,About Time! Christian Group Sues Amazon and SP...,1,real


Checking for duplicate entries. \
These are expected as this dataset merges four individual datasets together.

In [6]:
df.duplicated().sum()

8416

Dropping duplicate rows

In [7]:
df = df.drop_duplicates().reset_index(drop=True)
df.shape

(63121, 3)

# Text Clean-Up

Doing basic regex character removal

In [8]:
df.loc[:, "full_text"] = (
    df["full_text"]
    .str.lower()                                    # Convert to lowercase
    .replace(r'http[\w:/\.]+', ' ', regex=True)     # Remove URLs
    .replace(r"[^a-z\s'’]", " ", regex=True)        # Remove everything except lowercase letters, spaces, and apostrophes
    .replace(r'\s\s+', ' ', regex=True)             # Collapse multiple spaces
    .str.strip()                                    # Remove leading/trailing spaces
)
display(df["full_text"])

0        law enforcement on high alert following threat...
1        unbelievable obama’s attorney general says mos...
2        bobby jindal raised hindu uses story of christ...
3        satan russia unvelis an image of its terrifyin...
4        about time christian group sues amazon and spl...
                               ...                        
63116    wikileaks email shows clinton foundation funds...
63117    russians steal research on trump in hack of u ...
63118    watch giuliani demands that democrats apologiz...
63119    migrants refuse to leave train at refugee camp...
63120    trump tussle gives unpopular mexican leader mu...
Name: full_text, Length: 63121, dtype: object

Showcasing the `contractions` library

In [9]:
import contractions
sample_text = "I can't believe it's not true! They're going to the park."
fixed_text = contractions.fix(sample_text)
print("Original text:", sample_text)
print("Fixed text:", fixed_text)

Original text: I can't believe it's not true! They're going to the park.
Fixed text: I cannot believe it is not true! They are going to the park.


Replacing contractions using the above shown library

In [10]:
df["full_text"] = df["full_text"].apply(
    lambda x: contractions.fix(x) if isinstance(x, str) else x
)
display(df["full_text"])

0        law enforcement on high alert following threat...
1        unbelievable obama’s attorney general says mos...
2        bobby jindal raised hindu uses story of christ...
3        satan russia unvelis an image of its terrifyin...
4        about time christian group sues amazon and spl...
                               ...                        
63116    wikileaks email shows clinton foundation funds...
63117    russians steal research on trump in hack of yo...
63118    watch giuliani demands that democrats apologiz...
63119    migrants refuse to leave train at refugee camp...
63120    trump tussle gives unpopular mexican leader mu...
Name: full_text, Length: 63121, dtype: object

Removing leftover apostrophes which did not belong to any contractions

In [11]:
df['full_text'] = df['full_text'].replace(r"’", "", regex=True)

## Apply Lemmatization in Parallel

Code that checks if Spacy model is downloaded or not

In [None]:
import spacy
from spacy.cli import download
# Download the SpaCy model if not already installed
model_name = "en_core_web_sm"

try:
    nlp = spacy.load(model_name, disable=['parser', 'ner'])
    print(f"Successfully loaded model: {model_name}")
except OSError:
    print(f"Model '{model_name}' not found. Downloading...")
    download(model_name)
    nlp = spacy.load(model_name, disable=['parser', 'ner'])
    print(f"Successfully downloaded and loaded model: {model_name}")

nlp.add_pipe('sentencizer')

Successfully loaded model: en_core_web_sm


<spacy.pipeline.sentencizer.Sentencizer at 0x7fd52b381980>

Lemmatization can be a very computational expensive process.\
I have therefore used parallelization to speed up the process.

In [None]:
from joblib import Parallel, delayed
from spacy.lang.en.stop_words import STOP_WORDS
import gc
import numpy as np

# Function to lemmatize a single document
def lemmatize_doc(doc):
    return ' '.join(
        tok.lemma_.lower()
        for tok in doc
        if tok.is_alpha and tok.text.lower() not in STOP_WORDS
    )

# Function to chunk an iterable into chunks of size chunksize
def chunker(iterable, total_length, chunksize):
    for pos in range(0, total_length, chunksize):
        yield iterable[pos: pos + chunksize]
        
# Flatten a list of lists
def flatten(list_of_lists):
    return [item for sublist in list_of_lists for item in sublist]

# Process a chunk of texts in parallel
def process_chunk(texts):
    return [lemmatize_doc(doc) for doc in nlp.pipe(texts, batch_size=20)]

# Main preprocessing function for parallel processing
def batch_text_lemmatization(df, chunksize=100, num_parts=5):
    split_texts = np.array_split(df, num_parts)
    lemmatized_parts = []
    for number, part in enumerate(split_texts):
        print(f"Processing part {number + 1}/{num_parts}...")
        
        with Parallel(n_jobs=-1, backend='multiprocessing', prefer="processes") as executor:
            tasks = (delayed(process_chunk)(chunk) for chunk in chunker(part.tolist(), len(part.tolist()), chunksize=chunksize))
            result = executor(tasks)
        gc.collect()

        lemmatized_parts.extend(flatten(result))
    return lemmatized_parts

df["full_text"] = batch_text_lemmatization(df["full_text"], chunksize=100, num_parts=5)

Successfully loaded model: en_core_web_sm


Removing completely empty texts which got created with the cleanup

In [15]:
row_count_before = df.shape[0]
df = df[df["full_text"].str.strip() != ""].reset_index(drop=True)
print(f"Removed {row_count_before - df.shape[0]} rows with empty text")

Removed 42 rows with empty text


Saving the preprocessed dataset with a custom function.\
(To keep each file smaller than 50MB)

In [17]:
from custom_utils import save_dataframe_as_parquet

save_dataframe_as_parquet(df, folder_path='data', folder_name='preprocessed_df')
display(df)

Dataframe saved in 3 files under the folder: data/preprocessed_df


Unnamed: 0,full_text,label,label_names
0,law enforcement on high alert following threat...,1,real
1,unbelievable obamas attorney general says most...,1,real
2,bobby jindal raised hindu uses story of christ...,0,fake
3,satan russia unvelis an image of its terrifyin...,1,real
4,about time christian group sues amazon and spl...,1,real
...,...,...,...
63116,wikileaks email shows clinton foundation funds...,1,real
63117,russians steal research on trump in hack of yo...,0,fake
63118,watch giuliani demands that democrats apologiz...,1,real
63119,migrants refuse to leave train at refugee camp...,0,fake
