NLP LAB 1 – CORPUS CONSTRUCTION
Project: NLP-Labs
Dataset: Fake and Real News Dataset (Kaggle)
Source: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

------------------------------------------------------------
OBJECTIVE:
The objective of this lab is to construct a structured text corpus
for downstream Natural Language Processing (NLP) tasks.

We use the Fake and Real News dataset, which contains two separate
CSV files:
    • Fake.csv  – Fake news articles
    • True.csv  – Real news articles

TASKS PERFORMED:
1. Load both datasets into Pandas.
2. Assign binary labels:
       Fake News  → 0
       Real News  → 1
3. Merge the datasets into a single unified corpus.
4. Retain relevant columns (title, text, label).
5. Remove short or empty articles to ensure data quality.
6. Randomly sample a manageable subset (5000 articles).
7. Store the corpus in structured formats:
       • JSONL (for NLP pipelines)
       • CSV   (for general analysis)

OUTPUT FILES:
    data/news_corpus.jsonl
    data/news_corpus.csv

This corpus will be used in subsequent labs for:
    • Text preprocessing (NLTK)
    • Sentiment analysis
    • Machine learning models
    • Format conversion tasks

------------------------------------------------------------
Author: Mithil Pillai
Course: NLP Lab
Date: 22/02/2026

In [1]:
import pandas as pd
import os

DATA_PATH = "data/"

fake_df = pd.read_csv(os.path.join(DATA_PATH, "Fake.csv"))
true_df = pd.read_csv(os.path.join(DATA_PATH, "True.csv"))

fake_df["label"] = 0
true_df["label"] = 1

df = pd.concat([fake_df, true_df], ignore_index=True)

df = df[["title", "text", "label"]]

df = df[df["text"].str.len() > 100]

df = df.sample(n=5000, random_state=42).reset_index(drop=True)

df.to_json("data/news_corpus.jsonl", orient="records", lines=True)
df.to_csv("data/news_corpus.csv", index=False)

print("Corpus created successfully!")
print(df.head())

Corpus created successfully!
                                               title  \
0  German Social Democrats face pressure over coa...   
1  U.S. diplomatic delays, Trump agenda snarl Ita...   
2  Trump taps Retired General Kelly to lead Homel...   
3   Texas Republicans Cut Environmental Regulatio...   
4  Trump attacks FBI on leakers of Russia reports...   

                                                text  label  
0  BERLIN (Reuters) - Germany s Social Democrats ...      1  
1  ROME (Reuters) - Italy’s preparations for host...      1  
2  WASHINGTON (Reuters) - Republican U.S. Preside...      1  
3  A disgusting black sludge is coming out of res...      0  
4  WASHINGTON (Reuters) - U.S. President Donald T...      1  
