# Create Wikipedia Topic Mix Dataset for CVDD
This notebook builds a new anomaly detection dataset by scraping Wikipedia articles for normal and anomalous topics.

In [1]:
!pip install wikipedia-api pandas numpy tqdm scikit-learn



#### 1. Setup/Library Imports and Configuration Parameters
- Installs and imports all required libraries for scrapin (wikipedia-api), data manipulation (pandas, numpy), and progress tracking (tqdm).

- Sets up filesystem paths for saving outputs.

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import re
import random
from pathlib import Path
import wikipediaapi
import re

# ---------- CONFIG ----------
DATA_DIR = Path("./data_wiki_mix")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Categories to scrape 
CATEGORIES = ["Astronomy", "Cooking", "Politics", "Music", "History"]

# Caps to keep size manageable
MAX_PER_CATEGORY = 500     # limit number of pages per category
MAX_DEPTH = 1              # category recursion depth (0 or 1 is enough)

# Filtering
MIN_CHARS = 300            # drop very short/empty texts
USE_SUMMARY_FIRST = True   # summaries are shorter/cleaner

# Randomness
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print("Config OK.")

Config OK.


#### 2.Wikipedia Client, Cleaning, and Fetching Helpers
- This part of the code connects to Wikipedia, cleans the text, and collects articles from selected categories.
- It first imports the needed libraries -wikipediaapi to read Wikipedia pages and re for text cleaning. Then it sets up a Wikipedia client (wiki) with your email as a user-agent so the program can safely access Wikipedia data.

- The clean_text() function removes unwanted parts like brackets, links, and extra spaces from the text and converts everything to lowercase so it‚Äôs clean and uniform.

- The fetch_category_articles() function goes to a given Wikipedia category (like ‚ÄúAstronomy‚Äù) and collects article titles and text. It fetches either the summary or the full text, cleans it, skips short or repeated pages, and also checks subcategories up to a set depth.

- In the end, it returns a list of clean articles ready to be used for building the dataset, and the message Helpers ready confirms that everything is set up correctly.

In [3]:
import wikipediaapi
import re

#  Define Wikipedia client with a user-agent
wiki = wikipediaapi.Wikipedia(
    language="en",
    user_agent="CVDDResearchDataset/1.0 (pavaniaddagalla.2704@gmail.com)"  
)

def clean_text(text: str) -> str:
    """Cleans text by removing URLs, brackets, and extra spaces."""
    text = re.sub(r"\([^)]*\)", " ", text)
    text = text.replace("[", " ").replace("]", " ")
    text = re.sub(r"http\S+|www\.\S+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text.lower()

def fetch_category_articles(category_name: str, max_articles: int, max_depth: int = 1):
    """Recursively fetches Wikipedia pages from a given category."""
    cat_page = wiki.page(f"Category:{category_name}")
    if not cat_page.exists():
        print(f"[WARN] Category not found: {category_name}")
        return []

    pages = []
    seen_titles = set()

    def _crawl(cat, depth):
        nonlocal pages, seen_titles
        if depth > max_depth or len(pages) >= max_articles:
            return
        for member in cat.categorymembers.values():
            if len(pages) >= max_articles:
                break
            if member.ns == wikipediaapi.Namespace.MAIN:
                if member.title not in seen_titles:
                    text = member.summary if USE_SUMMARY_FIRST else member.text
                    if (not text or len(text) < MIN_CHARS) and USE_SUMMARY_FIRST:
                        text = member.text
                    text = clean_text(text or "")
                    if len(text) >= MIN_CHARS:
                        pages.append((member.title, text))
                        seen_titles.add(member.title)
            elif member.ns == wikipediaapi.Namespace.CATEGORY:
                _crawl(member, depth + 1)

    _crawl(cat_page, depth=0)
    return pages

print("Helpers ready ")


Helpers ready 


#### 3.Fetching Articles and Creating the Dataset
- This part of the code collects the cleaned Wikipedia articles from each chosen category and combines them into a single dataset.
- It starts with an empty list called rows, then loops through each category name in CATEGORIES (like Astronomy, Cooking, etc.). 
- For each category, it prints a message showing which one is being processed and uses the fetch_category_articles() function to get the articles.
- For every article found, it stores the title, cleaned text, and its category in the rows list. 
- After all categories are processed, the list is converted into a pandas DataFrame called df, which becomes the main dataset. 
- The line drop_duplicates(subset=["text"]) removes any repeated articles, and reset_index(drop=True) resets the row numbering.
- Finally, it prints how many total articles were collected and shows the first few rows of the dataset to confirm that the data looks correct.

In [4]:
rows = []
for cat in CATEGORIES:
    print(f"\nFetching category: {cat}")
    pages = fetch_category_articles(cat, max_articles=MAX_PER_CATEGORY, max_depth=MAX_DEPTH)
    print(f"Fetched {len(pages)} pages from {cat}")
    for title, text in pages:
        rows.append({"title": title, "text": text, "category": cat})

df = pd.DataFrame(rows).drop_duplicates(subset=["text"]).reset_index(drop=True)
print("\n Total articles:", len(df))
print(df.head(3))



Fetching category: Astronomy
Fetched 500 pages from Astronomy

Fetching category: Cooking
Fetched 500 pages from Cooking

Fetching category: Politics
Fetched 500 pages from Politics

Fetching category: Music
Fetched 500 pages from Music

Fetching category: History
Fetched 500 pages from History

 Total articles: 2471
                   title                                               text  \
0  Glossary of astronomy  this glossary of astronomy is a list of defini...   
1   Outline of astronomy  the following outline is provided as an overvi...   
2   96P sungrazer family  the 96p sungrazer family is a small group of s...   

    category  
0  Astronomy  
1  Astronomy  
2  Astronomy  


In [5]:
FULL_CSV = DATA_DIR / "wikipedia_topic_mix.csv"
df.to_csv(FULL_CSV, index=False)
print(f"\nüíæ Saved full dataset to: {FULL_CSV.resolve()}")


üíæ Saved full dataset to: /Users/pavani/Library/CloudStorage/OneDrive-MacquarieUniversity/Session2_2025/Applications_of_Datascience/GithubClone/CVDD-PyTorch/data/data_wiki_mix/wikipedia_topic_mix.csv


#### 4.Splitting the Dataset into Train and Test Sets
- This part divides the collected dataset into training and testing sets while keeping a balanced number of samples for each category.
- It first creates two empty lists ‚Äî train_rows and test_rows ‚Äî to store the data for both splits. 
- For each category in CATEGORIES, it takes all articles belonging to that category, randomly shuffles them using a fixed seed for reproducibility, and calculates 80% of the data for training. - The first 80% of rows go into the training list, and the remaining 20% go into the test list.

- After processing all categories, the training and test parts are combined into two separate DataFrames (train_df and test_df), which are again shuffled to mix articles from all topics. 
- These are saved as two CSV files ‚Äî wikipedia_topic_mix_train.csv and wikipedia_topic_mix_test.csv ‚Äî inside the data folder.

- Finally, the code prints how many samples are in each split and confirms that both files have been successfully saved.

In [6]:
# ---------- Split train/test ----------
# Random split per category to keep balance
train_rows, test_rows = [], []

for cat in CATEGORIES:
    cat_df = df[df["category"] == cat].sample(frac=1, random_state=RANDOM_SEED)
    n_train = int(0.8 * len(cat_df))
    train_rows.append(cat_df.iloc[:n_train])
    test_rows.append(cat_df.iloc[n_train:])

train_df = pd.concat(train_rows).sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)
test_df  = pd.concat(test_rows).sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

TRAIN_CSV = DATA_DIR / "wikipedia_topic_mix_train.csv"
TEST_CSV  = DATA_DIR / "wikipedia_topic_mix_test.csv"

train_df.to_csv(TRAIN_CSV, index=False)
test_df.to_csv(TEST_CSV, index=False)

print(f"\n Train set: {len(train_df)} | Test set: {len(test_df)}")
print(f" Saved to:\n  {TRAIN_CSV.resolve()}\n  {TEST_CSV.resolve()}")



 Train set: 1974 | Test set: 497
 Saved to:
  /Users/pavani/Library/CloudStorage/OneDrive-MacquarieUniversity/Session2_2025/Applications_of_Datascience/GithubClone/CVDD-PyTorch/data/data_wiki_mix/wikipedia_topic_mix_train.csv
  /Users/pavani/Library/CloudStorage/OneDrive-MacquarieUniversity/Session2_2025/Applications_of_Datascience/GithubClone/CVDD-PyTorch/data/data_wiki_mix/wikipedia_topic_mix_test.csv


#### 5.Creating Synthetic Mixed Samples (Optional Step)
- This part of the code adds extra data by creating synthetic or mixed-category text samples, which can be useful for testing how well the model detects anomalies or mixed contexts.

- The function create_synthetic_mixes() takes a list of DataFrames (df_list), where each one represents a different category. It then creates new samples by combining short text pieces from multiple categories. For each sample, it picks one article from each category, takes up to max_chars_each characters from each, and joins them into a single mixed paragraph. Each mixed sample is labeled with the category "synthetic_mix" and stored in a new DataFrame.

- Next, the code prepares one DataFrame per category and calls the function to create 150 synthetic samples. These are then combined with the original dataset using pd.concat() to form an augmented dataset that includes both real and mixed samples. The combined dataset is shuffled and saved as wikipedia_topic_mix_augmented.csv.

- Finally, the script prints how many synthetic samples were generated, confirms that the file has been saved, and displays ‚ÄúDone.‚Äù to indicate that the data augmentation process is complete.

In [7]:
#---------- Optional: Create Synthetic Mixes ----------
def create_synthetic_mixes(df_list, n_samples=100, max_chars_each=500):
    """Creates synthetic mixed-text samples from different categories."""
    if len(df_list) < 2:
        return pd.DataFrame()
    syn = []
    for i in range(min(n_samples, min(len(d) for d in df_list))):
        texts = [d.iloc[i]["text"][:max_chars_each] for d in df_list]
        mixed = " ".join(texts)
        syn.append({"title": f"SYN_{i}", "text": mixed, "category": "synthetic_mix"})
    return pd.DataFrame(syn)


dfs = [df[df["category"] == cat].reset_index(drop=True) for cat in CATEGORIES]
synthetic_df = create_synthetic_mixes(dfs, n_samples=150)

AUG_CSV = DATA_DIR / "wikipedia_topic_mix_augmented.csv"
aug_df = pd.concat([df, synthetic_df], ignore_index=True).sample(frac=1, random_state=RANDOM_SEED)
aug_df.to_csv(AUG_CSV, index=False)

print(f"\n Synthetic samples created: {len(synthetic_df)}")
print(f" Saved augmented dataset to: {AUG_CSV.resolve()}")
print("Done.")


 Synthetic samples created: 150
 Saved augmented dataset to: /Users/pavani/Library/CloudStorage/OneDrive-MacquarieUniversity/Session2_2025/Applications_of_Datascience/GithubClone/CVDD-PyTorch/data/data_wiki_mix/wikipedia_topic_mix_augmented.csv
Done.


#### Exporting Text Files into Category Folders

- This part of the code organizes and saves the dataset into a structured folder format so that each article is stored as a separate .txt file inside its corresponding category folder.

- The export_to_folders() function takes three inputs ‚Äî the base directory (base_dir), the DataFrame to export (df), and the name of the split (split_name, such as "train" or "test"). It first creates a main folder for the split (for example, data_wiki_mix/train/) and then makes a subfolder for each category (like Astronomy, Cooking, etc.).

- For every category, it loops through all the articles, creates safe filenames by replacing special characters with underscores, and saves each article‚Äôs cleaned text as a separate .txt file. This ensures that each text document can be easily accessed later for training or evaluation.

- After defining the function, it is called twice ‚Äî once for the training set and once for the test set ‚Äî to export all data into their respective folders. The message ‚ÄúText export complete.‚Äù confirms that all text files have been successfully created and saved in the correct structure.

In [8]:
import os

# ---------- Create folder structure and save text files ----------
def export_to_folders(base_dir, df, split_name):
    """
    Exports texts into subfolders per category.
    Each article becomes its own .txt file.
    """
    split_dir = Path(base_dir) / split_name
    split_dir.mkdir(parents=True, exist_ok=True)
    
    for cat in df["category"].unique():
        cat_dir = split_dir / cat
        cat_dir.mkdir(parents=True, exist_ok=True)
        
        cat_df = df[df["category"] == cat].reset_index(drop=True)
        print(f"Saving {len(cat_df)} texts to {cat_dir} ...")
        
        for i, row in cat_df.iterrows():
            # Create safe file name (remove slashes or illegal chars)
            safe_title = re.sub(r"[^A-Za-z0-9_]+", "_", row["title"])[:100]
            file_path = cat_dir / f"{safe_title or f'article_{i}'}.txt"
            
            with open(file_path, "w", encoding="utf-8") as f:
                f.write(row["text"])

# Export both train and test splits
export_to_folders(DATA_DIR, train_df, "train")
export_to_folders(DATA_DIR, test_df, "test")

print("\nText export complete.")


Saving 395 texts to data_wiki_mix/train/History ...
Saving 398 texts to data_wiki_mix/train/Cooking ...
Saving 388 texts to data_wiki_mix/train/Astronomy ...
Saving 397 texts to data_wiki_mix/train/Politics ...
Saving 396 texts to data_wiki_mix/train/Music ...
Saving 99 texts to data_wiki_mix/test/History ...
Saving 98 texts to data_wiki_mix/test/Astronomy ...
Saving 100 texts to data_wiki_mix/test/Politics ...
Saving 100 texts to data_wiki_mix/test/Cooking ...
Saving 100 texts to data_wiki_mix/test/Music ...

Text export complete.


In [9]:
import pandas as pd
train = pd.read_csv("./data_wiki_mix/wikipedia_topic_mix_train.csv")
print(train.head())
print("\nCategories:", train['category'].unique())

                            title  \
0      Venus tablet of Ammisaduqa   
1          Counterfactual history   
2                  Pastry blender   
3  List of astronomical societies   
4          History of agriculture   

                                                text   category  
0  the venus tablet of ammisaduqa is the record o...    History  
1  counterfactual history is a form of historiogr...    History  
2  a pastry blender, or pastry cutter, is a devic...    Cooking  
3  a list of notable groups devoted to promoting ...  Astronomy  
4  agriculture began independently in different p...    History  

Categories: ['History' 'Cooking' 'Astronomy' 'Politics' 'Music']
