# About

Here we collect our dataset for detecting AI-generated content. 

-------------------

A list of possible domains to fetch content from:

- Politics
    - [German Bundestag](https://www.bundestag.de/) ✔
    - [House of Commons](https://reshare.ukdataservice.ac.uk/854292/) ✔
- Student/School
    - [Kaggle Student Essays](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/data?select=train.csv) ✔
    - Take the english dataset and translate it?
- Research
    - [Arxiv](https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=language+model&terms-0-field=all&classification-physics_archives=all&classification-include_cross_list=include&date-year=&date-filter_by=date_range&date-from_date=2005&date-to_date=2020&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first) ✔
    - [I've seen a paper recently that creates full AI written papers. Maybe I can use it.](https://www.kaggle.com/discussions/general/527817#2958636)
- News
    - [Spiegel Online](https://www.spiegel.de/) ✔
    - [CNN Articles](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail) ✔
- Blogs/Tutorials/Forums
- Law
    - [German Open Legal Data](https://de.openlegaldata.io/) ✔
    - [European Court of Human Rights Cases](https://www.kaggle.com/datasets/mathurinache/ecthrnaacl2021/data) ✔
- Philosophy
    - Gutenberg Project ([ENG](https://www.gutenberg.org/) | [GER](https://www.projekt-gutenberg.org/)) ✔
- Literature
    - Gutenberg Project ([ENG](https://www.gutenberg.org/) | [GER](https://www.projekt-gutenberg.org/)) ✔
- Blogs, Food and Lifestyle
    - [Food Blogs](https://detailed.com/food-blogs/)
    - [WebBlogs](https://www.kaggle.com/datasets/rtatman/blog-authorship-corpus?select=blogtext.csv) (ENG) ✔
- Religion
    - [Bible](https://github.com/mxw/grmr/blob/master/src/finaltests/bible.txt) (ENG|GER) ✔
- Gaming

Interesting Languages:

- English
- German

In [1]:
class CONFIG:
    SRC_ROOT_PATH = '/home/staff_homes/kboenisc/home/prismAI/PrismAI/src'
    SRC_ROOT_PATH_COLL = '/home/staff_homes/kboenisc/home/prismAI/PrismAI/src/data_collector'
    DATA_ROOT_PATH = '/storage/corpora/prismAI/data'

In [2]:
import sys
import importlib
import os
import concurrent.futures
import threading
import queue
import pandas as pd

from dotenv import load_dotenv
from tqdm.notebook import tqdm

# So that it includes local imports. This is some next level python shit import
sys.path.insert(0, CONFIG.SRC_ROOT_PATH)
sys.path.insert(0, CONFIG.SRC_ROOT_PATH_COLL)

load_dotenv()

True

In [3]:
import collector
import collected_item
import collectors.bundestag_collector
import collectors.house_of_commons_collector
import collectors.student_essays_collector
import collectors.arxiv_collector
import collectors.spiegel_collector
import collectors.cnn_news_collector
import collectors.open_legal_data_collector
import collectors.euro_court_cases_collector
import collectors.religion_collector
import collectors.gutenberg_collector
import collectors.blog_corpus_collector
import collectors.blog_corpus_collector

import data_collector.agents.ai_agent
import data_collector.agents.openai_agent

def reload():
    importlib.reload(collector)
    importlib.reload(collected_item)
    importlib.reload(collectors.bundestag_collector)
    importlib.reload(collectors.house_of_commons_collector)
    importlib.reload(collectors.student_essays_collector)
    importlib.reload(collectors.arxiv_collector)
    importlib.reload(collectors.spiegel_collector)
    importlib.reload(collectors.cnn_news_collector)
    importlib.reload(collectors.open_legal_data_collector)
    importlib.reload(collectors.euro_court_cases_collector)
    importlib.reload(collectors.religion_collector)
    importlib.reload(collectors.gutenberg_collector)
    importlib.reload(collectors.blog_corpus_collector)

    importlib.reload(data_collector.agents.ai_agent)
    importlib.reload(data_collector.agents.openai_agent)

reload()

## Init the Collectors

In [4]:

collection = [
    collectors.bundestag_collector.BundestagCollector(CONFIG.DATA_ROOT_PATH),
    collectors.house_of_commons_collector.HouseOfCommonsCollector(CONFIG.DATA_ROOT_PATH),
    collectors.student_essays_collector.StudentEssaysCollector(CONFIG.DATA_ROOT_PATH),
    collectors.arxiv_collector.ArxivCollector(CONFIG.DATA_ROOT_PATH),
    collectors.spiegel_collector.SpiegelCollector(CONFIG.DATA_ROOT_PATH),
    collectors.cnn_news_collector.CNNNewsCollector(CONFIG.DATA_ROOT_PATH),
    #collectors.open_legal_data_collector.OpenLegalDataCollector(CONFIG.DATA_ROOT_PATH),
    collectors.euro_court_cases_collector.EuroCourtCasesCollector(CONFIG.DATA_ROOT_PATH),
    #collectors.religion_collector.ReligionCollector(CONFIG.DATA_ROOT_PATH),
    collectors.gutenberg_collector.GutenbergCollector(CONFIG.DATA_ROOT_PATH),
    collectors.blog_corpus_collector.BlogCorpusCollector(CONFIG.DATA_ROOT_PATH)
]

In [5]:
total_items = 0

for coll in collection:
    try:
        coll.init()
        coll.collect()
        total_items += coll.get_count()
    except Exception as ex:
        print('ERROR: Current collection failed due to an error: ')
        print(ex)
        print('\n ***** Continuing with the other collectors. ***** \n')

print('\n\n ==================================================== \n\n')
print(f'All collectors finished. Total data items: {total_items}')




33949 speeches were already collected at 2024-12-22 16:07:26.476688, hence we skip a redundant collection.
If you'd like to collect anyway, set the force variable to True.



53814 speeches were already collected at 2024-11-22 12:47:58.466366, hence we skip a redundant collection.
If you'd like to collect anyway, set the force variable to True.



115372 essays were already collected at 2024-11-22 13:02:19.983103, hence we skip a redundant collection.
If you'd like to collect anyway, set the force variable to True.



9947 papers were already collected at 2024-11-23 15:24:25.733882, hence we skip a redundant collection.
If you'd like to collect anyway, set the force variable to True.



105281 articles were already collected at 2024-11-23 05:43:32.662209, hence we skip a redundant collection.
If you'd like to collect anyway, set the force variable to True.



287226 news articles were already collected at 2024-11-24 18:53:06.344810, hence we skip a redundant collection.
If you'd lik

[nltk_data] Downloading package punkt to
[nltk_data]     /home/staff_homes/kboenisc/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/staff_homes/kboenisc/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# Generate the AI Content

After we've collected the dataset, we want to create it's AI-generated counterpart. We do this on multiple levels:

- Inject passages of AI-generated content
- Trying to rewrite the whole text as an AI agent.
- Trying different models?

In [6]:
agents = [
    data_collector.agents.openai_agent.OpenAIAgent(name='gpt-4o-mini', api_key=os.getenv('OPENAI_API_KEY'))
]

Foreach agent, we go through all different collectors and synthesize the texts.

In [7]:
take_per_collector = 2000
force_synth = True

progress_queue = queue.Queue()

I've never done parallelization in python before, but here we go!

In [8]:
def process_one_collector(coll, agents, take_per_collector, force_synth=False):
    '''
    Processes a single collector, chunk by chunk, and sends progress updates
    to the global progress_queue after each chunk.
    '''
    
    items_dfs = coll.get_collected_items_dfs()
    items_count = 0
    df_count = 1
    
    for df in items_dfs:
        stored_path = os.path.join(coll.get_synthesized_path(), f"items_{df_count}.json")
        
        # If chunk already exists & not forcing, skip it
        if not force_synth and os.path.exists(stored_path):
            df_count += 1
            continue
        
        if items_count >= take_per_collector:
            break
        
        synth_items = []
        
        for index, row in df.iterrows():
            if items_count >= take_per_collector:
                break
            
            # Build the item
            item = collected_item.CollectedItem.from_dict(row)
            
            # For each agent, do the synth
            for agent in agents:
                try:
                    item.synthetization.append({
                        'agent': agent.name,
                        'synth_obj': agent.synthesize_collected_item(item, coll),
                    })
                    synth_items.append(item)
                    items_count += 1
                except Exception as ex:
                    print(f"Error in collector {coll.get_folder_path()}, chunk {df_count}: {ex}")
        
        # Save chunk
        out_df = pd.DataFrame([item.__dict__ for item in synth_items])
        out_df.to_json(stored_path)
        df_count += 1
        
        # ---- SEND PROGRESS UPDATE ----
        # e.g. "I finished 1 chunk"
        progress_queue.put((coll.get_folder_path(), 1))

    return coll.get_folder_path()

In [9]:
def monitor_progress(num_chunks_total):
    '''
    Listens on the queue for chunk updates. We track how many chunks have finished out of total, 
    and update a single progress bar (that's the plan at least).
    '''
    with tqdm(total=num_chunks_total, desc='All Chunks', position=0) as pbar:
        chunks_done = 0
        
        while chunks_done < num_chunks_total:
            # Block until an update arrives
            folder_path, chunk_count = progress_queue.get()

            # Update progress bar
            pbar.update(chunk_count)
            chunks_done += chunk_count
            pbar.set_postfix_str(f'Last update from: {folder_path}')

In [None]:
# 1) Figure out how many total chunks exist across all collectors
# so we know how many times we'll update the bar in total.
num_chunks_total = 0
for coll in collection:
    num_chunks_total += len(coll.get_collected_items_dfs())

# 2) Start the monitor thread that updates the tqdm bar in the main thread
# (Maybe this is wasted effort, but I really like a real-time update)
monitor_thread = threading.Thread(target=monitor_progress, args=(num_chunks_total,), daemon=True)
monitor_thread.start()

# 3) Launch parallel tasks
with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
    futures = {
        executor.submit(process_one_collector, coll, agents, take_per_collector, force_synth): coll
        for coll in collection
    }

    # wait for them to complete
    for future in concurrent.futures.as_completed(futures):
        coll = futures[future]
        try:
            res = future.result()
        except Exception as ex:
            print(f"Collector {coll.get_folder_path()} failed: {ex}")

# 4) Now that all collectors are done, we know we won't get more queue updates.
monitor_thread.join()

print("All collectors & chunks are done!")

All Chunks:   0%|          | 0/1954 [00:00<?, ?it/s]

Collector blog_authorship_corpus failed: [Errno 28] No space left on device
Collector bundestag failed: [Errno 28] No space left on device
Collector gutenberg failed: [Errno 28] No space left on device
Collector spiegel_articles failed: [Errno 28] No space left on device
