# Data processing

In this notebook we will combine the different data sources collected for the project.

We had to include external data sources to out first options due to the original dataset being almost only composed of AI-generated text

## Imports

In [1]:
# Set root path
import sys

sys.path.append("..")

import os
import re
import logging
from typing import List

logger = logging.getLogger(__name__)

import polars as pl
from cfg import CFG

from data import *
from nlp import *

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Check sources

In [2]:
for source in [
    "train_prompts.csv",
    "machine-dev.csv",
    "machine-test.csv",
    "machine-train.csv",
    "sample_submission.csv",
    "train_drcat_01.csv",
    "train_drcat_02.csv",
    "train_drcat_03.csv",
    "train_drcat_04.csv",
    "train_essays.csv",
    "argugpt.csv",
    "essay_forum_real.csv",
    "test",
    "drcat_v3.csv",
    "ivypanda.csv",
]:
    if source not in os.listdir(CFG.data_dir):
        raise FileNotFoundError(f"{source} not found in {CFG.data_path}")

## Merge sources

In [3]:
sources = load_and_merge_sources()

In [4]:
# Plot distribution of each column
for col in sources.columns:
    print(sources[col].value_counts())

shape: (228_699, 2)
┌─────────────────────────────────┬───────┐
│ text                            ┆ count │
│ ---                             ┆ ---   │
│ str                             ┆ u32   │
╞═════════════════════════════════╪═══════╡
│ "In 'The Challenge of Explorin… ┆ 1     │
│ Nursing Community Based Interv… ┆ 1     │
│ Dear ____ State Senator,        ┆ 1     │
│                                 ┆       │
│ I am…                           ┆       │
│ DNA Evidence: The Case of the … ┆ 1     │
│ Concreteness of Words and Free… ┆ 1     │
│ …                               ┆ …     │
│ Culture and Background Effect … ┆ 1     │
│ The Role of Individuals in Soc… ┆ 1     │
│ Thank you so much! Prompt:Beyo… ┆ 1     │
│ As a NASA scientist, I have ha… ┆ 1     │
│ Lessons learnt from Les Misera… ┆ 1     │
└─────────────────────────────────┴───────┘
shape: (2, 2)
┌───────────┬────────┐
│ generated ┆ count  │
│ ---       ┆ ---    │
│ i8        ┆ u32    │
╞═══════════╪════════╡
│ 1         ┆ 2

In [5]:
sources

text,generated,source
str,i8,str
"""There are a variety of opinion…",1,"""machine-dev.csv"""
"""The university education is no…",1,"""machine-dev.csv"""
"""I believe that the university …",1,"""machine-dev.csv"""
"""University education is a topi…",1,"""machine-dev.csv"""
"""The purpose of university educ…",1,"""machine-dev.csv"""
…,…,…
"""There has been a fuss about th…",0,"""train_essays.csv"""
"""Limiting car usage has many ad…",0,"""train_essays.csv"""
"""There's a new trend that has b…",0,"""train_essays.csv"""
"""As we all know cars are a big …",0,"""train_essays.csv"""


## Tokenize

In [6]:
# get random text
text = sources["text"].sample(1).to_list()[0]
text

'People v. O’Neil Supreme Court Desicion Report\n\nFacts: On February 10, 1983, Stefan Golab, a worker at Film Recovery, fell ill while performing his duties. He was rushed to the hospital after losing consciousness and foaming at the mouth. Upon arrival at the hospital, however, he was pronounced dead and an autopsy was performed to determine the cause of his death. The autopsy results indicated that Golab had died of acute cyanide poisoning that he had inhaled while working at the company’s plant. The defendant, Steven O’Neil who was a senior manager at the company was charged with murder together with two fellow managers. The grand jury argued that as senior officials at the company, the three individuals had knowingly created an environment that led to Golab’s death by failing to advise and train him on the dangerous chemicals and provide him with protective equipment. Likewise, the company, Film Recovery, and her sister company, Metallic Marketing, were charged with involuntary ma

In [7]:
tokens = tokenize(text, remove_stopwords=True, lemmatize=True)
print(len(tokens))
tokens

Token indices sequence length is longer than the specified maximum sequence length for this model (704 > 512). Running this sequence through the model will result in indexing errors


384


['[CLS]',
 'people',
 'v',
 'neil',
 'supreme',
 'court',
 'de',
 '##icio',
 '##n',
 'report',
 'fact',
 'february',
 '10',
 '1983',
 'stefan',
 'go',
 '##lab',
 'worker',
 'film',
 'recovery',
 'fell',
 'ill',
 'performing',
 'duty',
 'rushed',
 'hospital',
 'losing',
 'consciousness',
 'foam',
 '##ing',
 'mouth',
 'upon',
 'arrival',
 'hospital',
 'however',
 'pronounced',
 'dead',
 'autopsy',
 'performed',
 'determine',
 'cause',
 'death',
 'autopsy',
 'result',
 'indicated',
 'go',
 '##lab',
 'died',
 'acute',
 'cy',
 '##ani',
 '##de',
 'poisoning',
 'inhaled',
 'working',
 'company',
 'plant',
 'defendant',
 'steven',
 'neil',
 'senior',
 'manager',
 'company',
 'charged',
 'murder',
 'together',
 'two',
 'fellow',
 'manager',
 'grand',
 'jury',
 'argued',
 'senior',
 'official',
 'company',
 'three',
 'individual',
 'knowing',
 '##ly',
 'created',
 'environment',
 'led',
 'go',
 '##lab',
 'death',
 'failing',
 'advise',
 'train',
 'dangerous',
 'chemical',
 'provide',
 'protectiv

In [8]:
tokens_encoded = encode(tokens)
tokens_encoded

[101,
 2111,
 1058,
 6606,
 4259,
 2457,
 2139,
 27113,
 2078,
 3189,
 2755,
 2337,
 2184,
 3172,
 8852,
 2175,
 20470,
 7309,
 2143,
 7233,
 3062,
 5665,
 4488,
 4611,
 6760,
 2902,
 3974,
 8298,
 17952,
 2075,
 2677,
 2588,
 5508,
 2902,
 2174,
 8793,
 2757,
 24534,
 2864,
 5646,
 3426,
 2331,
 24534,
 2765,
 5393,
 2175,
 20470,
 2351,
 11325,
 22330,
 7088,
 3207,
 16149,
 15938,
 2551,
 2194,
 3269,
 13474,
 7112,
 6606,
 3026,
 3208,
 2194,
 5338,
 4028,
 2362,
 2048,
 3507,
 3208,
 2882,
 6467,
 5275,
 3026,
 2880,
 2194,
 2093,
 3265,
 4209,
 2135,
 2580,
 4044,
 2419,
 2175,
 20470,
 2331,
 7989,
 18012,
 3345,
 4795,
 5072,
 3073,
 9474,
 3941,
 10655,
 2194,
 2143,
 7233,
 2905,
 2194,
 12392,
 5821,
 5338,
 26097,
 2158,
 17298,
 13900,
 2121,
 6685,
 2194,
 4895,
 18447,
 4765,
 19301,
 2135,
 7129,
 2175,
 20470,
 11265,
 25394,
 11461,
 2552,
 2472,
 2804,
 4028,
 26097,
 2158,
 17298,
 13900,
 2121,
 3715,
 6467,
 2036,
 5338,
 3265,
 5971,
 13474,
 18555,
 6204,
 3277,