<a href="https://colab.research.google.com/github/Amanda9805/Detecting-Machine-Generated-Texts/blob/development/COS_760_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (1

In [8]:
import pandas as pd
from huggingface_hub import login

from datasets import load_dataset


#shona_data = load_dataset("DigitalUmuganda/AfriVoice", "sn", streaming=True, split="train[:10%]")


english_data = load_dataset("oscar-corpus/OSCAR-2201", language="en", streaming=True)

num_samples = 1000
samples = []
for i, example in enumerate(english_data["train"]):
    if i >= num_samples:
        break
    samples.append(example['text'])

eng_df = pd.DataFrame(samples, columns=['text'])
#print(eng_df.head())

zulu_data = load_dataset("dsfsi/vukuzenzele-sentence-aligned", "eng-zul", streaming=True)
zul_df = pd.DataFrame(zulu_data["train"])
zul_df = zul_df.rename(columns={'__index_level_0__': 'sentence_id'})
#print(zul_df.head())

In [4]:
import re
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

eng_df['cleaned_text'] = eng_df['text'].apply(clean_text)
eng_df['tokens'] = eng_df['cleaned_text'].apply(word_tokenize)
print(eng_df.head())

zul_df['cleaned_text'] = zul_df['zul'].apply(clean_text)
zul_df['tokens'] = zul_df['cleaned_text'].apply(word_tokenize)
print(zul_df.head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


                                                text  \
0  Founded in 2015, Golden Bees is a leading prog...   
1  For your team The Smarter pms Channel Manager ...   
2  Importing Viagra To Us, Sildenafil Citrate 100...   
3  The topic we will be looking at today is praye...   
4  Empire Events | Asian Wedding Services East Lo...   

                                        cleaned_text  \
0  founded in golden bees is a leading programmat...   
1  for your team the smarter pms channel manager ...   
2  importing viagra to us sildenafil citrate mg i...   
3  the topic we will be looking at today is praye...   
4  empire events asian wedding services east lond...   

                                              tokens  
0  [founded, in, golden, bees, is, a, leading, pr...  
1  [for, your, team, the, smarter, pms, channel, ...  
2  [importing, viagra, to, us, sildenafil, citrat...  
3  [the, topic, we, will, be, looking, at, today,...  
4  [empire, events, asian, wedding, services, eas..

In [7]:
from transformers import pipeline
import json

en_generator = pipeline('text-generation', model='gpt2')


english_prompts = [
    "Explain the significance of lobola in Southern Africa",
    "Write a short dialogue between two friends in Johannesburg",
    "Describe linguistic features that make isiZulu agglutinative"
]

zulu_prompts = [
    "Chaza ngokubaluleka kwesiko lwelobola eNingizimu Afrika",
    "Bhala inkulumo emfushane phakathi kwabangani ababili eGoli",
    "Landela indaba yamaZulu ngokomlando",
    "Chaza ngamasiko amasha eZulu eskhathini samanje",
    "Bhala inganekwane ethi 'UNogwaja noFudu'"
]

def generate_with_prompts(prompts, generator, language, samples_per_prompt=3):
    data = []
    for prompt in prompts:
        for _ in range(samples_per_prompt):
            output = generator(prompt, max_length=100, do_sample=True, temperature=0.7)
            data.append({
                'prompt': prompt,
                'text': output[0]['generated_text'],
                'label': 'machine',
                'language': language,
                'prompt_type': 'cultural' if "tsika" in prompt else 'linguistic'  # Tag for analysis
            })
    return pd.DataFrame(data)


eng_mg_df = generate_with_prompts(english_prompts, en_generator, 'English')
eng_mg_df['cleaned_text'] = eng_mg_df['text'].apply(clean_text)
eng_mg_df['tokens'] = eng_mg_df['cleaned_text'].apply(word_tokenize)
print(eng_mg_df['text'].iloc[0][:300])

with open('zulu_mg_text.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

zul_mg_df = pd.DataFrame(data)
zul_mg_df['text'] = zul_mg_df['text'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
zul_mg_df['cleaned_text'] = zul_mg_df['text'].apply(clean_text)
zul_mg_df['tokens'] = zul_mg_df['cleaned_text'].apply(word_tokenize)
print(zul_mg_df.head())

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eo

Explain the significance of lobola in Southern Africa from the first and most recent days of the last century, and how they may have provided a useful platform for the colonization of Africa. The following is a list of relevant documents and other resources that have been provided.

References:

[1]
                                              prompt  \
0  Chaza ngokubaluleka kwesiko lwelobola eNingizi...   
1  Bhala inkulumo emfushane phakathi kwabangani a...   
2                Landela indaba yamaZulu ngokomlando   
3    Chaza ngamasiko amasha eZulu eskhathini samanje   
4           Bhala inganekwane ethi 'UNogwaja noFudu'   

                                                text    label language  \
0  Ilobolo liyisiko elibalulekile eliqinisekisa u...  machine     Zulu   
1  Sipho: Sawubona Thabo, kudala singabonani! \nT...  machine     Zulu   
2  AmaZulu aqala ukukhula njengohlanga olukhulu n...  machine     Zulu   
3  Ezulwini lanamuhla, amaZulu asebenzisa izindle...  machine     