<a href="https://colab.research.google.com/github/Pacozabala/CSCI199.X-TestSpace/blob/main/data_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Requirements to run
1. TAS-BERT requires GPU, so please change your runtime to GPU.
2. On the first code block, upload your kaggle API key.

In [None]:
!pip install -q kaggle
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"pacozabala","key":"4bddcfaf7d7419eabf3e18b02f90de83"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d asaniczka/reddit-on-israel-palestine-daily-updated
!unzip reddit-on-israel-palestine-daily-updated.zip

Dataset URL: https://www.kaggle.com/datasets/asaniczka/reddit-on-israel-palestine-daily-updated
License(s): ODC Attribution License (ODC-By)
Downloading reddit-on-israel-palestine-daily-updated.zip to /content
 95% 1.20G/1.27G [00:16<00:02, 29.4MB/s]
100% 1.27G/1.27G [00:17<00:00, 80.0MB/s]
Archive:  reddit-on-israel-palestine-daily-updated.zip
  inflating: legacy/pse_isr_reddit_comments.csv  
  inflating: reddit_opinion_PSE_ISR.csv  


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

In [None]:
df = pd.read_csv("reddit_opinion_PSE_ISR.csv", dtype={10: str})

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3476679 entries, 0 to 3476678
Data columns (total 24 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   comment_id                  object 
 1   score                       int64  
 2   self_text                   object 
 3   subreddit                   object 
 4   created_time                object 
 5   post_id                     object 
 6   author_name                 object 
 7   controversiality            int64  
 8   ups                         int64  
 9   downs                       int64  
 10  user_is_verified            object 
 11  user_account_created_time   object 
 12  user_awardee_karma          float64
 13  user_awarder_karma          float64
 14  user_link_karma             float64
 15  user_comment_karma          float64
 16  user_total_karma            float64
 17  post_score                  int64  
 18  post_self_text              object 
 19  post_title           

In [None]:
# filter between Oct and Dec 2023
df['post_created_time'] = pd.to_datetime(df['post_created_time'])

start_date = pd.to_datetime('2023-10-01')
end_date = pd.to_datetime('2023-12-31')

df_dated = df[
    (df['post_created_time'] >= start_date) &
    (df['post_created_time'] <= end_date)
]

print(df_dated['post_created_time'].min())
print(df_dated['post_created_time'].max())

2023-10-01 10:52:13
2023-12-30 23:20:36


In [None]:
# filter out posts from underrepresented subreddits
subreddit_counts = df_dated['subreddit'].value_counts()
valid_subreddits = subreddit_counts[subreddit_counts >= 1000].index
df_dated = df_dated[df_dated['subreddit'].isin(valid_subreddits)]
df_dated.info()

<class 'pandas.core.frame.DataFrame'>
Index: 577116 entries, 2881774 to 3474681
Data columns (total 24 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   comment_id                  577116 non-null  object        
 1   score                       577116 non-null  int64         
 2   self_text                   577113 non-null  object        
 3   subreddit                   577116 non-null  object        
 4   created_time                577116 non-null  object        
 5   post_id                     577116 non-null  object        
 6   author_name                 577116 non-null  object        
 7   controversiality            577116 non-null  int64         
 8   ups                         577116 non-null  int64         
 9   downs                       577116 non-null  int64         
 10  user_is_verified            577116 non-null  object        
 11  user_account_created_time   543460 no

In [None]:
# filter out null values
df_dated = df_dated.dropna(subset=['post_self_text'])
display(df_dated[['post_self_text']].head())

Unnamed: 0,post_self_text
2881774,Are you counting the 8-17 year olds that have ...
2885666,"Hello everyone, I hope you all are doing well...."
2885704,After 54 days in captivity- Mia Schem had been...
2885720,Discussion is going to be centralized here.\n\...
2885743,After 54 days in captivity- Mia Schem had been...


In [None]:
# get a random sample of 1000
df_sample = df_dated.sample(n=100, random_state=42) # using a random state for reproducibility
display(df_sample.head())

Unnamed: 0,comment_id,score,self_text,subreddit,created_time,post_id,author_name,controversiality,ups,downs,...,user_link_karma,user_comment_karma,user_total_karma,post_score,post_self_text,post_title,post_upvote_ratio,post_thumbs_ups,post_total_awards_received,post_created_time
3311851,k9om2n4,0,&gt;Smoking gun evidence won’t be released unt...,IsraelPalestine,2023-11-17 20:49:23,17xgmy0,CulturalCranberry960,0,0,0,...,53.0,348.0,401.0,35,"editors note: if you liked my article segment,...",The IDF says they have found an “operational H...,0.75,35,0,2023-11-17 14:45:41
2947282,keq6tl4,4,Banned in Iran already.,IsraelPalestine,2023-12-24 10:45:45,18plfym,Less-Plant-4099,0,4,0,...,1.0,5209.0,5210.0,28,Found this free simulation game from 2006 that...,Peacemaker: Peace Simulation Video Game,0.92,28,0,2023-12-24 02:33:08
3074349,kclhvks,1,The UN is a kangaroo court. Israel should be ...,IsraelPalestine,2023-12-09 04:45:49,18drn6w,jwilens,0,1,0,...,110.0,1682.0,1800.0,76,I've been following this account on Twitter (s...,The casualty numbers in Gaza are completely fa...,0.6,76,0,2023-12-08 17:17:37
3373906,k8zu7nh,4,The replies are there for you to read yourself.,IsraelPalestine,2023-11-12 23:23:15,17tizfn,mikebenb,0,4,0,...,507.0,15810.0,16455.0,59,Is exposing people's true feelings about Jews....,The only think to thank Hamas for,0.71,59,0,2023-11-12 12:10:57
3058333,kcsis97,4,"If you actually cared about ""vile, depraved"" w...",IsraelPalestine,2023-12-10 18:02:50,18f8k0d,AhsokaSolo,0,4,0,...,1.0,64980.0,64981.0,0,Although I strongly disagree that being a pos ...,I have a question for those who think that any...,0.35,0,0,2023-12-10 17:28:43


In [None]:
# cleaning text
# remove html tags, user mentions, subreddit references

import re
from bs4 import BeautifulSoup

def clean_text(text):
    if pd.isna(text):
        return ""

    # 1. remove HTML tags, CSS styles
    text = BeautifulSoup(text, "html.parser").get_text()

    # 2. remove user mentions like "u/username"
    text = re.sub(r"u/[A-Za-z0-9_-]+", "", text)

    # 3. remove subreddit mentions"
    text = re.sub(r"r/[A-Za-z0-9_-]+", "", text)

    # 4. remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)

    # 5. remove whitespace and line breaks
    text = re.sub(r"\s+", " ", text).strip()

    # 6. lowercase text
    text = text.lower()

    # 7. remove punctuation, but keep periods, question marks, and exclamation points
    text = re.sub(r"[^\w\s.?!]", "", text)

    return text
df_cleaned = df_sample.copy()
df_cleaned['cleaned_text'] = df_sample['post_self_text'].apply(clean_text)
display(df_cleaned[['post_self_text', 'cleaned_text']].head(10))

Unnamed: 0,post_self_text,cleaned_text
3311851,"editors note: if you liked my article segment,...",editors note if you liked my article segment c...
2947282,Found this free simulation game from 2006 that...,found this free simulation game from 2006 that...
3074349,I've been following this account on Twitter (s...,ive been following this account on twitter sor...
3373906,Is exposing people's true feelings about Jews....,is exposing peoples true feelings about jews. ...
3058333,Although I strongly disagree that being a pos ...,although i strongly disagree that being a pos ...
3012086,Discussion is going to be centralized here.\n\...,discussion is going to be centralized here. mo...
3165579,Prime Minister Benjamin Netanyahu has reported...,prime minister benjamin netanyahu has reported...
3104922,What is happening in the palestinian territori...,what is happening in the palestinian territori...
3027945,Background then question:\n\n I didn’t realize...,background then question i didnt realize i was...
3054959,It’s crazy to me that Hamas attacked on Octobe...,its crazy to me that hamas attacked on october...


In [None]:
# get needed columns
df_column = df_cleaned[['cleaned_text']]
df_column.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 3311851 to 2942716
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   cleaned_text  100 non-null    object
dtypes: object(1)
memory usage: 1.6+ KB


In [None]:
# eliminate duplicates
df_unique = df_column.drop_duplicates(subset=['cleaned_text'])
display(df_unique.info())

<class 'pandas.core.frame.DataFrame'>
Index: 95 entries, 3311851 to 2942716
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   cleaned_text  95 non-null     object
dtypes: object(1)
memory usage: 1.5+ KB


None

In [None]:
# sentence segmentation and punctuation removal
import nltk
from nltk.tokenize import sent_tokenize
import string

# Download all 'punkt' related resources if they haven't already been downloaded
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
except Exception as e:
    print(f"Error downloading 'punkt' resources: {e}")


def segment_and_clean_sentences(text):
    if pd.isna(text):
        return []
    sentences = sent_tokenize(text)
    cleaned_sentences = []
    for sentence in sentences:
        # Remove all punctuation from the sentence
        sentence_no_punct = sentence.translate(str.maketrans('', '', string.punctuation))
        cleaned_sentences.append(sentence_no_punct)
    return cleaned_sentences

# Create a new list to store the individual sentences
sentences_list = []
for index, row in df_unique.iterrows():
    cleaned_sentences = segment_and_clean_sentences(row['cleaned_text'])
    for sentence in cleaned_sentences:
        sentences_list.append({'sentence': sentence})

# Create a new DataFrame from the list of sentences
df_sentences = pd.DataFrame(sentences_list)
display(df_sentences.head())
df_sentences.info()

Unnamed: 0,sentence
0,editors note if you liked my article segment c...
1,reported by cnn the israel defense forces idf ...
2,a video released by the idf displays a substan...
3,however there was no footage supplied of the c...
4,idf spokesman daniel hagari said army engineer...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  1094 non-null   object
dtypes: object(1)
memory usage: 8.7+ KB


In [None]:
# dataset columns
# sentence_id: Unique ID for each sentence
# sentence: The text (can have leading space)
# target: The aspect target term (or "NULL" if implicit)
# category: The aspect category (e.g., "food quality", "service general")
# polarity: positive/negative/neutral
# category_polarity: category + space + polarity
# entailed: "yes" if this row represents an actual opinion, "no" otherwise
# start: 1-based word index where target starts (0 for NULL)
# end: 1-based word index where target ends (0 for NULL)

In [None]:
# mock data generation
mft_categories = [
    "care/harm", "fairness/cheating", "loyalty/betrayal",
    "authority/subversion", "purity/degradation", "none"
]
polarities = ["positive", "negative", "neutral"]

data = []

for i, row in df_sentences.iterrows():
    # Ensure consistent spacing (TAS-BERT tokenizes on spaces)
    sentence = " ".join(row["sentence"].strip().split())

    # Randomly assign category and polarity
    category = random.choice(mft_categories)
    polarity = random.choice(polarities)
    category_polarity = f"{category} {polarity}"

    # 70% explicit targets, 30% implicit
    entailed = "yes" if random.random() > 0.3 else "no"

    if entailed == "yes":
        words = sentence.split()
        if len(words) > 3:
            # Randomly select a span
            span_length = random.randint(1, min(3, len(words)))
            start = random.randint(1, len(words) - span_length + 1)
            end = start + span_length  # end is EXCLUSIVE for TAS-BERT
            target = " ".join(words[start - 1:end - 1])  # match TAS-BERT slicing
        else:
            start, end, target = 0, 0, "NULL"
    else:
        start, end, target = 0, 0, "NULL"

    sentence_id = f"{1000000 + i}:0"
    data.append({
        "sentence_id": sentence_id,
        "sentence": sentence,
        "target": target,
        "category": category,
        "polarity": polarity,
        "category_polarity": category_polarity,
        "entailed": entailed,
        "start": start,
        "end": end
    })

df_mock = pd.DataFrame(data)
df_mock.head()


Unnamed: 0,sentence_id,sentence,target,category,polarity,category_polarity,entailed,start,end
0,1000000:0,editors note if you liked my article segment c...,,authority/subversion,negative,authority/subversion negative,no,0,0
1,1000001:0,reported by cnn the israel defense forces idf ...,reported by cnn,authority/subversion,positive,authority/subversion positive,yes,1,4
2,1000002:0,a video released by the idf displays a substan...,ground,purity/degradation,positive,purity/degradation positive,yes,13,14
3,1000003:0,however there was no footage supplied of the c...,of the,loyalty/betrayal,positive,loyalty/betrayal positive,yes,7,9
4,1000004:0,idf spokesman daniel hagari said army engineer...,,purity/degradation,positive,purity/degradation positive,no,0,0


In [None]:
# shuffle dataframe
df_shuffled = df_mock.sample(frac=1, random_state=42).reset_index(drop=True)

# calculate split point
split_point = int(len(df_shuffled) * 0.8)

# split into training and testing sets
df_train = df_shuffled[:split_point]
df_test = df_shuffled[split_point:]

# save to CSV files
df_train.to_csv('df_mock_train.csv', index=False)
df_test.to_csv('df_mock_test.csv', index=False)

# save also as TSV
df_train.to_csv('df_mock_train.tsv', index=False, sep='\t')
df_test.to_csv('df_mock_test.tsv', index=False, sep='\t')

print("Training set shape:", df_train.shape)
print("Test set shape:", df_test.shape)

Training set shape: (875, 9)
Test set shape: (219, 9)


In [None]:
# run before here!

In [None]:
# clone TAS-BERT repo
# I forked TAS-BERT so it could take in a custom dataset, but only mock_data for now.
!git clone https://github.com/Pacozabala/TAS-BERT

Cloning into 'TAS-BERT'...
remote: Enumerating objects: 82, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 82 (delta 13), reused 9 (delta 8), pack-reused 61 (from 1)[K
Receiving objects: 100% (82/82), 789.99 KiB | 11.62 MiB/s, done.
Resolving deltas: 100% (34/34), done.


In [None]:
# install pytorch-crf dependency
!pip install pytorch-crf
import torchcrf
print("torchcrf imported successfully!")

Collecting pytorch-crf
  Downloading pytorch_crf-0.7.2-py3-none-any.whl.metadata (2.4 kB)
Downloading pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Installing collected packages: pytorch-crf
Successfully installed pytorch-crf-0.7.2
torchcrf imported successfully!


In [None]:
# download uncased BERT model
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip -o uncased_L-12_H-768_A-12.zip -d TAS-BERT/

--2025-11-11 09:54:39--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.210.207, 74.125.26.207, 173.194.212.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.210.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip’


2025-11-11 09:54:42 (136 MB/s) - ‘uncased_L-12_H-768_A-12.zip’ saved [407727028/407727028]

Archive:  uncased_L-12_H-768_A-12.zip
   creating: TAS-BERT/uncased_L-12_H-768_A-12/
  inflating: TAS-BERT/uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: TAS-BERT/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: TAS-BERT/uncased_L-12_H-768_A-12/vocab.txt  
  inflating: TAS-BERT/uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: TAS-BERT/uncased_L-12_H-768_A-12/bert_config.json  


In [None]:
# command to create BERT-pytorch-model
!python TAS-BERT/convert_tf_checkpoint_to_pytorch.py \
--tf_checkpoint_path TAS-BERT/uncased_L-12_H-768_A-12/bert_model.ckpt \
--bert_config_file TAS-BERT/uncased_L-12_H-768_A-12/bert_config.json \
--pytorch_dump_path TAS-BERT/uncased_L-12_H-768_A-12/pytorch_model.bin

2025-11-11 09:55:00.857389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762854901.321088    2722 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762854901.426515    2722 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1762854902.300241    2722 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1762854902.300285    2722 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1762854902.300292    2722 computation_placer.cc:177] computation placer alr

In [None]:
# data pre-prep command
# !cd TAS-BERT && cd data && python data_preprocessing_for_TAS.py --dataset semeval2015 && python data_preprocessing_for_TAS.py --dataset semeval2016

In [None]:
# data pre-prep with custom data
import os

# Create the directory if it doesn't exist
os.makedirs('TAS-BERT/data/mock_data/', exist_ok=True)

# Move the TSV files to the new directory
os.rename('/content/df_mock_train.tsv', 'TAS-BERT/data/mock_data/df_mock_train.tsv')
os.rename('/content/df_mock_test.tsv', 'TAS-BERT/data/mock_data/df_mock_test.tsv')

print("TSV files moved successfully.")

TSV files moved successfully.


In [None]:
# modify the data_preprocessing code to take in mock_data as well.
# in main, modify choices to take in mock_data
# and on the next conditional, add in mock data file name for training and testing.

In [None]:
# pre-process custom data
# Modify the data_preprocessing_for_TAS.py file to add 'mock_data' to the choices for the --dataset argument
!sed -i "s/choices={'semeval2015', 'semeval2016'}/choices={'semeval2015', 'seval2016', 'mock_data'}/" TAS-BERT/data/data_preprocessing_for_TAS.py

# Rename the .tsv files to .txt
!mv TAS-BERT/data/mock_data/df_mock_train.tsv TAS-BERT/data/mock_data/df_mock_train.txt
!mv TAS-BERT/data/mock_data/df_mock_test.tsv TAS-BERT/data/mock_data/df_mock_test.txt

!cd TAS-BERT/data && python data_preprocessing_for_TAS.py --dataset mock_data

entity_sum:  621
max_sen_len:  120
sample ratio:  620 - 15112
entity_sum:  161
max_sen_len:  134
sample ratio:  161 - 3763


In [None]:
import os

def trim_problematic_lines(filepath, expected_columns=5):
    if not os.path.exists(filepath):
        print(f"File not found: {filepath}")
        return

    temp_filepath = filepath + ".temp"
    problematic_count = 0
    total_lines = 0

    with open(filepath, 'r', encoding='utf-8') as infile, \
         open(temp_filepath, 'w', encoding='utf-8') as outfile:
        for line_num, line in enumerate(infile, 1):
            total_lines += 1
            line = line.strip()
            if not line:
                problematic_count += 1
                continue

            line_arr = line.split('\t')
            if len(line_arr) == expected_columns:
                outfile.write(line + '\n')
            else:
                problematic_count += 1

    os.replace(temp_filepath, filepath)
    print(f"Trimmed {problematic_count} lines from {filepath}. Total lines processed: {total_lines}")

# Trim the training and testing TSV files
trim_problematic_lines('/content/TAS-BERT/data/mock_data/three_joint/BIO/train_TAS.tsv')
trim_problematic_lines('/content/TAS-BERT/data/mock_data/three_joint/BIO/test_TAS.tsv')

Trimmed 54 lines from /content/TAS-BERT/data/mock_data/three_joint/BIO/train_TAS.tsv. Total lines processed: 15733
Trimmed 54 lines from /content/TAS-BERT/data/mock_data/three_joint/BIO/test_TAS.tsv. Total lines processed: 3925


In [None]:
# command to train + test model
!cd TAS-BERT && CUDA_VISIBLE_DEVICES=0 python TAS_BERT_joint.py \
--data_dir data/mock_data/three_joint/BIO/ \
--output_dir results/mock_data/three_joint/BIO/my_result \
--vocab_file uncased_L-12_H-768_A-12/vocab.txt \
--bert_config_file uncased_L-12_H-768_A-12/bert_config.json \
--init_checkpoint uncased_L-12_H-768_A-12/pytorch_model.bin \
--tokenize_method word_split \
--use_crf \
--eval_test \
--do_lower_case \
--max_seq_length 128 \
--train_batch_size 24 \
--eval_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 1.0

2025-11-11 09:57:16.658576: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762855036.933503    3358 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762855037.003169    3358 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1762855037.538403    3358 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1762855037.538458    3358 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1762855037.538466    3358 computation_placer.cc:177] computation placer alr