# Project Setup

Follow these steps to set up the necessary files and structure for the project.

## Folder Structure and Files

1. **Create a `data` Folder**: In the main project directory, create a folder named `data` and place the following files inside it:
   - `Reddit-Threads_2020-2021.csv`
   - `Reddit-Threads_2022-2023.csv`

2. **Create a `.env` File**: In the main project directory, create a file named `.env`.

3. **Add Your Hugging Face API Key**:
   - Open the `.env` file and add the following line:

     ```plaintext
     HUGGINGFACE_API_KEY=your_api_key_here
     ```

   - Replace `your_api_key_here` with your actual Hugging Face API key.

In [1]:
# Standard Library Imports
import os
import json
import time
import random
from dotenv import load_dotenv

# Data Handling
import pandas as pd
import numpy as np
from datasets import Dataset

# NLP and Transformers
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from nltk.corpus import stopwords
import torch
import torch.nn as nn
from sklearn.metrics import f1_score, precision_score, recall_score

# API and Hugging Face Integration
import requests
from huggingface_hub import login

# Visualization
import matplotlib.pyplot as plt

# Utilities
from tqdm import tqdm
import ast

# huggingface API key
hf_api_key = os.getenv('HUGGINGFACE_API_KEY')
login(token=hf_api_key)

if torch.cuda.is_available():
    device = torch.device("cuda")
    device_name = torch.cuda.get_device_name(torch.cuda.current_device())
    print(f'Device in use: {device_name}')
else:
    device = torch.device("cpu")
    print('Device in use: CPU')

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ethan/.cache/huggingface/token
Login successful
Device in use: NVIDIA GeForce RTX 2070 SUPER


True

# Reading in data

In [None]:
df = pd.DataFrame()

###   FULL DATASET   ###
chunk_size = 10000
for chunk in pd.read_csv('../data/Reddit-Threads_2020-2021.csv', chunksize=chunk_size):
    print(chunk.head())  
    df = pd.concat([df, chunk])
for chunk in pd.read_csv('../data/Reddit-Threads_2022-2023.csv', chunksize=chunk_size):
    print(chunk.head())  
    df = pd.concat([df, chunk])
##   FULL DATASET   ###


print(df.shape)


                                                text            timestamp  \
0                                      STI chiong ah  2020-05-14 12:35:30   
1  Look on the bright side - you'll never make th...  2020-02-09 17:23:24   
2  For posts flaired as such (by OP), we will be ...  2021-04-06 18:08:59   
3  sounds q fucked up if no concern for each othe...  2021-01-22 14:22:42   
4  Chinese media reported a while ago: https://ww...  2020-03-26 04:51:22   

         username                                               link  \
0       iamabear1  /r/singapore/comments/gjjem5/covid19_8663_busi...   
1          lkc159  /r/singapore/comments/f15aks/did_i_just_get_sc...   
2   AutoModerator  /r/singapore/comments/maajuo/a_compilation_of_...   
3       [deleted]  /r/singapore/comments/l28wfr/rsingapore_random...   
4  localinfluenza  /r/singapore/comments/fp5hgu/pcf_cluster_anoth...   

     link_id   parent_id       id subreddit_id  \
0  t3_gjjem5   t3_gjjem5  fqljinp     t5_2qh8c   
1  t

# Cleaning

In [6]:
df_normalized = df

### removing deleted or removed text ###
df_normalized = df_normalized[df_normalized['text'] != '[deleted]']
df_normalized = df_normalized[df_normalized['text'] != '[removed]']
df_normalized = df_normalized.dropna(subset=['text'])

In [8]:
df1 = df_normalized.iloc[:1000000]  
df2 = df_normalized.iloc[1000000:2000000]  
df3 = df_normalized.iloc[2000000:3000000]  
df4 = df_normalized.iloc[3000000:4000000]  
df5 = df_normalized.iloc[4000000:]

dataframes = [df1, df2, df3, df4, df5]

In [None]:
hate_classifier = pipeline("text-classification", model="sileod/deberta-v3-base-tasksource-toxicity", return_all_scores=True, device=0)

error_df = pd.DataFrame()

for i, df in enumerate(dataframes, 1):
    df['BERT_2_hate'] = False
  
    # Iterate over each row in the DataFrame
    for index, row in tqdm(df.iterrows(), total=df.shape[0], desc=f"Classifying hate speech for DataFrame {i}"):
        text = row['text']

        # Check if text is valid (not empty or in an unexpected format)
        if not isinstance(text, str) or text.strip() == "":
            print(f"Invalid text at index {index}. Skipping row.")
            error_df = pd.concat([error_df, pd.DataFrame([row])], ignore_index=True)
            continue

        try:
            # Classify the text
            hate_prediction = hate_classifier(text)
            for pred in hate_prediction[0]:
                label = pred['label']
                score = pred['score']
                if label == 'hate' and score >= 0.01:  # 0.01 from testing with the expertly labelled data
                    df.at[index, 'BERT_2_hate'] = True

        except Exception as e:
            # Print error message and log the problematic row in error_df
            print(f"Error processing toxicity at index {index}: {e}")
            error_df = pd.concat([error_df, pd.DataFrame([row])], ignore_index=True)

    # Save the processed DataFrame to CSV
    output_filename = f'../data/deberta_v3_labelled_3_{i}.csv'
    df.to_csv(output_filename, index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['BERT_2_hate'] = False
Classifying hate speech for DataFrame 1:   9%|▉         | 91884/1000000 [27:10<4:45:44, 52.97it/s] 

Error processing toxicity at index 101214: CUDA out of memory. Tried to allocate 828.00 MiB. GPU 0 has a total capacity of 7.79 GiB of which 399.62 MiB is free. Including non-PyTorch memory, this process has 7.39 GiB memory in use. Of the allocated memory 5.21 GiB is allocated by PyTorch, and 2.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 1:  98%|█████████▊| 984560/1000000 [4:49:39<04:55, 52.29it/s]  

Error processing toxicity at index 1090209: CUDA out of memory. Tried to allocate 1.05 GiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 6.34 GiB is allocated by PyTorch, and 1.03 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 1: 100%|██████████| 1000000/1000000 [4:54:10<00:00, 56.65it/s]
Classifying hate speech for DataFrame 2:  51%|█████     | 511671/1000000 [2:30:46<2:34:44, 52.59it/s]

Error processing toxicity at index 1673520: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 5.96 GiB is allocated by PyTorch, and 1.42 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 2:  55%|█████▌    | 550180/1000000 [2:42:04<2:22:40, 52.55it/s]

Error processing toxicity at index 1716166: CUDA out of memory. Tried to allocate 1.05 GiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 6.34 GiB is allocated by PyTorch, and 1.03 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 2:  89%|████████▉ | 893479/1000000 [4:22:16<40:21, 44.00it/s]  

Error processing toxicity at index 2096337: CUDA out of memory. Tried to allocate 880.00 MiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 5.45 GiB is allocated by PyTorch, and 1.93 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 2: 100%|██████████| 1000000/1000000 [4:53:26<00:00, 56.80it/s]
Classifying hate speech for DataFrame 3:  45%|████▌     | 450038/1000000 [2:11:33<2:51:54, 53.32it/s]

Error processing toxicity at index 2712215: CUDA out of memory. Tried to allocate 1.37 GiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 6.23 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 3: 100%|██████████| 1000000/1000000 [4:51:15<00:00, 57.22it/s] 
Classifying hate speech for DataFrame 4:   6%|▌         | 60474/1000000 [17:32<6:01:56, 43.26it/s] 

Error processing toxicity at index 421381: CUDA out of memory. Tried to allocate 1.42 GiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 6.44 GiB is allocated by PyTorch, and 966.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 4:   7%|▋         | 69156/1000000 [20:02<4:40:01, 55.40it/s]

Error processing toxicity at index 430758: CUDA out of memory. Tried to allocate 960.00 MiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 5.82 GiB is allocated by PyTorch, and 1.56 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 4: 100%|██████████| 1000000/1000000 [4:49:07<00:00, 57.65it/s]  
Classifying hate speech for DataFrame 5:  75%|███████▌  | 384871/509972 [1:51:11<39:08, 53.26it/s]  

Error processing toxicity at index 1848510: CUDA out of memory. Tried to allocate 824.00 MiB. GPU 0 has a total capacity of 7.79 GiB of which 283.62 MiB is free. Including non-PyTorch memory, this process has 7.50 GiB memory in use. Of the allocated memory 5.19 GiB is allocated by PyTorch, and 2.19 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Classifying hate speech for DataFrame 5: 100%|██████████| 509972/509972 [2:27:24<00:00, 57.66it/s]  


In [11]:
del hate_classifier

In [None]:
# check which texts causes erros
error_df.to_csv('../data/error.csv', index=False)