### **Data Annotation - Sentence paraphrasing**

After gathering and combining the scraped data into a single file, we used a paraphraser like "chatgpt_paraphraser_on_T5_base" from Hugging Face to create different versions of the questions we've collected. This helps in forming pairs of annotated questions. The paraphraser rephrases the original questions with varied wording and structure. For each original question, we generate five duplicate questions. We then pair the first two duplicates with the original question to form similar question pairs. The remaining three duplicates are shuffled to create a set of dissimilar questions.

In [1]:
# Import the pandas library as pd for data manipulation
import pandas as pd

# Read the CSV file containing the merged questions data into a DataFrame named df
df = pd.read_csv("data\merged_questions.csv")

In [9]:
# Paraphraser Reference: https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base

# Import necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Set the device to use for processing
device = "cuda"

# Load the tokenizer from the pre-trained "chatgpt_paraphraser_on_T5_base" model
tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")

# Load the model for sequence-to-sequence generation from the pre-trained "chatgpt_paraphraser_on_T5_base" model
# and move it to the specified device (GPU or CPU)
model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

# Define a function for paraphrasing input questions
def paraphrase(
    question,
    num_beams=5,
    num_beam_groups=5,
    num_return_sequences=5,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=128
):
    # Tokenize the input question and prepare it for model input
    input_ids = tokenizer(
        f'paraphrase: {question}',  # Add prefix "paraphrase:" to indicate paraphrasing
        return_tensors="pt", padding="longest",
        max_length=max_length,
        truncation=True,
    ).input_ids.to(device)
    
    # Generate paraphrases using the loaded model
    outputs = model.generate(
        input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
        num_return_sequences=num_return_sequences, no_repeat_ngram_size=no_repeat_ngram_size,
        num_beams=num_beams, num_beam_groups=num_beam_groups,
        max_length=max_length, diversity_penalty=diversity_penalty
    )

    # Decode the generated outputs and remove special tokens
    res = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Return the paraphrased results
    return res

In [11]:
# Import tqdm library for progress tracking
import tqdm

# Ignore warnings to avoid cluttering the output
import warnings 
warnings.filterwarnings("ignore")

# Initialize an empty list to store paraphrased question pairs
l = []

# Intialy we did a sanity check, which run has already created paraphrases for first 3000 samples and were saved.
# Now we generate para[hrases for rest of the questions by looping over indices starting from 3000 until the length of the DataFrame
for ind in tqdm.tqdm(range(3000, len(df))):
    
    # Print progress message every 1000 iterations
    if (ind % 1000) == 0:
        print(ind)
        l = []
    
    try:
        # Get the text of the question from the DataFrame
        text = df.loc[ind, "q1"]
        
        # Generate paraphrases for the question text using the paraphrase function
        l.append([ind, text] + paraphrase(text))
    
    except:
        # Print an error message if an exception occurs during paraphrasing
        print("ERROR!!! @ ", ind)
    
    # Save the paraphrased question pairs to a CSV file every 1000 iterations
    if ((ind + 1) % 1000) == 0:
        d = pd.DataFrame(l, columns=["idx", "q1", "d1", "d2", "d3", "d4", "d5"])
        d.to_csv("pairs/pairs_{}.csv".format(ind))
    
    # Save the paraphrased question pairs to a CSV file when reaching the end of the DataFrame
    if ind == len(df):
        d = pd.DataFrame(l, columns=["idx", "q1", "d1", "d2", "d3", "d4", "d5"])
        d.to_csv("pairs/pairs_{}.csv".format(ind))

  0%|          | 0/17100 [00:00<?, ?it/s]

3000


  6%|▌         | 1000/17100 [11:02<2:52:06,  1.56it/s]

4000


 12%|█▏        | 2000/17100 [23:45<4:39:20,  1.11s/it]

5000


 18%|█▊        | 3000/17100 [34:17<2:41:55,  1.45it/s]

6000


 23%|██▎       | 4000/17100 [44:56<2:06:03,  1.73it/s]

7000


 29%|██▉       | 5000/17100 [54:24<2:10:37,  1.54it/s]

8000


 35%|███▌      | 6000/17100 [1:03:37<1:36:15,  1.92it/s]

9000


 41%|████      | 7000/17100 [1:13:00<2:11:08,  1.28it/s]

10000


 47%|████▋     | 8000/17100 [1:23:57<1:21:10,  1.87it/s]

11000


 53%|█████▎    | 9000/17100 [1:34:07<1:51:22,  1.21it/s]

12000


 58%|█████▊    | 10000/17100 [1:43:56<58:08,  2.04it/s] 

13000


 64%|██████▍   | 11000/17100 [1:53:46<57:49,  1.76it/s]  

14000


 70%|███████   | 12000/17100 [2:02:57<43:02,  1.98it/s]  

15000


 76%|███████▌  | 13000/17100 [2:12:41<30:36,  2.23it/s]  

16000


 82%|████████▏ | 14000/17100 [2:23:33<33:41,  1.53it/s]  

17000


 88%|████████▊ | 15000/17100 [2:34:48<24:31,  1.43it/s]

18000


 94%|█████████▎| 16000/17100 [2:48:01<12:44,  1.44it/s]  

19000


 99%|█████████▉| 17000/17100 [3:01:00<01:10,  1.41it/s]  

20000


100%|██████████| 17100/17100 [3:02:07<00:00,  1.56it/s]
