### **Data Annotation - Sentence pair creation**

Once we've generated paraphrased sentences for each question, we match the first two duplicates with the original question to create similar question pairs. Then, we shuffle the remaining three duplicates to produce a set of dissimilar questions. It's important to note that when shuffling samples to create dissimilar question pairs, there might be instances where two similar questions are mistakenly paired as dissimilar. Similarly, in the similar question set, there could be cases where two dissimilar questions are incorrectly labeled as similar due to ambiguity in the original question. We detect and remove these anomalies from the dataset during the data preprocessing stage.


In [2]:
# Import necessary libraries/modules
import numpy as np  # NumPy for numerical operations
import pandas as pd  # Pandas for data manipulation
import glob  # Glob for file searching and manipulation
import tqdm  # tqdm for progress tracking


#### **Create question pairs using a single set - sanity check**

In [3]:
# Read the CSV file named "pairs_3999.csv" into a DataFrame named df
df = pd.read_csv("pairs\pairs_3999.csv")

# Exclude the first column (index column) from the DataFrame
df = df[df.columns[1:]]

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,idx,q1,d1,d2,d3,d4,d5
0,3000,Is it a good idea to go for masters in data sc...,Is it advisable to pursue a data science maste...,"With 7 years of experience in Indian banking, ...","Would it be wise for me, as a btech graduate w...",As a BTech graduate with 7+ years of experienc...,Should I consider studying data science abroad...
1,3001,How can a mini cloud storage device be built o...,What is the process for creating a miniature c...,How can one go about constructing or assemblin...,Can you explain the steps involved in building...,Is it possible to create or manufacture a mini...,In what way can a miniature cloud storage devi...
2,3002,How can we use data analytics or data science ...,In what ways can data analytics or data scienc...,How can data analytics or data science be appl...,What are the potential applications of data an...,Can data analytics or Data Science be applied ...,What are some practical uses of data analytics...
3,3003,Why am I not getting sum option in Power BI? I...,I am not able to access the sum option in Powe...,How come I am not able to access the sum funct...,"Power BI is not providing the sum option, and ...",Why isn't the Sum option available in Power BI...,What could be the reason for not being able to...
4,3004,"As a research assistant, you were assigned to ...",Your role as a research assistant involved col...,What are the potential issues with the timelin...,"As a research assistant, you are responsible f...","In your capacity as a research assistant, you ...",You work as a research assistant. We are going...


In [90]:
# Display paraphrased samples for index 0
df.values[0]

array([3000,
       'Is it a good idea to go for masters in data science abroad after a 7+ years experience in Indian banking? I’m a btech graduate and I’m trying for a career switch. I recently completed Google data analytics course and found this field interesting.',
       "Is it advisable to pursue a data science master's degree abroad as btech graduate who has worked in Indian banking for over 7 years, and am now considering pursuing specialized training through my Google data analytics course?",
       "With 7 years of experience in Indian banking, is it worth considering a master's degree program in data science abroad? I recently completed specialized training on Google analytics.",
       'Would it be wise for me, as a btech graduate with 7 years of experience in Indian banking, to pursue specialized training abroad after completing & initiating an Google data analytics course, on the cusp of switching my career?',
       "As a BTech graduate with 7+ years of experience in Ind

In [91]:
# Create a new DataFrame named df_similar containing specific columns from the original DataFrame df
# The selected columns include "idx" (index), "q1" (original question), "d1", and "d2" (paraphrased questions)
df_similar = df[["idx", "q1", "d1", "d2"]]

# Display the first few rows of the new DataFrame df_similar
df_similar.head()


Unnamed: 0,idx,q1,d1,d2
0,3000,Is it a good idea to go for masters in data sc...,Is it advisable to pursue a data science maste...,"With 7 years of experience in Indian banking, ..."
1,3001,How can a mini cloud storage device be built o...,What is the process for creating a miniature c...,How can one go about constructing or assemblin...
2,3002,How can we use data analytics or data science ...,In what ways can data analytics or data scienc...,How can data analytics or data science be appl...
3,3003,Why am I not getting sum option in Power BI? I...,I am not able to access the sum option in Powe...,How come I am not able to access the sum funct...
4,3004,"As a research assistant, you were assigned to ...",Your role as a research assistant involved col...,What are the potential issues with the timelin...


In [92]:
# Create a new DataFrame named df_similar1 containing specific columns from the DataFrame df_similar
# The selected columns include "idx" (index), "q1" (original question), and "d1" (first paraphrased question)
df_similar1 = df_similar[["idx", "q1", "d1"]]

# Rename the columns of df_similar1 to match the format of the question pairs
df_similar1.columns = ["idx", "q1", "q2"]

# Create another new DataFrame named df_similar2 containing specific columns from the DataFrame df_similar
# The selected columns include "idx" (index), "q1" (original question), and "d2" (second paraphrased question)
df_similar2 = df_similar[["idx", "q1", "d2"]]

# Rename the columns of df_similar2 to match the format of the question pairs
df_similar2.columns = ["idx", "q1", "q2"]

# Concatenate df_similar1 and df_similar2 to combine the similar question pairs
# The ignore_index=True parameter resets the index of the concatenated DataFrame
df_similar = pd.concat([df_similar1, df_similar2], ignore_index=True)

# Add a new column named "labels" to df_similar and assign a value of 1 to indicate similar question pairs
df_similar["labels"] = 1

In [93]:
# Display few similar samples
df_similar.sort_values(by=["q1"]).head()

Unnamed: 0,idx,q1,q2,labels
1524,3524,66 marks in GAT test are good enough for MS in...,Would a GAT score of 66 be enough to qualify f...,1
524,3524,66 marks in GAT test are good enough for MS in...,Is a GAT score of 66 sufficient for admission ...,1
958,3958,A BA graduate with no work-ex should do MBA in...,For a BA graduate who has no experience in wor...,1
1958,3958,A BA graduate with no work-ex should do MBA in...,If you have graduated with a BA and no prior w...,1
1067,3067,A Tik-Tok video claims if a forecast states a ...,"According to the Tik-Tok video, what is the pe...",1


In [4]:
# Create a new DataFrame named df_notsimilar containing specific columns from the original DataFrame df
# The selected columns include "idx" (index), "q1" (original question), and paraphrased questions "d3", "d4", and "d5"
df_notsimilar = df[["idx", "q1", "d3", "d4", "d5"]]

# Display the first few rows of the new DataFrame df_notsimilar
df_notsimilar.head()

Unnamed: 0,idx,q1,d3,d4,d5
0,3000,Is it a good idea to go for masters in data sc...,"Would it be wise for me, as a btech graduate w...",As a BTech graduate with 7+ years of experienc...,Should I consider studying data science abroad...
1,3001,How can a mini cloud storage device be built o...,Can you explain the steps involved in building...,Is it possible to create or manufacture a mini...,In what way can a miniature cloud storage devi...
2,3002,How can we use data analytics or data science ...,What are the potential applications of data an...,Can data analytics or Data Science be applied ...,What are some practical uses of data analytics...
3,3003,Why am I not getting sum option in Power BI? I...,"Power BI is not providing the sum option, and ...",Why isn't the Sum option available in Power BI...,What could be the reason for not being able to...
4,3004,"As a research assistant, you were assigned to ...","As a research assistant, you are responsible f...","In your capacity as a research assistant, you ...",You work as a research assistant. We are going...


In [5]:
# Create a new DataFrame named df_notsimilar1 containing specific columns from the DataFrame df_notsimilar
# The selected columns include "idx" (index), "q1" (original question), and "d3" (first paraphrased question)
df_notsimilar1 = df_notsimilar[["idx", "q1", "d3"]]

# Rename the columns of df_notsimilar1 to match the format of the question pairs
df_notsimilar1.columns = ["idx", "q1", "q2"]

# Create another new DataFrame named df_notsimilar2 containing specific columns from the DataFrame df_notsimilar
# The selected columns include "idx" (index), "q1" (original question), and "d4" (second paraphrased question)
df_notsimilar2 = df_notsimilar[["idx", "q1", "d4"]]

# Rename the columns of df_notsimilar2 to match the format of the question pairs
df_notsimilar2.columns = ["idx", "q1", "q2"]

# Create another new DataFrame named df_notsimilar3 containing specific columns from the DataFrame df_notsimilar
# The selected columns include "idx" (index), "q1" (original question), and "d5" (third paraphrased question)
df_notsimilar3 = df_notsimilar[["idx", "q1", "d5"]]

# Rename the columns of df_notsimilar3 to match the format of the question pairs
df_notsimilar3.columns = ["idx", "q1", "q2"]

In [106]:
df_notsimilar1.head()

Unnamed: 0,idx,q1,q2
0,3000,Is it a good idea to go for masters in data sc...,"Would it be wise for me, as a btech graduate w..."
1,3001,How can a mini cloud storage device be built o...,Can you explain the steps involved in building...
2,3002,How can we use data analytics or data science ...,What are the potential applications of data an...
3,3003,Why am I not getting sum option in Power BI? I...,"Power BI is not providing the sum option, and ..."
4,3004,"As a research assistant, you were assigned to ...","As a research assistant, you are responsible f..."


In [107]:
df_notsimilar2.head()

Unnamed: 0,idx,q1,q2
0,3000,Is it a good idea to go for masters in data sc...,As a BTech graduate with 7+ years of experienc...
1,3001,How can a mini cloud storage device be built o...,Is it possible to create or manufacture a mini...
2,3002,How can we use data analytics or data science ...,Can data analytics or Data Science be applied ...
3,3003,Why am I not getting sum option in Power BI? I...,Why isn't the Sum option available in Power BI...
4,3004,"As a research assistant, you were assigned to ...","In your capacity as a research assistant, you ..."


In [108]:
df_notsimilar3.head()

Unnamed: 0,idx,q1,q2
0,3000,Is it a good idea to go for masters in data sc...,Should I consider studying data science abroad...
1,3001,How can a mini cloud storage device be built o...,In what way can a miniature cloud storage devi...
2,3002,How can we use data analytics or data science ...,What are some practical uses of data analytics...
3,3003,Why am I not getting sum option in Power BI? I...,What could be the reason for not being able to...
4,3004,"As a research assistant, you were assigned to ...",You work as a research assistant. We are going...


In [109]:
# Shuffle the values in the "q2" column of the DataFrame df_notsimilar1 using NumPy's random.shuffle() function
np.random.shuffle(df_notsimilar1["q2"].values)

# Shuffle the values in the "q2" column of the DataFrame df_notsimilar2 using NumPy's random.shuffle() function
np.random.shuffle(df_notsimilar2["q2"].values)

# Shuffle the values in the "q2" column of the DataFrame df_notsimilar3 using NumPy's random.shuffle() function
np.random.shuffle(df_notsimilar3["q2"].values)

In [110]:
df_notsimilar1.head()

Unnamed: 0,idx,q1,q2
0,3000,Is it a good idea to go for masters in data sc...,"In the world, what is the percentage of indivi..."
1,3001,How can a mini cloud storage device be built o...,Can you suggest some male names that sound low...
2,3002,How can we use data analytics or data science ...,How does Concordia University of Edmonton's ma...
3,3003,Why am I not getting sum option in Power BI? I...,I'm a computer science engineering student in ...
4,3004,"As a research assistant, you were assigned to ...",Will ISRO accept applications from final year ...


In [111]:
df_notsimilar2.head()

Unnamed: 0,idx,q1,q2
0,3000,Is it a good idea to go for masters in data sc...,With 3 years of experience in an MNC and a sal...
1,3001,How can a mini cloud storage device be built o...,Is there a way to export and import your Googl...
2,3002,How can we use data analytics or data science ...,Which degree in engineering is the most benefi...
3,3003,Why am I not getting sum option in Power BI? I...,How would you define CDB AIA at Cognizant and ...
4,3004,"As a research assistant, you were assigned to ...",Could you suggest some brief one-liners on the...


In [112]:
df_notsimilar3.head()

Unnamed: 0,idx,q1,q2
0,3000,Is it a good idea to go for masters in data sc...,"Hey all, I'm Karim from Morocco and I want to ..."
1,3001,How can a mini cloud storage device be built o...,What are the steps to follow when creating a d...
2,3002,How can we use data analytics or data science ...,Is it better to join Newgen or Capgemini as a ...
3,3003,Why am I not getting sum option in Power BI? I...,Why does IT industry is experiencing recession?
4,3004,"As a research assistant, you were assigned to ...","In what ways are linear models, interactive mo..."


In [113]:
# Concatenate the DataFrames df_notsimilar1, df_notsimilar2, and df_notsimilar3 along the row axis
# to combine the dissimilar question pairs into a single DataFrame df_notsimilar
# The ignore_index=True parameter resets the index of the concatenated DataFrame
df_notsimilar = pd.concat([df_notsimilar1, df_notsimilar2, df_notsimilar3], ignore_index=True)

# Add a new column named "labels" to df_notsimilar and assign a value of 0 to indicate dissimilar question pairs
df_notsimilar["labels"] = 0

In [114]:
# Display few dissimilar samples
df_notsimilar.sort_values(by=["q1"]).head()

Unnamed: 0,idx,q1,q2,labels
2524,3524,66 marks in GAT test are good enough for MS in...,"Which is more lucrative, a Health Data Science...",0
1524,3524,66 marks in GAT test are good enough for MS in...,I am looking to hire a complete digital market...,0
524,3524,66 marks in GAT test are good enough for MS in...,"Which career choice, mechanical or data engine...",0
2958,3958,A BA graduate with no work-ex should do MBA in...,"Can online certification courses, such as busi...",0
958,3958,A BA graduate with no work-ex should do MBA in...,When is it difficult to interpret the informat...,0


In [115]:
# Concatenate the DataFrames df_similar and df_notsimilar along the row axis
# to combine both similar and dissimilar question pairs into a single DataFrame df_set
# The ignore_index=True parameter resets the index of the concatenated DataFrame
df_set = pd.concat([df_similar, df_notsimilar], ignore_index=True)

# Return the concatenated DataFrame df_set, which contains both similar and dissimilar question pairs
df_set

Unnamed: 0,idx,q1,q2,labels
0,3000,Is it a good idea to go for masters in data sc...,Is it advisable to pursue a data science maste...,1
1,3001,How can a mini cloud storage device be built o...,What is the process for creating a miniature c...,1
2,3002,How can we use data analytics or data science ...,In what ways can data analytics or data scienc...,1
3,3003,Why am I not getting sum option in Power BI? I...,I am not able to access the sum option in Powe...,1
4,3004,"As a research assistant, you were assigned to ...",Your role as a research assistant involved col...,1
...,...,...,...,...
4995,3995,I am working in a private sector but not happy...,What are the steps to enroll in a PG program w...,0
4996,3996,Which all government jobs have provision for h...,How many employers and companies recognize the...,0
4997,3997,Which are the different colleges in Bangalore ...,How can I improve my chances of landing an ana...,0
4998,3998,Which is the best option for an electrical eng...,Despite being an undergraduate student in info...,0


In [120]:
# Remove duplicate rows from the DataFrame df_set based on the combination of columns "q1" and "q2"
df_set = df_set.drop_duplicates(subset=["q1","q2"]).sort_values(by=["q1"]).reset_index(drop=True)
df_set.head()

Unnamed: 0,idx,q1,q2,labels
0,3524,66 marks in GAT test are good enough for MS in...,"Which is more lucrative, a Health Data Science...",0
1,3524,66 marks in GAT test are good enough for MS in...,I am looking to hire a complete digital market...,0
2,3524,66 marks in GAT test are good enough for MS in...,Is a GAT score of 66 sufficient for admission ...,1
3,3524,66 marks in GAT test are good enough for MS in...,"Which career choice, mechanical or data engine...",0
4,3524,66 marks in GAT test are good enough for MS in...,Would a GAT score of 66 be enough to qualify f...,1


In [124]:
# Display few question pairs
df_set.values[:3]

array([[3524,
        '66 marks in GAT test are good enough for MS in nust in field of data science and artificial intelligence?',
        'Which is more lucrative, a Health Data Science course or Public Health?',
        0],
       [3524,
        '66 marks in GAT test are good enough for MS in nust in field of data science and artificial intelligence?',
        'I am looking to hire a complete digital marketing strategist with no minimum or upper limit, but within £5k. Who should I bring on board and what is the procedure?',
        0],
       [3524,
        '66 marks in GAT test are good enough for MS in nust in field of data science and artificial intelligence?',
        'Is a GAT score of 66 sufficient for admission to MS in data science and artificial intelligence program at NUSTI?',
        1]], dtype=object)

#### **Create question pairs using on all the paraphrased csv's**

In [133]:
# Get a list of file paths matching the pattern "pairs/*pairs_*.csv" using the glob.glob() function
csv_list = glob.glob("pairs/*pairs_*.csv")

# Initialize an empty list to store cleaned DataFrames
set_list = []

# Iterate through each file path in csv_list using tqdm for progress tracking
for path in tqdm.tqdm(csv_list):
    # Read the CSV file into a DataFrame
    df = pd.read_csv(path)
    
    # Exclude the first column (index column) from the DataFrame
    df = df[df.columns[1:]]
    
    # Extract similar question pairs and preprocess them
    df_similar = df[["idx", "q1", "d1", "d2"]]
    df_similar1 = df_similar[["idx", "q1", "d1"]]
    df_similar1.columns = ["idx", "q1", "q2"]
    df_similar2 = df_similar[["idx", "q1", "d2"]]
    df_similar2.columns = ["idx", "q1", "q2"]
    df_similar = pd.concat([df_similar1, df_similar2], ignore_index=True)
    df_similar["labels"] = 1
    
    # Extract dissimilar question pairs and preprocess them
    df_notsimilar = df[["idx", "q1", "d3", "d4", "d5"]]
    df_notsimilar1 = df_notsimilar[["idx", "q1", "d3"]]
    df_notsimilar1.columns = ["idx", "q1", "q2"]
    df_notsimilar2 = df_notsimilar[["idx", "q1", "d4"]]
    df_notsimilar2.columns = ["idx", "q1", "q2"]
    df_notsimilar3 = df_notsimilar[["idx", "q1", "d5"]]
    df_notsimilar3.columns = ["idx", "q1", "q2"]
    
    # Shuffle the paraphrased questions within each dissimilar question pair
    np.random.shuffle(df_notsimilar1["q2"].values)
    np.random.shuffle(df_notsimilar2["q2"].values)
    np.random.shuffle(df_notsimilar3["q2"].values)
    
    # Concatenate the shuffled dissimilar question pairs into a single DataFrame
    df_notsimilar = pd.concat([df_notsimilar1, df_notsimilar2, df_notsimilar3], ignore_index=True)
    df_notsimilar["labels"] = 0
    
    # Concatenate the similar and dissimilar question pairs into a single DataFrame
    df_set = pd.concat([df_similar, df_notsimilar], ignore_index=True)
    
    # Remove duplicate question pairs, sort by the original question, and reset the index
    df_set = df_set.drop_duplicates(subset=["q1","q2"]).sort_values(by=["q1"]).reset_index(drop=True)
    
    # Append the cleaned DataFrame to the set_list
    set_list.append(df_set)

100%|██████████| 20/20 [00:00<00:00, 23.16it/s]


In [135]:
# Concatenate all DataFrames in set_list into a single DataFrame named sentence_pairs
sentence_pairs = pd.concat(set_list, ignore_index=True)

# Remove duplicate question pairs from the DataFrame based on the combination of columns "q1" and "q2"
# Sort the DataFrame by the values in the "q1" column and reset the index
sentence_pairs = sentence_pairs.drop_duplicates(subset=["q1","q2"]).sort_values(by=["q1"]).reset_index(drop=True)

In [137]:
# Export the DataFrame sentence_pairs to a CSV file named "sentence-pairs.csv" in the "pairs" directory
sentence_pairs.to_csv("pairs/sentence-pairs.csv", index=False)

#### **Adding cosine similarity score of question pairs into the sentence-pairs.csv**

In [142]:
# Import the SentenceTransformer class and the util module from the sentence_transformers library
from sentence_transformers import SentenceTransformer, util

# Initialize a SentenceTransformer model named model
# The model used here is 'sentence-transformers/all-MiniLM-L6-v2'
# The device parameter is set to "cuda" to utilize GPU acceleration if available
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device="cuda")

In [148]:
# Initialize an empty list named scores to store cosine similarity scores
scores = []

# Iterate through each row index in the sentence_pairs DataFrame using tqdm for progress tracking
for ind in tqdm.tqdm(range(len(sentence_pairs))):
    # Extract the two sentences (question pairs) from the current row
    sentences = [sentence_pairs.loc[ind,"q1"], sentence_pairs.loc[ind, "q2"]]
    
    # Encode the sentences into numerical representations using the SentenceTransformer model
    embeddings = model.encode(sentences)
    
    # Calculate the cosine similarity between the embeddings of the two sentences
    # Append the calculated similarity score to the scores list
    scores.append(util.cos_sim(embeddings[0],embeddings[1]))

100%|██████████| 94733/94733 [10:08<00:00, 155.58it/s]


In [153]:
# Convert the cosine similarity scores from the scores list to float data type
scoresf = [float(i) for i in scores]

# Add a new column named "cos_sim" to the sentence_pairs DataFrame and assign the cosine similarity scores
sentence_pairs["cos_sim"] = scoresf

# Export the updated DataFrame sentence_pairs to a CSV file named "sentence-pairs-cos.csv" in the "pairs" directory
sentence_pairs.to_csv("pairs/sentence-pairs-cos.csv", index=False)

In [157]:
# Validating annotated question pairs
sentence_pairs.groupby("labels")["cos_sim"].mean()

labels
0    0.147896
1    0.871013
Name: cos_sim, dtype: float64

In [165]:
# Check sample distribution
sentence_pairs.groupby("labels")["cos_sim"].count()

labels
0    59999
1    34734
Name: cos_sim, dtype: int64