# Building a Paraphrase Classifier

This is an assignment for a Virtual Assistants course done in February 2022.

The goal of this assignment is to create a simple paraphrase classifier that should output yes/no as classification for a paraphrase given two sentence inputs.

The dataset to train and test the classifier can be found here: https://github.com/cocoxu/SemEval-PIT2015. The dataset was used for an internataional competition at SemEval 2015, called *Paraphrase and Semantic Similarity in Twitter*

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

## Exploring the dataset

In [2]:
train_data = pd.read_csv("./SemEval-PIT2015-master/data/SemEval-PIT2015-github/SemEval-PIT2015-github/data/train.data", sep='\t', names=['Topic_Id', 'Topic_Name', 'Sent_1', 'Sent_2', 'Label', 'Sent_1_tag', 'Sent_2_tag'])
dev_data = pd.read_csv("./SemEval-PIT2015-master/data/SemEval-PIT2015-github/SemEval-PIT2015-github/data/dev.data", sep='\t', names=['Topic_Id', 'Topic_Name', 'Sent_1', 'Sent_2', 'Label', 'Sent_1_tag', 'Sent_2_tag'])
test_data = pd.read_csv("./SemEval-PIT2015-master/data/SemEval-PIT2015-github/SemEval-PIT2015-github/data/test.data", sep='\t', names=['Topic_Id', 'Topic_Name', 'Sent_1', 'Sent_2', 'Label', 'Sent_1_tag', 'Sent_2_tag'])

Use the "Label" column to generate new columns "Class", "Positive votes", "Negative Votes"

Assumptions made:
- A majority vote in the "Label" column is what generates the "Class"

In [3]:
c = []
p_votes = []
n_votes = []
for row in train_data.iterrows():
    row = row[1]
    p_votes.append(row['Label'][1])
    n_votes.append(row['Label'][4])
    if (int(row['Label'][1]) > int(row['Label'][4])):
        c.append("P")
    else:
        c.append("NP")
        
train_data['Class'] = c
train_data['Positive votes'] = p_votes
train_data['Negative votes'] = n_votes

c = []
p_votes = []
n_votes = []
for row in dev_data.iterrows():
    row = row[1]
    p_votes.append(row['Label'][1])
    n_votes.append(row['Label'][4])
    if (int(row['Label'][1]) > int(row['Label'][4])):
        c.append("P")
    else:
        c.append("NP")
        
dev_data['Class'] = c
dev_data['Positive votes'] = p_votes
dev_data['Negative votes'] = n_votes

c = []
for row in test_data.iterrows():
    row = row[1]
    if (int(row['Label']) > 2):
        c.append("P")
    else:
        c.append("NP")
        
test_data['Class'] = c

## Developing a baseline algorithm

Our baseline algorithm can be simple. For this approach the baseline algorithm will look at how many words match between the two sentence sets. If 50% or more of the words match then we can classify them as "paraphrase" and if they don't then we classify them as "non-paraphrase". Each sentence's characters will be replaced by its lower form, and then split into words.

We'll evaluate results using the Dev set.

In [4]:
pred = []
for row in dev_data.iterrows():
    row = row[1]
    words1 = set(row['Sent_1'].lower().split())
    words2 = set(row['Sent_2'].lower().split())
    same = words1.intersection(words2)
    score = 2*len(same)/(len(words1) + len(words2))
    if score >= 0.5:
        pred.append('P')
    else:
        pred.append('NP')

Now let's calculate the precision and recall scores for our prediction values when compared to the Dev Set

Precision and recall can be calculated like so:

![Image](https://miro.medium.com/max/444/1*7J08ekAwupLBegeUI8muHA.png)

Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9?gi=2925b9472506

In [5]:
actual = dev_data['Class'].tolist()

tp = 0
fp = 0
fn = 0
acc = 0
for pair in zip(pred, actual):
#     print(pair[0], pair[1])
    if pair[0] == 'P' and pair[0] == pair[1]:
        tp += 1
    elif pair[0] == 'NP' and pair[0] != pair[1]:
        fn += 1
    elif pair[0] == 'P' and pair[0] != pair[1]:
        fp += 1
        
    if pair[0] == pair[1]:
        acc += 1
    
precision = tp/(tp+fp)
recall = tp/(tp+fn)
acc = acc / len(actual)

print('Precision: ', precision)
print('Recall: ', recall)
print('Accuracy: ', acc)

Precision:  0.6976
Recall:  0.2965986394557823
Accuracy:  0.7412735350116353


Based on the results found for precision, recall and accuracy, we can see our baseline model works pretty well considering how simple the approach is. 

Recall seems to be the worst performing, which would mean we have a large number of false negatives. 

Let's try a different approach and see if that gets better results.

## Applying a classification model

Now, we can try using a simple classification algorithm from sklearn and see if that is a better approach. 

We'll use a similar approach of splitting each sentence input into words. We'll generate a vector using a technique known as Bag of Words and use that as features for the classification algorithm. We can use a KNN Classifier to help perform the classification.

Source: https://www.mygreatlearning.com/blog/bag-of-words/

In [6]:
CountVec = CountVectorizer()
sentences = train_data['Sent_1'].tolist() + train_data['Sent_2'].tolist()
count_data = CountVec.fit(sentences)

In [7]:
transformed_sent1 = CountVec.transform(train_data['Sent_1'].tolist())
transformed_sent2 = CountVec.transform(train_data['Sent_2'].tolist())

train_data['T_Sent_1'] = transformed_sent1
train_data['T_Sent_2'] = transformed_sent2

In [8]:
clf = KNeighborsClassifier()
data = np.add(transformed_sent1.toarray(),transformed_sent2.toarray())
X = pd.DataFrame(data,columns=CountVec.get_feature_names())
y = train_data[['Class']]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=, random_state=1110)
clf.fit(X, y.values.ravel())
# score = clf.score(X_test, y_test.values.ravel())
# print(score)

KNeighborsClassifier()

Now that we have the model trained, we can test it on Dev Set.

In [9]:
transformed_sent1 = CountVec.transform(dev_data['Sent_1'].tolist())
transformed_sent2 = CountVec.transform(dev_data['Sent_2'].tolist())
data = np.add(transformed_sent1.toarray(),transformed_sent2.toarray())
X = pd.DataFrame(data,columns=CountVec.get_feature_names())
# y = dev_data[['Class']]
# score = clf.score(X, y.values.ravel())
# print(score)

In [10]:
predictions = clf.predict(X)

In [11]:
actual = dev_data['Class'].tolist()

tp = 0
fp = 0
fn = 0
acc = 0
for pair in zip(predictions, actual):
#     print(pair[0], pair[1])
    if pair[0] == 'P' and pair[0] == pair[1]:
        tp += 1
    elif pair[0] == 'NP' and pair[0] != pair[1]:
        fn += 1
    elif pair[0] == 'P' and pair[0] != pair[1]:
        fp += 1
        
    if pair[0] == pair[1]:
        acc += 1
    
precision = tp/(tp+fp)
recall = tp/(tp+fn)
acc = acc / len(actual)

print('Precision: ', precision)
print('Recall: ', recall)
print('Accuracy: ', acc)

Precision:  0.31746031746031744
Recall:  0.05442176870748299
Accuracy:  0.6695578591072562


Based on the above results, we observe that this new approach is worse than the baseline approach. Precision is half of what it was for the baseline, and recall is 1/6. Overall accuracy decreased by approximately 7%.

A potential reason could be that the vector representation of each sentence only showed small differences between two inputs, as there were over 8000 features.

A solution to this could be to remove stop words (like 'a', 'at', 'to', etc.). This will reduce the features and focus on words that matter between sentences.

## Improving using stopwords

For this all we need to do is make our CountVectorizer omit stopwords.

In [12]:
CountVec = CountVectorizer(stop_words='english')
sentences = train_data['Sent_1'].tolist() + train_data['Sent_2'].tolist()
count_data = CountVec.fit(sentences)

transformed_sent1 = CountVec.transform(train_data['Sent_1'].tolist())
transformed_sent2 = CountVec.transform(train_data['Sent_2'].tolist())

train_data['T_Sent_1'] = transformed_sent1
train_data['T_Sent_2'] = transformed_sent2

clf = KNeighborsClassifier()
data = np.add(transformed_sent1.toarray(),transformed_sent2.toarray())
X = pd.DataFrame(data,columns=CountVec.get_feature_names())
y = train_data[['Class']]
clf.fit(X, y.values.ravel())

transformed_sent1 = CountVec.transform(dev_data['Sent_1'].tolist())
transformed_sent2 = CountVec.transform(dev_data['Sent_2'].tolist())
data = np.add(transformed_sent1.toarray(),transformed_sent2.toarray())
X = pd.DataFrame(data,columns=CountVec.get_feature_names())

predictions = clf.predict(X)

In [13]:
actual = dev_data['Class'].tolist()

tp = 0
fp = 0
fn = 0
acc = 0
for pair in zip(predictions, actual):
#     print(pair[0], pair[1])
    if pair[0] == 'P' and pair[0] == pair[1]:
        tp += 1
    elif pair[0] == 'NP' and pair[0] != pair[1]:
        fn += 1
    elif pair[0] == 'P' and pair[0] != pair[1]:
        fp += 1
        
    if pair[0] == pair[1]:
        acc += 1
    
precision = tp/(tp+fp)
recall = tp/(tp+fn)
acc = acc / len(actual)

print('Precision: ', precision)
print('Recall: ', recall)
print('Accuracy: ', acc)

Precision:  0.518348623853211
Recall:  0.07687074829931972
Accuracy:  0.6907129257457161


As we can see there has been improvement to all 3 evaluation metrics, with the most improvement being seen for precision.