**Paraphrase Detection**: To successfully compare two different text entities and to check if they have a similar meaning or not. We will be assuming that the sentences are in the English language. 

**Research Questions**
How can we check if two different text entities have the same meaning or not, using NLP?
What could be the applications of this task?
What are examples of paraphrase detection already being used in real life?

A prominent example of paraphrase detection seen consistently in everyday life is a plagiarism detector, like the tool Turnitin, used by many schools to determine if a student's assigment is original content or if it had been plagiarised. Essentially following the definition of paraphrase detection, a plagiarism detector like Turnitin would scour the corpus, in this case an assignment that a student had submitted and determine if it contained similar words to other published works. This could also assist in finding synonyms and ideally antonyms for certain words in text, which is similar to the app Grammarly.

**Datasets**
Sentence Label Sentence: Labels are either true or false.
* Microsoft Research Paraphrase: 
    * This contains 5800 pairs of sentences that have been extracted from news sources on the web 
    * This dataset has been human-annotated
        * Looks at whether each pair captures a paraphrase/semantic equivalence relationship
* TwitterPPDB corpus:
    * This consists of 51,524 pairs of sentence-level paraphrases from Twitter by linking tweets through shared URLs. 
    * This corpus is human-annotated.
    * It can grow 30,000 new sentential paraphrases per month with ~70% precision.


In [8]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from nltk.tokenize import WordPunctTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords


In [9]:
train_data = pd.read_csv('data/msr_paraphrase_corpus/msr_paraphrase_train.txt', sep = "\t", 
                header = None, names = ['class', 'id1', 'id2', 'text1', 'text2']) 
train_data = train_data.drop([0])
train_data['text'] = train_data['text1']+' '+train_data['text2']
train_data.drop(['id1', 'id2'], axis = 1, inplace = True) 
train_data = train_data.dropna()
train_data['class'] = train_data['class'].apply(lambda x: int(x))


In [10]:
# test
test_data = pd.read_csv('data/msr_paraphrase_corpus/msr_paraphrase_test.txt', sep = "\t", 
                            header = None, names = ['class', 'id1', 'id2', 'text1', 'text2']) 
test_data = test_data.drop([0])
test_data['text'] = test_data['text1']+' '+test_data['text2']
test_data.drop(['id1', 'id2'], axis = 1, inplace = True) 
test_data = test_data.dropna()
test_data['class'] = test_data['class'].apply(lambda x: int(x))

final_data = pd.concat([train_data, test_data])


In [11]:
def tok_helper(word):
    word.lower()
    word = word.replace(".","").replace(",","")
    return word.lower()

def lemma_tokenizer(text):
    wpt = WordPunctTokenizer()
    lemmatizer=WordNetLemmatizer()
    return [lemmatizer.lemmatize(tok_helper(w)) for w in wpt.tokenize(text) if w not in stopwords.words('english')]


# classifier.fit

In [12]:
# Word Overlap
def predict_overlap(t1, t2):
    t1_tok = set(lemma_tokenizer(t1))
    t2_tok = set(lemma_tokenizer(t2))
    common = t1_tok.intersection(t2_tok)
    return len(common) / len(t1_tok.union(t2_tok)) > 0.5

over_pred = []
for index, row in test_data.iterrows():
    over_pred.append(predict_overlap(row['text1'], row['text2']))

c = (list(over_pred == test_data['class'].values)).count(True)
acc = c/len(over_pred)
print(acc)

0.6630103595368677


In [13]:
# LogReg
vectorizer = CountVectorizer(tokenizer=lemma_tokenizer)
final_vector = vectorizer.fit_transform(final_data['text']) 

classifier = LogisticRegression(max_iter=1000)
classifier.fit(final_vector[:3941], train_data['class'].values)

classifier.score(final_vector[3941:],test_data['class'])

0.6489945155393053

In [14]:
# N-gram overlap
n_gram_count = CountVectorizer(tokenizer=lemma_tokenizer, ngram_range=(1,3))
def predict_overlap_n(t1, t2):
    t1_tok = set(n_gram_count.fit([t1]).vocabulary_)
    t2_tok = set(n_gram_count.fit([t2]).vocabulary_)    
    common = t1_tok.intersection(t2_tok)
    return len(common) / len(t1_tok.union(t2_tok)) > 0.5

over_pred = []
for index, row in test_data.iterrows():
    over_pred.append(predict_overlap_n(row['text1'], row['text2']))

c = (list(over_pred == test_data['class'].values)).count(True)
acc = c/len(over_pred)
print(acc)



0.4673979280926264


In [17]:
from sklearn import metrics
preds = classifier.predict(final_vector[3941:])
report = metrics.classification_report(test_data['class'], preds, target_names=['0','1'])
print(report)

              precision    recall  f1-score   support

           0       0.47      0.39      0.43       549
           1       0.72      0.78      0.75      1092

    accuracy                           0.65      1641
   macro avg       0.59      0.59      0.59      1641
weighted avg       0.64      0.65      0.64      1641



In [16]:
pd.DataFrame(report).transpose()

ValueError: DataFrame constructor not properly called!

**Timeline**

* Nov 29th - Add more datasets if found
* Dec 8th  - Rework NLP Model to increase accuracy
* Dec 10th - Build slides (up till initial creation of model)
* Dec 13th - Report work to be comepleted
* Dec 15th - Check, fix, extra finishing touches
* Dec 16th - Buffer day + submit

**Work Allocation**

* Tanishk Jain   : Research on NLP models to improve accuracy, start on slides
* Ameya Jain     : Research on NLP models to improve accuracy, start on report
* Shubh Vashisht : Appending to current NLP model, improve accuracy with feature manipulation, start on slides + report
* Abhi Chalasani : Appending to current NLP Model, find datasets if possible, start working on slides and report

Essentially, all of the deadlines listed above are the same for everyone. As we are all building off of each others' work, the sole foundational deadline would be the ones mentioned above to commence the next portion (report + slides)