**Paraphrase Detection**: To successfully compare two different text entities and to check if they have a similar meaning or not. We will be assuming that the sentences are in the English language. 

**Research Questions**
How can we check if two different text entities have the same meaning or not, using NLP?
What could be the applications of this task?
What are examples of paraphrase detection already being used in real life?

**Datasets**
Sentence Label Sentence: Labels are either true or false.
* Microsoft Research Paraphrase: 
    * This contains 5800 pairs of sentences that have been extracted from news sources on the web 
    * This dataset has been human-annotated
        * Looks at whether each pair captures a paraphrase/semantic equivalence relationship
* TwitterPPDB corpus:
    * This consists of 51,524 pairs of sentence-level paraphrases from Twitter by linking tweets through shared URLs. 
    * This corpus is human-annotated.
    * It can grow 30,000 new sentential paraphrases per month with ~70% precision.


In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from nltk.tokenize import WordPunctTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords


In [2]:
train_data = pd.read_csv('data/msr_paraphrase_corpus/msr_paraphrase_train.txt', sep = "\t", header = None, names = ['class', 'id1', 'id2', 'text1', 'text2']) 
train_data = train_data.drop([0])
train_data['text'] = train_data['text1']+' '+train_data['text2']
train_data.drop(['id1', 'id2'], axis = 1, inplace = True) 
train_data = train_data.dropna()
train_data['class'] = train_data['class'].apply(lambda x: int(x))


In [3]:
# test
test_data = pd.read_csv('data/msr_paraphrase_corpus/msr_paraphrase_test.txt', sep = "\t", header = None, names = ['class', 'id1', 'id2', 'text1', 'text2']) 
test_data = test_data.drop([0])
test_data['text'] = test_data['text1']+' '+test_data['text2']
test_data.drop(['id1', 'id2'], axis = 1, inplace = True) 
test_data = test_data.dropna()
test_data['class'] = test_data['class'].apply(lambda x: int(x))

final_data = pd.concat([train_data, test_data])

In [4]:
def tok_helper(word):
    word = word.replace(".","").replace(",","")
    return word

def lemma_tokenizer(text):
    wpt = WordPunctTokenizer()
    lemmatizer=WordNetLemmatizer()
    return [lemmatizer.lemmatize(tok_helper(w)) for w in wpt.tokenize(text) if w not in stopwords.words('english')]

vectorizer = CountVectorizer(tokenizer=lemma_tokenizer)
final_vector = vectorizer.fit_transform(final_data['text']) 

classifier = LogisticRegression(max_iter=1000)
classifier.fit(final_vector[:3941], train_data['class'].values)
# classifier.fit

LogisticRegression(max_iter=1000)

In [5]:
np.shape(final_vector)

(5582, 14348)

In [6]:
# test_vector = vectorizer.transform(test_data['text'])
# final_vector[3941:]

In [7]:
preds = classifier.predict(final_vector[3941:])

In [8]:
classifier.score(final_vector[3941:],test_data['class'])

0.6489945155393053

In [9]:
from sklearn import metrics

report = metrics.classification_report(test_data['class'], preds, target_names=['0','1'])

In [10]:
report

'              precision    recall  f1-score   support\n\n           0       0.47      0.39      0.43       549\n           1       0.72      0.78      0.75      1092\n\n    accuracy                           0.65      1641\n   macro avg       0.59      0.59      0.59      1641\nweighted avg       0.64      0.65      0.64      1641\n'

**Timeline**

* Add more datasets if found - by Nov 29th
* Rework NLP Model to increase accuracy - by Dec 8th
* Build slides (up till initial creation of model) - by Dec 10th
* Report work to be comepleted - by Dec 13th
* Check, fix, extra finishing touches - by Dec 15th
* Buffer day + submit - Dec 16th

**Work Allocation**

* Tanishk Jain: Research on NLP models to improve accuracy, start on slides
* Ameya Jain: Research on NLP models to improve accuracy, start on report
* Shubh Vashisht: Appending to current NLP model, improve accuracy with feature manipulation, start on slides + report
* Abhi Chalasani: Appending to current NLP Model, find datasets if possible, start working on slides and report

Overall all deadlines are the same for everyone. Since we are all building off each others' work, the only basic deadline would be the deadlines mentioned above to start onto the next portion (report + slides)