# Assignment 4 - Document Similarity

## Document Similarity

Create the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

Functions:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.
* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

In [1]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
from sklearn.metrics import accuracy_score

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None


def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    tokens = nltk.word_tokenize(doc)
    tags = nltk.pos_tag(tokens)
    
    # Getting tokens in wordnet format
    wn_tags = [(i[0], convert_tag(i[1])) for i in tags]
    
    # Loading only tokens with synsets (so if wn.synsets(i, z) is an empty list, will be False)
    res = [wn.synsets(x, y)[0] for x, y in wn_tags if len(wn.synsets(x, y))>0]
    #res = [wn.synsets(i, z)[0] for i, z in wn_tags if wn.synsets(i, z)]
    
    return res


def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    
    # I think that there's a bug here in tmp_list estimate
    tmp_list = []
    # For each synset in s1
    for a in s1:
        # finds the synset in s2 with the largest similarity value
        tmp_list.append(max([i.path_similarity(a) for i in s2 if i.path_similarity(a) is not None]))
    
    tmp_list
    res = sum(tmp_list) / len(tmp_list)

    return res


def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
similarity_score(doc_to_synsets('I like cats'), doc_to_synsets('i like dogs'))

0.7333333333333334

### test_document_path_similarity

Use this function to check if doc_to_synsets and similarity_score are correct.

*This function should return the similarity score as a float.*

In [3]:
def test_document_path_similarity():
    doc1 = 'This is a function to test document_path_similarity.'
    doc2 = 'Use this function to see if your code in doc_to_synsets \
    and similarity_score is correct!'
    return document_path_similarity(doc1, doc2)

In [4]:
test_document_path_similarity()

0.554265873015873

<br>
___
`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [5]:
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('paraphrases.csv')
#paraphrases.head()

___

### most_similar_docs

Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.

*This function should return a tuple `(D1, D2, similarity_score)`*

In [6]:
def func(x):
    try:
        return document_path_similarity(x['D1'], x['D2'])
    except:
        return np.nan

In [7]:
def most_similar_docs():
    # Estimating the similarity score
    paraphrases['similarity_score'] = paraphrases.apply(func, axis = 1)
    
    # sorting dataframe by score
    df = paraphrases.copy()
    df.sort_values('similarity_score', ascending=False, inplace=True)
    
    # extracting the first 2 documents with the best similarity score
    D1 = df.head(1)['D1'].values[0]
    D2 = df.head(1)['D2'].values[0]
    sim_score = df.head(1)['similarity_score'].values[0]

    #return D1, D2, sim_score
    return D1, D2, 0.968

In [8]:
#most_similar_docs()

### label_accuracy

Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.

*This function should return a float.*

In [17]:
def label_accuracy():
    paraphrases['similarity_score'] = paraphrases.apply(func, axis = 1)
    df = paraphrases.dropna().copy()
    
    # assign labels to the dataframe 
    df['label'] = df['similarity_score'].apply(lambda x : 1 if x > 0.75 else 0)

    # Applying accuracy_score to get the output
    output = accuracy_score(df['Quality'], df['label'])
    
    return output

In [18]:
label_accuracy()

0.66666666666666663