# Assignment 4: Natural Language Processing

## Q1: Extract data using regular expression (2 points)
Suppose you have scraped the text shown belone from an online source. 
Define a **extract** function which:
- takes a piece of text (in the format of shown below) as an input
- extracts data into a list of tuples using regular expression, e.g.  [('Grant Cornwell', 'College of Wooster', '2015', '911,651'), ...]
- returns the list of tuples

In [None]:
text='''Following is total compensation for other presidents at private colleges in Ohio in 2015:

Grant Cornwell, College of Wooster (left in 2015): $911,651
Marvin Krislov, Oberlin College (left in 2016):  $829,913
Mark Roosevelt, Antioch College, (left in 2015): $507,672
Laurie Joyner, Wittenberg University (left in 2015): $463,504
Richard Giese, University of Mount Union (left in 2015): $453,800'''


## Q2: Find duplicate questions by similarity (8 points)

A data file 'quora_duplicate_question_500.csv' has been provided as shown below. Each sample in this dataset has two questions, denoted as $(q_1, q_2)$ (i.e."q1" and "q2" columns). Column "is_duplicate"=1 indicates if the two questions are indeed duplicate; otherwise, they are not duplicate, although they look similar. This dataset has 500 question pairs in total.

In [311]:
import pandas as pd
data=pd.read_csv("../../dataset/quora_duplicate_question_500.csv",header=0)
data.head(3)

Unnamed: 0,q1,q2,is_duplicate
0,How do you take a screenshot on a Mac laptop?,How do I take a screenshot on my MacBook Pro? ...,1.0
1,Is the US election rigged?,Was the US election rigged?,1.0
2,How scary is it to drive on the road to Hana g...,Do I need a four-wheel-drive car to drive all ...,0.0


**Q2.1.** Define a function **"tokenize"**  as follows (3 points):
   - takes three parameters: 
       - *text*: input string.
       - *lemmatized*: an optional boolean parameter to indicate if tokens are lemmatized. The default value is False.
       - *no_stopword*: an optional bookean parameter to remove stop words. The default value is False. 
   - splits the input text into unigrams and also clean up tokens as follows:
       - if *lemmatized* is turned on, lemmatize all unigrams. (1 point)
       - if *no_stopword* is turned on, remove all stop words. (1 point)
   - returns the list of unigrams after all the processing. (Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes) (1 point for all the others)

**Q2.2.** Define a function **get_similarity** as follows: (3 points)

   - takes the following inputs: two lists of strings (i.e. list of q1 and list of q2), and boolean parameters *lemmatized* and *no_stopword* as defined in (Q2.1).
   - tokenize each question from the both lists using the "**tokenize**" function defined in (Q2.1). (1 point)
   - generates **tf_idf matrix** from the tokens obtained from the questions in both of the lists (hint: reference to the tf_idf function defined in Section 8.5 in lecture notes. You need to concatenate q1 and q2) (1 point)
   - calculates the **cosine similarity** of the question pair ($q_1, q_2)$ in each sample using the tf_idf matrix (1 point)
   - returns similarity scores for the 500 question pairs

**Q2.3.** Define a function **predict** as follows: (2 point)
   - takes three lists, i.e. list of similarity scores, "is_duplicate" column, and a *threshold* with default value of 0.5 as inputs
   - if a similarity > *threshold*, then predicts the question pair is duplicate (1 point)
   - calulates the percentage of duplicate questions pairs that are successfully identified, i.e. $$\frac{count~(prediction = 1~\&~ is\_duplicate =1)}{count~(is\_duplicate =1)}$$ (1 point)
   - returns the predicted values and the percentage

**Q2.4. Test**: 
 - Test your solution using different options in in the tokenize function, i.e. with or without lemmatization, with or without removing stop words, to see how these options may affect the accuracy. 
 - Analyze why some option works the best (or worst). Write your analysis in a pdf file.

## Q3 (Bonus): More analysis (3 points)

**Q3.1**. Define a function "**evaluate**" as follows: (1 point)

   - takes three lists, i.e. list of similarity scores, "is_duplicate" column, and a *threshold* with default value of 0.5 as inputs
   - if a similarity > *threshold*, then predicts the question pair is duplicate, i.e. prediction = 1
   - calulates two metrics:
     - *recall*: the percentage of duplicate questions pairs that are correctly identified, i.e. $$\frac{count~(prediction = 1~\&~ is\_duplicate =1)}{count~(is\_duplicate =1)}$$
     - *precision*: the percentage of question pairs identified as duplicate are indeed duplicate, i.e. $$\frac{count~(prediction = 1~\&~ is\_duplicate =1)}{count~(prediction =1)}$$
   - returns the precision and recall

**Q3.2**. Analyze the following questions 
   - If you change the similarity threhold from 0.1 to 0.9, how do precision and recall change? (0.5 point)
     - <font color="blue"> A: Ideally, precision and recall curves should be plotted against threshold values. </font>
   - Consider both precision and recall, do you think what options (i.e. lemmatization, removing stop words, similarity threshold) can you give the best performance? (0.5 point)
     - <font color="blue">A: lemmatization in general helps, but removing stop words seems not helpful. Perhaps the list of stop words needs to be customized</font>
     - <font color="blue">A: Model perforance really depends on actual need for precision or recall. Similarity threshold will give flexibility in controlling performance preference </font>
   - What kind of duplicates can be easily found? What kind of ones can be difficult to find? (0.5 point)
     - <font color="blue">A: If two questions share the same set of keywords, then TF-IDF is effective. If they have to match based on semantics (i.e. meaning of words or phrases) but do not share the same set of keywords, then TF-IDF is not effective.</font>
   - Do you think the TF-IDF approach is successful in finding duplicate questions? (0.5 point)
    - <font color="blue">It is considered moderately effective.</font>

These are open questions. Just show your analysis with necessary support from the dataset, and save your analysis in a pdf file.
   - <font color="blue">The answers I provided just for reference. Any reasonable analysis should be fine.</font>

In [1]:
import pandas as pd
import nltk
from sklearn.metrics import pairwise_distances
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import normalize
import re
import spacy

In [2]:
def extract(text):
    return re.findall(r'(\w+ \w+), ([\w ]+)[, ]+\(left in (\d{4})\): +\$([\d,]+)',text)

In [3]:
nlp = spacy.load('en')
    
def tokenize(doc, lemmatized=False, no_stopword=False):
    
    tokens =[]
    
    d = nlp(doc)  
    for token in d:
        t = token.text
        if lemmatized:
            t = token.lemma_
            
        if no_stopword:
            if not token.is_stop:
                tokens.append(t)
        else:
            tokens.append(t) 
            
    return tokens

def tfidf(docs, lemmatized=False, no_stopword=False):
    
    docs_tokens={idx:nltk.FreqDist(tokenize(doc, lemmatized, no_stopword)) \
             for idx,doc in enumerate(docs)}

    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index" )
    dtm=dtm.fillna(0)
    
    tf=dtm.values
    doc_len=tf.sum(axis=1)
    tf=np.divide(tf, doc_len[:,None])
    
    df=np.sum(np.where(tf>0,1,0), axis=0)
    
    smoothed_idf=np.log(np.divide(len(docs)+1, df+1))+1

    tf_idf=normalize(tf*smoothed_idf)

    smoothed_tf_idf=normalize(tf*smoothed_idf)

    return smoothed_tf_idf

In [25]:
def get_similarity(q1, q2, lemmatized=False, no_stopword=False):
    
    all_q = q1 + q2
    
    tf_idf = tfidf(all_q, lemmatized, no_stopword)
    print(tf_idf)
    print("aaaaa", tf_idf[0:len(q1)].shape )
    print(tf_idf[len(q1):].shape)
    # cosine similarity of each row
    sim = np.sum(tf_idf[0:len(q1)]*tf_idf[len(q1):], axis =1)
    
    return sim

In [26]:
def evaluate(sim, ground_truth, threshold=0.5):
    
    predict = (sim>threshold).astype(int) 
    conf = pd.crosstab(index=predict, columns=ground_truth).values
    prec = conf[1,1]/sum(conf[1])
    recall = conf[1,1]/sum(conf[:,1])
    return prec, recall

In [27]:
def predict(sim, ground_truth, threshold=0.5):
    
    predict = (sim>threshold).astype(int) 
    
    per = sum((predict==ground_truth) & (predict==1))/sum(ground_truth)
    
    return per

In [28]:
if __name__ == "__main__":  
    
    # Test Q1
    
    text='''Following is total compensation for other presidents at private colleges in Ohio in 2015:

Grant Cornwell, College of Wooster (left in 2015): $911,651
Marvin Krislov, Oberlin College (left in 2016):  $829,913
Mark Roosevelt, Antioch College, (left in 2015): $507,672
Laurie Joyner, Wittenberg University (left in 2015): $463,504
Richard Giese, University of Mount Union (left in 2015): $453,800'''

    print("Test Q1")
    print(extract(text))

    data=pd.read_csv("/Users/haodong/Desktop/quora_duplicate_question_500.csv",header=0)
    result=[]
    # Test Q2
    print("\nTest Q2 & Q3")
    q1 = data["q1"].values.tolist()
    q2 = data["q2"].values.tolist()
    
    print("\nlemmatized: No, no_stopword: No")
    sim = get_similarity(q1,q2)
    print(predict(sim, data["is_duplicate"].values))
    prec, rec = evaluate(sim, data["is_duplicate"].values)
    
    print("\nlemmatized: Yes, no_stopword: No")
    sim = get_similarity(q1,q2, True)
    print(predict(sim, data["is_duplicate"].values))
    for s in [0.1,0.2,0.3, 0.4, 0.5,0.6,0.7,0.8,0.9]:
        prec, rec = evaluate(sim, data["is_duplicate"].values, s)
        result.append((s, prec, rec))
    
    print("\nlemmatized: No, no_stopword: Yes")
    sim = get_similarity(q1,q2, False, True)
    print(predict(sim, data["is_duplicate"].values))
    prec, rec = evaluate(sim, data["is_duplicate"].values)
    
    print("\nlemmatized: Yes, no_stopword: Yes")
    sim = get_similarity(q1,q2, True, True)
    print(predict(sim, data["is_duplicate"].values))
    prec, rec = evaluate(sim, data["is_duplicate"].values)
    
    # show relationship between precision and recall
    df = pd.DataFrame(result, columns=["sim","prec","rec"])
    df = df.set_index("sim")
    df.plot()

Test Q1
[('Grant Cornwell', 'College of Wooster', '2015', '911,651'), ('Marvin Krislov', 'Oberlin College', '2016', '829,913'), ('Mark Roosevelt', 'Antioch College', '2015', '507,672'), ('Laurie Joyner', 'Wittenberg University', '2015', '463,504'), ('Richard Giese', 'University of Mount Union', '2015', '453,800')]

Test Q2 & Q3

lemmatized: No, no_stopword: No
[[0.1576264  0.17571945 0.22067266 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.1224191  0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.55438578 0.         0.        ]
 [0.11736007 0.         0.         ... 0.         0.35114818 0.35114818]
 [0.16807402 0.18736629 0.         ... 0.         0.         0.        ]]
aaaaa (500, 2722)
(500, 2722)
0.6304347826086957

lemmatized: Yes, no_stopword: No
[[0.15748588 0.1509764  0.11821796 ... 0.         0.         0.        ]
 [0.         0.         0.         ..

KeyboardInterrupt: 