# Assignment 3: Natural Language Processing

## Q1: Extract data using regular expression (2 points)
Suppose you have scraped the text shown below from an online source. 
Define function `extract(text)` which:
- takes a piece of text (in the format of shown below) as an input
- extracts data into a list of tuples using regular expression, e.g.  [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]
- returns the list of tuples

In [1]:
import re

In [2]:
text='''Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020

        Producer Price Index:
        +0.4% in Sep 2020

        Employment Cost Index:
        +0.5% in 2nd Qtr of 2020

        Productivity:
        +10.1% in 2nd Qtr of 2020

        Import Price Index:
        +0.3% in Sep 2020

        Export Price Index:
        +0.6% in Sep 2020'''



## Q2: Develop a QA system for COVID-19 (8 points)

A curated COVID-19 Question and Answer (QA) dataset has been provided. Now you are required to develop a QA system, which can search for the best answers to any question related to COVID-19. The dataset looks like below.

In [3]:
import pandas as pd
df=pd.read_csv("03_data/03_covid_qa.csv")
df.head(5)

Unnamed: 0,question,answer
0,Can I get COVID-19 from animals when travellin...,Although the current spread and growth of the ...
1,How can I protect myself and others?,The best way to prevent illness from COVID-19 ...
2,Where did COVID-19 come from?,"It was first found in Wuhan City, Hubei Provin..."
3,Can my pet or other animals get sick from COVI...,There is currently no evidence to suggest that...
4,How can I protect my child from COVID-19?,By having them practice the same things you ha...


**Q2.1.** Define a function `tokenize(doc, lemmatized = True, stopword = True, punctuation = True)`  as follows:
   - Take three parameters: 
       - `doc`: an input string (e.g. a question)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized).
       - `stopword`: an optional bookean parameter to keep stop words. The default value is True (i.e. keep stop words). 
       - `punctuation`: optional bookean parameter to keep punctuations. The default values is True (i.e. keep all punctuations)
   - Split the input text into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams.
       - if `stopword` is set to False, remove all stop words.
       - if `punctuation` is set to False, remove all punctuations.
   - Convert all unigrams to the lower case and remove empty ones
   - Return the list of unigrams after all the processing. (Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

**Q2.2.** Define a function `compute_tf_idf(docs, lemmatized = True, stopword = True, punctuation = True)` as follows: 
- Take the following inputs:
    - `docs`: a corpus consisting of a list of strings (e.g. all questions)
    - `lemmatized, stopword, punctuation`:  similar to those defined in Q2.1
- Tokenize each string in `docs` using the function defined in Q2.1. Note, you need to pass the value of `lemmatized, stopword, punctuation` to `tokenize` function.
- Calculate tf_idf weights as shown in lecture notes (Hint: you can reuse the last code segment in NLP Lecture Notes (II))
- Return the following three variables:
    - a smoothed normalized `tf_idf` array, 
    - the list of `words` (i.e. unigrams) in the vocabulary of the corpus, and 
    - inverse document frequency (`idf`) 

**Q2.3.** Define a function `vectorize_doc(doc, words, idf, lemmatized = True, stopword = True, punctuation = True)`  to calculate the tf_idf weights for a document, as follows: 
- Take four inputs:
   - `doc`: a new document (e.g. a new question)
   - `words`: the list of words from the corpus (i.e. the return from Q2.2)
   - `idf`: inverse document frequency from the corpus (i.e. the return from Q2.2)
   - `lemmatized, stopword, punctuation`:  similar to those defined in Q2.1
- Tokenize `doc` using the `tokenize` function as defined in Q2.1.
- Compute the term frequency of the document
- Calculate the smoothed normalized tf_idf weights for the single document
- Return the tf_idf weight vector, which should have the same shape as `idf`

**Q2.4.** Define a function `find_answer(doc_vect, tf_idf, docs)` as follows: 

   - Take three inputs: 
      - `doc_vect`: A tf_idf weight vector for a new question. This is the return from Q2.3. 
      - `tf_idf`: A tf_idf array. This is a return from Q2.2
      - `docs`: the set of documents from which `tf_idf` was created. Note, if there are `m` documents, `n` words, the shape of `tf_idf` is `(m,n)`, the shape of `doc_vect` should be `(n,)`. 
   - Caluclate the cosine similarity between `doc_vect` and `tf_idf`. This returns a vector of `(m,)`, indicating the similarities between the question and each document in `docs`
   - Find the indexes of the top-3 similarities 
   - Return the documents corresponding to these indexes

**Q2.5. Test**: 
- For better match, you can concate a pair of question and answer as a single document (see code below)
- Test your solution using different options in in the tokenize function, i.e. with or without lemmatization, with or without removing stop words/punctuation, to see how these options may affect the accuracy of answers. 
- Test your solution with different questions to see how effective your system is.

## Q3 (bonus): Analysis

Perhaps this could be your first QA system. If you try this system with different questions, you may notice this solution is not perfect and it may not find the right answers to some questions. 
- Please summarize `three drawbacks` you observed about this QA system . Use examples to illustrate each of the drawbacks.
- Research to find possible solutions to each of the drawbacks
- You do not need to implement these solutions. Just explain them  conceptually.
- This is an open-ended question. Just show your observation and thinking here.

In [4]:
import nltk
from sklearn.metrics import pairwise_distances
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import normalize
import re
#import spacy

import string
import nltk
import pandas as pd
import numpy as np
#nltk.download()
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords

from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
# Q.1

def extract(text):
    
    a = re.findall('(\w[a-zA-Z ]+)\:\s+(\+\d+.\d\%)\sin\s([\w ]+)', text)
       
    return a

In [6]:
# Q2.1

def tokenize(doc, lemmatized=True, stopword=True, punctuation = True):
    """
    Argus:
    ------
    doc:string
        input is a sentence, question or answer
    lemmatized:
        If True, token will be lemmatized.
        
    Returns:
     -------
    tokens:list
    """
    # tokenzie word (actually split word and puncuation by space) and lowercase
    tokens = nltk.word_tokenize(doc.lower())
    # tag each tokenized word
  
    # False = remove stopword  ||   True = keep stopword
    if stopword == False:
        # using english stopwords
        stop_words = stopwords.words("english")
        # stop_words +=["covid"]
        tokens = [word for word in tokens if word not in stop_words]
    else:
        pass
        
    # False=remove punctuation,   ||   True=keep punctuation
    if punctuation == False:
        tokens = [token.strip(string.punctuation) for token in tokens]
        # remove empty tokens
        tokens = [token.strip() for token in tokens if token.strip()!='']
    else:
        pass
    
    # lemmatized
    if lemmatized == True:
        # set WordNetLemmatirzer
        wordnet_lemmatizer = WordNetLemmatizer()
        # create word contains
        word_list = []
        for i in tokens:
            # I only can extra lemmatize without specifty tag
            # this part i can improve, use different tag to get different lemmaitze word, for now, i simply provide VERB as defualt 
            word = wordnet_lemmatizer.lemmatize(i,wordnet.VERB)
            word_list.append(word)
        tokens = word_list
    else:
        pass

                       
    return tokens

In [7]:
# Q2.2
def compute_tfidf(docs, lemmatized=True, stopword=True, punctuation=True):
    """
    Argus:
    ------
    docs:list
        input is not string. It's a list from list(df['question'])
        
    Return:
    ------
    smoothed_tf_idf:
    
    """
    # in order to prevent reuse, empty these variables
    smoothed_tf_idf, smoothed_idf, words = None, None, None
    
    # STEP 1. get preprocessed tokens of each document into list
    token_list = []
    # our function tokenize only process string level, out docs are corpus(multi-sentences) level,
    # so we need iterate to individual sencent(sting)
    for i in docs:
        # i is each question individually, default parameters is enable lemmatized but don't delete stopwords and punctuation
        # In test(), we enable lemmatized, don't delete stopwords and delete punctuation
        token = tokenize(i, lemmatized=lemmatized, stopword=stopword, punctuation=punctuation)
        # add list into a big list with nested structure
        token_list.append(token)
        # add new tokens into end of old list and integrate into a big corpus
#         token_list.extend(token)
    #create token count dictionary, we need input a list corpus
#     token_count = nltk.FreqDist(token_list)
    
    # STEP 2. process all documents a dictonary of dictionaries
    # idx is the order of each token, FreqDist will be the occurance count number for each token in this single sentence
    # we built a dictionary, key is the index of this word (vlaue) in this sentence (string),
    # value is another dictionary contain frequence of indivude words in this sentence
    tokens_docs = {idx:nltk.FreqDist(token) for idx,token in enumerate(token_list)}
    
    # STEP 3.get document-term matrix, 
    # constrcut a document-term matrix where each row is a doc, each column is a token and the value is the ferquency ot the token
    # here has a very important merge dupulicate mechanism. If we use orient='index', each dict pair {key:value} will fill DataFrame
    # column with key, fill value into cooresponding row index. 
    # for instance, if index =0 : {oil:1}, this 1 will be put into slot (0,0) if the oil is the first word in column
    # next step, we got a index=1:{oil:2}, this 2 will be put into slot (1,0) if the oil is the first word in column
    # this mechanism wil elimate duplication can record the frequence of each word in each sentence
    df_corpus = pd.DataFrame.from_dict(tokens_docs, orient='index')
    # fill np.Nan (beacuase the didn't appear in this sentence) with 0
    df_corpus = df_corpus.fillna(0)
    
    # sort the index will help you match the right sentence with its correct sentence length for caculation
    # sort by index (i.e. doc id)
    df_corpus = df_corpus.sort_index(axis=0)
    
    # STEP 4.get normalized term frequency (tf) matrix
    # convert df_corpus to numpy arrays
    tf = df_corpus.values
    # sum the value of each row, this is the length of each sentcent in our question, will be the denominator of tf caculation
    doc_len = tf.sum(axis=1)
    # divide df_corpus matrix by the doc length matrix
    # tf will have a length of each sentence (different for each sentence)
    tf = np.divide(tf, doc_len[:,None])
    # set float precision to print nicely, only reserve 3 deciaml places
#     np.set_printoptions(precision=3)
    
    # STEP 5.get idf
    # get document frequent, if tf.value>1, then this slot is 1, otherwise, this slot is 0, this is used to count appearence
    # df_corpus is the frequnce of words, df_idf is the occurance of each words
    df_idf = np.where(tf>0, 1, 0)
    # get idf. df_idf represent the appear count for each words. If we sum then by column and get a row result.
    # so row result will be the occurance count for each word in all document
    # plus 1 is to avoide log0
    # numerator is length of each sentence, denominator is the occcurance count of each token(words).
    # idf is corpus level, so this idf will be length = (unique word number)
    idf = np.log(np.divide(len(token_list), np.sum(df_idf, axis=0)))+1
    # in order to get more smooth result we add 1 in each step
    smoothed_idf = np.log(np.divide(len(token_list)+1, np.sum(df_idf,axis=0)+1))+1
    
    # STEP 6.get tf-idf
    # tf here is a (m,n) matrix, m is the number of exmaple/sentence, n is the number of unique words inheried from df_courps
    # idf is a (n,) vector, n is the number of unique word inheried from df_idf(extract from tf)
    # so tf*idf will be a element-wise multilication, result will be (m,n) matrix
    tf_idf = normalize(tf*idf)
    smoothed_tf_idf = normalize(tf*smoothed_idf)
    
    # for better visulization, make a tf-idf arrray dataframe
    pd.options.display.float_format = '{:,.2f}'.format
    #
    df_smooth = pd.DataFrame(smoothed_tf_idf, columns = df_corpus.columns)
    
    # smoothed_tf_idf = tf_idf weights
    # smoothed_idf represent: how important each word in the corpus
    # list(df_corpus.columns) = words = vocabulary = feature sets
    return smoothed_tf_idf, smoothed_idf, list(df_corpus.columns)

In [8]:
# Q2.3

def vectorize_doc(doc, words, idf, lemmatized=True, stopword=True, punctuation = True):
    """
    Argus:
    ------
    doc:string:
        a new sentence
        
    words:list
        unique words from former courps
        
    idf:vector
        lenght is the unquice number of words from former corpus
        
    Returns
    -------
    vect:
        
    """
    # add your code here
    vect = None
    # tokenize doc using the tokenize() function 
    token = tokenize(doc, lemmatized=lemmatized, stopword=stopword, punctuation=punctuation)
    # compute the term frequency to the document 
    # STEP 2. process all documents a dictonary of dictionaries
    # idx is the order of each token, FreqDist will be the occurance count number for each token in this single sentence
#     tokens_docs = {idx:nltk.FreqDist(word) for idx,word in enumerate(token)}
    token_count = nltk.FreqDist(token)
    
    # STEP 3.get document-term matrix, 
    # constrcut a document-term matrix where each row is a doc, each column is a token and the value is the ferquency ot the token
    # create a empty vectors, reshape its dimention to (1,m) so we can create a one row dataframe
    token_zero = np.zeros(len(words)).reshape(1,len(words))
    # cerate a empty one row vector
    df_corpus = pd.DataFrame(token_zero,columns=words)
    # check every words in df_corpus.columns, which is actually words
    for idx, word in enumerate(df_corpus.columns):
        if list(token_count)[idx] in df_corpus.columns:
#             print(list(token_count)[idx])
            df_corpus[list(token_count)[idx]]=token_count[ list(token_count)[idx] ]
        else:
            break
#     # fill np.Nan (beacuase the didn't appear in this sentence) with 0
#     df_corpus = df_corpus.fillna(0)
    
    # STEP 4.get normalized term frequency (tf) matrix
    # convert df_corpus to numpy arrays
    tf = df_corpus.values
    # sum the value of each row
    doc_len = tf.sum(axis=1)
    # divide df_corpus matrix by the doc length matrix
    tf = np.divide(tf, doc_len[:,None])
    # set float precision to print nicely
    np.set_printoptions(precision=3)
    
#     # STEP 5.get idf
#     # get document frequent, if tf.value>1, then this slot is 1, otherwise, this slot is 0, this is used to count appearence
#     df_idf = np.where(tf>0, 1, 0)
#     # get idf. df_idf represent the appear count for each words. If we sum then by column and get a row result.
#     # so row result will be the occurance count for each word in all document
#     # plus 1 is to prevent log0
#     idf = np.log(np.divide(len(token_list), np.sum(df_idf, axis=0)))+1
#     # in order to get more smooth result we add 1 in each step
#     smoothed_idf = np.log(np.divide(len(token_list)+1, np.sum(df_idf,axis=0)+1))+1
    
#     # STEP 6.get tf-idf
#     tf_idf = normalize(tf*idf)
    smoothed_tf_idf = normalize(tf*idf)
    # transorm shape from (1,1176) to (1176,1)
    vect = smoothed_tf_idf.reshape(len(words),)

    return vect

In [9]:
# doc = 'Is it safe to travel by plane?'  # What kind of masks should I use?
# vect= vectorize_doc(doc, words, idf, lemmatized=True, stopword=True, punctuation =  False)

In [10]:
# print([(words[idx], i) for idx, i in enumerate(vect) if i>0])

In [11]:
# Q2.4

def find_answer(doc_vect, tf_idf, docs):
    """
    top_docs  = []
    doc_vect: A tf_idf weight vector for a new question. This is the return from Q2.3.
    tf_idf: A tf_idf array. This is a return from Q2.2
    docs: the set of documents from which tf_idf was created. Note, if there are m documents, n words, 
    the shape of tf_idf is (m,n), the shape of doc_vect should be (n,).
    
    """
    # add your code here
#     Caluclate the cosine similarity between doc_vect and tf_idf. This returns a vector of (m,), indicating the similarities between the question and each document in docs
#     Find the indexes of the top-3 similarities
#     Return the documents corresponding to these indexes
    cos_list = []
    for idx,i in enumerate(tf_idf):
        cos = np.dot(vect, i)/(np.linalg.norm(vect)*np.linalg.norm(i))
        cos_list.append(cos)
    cos_vect = pd.DataFrame(cos_list)
    top_index = cos_vect[0].nlargest(3).index
    top_docs = docs[top_index]
    return top_docs

In [12]:
if __name__ == "__main__":  
    
    # Test Q1
    
    text='''Consumer Price Index:
            +0.2% in Sep 2020

            Unemployment Rate:
            +7.9% in Sep 2020

            Producer Price Index:
            +0.4% in Sep 2020

            Employment Cost Index:
            +0.5% in 2nd Qtr of 2020

            Productivity:
            +10.1% in 2nd Qtr of 2020

            Import Price Index:
            +0.3% in Sep 2020

            Export Price Index:
            +0.6% in Sep 2020'''
    
    print("\n==================\n")
    print("Test Q1")
    print(extract(text))
      
    data=pd.read_csv("03_data/03_covid_qa.csv")
    # concatenate a pair of question and answer as a single doc
    docs = data.apply(lambda x: x["question"] + " " + x["answer"], axis = 1) 
    
    print("\n==================\n")
    print("Test Q2.1 - Try different parameter values to make sure all options work\n")
    
    # Let's tokenize the first document
    doc = docs[0]
    
    print("===Lemmatize words, keep stop words/punctuations===\n")
    tokens = tokenize(doc, lemmatized=True, stopword=True, punctuation = True)
    print(tokens)
    print("\n")
    
    print("===Lemmatize words, remove stop words/punctuations==\n")
    tokens = tokenize(doc, lemmatized=True, stopword=False, punctuation = False)
    print(tokens)
    print("\n")
    
    print("===Do not lemmatize words, remove stop words, but keep punctuations===\n")
    tokens = tokenize(doc, lemmatized=False, stopword=False, punctuation = True)
    print(tokens)
     
    print("\n==================\n")
    print("Test Q2.2")
    tf_idf, idf, words = compute_tfidf(docs, lemmatized=True, stopword=True, punctuation = False)
    print("TF_IDF Shape: ", tf_idf.shape)
    print("IDF Shape: ", idf.shape)

    print("\n==================\n")
    print("Test Q2.3 -- You can try different questions related to Covid-19 here")
    doc = 'Is it safe to travel by plane?'  # What kind of masks should I use?
    vect = vectorize_doc(doc, words, idf, lemmatized=True, stopword=True, punctuation = True)
    # print words with non-zero tf_idf weights
    print([(words[idx], i) for idx, i in enumerate(vect) if i>0])
  
    print("\n==================\n")
    print("Test Q2.4")
    answers = find_answer(vect, tf_idf, docs)
    for a in answers:
        print(a, "\n")



Test Q1
[('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]


Test Q2.1 - Try different parameter values to make sure all options work

===Lemmatize words, keep stop words/punctuations===

['can', 'i', 'get', 'covid-19', 'from', 'animals', 'when', 'travel', 'to', 'other', 'countries', '?', 'although', 'the', 'current', 'spread', 'and', 'growth', 'of', 'the', 'covid-19', 'outbreak', 'be', 'primarily', 'associate', 'with', 'spread', 'from', 'person', 'to', 'person', ',', 'experts', 'agree', 'that', 'the', 'virus', 'likely', 'originate', 'from', 'bat', 'and', 'may', 'have', 'pass', 'through', 'an', 'intermediary', 'animal', 'source', '(', 'currently', 'unknown', ')', 'in', 'china', 'before', 'be', 'transmit', 'to