# HW 4: Preprocessing

## Q1: Extract data using regular expression (2 points)
Suppose you have scraped the text shown below from an online source (https://www.google.com/finance/). 
Define a `extract` function which:
- takes a piece of text (in the format of shown below) as an input
- extracts data into a DataFrame with columns 'Ticker','Name','Article','Media','Time','Price',and 'Change' using regular expression
- returns the DataFrame

In [1]:
import pandas as pd
import nltk
from sklearn.metrics import pairwise_distances
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import normalize
import re
import spacy

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [2]:
text = '''QQQ
Invesco QQQ Trust Series 1
Invesco Expands QQQ Innovation Suite to Include Small-Cap ETF
PR Newswire • 4 hours ago
$265.62
1.13%
add_circle_outline
AAPL
Apple Inc
Estimating The Fair Value Of Apple Inc. (NASDAQ:AAPL)
Yahoo Finance • 4 hours ago
$140.41
1.50%
add_circle_outline
TSLA
Tesla Inc
Could This Tesla Stock Unbalanced Iron Condor Return 23%?
Investor's Business Daily • 1 hour ago
$218.30
0.49%
add_circle_outline
AMZN
Amazon.com, Inc.
The Regulators of Facebook, Google and Amazon Also Invest in the Companies' Stocks
Wall Street Journal • 2 days ago
$110.91
1.76%
add_circle_outline'''



In [3]:
def extract(text):

    # add your codes
    col = ('Ticker', 'Name', 'Article', 'Media', 'Time', 'Price', 'Change')
    match = re.findall(
        r'([A-Z]+)\n(.+)\n(.+)\n(.+) • (.+)\n(\$\d+\.\d{2})\n(\d\.\d{2}\%)', text)

    result = pd.DataFrame(match, columns=col)

    return result


In [4]:
extract(text)

Unnamed: 0,Ticker,Name,Article,Media,Time,Price,Change
0,QQQ,Invesco QQQ Trust Series 1,Invesco Expands QQQ Innovation Suite to Includ...,PR Newswire,4 hours ago,$265.62,1.13%
1,AAPL,Apple Inc,Estimating The Fair Value Of Apple Inc. (NASDA...,Yahoo Finance,4 hours ago,$140.41,1.50%
2,TSLA,Tesla Inc,Could This Tesla Stock Unbalanced Iron Condor ...,Investor's Business Daily,1 hour ago,$218.30,0.49%
3,AMZN,"Amazon.com, Inc.","The Regulators of Facebook, Google and Amazon ...",Wall Street Journal,2 days ago,$110.91,1.76%


## Q2: Analyze a document (8 points)

When you have a long document, you would like to 
- Quanitfy how concrete a sentence is
- Create a concise summary while preserving it's key information content and overall meaning. Let's implement an `extractive method` based on the concept of TF-IDF. The idea is to identify the key sentences from an article and use them as a summary. 


Carefully follow the following steps to achieve these two targets.

### Q2.1. Preprocess the input document (4 points, each step 0.5 point (see below), overall function and logic 2 points)

Define a function `proprocess(doc, lemmatized = True, remove_stopword = True, lower_case = True, remove_punctuation = True, pos_tag = False)` 
- Inputs with four parameters:
    - `doc`: an input string (e.g. a document)
    - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized).
    - `remove_stopword`: an optional boolean parameter to remove stop words. The default value is True, i.e., remove stop words. 
    - `remove_punctuation`: optional boolean parameter to remove punctuations. The default values is True, i.e., remove all punctuations.
    - `lower_case`: optional boolean parameter to convert all tokens to lower case. The default option is True, i.e., lowercase all tokens.
    - `pos_tag`: optional boolean parameter to add a POS tag for each token. The default option is False, i.e., no POS tagging.  
       
- Split the input `doc` into sentences. Hint, typically, "\n\n" is used to separate paragraphs. Make sure each sentence does not cross over two paragraphs. (0.5 point)


- Tokenize each sentence into unigram tokens and also process the tokens as follows:
    - If `lemmatized` is True, lemmatize all unigrams. (0.5 point)
    - If `remove_stopword` is set to True, remove all stop words. (0.5 point)
    - If `remove_punctuation` is set to True, remove all punctuations. (0.5 point)
    - If `lower_case` is set to True, convert all tokens to lower case (0.5 point)
    - If `pos_tag` is set to True, find the POS tag for each token and form a tuple for each token, e.g., ('recently', 'ADV'). Either Penn tags or Universal tags are fine. See mapping of these two tagging systems here: https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

- Return the original sentence list (`sents`) and also the tokenized (or tagged) sentence list (`tokenized_sents`). 
   
(Hint: you can use [nltk](https://www.nltk.org/api/nltk.html) and [spacy](https://spacy.io/api/token#attributes) package for this task.)

In [5]:
nlp = spacy.load("en_core_web_sm")


def preprocess(doc, lemmatized=True, pos_tag=False, remove_stopword=True, lower_case=True, remove_punctuation=True):

    # add your codes

    docs = nlp(doc.replace('\n\n', '. '))
    sents = list(docs.sents)
    tokenized_sents = []
    for sent in docs.sents:
        tokenized_sent = []
        for token in sent:
            text = token.text
            if lemmatized:
                text = token.lemma_
            if remove_stopword:
                if token.is_stop:
                    continue
            if remove_punctuation:
                if token.is_punct:
                    continue
            if lower_case:
                text = text.lower()
            if pos_tag:
                text = (text, token.pos_)

            tokenized_sent.append(text)

        tokenized_sents.append(tokenized_sent)

    return sents, tokenized_sents


In [6]:
# load test document

text = open("power_of_nlp.txt", "r", encoding='utf-8').read()

In [7]:
# test with all default options:

sents, tokenized_sents = preprocess(text)
for i in range(3):
    print(sents[i], "\n",tokenized_sents[i],"\n\n" )

The Power of Natural Language Processing. 
 ['power', 'natural', 'language', 'processing'] 


Until recently, the conventional wisdom was that while AI was better than humans at data-driven decision making tasks, it was still inferior to humans for cognitive and creative ones. 
 ['recently', 'conventional', 'wisdom', 'ai', 'well', 'human', 'data', 'drive', 'decision', 'make', 'task', 'inferior', 'human', 'cognitive', 'creative', 'one'] 


But in the past two years language-based AI has advanced by leaps and bounds, changing common notions of what this technology can do.. . 
 ['past', 'year', 'language', 'base', 'ai', 'advance', 'leap', 'bound', 'change', 'common', 'notion', 'technology'] 




In [8]:
# process text without remove stopwords, punctuation, lowercase, but with pos tagging

sents, tokenized_sents = preprocess(text, lemmatized = False, pos_tag = True, 
                                    remove_stopword=False, remove_punctuation = False, 
                                    lower_case = False)

for i in range(3):
    print(sents[i], "\n",tokenized_sents[i],"\n\n" )

The Power of Natural Language Processing. 
 [('The', 'DET'), ('Power', 'PROPN'), ('of', 'ADP'), ('Natural', 'PROPN'), ('Language', 'PROPN'), ('Processing', 'PROPN'), ('.', 'PUNCT')] 


Until recently, the conventional wisdom was that while AI was better than humans at data-driven decision making tasks, it was still inferior to humans for cognitive and creative ones. 
 [('Until', 'ADP'), ('recently', 'ADV'), (',', 'PUNCT'), ('the', 'DET'), ('conventional', 'ADJ'), ('wisdom', 'NOUN'), ('was', 'AUX'), ('that', 'SCONJ'), ('while', 'SCONJ'), ('AI', 'PROPN'), ('was', 'AUX'), ('better', 'ADJ'), ('than', 'ADP'), ('humans', 'NOUN'), ('at', 'ADP'), ('data', 'NOUN'), ('-', 'PUNCT'), ('driven', 'VERB'), ('decision', 'NOUN'), ('making', 'VERB'), ('tasks', 'NOUN'), (',', 'PUNCT'), ('it', 'PRON'), ('was', 'AUX'), ('still', 'ADV'), ('inferior', 'ADJ'), ('to', 'ADP'), ('humans', 'NOUN'), ('for', 'ADP'), ('cognitive', 'ADJ'), ('and', 'CCONJ'), ('creative', 'ADJ'), ('ones', 'NOUN'), ('.', 'PUNCT')] 


Bu

### Q2.2. Quantify sentence concreteness


`Concreteness` can increase a message's persuasion. The concreteness can be measured by:
- the use of `article` (e.g., a, an, and the), 
- `adpositions` (e.g., in, at, of, on, etc), and
- `quantifiers`, i.e., adjectives before nouns.


Define a function `compute_concreteness(tagged_sent)` as follows:
- Input argument is `tagged_sent`, a list with (token, pos_tag) tuples as shown above.
- Find the three types of tokens: `articles`, `adposition`, and `quantifiers`.
- Compute `concereness` score as:  `(the sum of the counts of the three types of tokens)/(total non-punctuation tokens)`.
- return the concreteness score, articles, adposition, and quantifiers lists.


Find the most concrete and the least concrete sentences from the article. 


Reference: Peer to Peer Lending: The Relationship Between Language Features, Trustworthiness, and Persuasion Success, https://socialmedialab.sites.stanford.edu/sites/g/files/sbiybj22976/files/media/file/larrimore-jacr-peer-to-peer.pdf

In [9]:
def compute_concreteness(tagged_sent):

    # add your codes
    articles = []
    adpositions = []
    quantifier = []
    non_punct = 0

    for idx, token in enumerate(tagged_sent):
        (word, tag) = token
        if word.lower() in ['a', 'an', 'the']:
            articles.append(token)
        if tag == 'ADP':
            adpositions.append(token)
        if tag == 'ADJ' and idx < len(tagged_sent) - 1 and tagged_sent[idx + 1][1] == 'NOUN':
            quantifier.append(token)

        if tag != 'PUNCT':
            non_punct += 1

    concreteness = (len(articles) + len(adpositions) +
                    len(quantifier)) / non_punct
    return concreteness, articles, adpositions, quantifier


In [10]:
# tokenize with pos tag, without change the text much

sents, tokenized_sents = preprocess(text, lemmatized = False, pos_tag = True, 
                                    remove_stopword=False, remove_punctuation = False, 
                                    lower_case = False)



In [11]:
# Test with one sentence

idx = 1
x = tokenized_sents[idx]
concreteness, articles, adpositions,quantifier = compute_concreteness(x)
sents[idx]
concreteness, articles, adpositions,quantifier

Until recently, the conventional wisdom was that while AI was better than humans at data-driven decision making tasks, it was still inferior to humans for cognitive and creative ones.

(0.26666666666666666,
 [('the', 'DET')],
 [('Until', 'ADP'),
  ('than', 'ADP'),
  ('at', 'ADP'),
  ('to', 'ADP'),
  ('for', 'ADP')],
 [('conventional', 'ADJ'), ('creative', 'ADJ')])

In [12]:
# Find the most concrete and the least concrete sentences from the article

concrete = [compute_concreteness(x)[0] for x in tokenized_sents]
max_id = np.argmax(np.array(concrete))
min_id = np.argmin(np.array(concrete))
print (f"The most concerete sentence:  {sents[max_id]}, {concrete[max_id]:.3f}\n")
print (f"The least concerete sentence:  {sents[min_id]}, {concrete[min_id]:.3f}")

The most concerete sentence:  Large foundation models like GPT-3 exhibit abilities to generalize to a large number of tasks without any task-specific training., 0.450

The least concerete sentence:  What NLP Can Do., 0.000


### Q2.3. Generate TF-IDF representations for sentences (1 point,  0.5 point for use_idf option, 0.5 point for overall)

Define a function `compute_tf_idf(sents, use_idf)` as follows: 


- Take the following two inputs:
    - `sents`: tokenized sentences returned from Q2.1. These sentences form a corpus for you to calculate `TF-IDF` vectors.
    - `use_idf`: if this option is true, return smoothed normalized `TF_IDF` vectors for all sentences; otherwise, just return normalized `TF` vector for each sentence.
    
    
- Calculate `TF-IDF` vectors as shown in the lecture notes (Hint: you can slightly modify code segment 7.5 in NLP Lecture Notes (II) for this task)

- Return the `TF-IDF` vectors  if `use_idf` is True.  Return the `TF` vectors if `use_idf` is False.

In [13]:
def compute_tf_idf(sents, use_idf=True, min_df=1):

    # add your codes
    docs_tokens = {idx: {token: sent.count(token)
                         for token in set(sent)}
                   for idx, sent in enumerate(sents)}

    dtm = pd.DataFrame.from_dict(docs_tokens, orient="index")
    dtm = dtm.fillna(0)
    dtm = dtm.sort_index(axis=0)

    tf = dtm.values
    doc_len = tf.sum(axis=1, keepdims=True)
    tf = np.divide(tf, doc_len)

    df = np.where(tf > 0, 1, 0)

    smoothed_idf = np.log(np.divide(len(sents)+1, np.sum(df, axis=0)+1))+1
    smoothed_tf_idf = tf*smoothed_idf

    return smoothed_tf_idf if use_idf else tf


In [14]:
sents, tokenized_sents = preprocess(text)
tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)

# show shape of TF-IDF
tf_idf.shape

(80, 488)

### Q2.4. Identify key sentences as summary 

`2 points, 0.5 point for steps 1-3 each, 0.5 for overall logic. Due ot different packages used, the output summary sentences may not be the same as the sample output. If different, please check the code to check if the coding logic is correct`

The basic idea is that, in a coherence article, all sentences should center around some key ideas. If we can identify a subset of sentences, denoted as $S_{key}$, which precisely capture the key ideas,  then $S_{key}$ can be used as a summary. Moreover, $S_{key}$ should have high similarity to all the other sentences on average, because all sentences are centered around the key ideas contained in $S_{key}$. Therefore, we can identify whether a sentence belongs to $S_{key}$ by its similarity to all the other sentences.


Define a function `get_summary(tf_idf, sents, topN = 5)`  as follows:

- This function takes three inputs:
    - `tf_idf`: the TF-IDF vectors of all the sentences in a document
    - `sents`: the original sentences corresponding to the TF-IDF vectors
    - `topN`: the top N sentences in the generated summary

- Steps:
    1. Calculate the cosine similarity for every pair of TF-IDF vectors (0.5 point)
    1. For each sentence, calculate its average similarity to all the others (0.5 point)
    1. Select the sentences with the `topN` largest average similarity (0.5 point)
    1. Print the `topN` sentences index
    1. Return these sentences as the summary

In [85]:
def get_summary(tf_idf, sents, topN=5):

    # add your codes
    similarity = 1 - pairwise_distances(tf_idf, metric='cosine')

    avg_similarity = list(
        map(lambda s: s / len(sents), similarity.sum(axis=0)))

    index = np.argsort(avg_similarity)[::-1][:topN]
    summary = map(lambda idx: sents[idx], index)

    return summary


In [86]:
# put everything together and test with different options

sents, tokenized_sents = preprocess(text)
tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)
summary = get_summary(tf_idf, sents, topN = 5)

for sent in summary:
    print(sent,"\n")

Begin incorporating new language-based AI tools for a variety of tasks to better understand their capabilities.. 

Models like GPT-3 are considered to be foundation models — an emerging AI research area — which also work for other types of data such as images and video. 

Powerful generalizable language-based AI tools like Elicit are here, and they are just the tip of the iceberg; multimodal foundation model-based tools are poised to transform business in ways that are still difficult to predict. 

A Language-Based AI Research Assistant. 

Understand how you might leverage AI-based language technologies to make better decisions or reorganize your skilled labor.. Language-based AI won’t replace jobs, but it will automate many tasks, even for decision makers. 



In [87]:
# test with the option lemmatized=False, remove_stopword=False

sents, tokenized_sents = preprocess(text, lemmatized=False, remove_stopword=False, remove_punctuation = True )
tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)
summary = get_summary(tf_idf, sents, topN = 5)
for sent in summary:
   print(sent,"\n")



It is difficult to anticipate just how these tools might be used at different levels of your organization, but the best way to get an understanding of this tech may be for you and other leaders in your firm to adopt it yourselves. 

Powerful generalizable language-based AI tools like Elicit are here, and they are just the tip of the iceberg; multimodal foundation model-based tools are poised to transform business in ways that are still difficult to predict. 

And don’t forget to adopt these technologies yourself — this is the best way for you to start to understand their future roles in your organization. 

Don’t bet the boat on it because some of the tech may not work out, but if your team gains a better understanding of what is possible, then you will be ahead of the competition. 

You are certainly aware of the value of data, but you still may be overlooking some essential data assets if you are not utilizing text analytics and NLP throughout your organization. 



In [88]:
# test with the option use_idf = False

sents, tokenized_sents = preprocess(text)
tf_idf = compute_tf_idf(tokenized_sents, use_idf = False)
summary = get_summary(tf_idf, sents, topN = 5)
for sent in summary:
   print(sent,"\n")

Powerful generalizable language-based AI tools like Elicit are here, and they are just the tip of the iceberg; multimodal foundation model-based tools are poised to transform business in ways that are still difficult to predict. 

Begin incorporating new language-based AI tools for a variety of tasks to better understand their capabilities.. 

Models like GPT-3 are considered to be foundation models — an emerging AI research area — which also work for other types of data such as images and video. 

Understand how you might leverage AI-based language technologies to make better decisions or reorganize your skilled labor.. Language-based AI won’t replace jobs, but it will automate many tasks, even for decision makers. 

A Language-Based AI Research Assistant. 



### Q2.5. Analysis (1 point, 0.5 point for Q1, and 0.5 for all the others)

- Do you think the way to quantify concreteness makes sense? Any other thoughts to measure concreteness or abstractness? Share your ideas in pdf or markdown.




- Do you think this method is able to generate a good summary? Any pros or cons have you observed? (0.5 point)




- Do these options `lemmatized, remove_stopword, remove_punctuation, use_idf` matter? 
- Why do you think these options matter or do not matter? 
- If these options matter, what are the best values for these options?





#### answers:
- In my opnion, quantify concreteness makes sense.

- These methods produce a pretty good summary to some extent. The shortcoming, however, is that there are certain limitations that restrict the criteria for measurement.

- These options are necessary. This is because by them it is possible to produce results with different precision.

## Q2.5. (Bonus 1 point). 

`While the idea is proposed, it must be implemented. `


- Can you think a way to improve this extractive summary method? Explain the method you propose for improvement,  implement it, use it to generate a new summary, and demonstrate what is improved in the new summary.

**Sample Answer**


A: If an article have sentences repeats themselves and if one of the repeated sentences is selected, the other will be selected too. It is hard to ensure the diverity of the sentences in the summary.  To ensure diversity, tor example, this algorithm can be improved using **max-min** method: 
1. Select top 10 (or more) sentences as before as candidates. 
1. Add top 1 from the candidates into the summary, 
1. gradually add other sentences such that each of them is **least similar** to those aleady added. 
An implementation is provided. See if it's better!




- Or, you can research on some other extractive summary methods and implement one here. Compare it with the one you implemented in Q2.1-Q2.3 and show pros and cons of each method.

*Another alogithm can be selecting the sentences which have the largest total word tf-idf scores. For implementation, see https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3*

In [19]:
def get_summary_with_diversity(tf_idf, sents, topN = 5):
    
   
    #add your codes



    return summary 

In [20]:
summary = get_summary_with_diversity(tf_idf, sents, topN = 5)

for sent in summary:
    print(sent,"\n")

Think of finance. 

How Can Organizations Prepare for the Future?. 

The original suggestion itself wasn’t perfect, but it reminded me of some critical topics that I had overlooked, and I revised the article accordingly. 

What NLP Can Do. 

Yet while these stunts may be attention grabbing, are they really indicative of what this tech can do for businesses?. 

