<a href="https://colab.research.google.com/github/AbeHandler/AbeHandler.github.io/blob/master/Phrases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Moving from words to phrases when doing NLP
- [Abe Handler](https://www.abehandler.com/) University of Colorado, Boulder
- [Shufan Wang](https://people.cs.umass.edu/~shufanwang/) University of  Massachusetts, Amherst

## Introduction

If you have found this tutorial, you have probably done NLP projects where you (a) start with documents, (b) break them into individual words, and then (c) use computation to draw conclusions about the words in the documents. For instance, in your last project, maybe you took a collection of documents, broke the documents into individual words and then ran a topic model to find groups of words that tended to appear together in the documents. 

Breaking documents into individual words implicitly represents text in terms of single-word units called **unigrams**. The unordered collection of all the unigrams in a document is often called a   **bag of words** (see [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/4.pdf)). Representing text using the unigram bag of words has many advantages. For one, analyzing unigrams is easy and fast; you can take a big text document and break it into a bunch of single-word observations, so you can observe useful statistical properties from the text. 

However, breaking documents into single words for downstream analysis does have downsides. One limitation is that some concepts or linguistic units within documents consist of multiple words, and so get lost or discarded when you using a unigram representation. For instance, the string "New York" refers to a particular city (or state). If you break this string into single word units "New" and "York", and put each of these unigrams to your bag of words, your representation of the text does not really represent the concept "New York".  That means when you draw conclusions about lexical units during downsteam analysis, you won't be able to draw conclusions about "New York". 

For this reason, it sometimes may make sense for you to analyze groups of words instead of unigrams. This tutorial will show you how to do NLP with groups of words, which we will call **phrases** or **multi-word expressions**. We will show how to (A) extract phrases from documents and (B) use these phrases for downstream analysis. 

## High-level takeaways

1. **You can use phrases when you do NLP.** There are many existing tools and methods for extracting phrases (e.g. [PyATE](https://github.com/kevinlu1248/pyate)). In this tutorial, we will explore using phrases extracted via Python package [phrasemachine](https://github.com/slanglab/phrasemachine) which you can install using `pip install phrasemachine`. Phrasemachine is based on the method described in the paper [Bag of What](https://aclanthology.org/W16-5615.pdf).
2. **You can be creative and define phrases in a way that makes sense for your problem.**  `phrasemachine` uses a grammar over part-of-speech tags to extract phrases. The particular phrasemachine patterns are often useful. But in your work, think of other kinds of phrasal patterns you might want to extract using regular expressions. For instance, if you are analyzing the political valance of economic theories, you might want to search economics papers for a pattern like "theory of \$ADJ\?(NOUN|PROPN)+" (e.g. "Theory of Monetary Policy").

#### Unigram bag of words

Let's start with a single (short) document and break it into a unigram bag of words. It's easy to find packages for this online, but we will just use vanilla Python for this. A few notes:

- We will define words using a whitespace delimiter below, but note there are also other better ways to do [tokenization](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf).
- Note that a [bag](https://en.wikipedia.org/wiki/Multiset) is a set that allows duplicates; notice that the word `a` appears two times in the `unigram_bag_of_words`.
- Note that each item in our bag is a unigram (single word)

In [149]:
from collections import defaultdict
document =  "Solyndra received a loan guarantee. The Department of Energy offered the guarantee.".replace(".","")
unigram_bag_of_words = defaultdict(int)
for word in document.split():
    unigram_bag_of_words[word] += 1

unigram_bag_of_words

defaultdict(int,
            {'Department': 1,
             'Energy': 1,
             'Solyndra': 1,
             'The': 1,
             'a': 1,
             'guarantee': 2,
             'loan': 1,
             'of': 1,
             'offered': 1,
             'received': 1,
             'the': 1})

### Discussion: what are some phrases that get missed?

### Adding phrases to the bag of words

In [150]:
import phrasemachine
text = "Solyndra received a loan guarantee. The Department of Energy offered the guarantee.".replace(".","")
out = phrasemachine.get_phrases(text)
out

{'counts': Counter({'department of energy': 1, 'loan guarantee': 1}),
 'num_tokens': 12}

In [151]:
# here we are adding the phrases to the unigram bag of words

enriched_bag_of_words = unigram_bag_of_words
for phrase in out["counts"]:
    enriched_bag_of_words[phrase] = out['counts'][phrase]

enriched_bag_of_words

defaultdict(int,
            {'Department': 1,
             'Energy': 1,
             'Solyndra': 1,
             'The': 1,
             'a': 1,
             'department of energy': 1,
             'guarantee': 2,
             'loan': 1,
             'loan guarantee': 1,
             'of': 1,
             'offered': 1,
             'received': 1,
             'the': 1})

In [4]:
! pip3 install convokit
! wget https://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/cases.jsonl -O cases.jsonl
! pip install phrasemachine
! pip install tqdm

--2022-02-10 21:28:57--  https://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/cases.jsonl
Resolving zissou.infosci.cornell.edu (zissou.infosci.cornell.edu)... 128.253.51.178
Connecting to zissou.infosci.cornell.edu (zissou.infosci.cornell.edu)|128.253.51.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13337468 (13M) [application/octet-stream]
Saving to: ‘cases.jsonl’


2022-02-10 21:28:58 (16.4 MB/s) - ‘cases.jsonl’ saved [13337468/13337468]



## Using phrases for downstream analysis

Now that we know how to extract phrases using phrasemachine, we will now see how to use such phrases for downstream analysis. Specifically, we will analyze the ideological orientation of words and phrases in U.S. Supreme Court Oral Arguments.

At a high level, we will ask: what kinds of things to liberal and conservative justices tend to bring up during oral arguments? What would you expect to liberals and convservatives to talk about?

##### Corpus
We will use the [`convokit`](https://convokit.cornell.edu/documentation/supreme.html) corpus of supreme court oral arguments. In this notebook, we will only examine comments from liberal and conservative justices from the years 2010-2019.

In [11]:
from convokit import Corpus, download
corpus = Corpus(filename=download("supreme-corpus")) # download the corpus

Downloading supreme-corpus to /root/.convokit/downloads/supreme-corpus
Downloading supreme-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/supreme-corpus.zip (1255.8MB)... Done


In this analysis, we will investigate which phrases are used by liberal (L) and conservative (C) justices. So we need a mapping of justices to ideologies, which we construct manually below.

In [7]:
judge2ideology = {'j__john_g_roberts_jr': "C", 
                  'j__samuel_a_alito_jr': "C",
                  'j__ruth_bader_ginsburg': "L",
                  'j__sonia_sotomayor': "L",
                  'j__antonin_scalia': "L",
                  'j__stephen_g_breyer': "L",
                  'j__anthony_m_kennedy': "C",
                  'j__elena_kagan': "L",
                  'j__clarence_thomas': "C",
                  'j__neil_gorsuch': "C",
                  'j__brett_m_kavanaugh': "C"
                  }

We also build a set of all justices in the dataset.

In [10]:
import json

def get_justices(input_file="cases.jsonl"):
    '''Get names of all justices in the dataset'''
    all_justices = set()
    with open(input_file, "r") as inf:
        for j in inf:
            j = json.loads(j)
            if j["votes"] is not None:
                for justice in j["votes"].keys():
                    all_justices.add(justice)
    return all_justices

all_justices = get_justices()

The next step is to extract words and phrases for the justices. In computing this information, we do two things to make the computational requirements more managable: 
1. We limit our analysis to court cases from 2010-2019, which is why you see `u.meta["case_id"][0:3] == "201"` below. 
2. We also only extract phrases from the first 500 characters of the utterance.

In [20]:
utterances = [] # build a list of the utterances we are interested in
for u in tqdm(corpus.get_utterance_ids()):
    u = corpus.get_utterance(u)
    if u.speaker.id in all_justices and u.meta["case_id"][0:3] == "201":
        utterances.append(u)

  0%|          | 0/1700789 [00:00<?, ?it/s]

### Extracting phrases

The code below extracts and counts phrases from liberal and conservative justices

In [67]:
from collections import defaultdict # https://docs.python.org/3/library/collections.html#collections.defaultdict
from tqdm.notebook import tqdm

justice2phrases = defaultdict(lambda: defaultdict(int))

for u in tqdm(utterances):
    phrases = phrasemachine.get_phrases(u.text[0:500])["counts"] # roughly 97.5% are less than 500 chars, and runs way faster
    for p in phrases:
        # we can filter out some filler/stop phrases here, e.g. when the record notes laughter
        if "justice" not in p and "mr." not in p and "minutes" not in p and "laugher" not in p:
            justice2phrases[judge2ideology[u.speaker.id]][p] += phrases[p]

phrasecounts = justice2phrases # this builds a dictionary of count of phrases by liberal/conservative judges

  0%|          | 0/78598 [00:00<?, ?it/s]

### Extracting words

The code below extracts and counts unigrams from liberal and conservative justices

In [83]:
justice2words = defaultdict(lambda: defaultdict(int))

for u in tqdm(utterances):
    words = u.text[0:500].split()
    for word in words:
        # we can filter out some filler/stop phrases here, e.g. when the record notes laughter
        if "justice" not in word and "mr." not in word and "minutes" not in word and "laugher" not in word:
            justice2words[judge2ideology[u.speaker.id]][word] += 1

wordcounts = justice2words # this builds a dictionary of count of phrases by liberal/conservative judges

  0%|          | 0/78598 [00:00<?, ?it/s]

## Analyzing word use

Now that we have counted words and phrases from liberal and convservative justices, we will analyze differences in the political orientation of words and phrases. Specifically we will:
- Compute statistics about how frequently liberal and conservative justices use particular words and 
- Display this information on a plot for analysis 

Our approach is based on the [Fightin' Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) method from Monroe et al. Specifically, we will use the word importance score from Section 3.2.2 of Fightin' Words. If you are curious, the paper describes other word importance scores.

In [133]:
import pandas as pd

def compute_normalize_counts(_countdict):

    normalized_counts = defaultdict(lambda: defaultdict(int))

    for wing in _countdict.keys():
        for p in _countdict[wing]:
            normalized_counts[wing][p] = _countdict[wing][p]/n[wing]

    return normalized_counts

def compute_phrase_scores(normalized, _countdict):
    df = []
    for wing in _countdict.keys():
        for phrase in _countdict[wing]:
            df.append({"score": normalized["L"][phrase] - normalized["C"][phrase], 
                    "phrase": phrase,
                    "count": _countdict["C"][phrase] + _countdict["L"][phrase]}) # http://languagelog.ldc.upenn.edu/myl/Monroe.pdf, 3.2.2
    return pd.DataFrame(df).drop_duplicates()

def getK(_df, k=20):
    if k > 0:
        return _df.sort_values("score")[0:k].copy()
    else:
        return _df.sort_values("score")[k:].copy()

def get_top_K_df(counts):

    countdict = counts

    n = {}  
    n["C"] = sum(countdict["C"].values())
    n["L"] = sum(countdict["L"].values())

    normalized_counts = compute_normalize_counts(countdict)

    df = compute_phrase_scores(normalized_counts, countdict)

    df = df[df["count"] < 200] # exclude high-count lexical items, roughly stop words

    # add a label field to the data frame for altair
    df["label"] = df["phrase"].apply(lambda x: x if x in tops["phrase"].to_list() else "")

    # add an abolute value of the score
    df["score_abs"] = df["score"].apply(lambda x: abs(x))

    return pd.concat([getK(df, k=-20), getK(df, k=20)])


tops_phrases = get_top_K_df(phrasecounts)

tops_words = get_top_K_df(wordcounts)

tops_words["label"] = tops_words["phrase"]

In [None]:
# ! pip3 install convokit
# ! wget https://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/cases.jsonl -O cases.jsonl
# ! pip install phrasemachine 
# ! pip install tqdm



In [None]:
# add words to doc?

In [147]:
import altair as alt
import pandas as pd


def make_plot(source):

    height = 1000

    points = alt.Chart(source).mark_circle().encode(
        x='count:Q',
        y='score:Q',
        size='score_abs',
        color=alt.Color('score:Q', scale=alt.Scale(scheme='redyellowblue'))
    ).properties(
        width=1200,
        height=height
    )

    text = alt.Chart(source).mark_text(
        align='left',
        baseline='middle',
        dx=7
    ).encode(
        x='count:Q',
        y='score:Q',
        text='label'
    ).properties(
        width=1200,
        height=height
    )

    return points + text

make_plot(tops_phrases)

In [148]:
make_plot(tops_words)

### Discussion

- Comparing the unigram plot to the plot with phrases, what do you notice? 

- Which plot gives you a clearer sense what justices tend to talk about. This is sometimes called being more "[interpretable](https://arxiv.org/abs/1702.08608)".