### Working with unstructured data (Module 07)


In [59]:
from ptb import TreebankWordTokenizer
# This is the Penn Tree Bank tokenizer from NLTK as just one file
tok = TreebankWordTokenizer()

#### Adding a covariate

- In text processing, a variable that occurs along with words (but is not a word) is sometimes called a covariate 

- This is similar to the idea of "metadata"

- Example covariates: day of the week, time of day, number of upvotes, author, political party 

#### Questions

- Make a term document matrix for `libertarian.jsonl`
- Make a term document matrix for `socialism.jsonl`
- This is mostly review from last time, but you should share the vocabulary across the two subreddits
- There are two changes since last time... 

### Change 1: Raw counts 

- Last time we made a binary matrix. This was just to get started.
- This time, instead of a binary matrix, replace the 1s with the counts of each word
- So if a word occurs 5x in a document, the number should be 5 in the term-document matrix

### Change 2: Stop words 

Before we start, we will add one new thing, stop words. A stop word is a common word that is excluded from analysis in text processing. In NLP, it is common to exclude stop words. There are many stop word lists out there. We will use a [common list](https://gist.github.com/sebleier/554280) from NLTK. 

Start off by downloading the list using requests. Hint: click raw on Github to get a link to the raw data.

Once you had a stop word list, when you read in the tokens from your file, this time ignore the stop words. Lower case the word to see if it is a stop word.

In [80]:
import requests
import string

url = "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords"
r = requests.get(url).text
stop_words = r.split("\n")
stop_words = [i.lower() for i in stop_words] + [o for o in string.punctuation]

In [81]:
import json

def get_D_and_V(fn):
    V = set()
    D = 0

    with open(fn, "r") as inf:
        for doc in inf:
            D += 1
            doc = json.loads(doc)
            for token in tok.tokenize(doc["body"]):
                if token.lower() not in stop_words:
                    V.add(token)

    V = list(V) # we want a consistent order. Not sure the latest on Python set ordering
    
    return D, V

D1, V1 = get_D_and_V(fn="libertarian.jsonl")
D2, V2 = get_D_and_V(fn="socialism.jsonl")

V = list(set(V1 + V2))

n2v = {k:v for k, v in enumerate(V)}
v2n = {v:k for k, v in enumerate(V)}

### Make a tdm for each subreddit, sharing vocabulary
- Remember to skip stop words
- Remember to make a matrix of counts, not a binary matrix

In [82]:
import numpy as np
import pandas as pd

def make_tdm(_D, _V, fn):

    out = np.zeros((D2, len(_V)))

    with open(fn, "r") as inf:
        for docno, doc in enumerate(inf):
            doc = json.loads(doc)
            for token in tok.tokenize(doc["body"]):
                if token.lower() not in stop_words:
                    out[docno][v2n[token]] += 1

    out = pd.DataFrame(data=out, columns=V)
    
    return out

tdm_libertarian_df =  make_tdm(D1, V, fn="libertarian.jsonl")
tdm_socialism_df =  make_tdm(D2, V, fn="socialism.jsonl")

In [83]:
### Add in covariates, and make a big dataframe

tdm_libertarian_df["source_reddit"] = "libertarian"
tdm_socialism_df["source_reddit"] = "socialism"

tdm = pd.concat([tdm_libertarian_df, tdm_socialism_df])

### Question 

- What does the tdm represent, now that we added the `source_reddit` covariate?

### Question 

- What are the top terms, based on raw count in the libertarian subreddit?
- What are the top terms, based on raw count in the socialism subreddit?
- Do you think it helps that you removed stop words?
- Can you think of ways to expand the stop word list that might help?

In [84]:
counts_socialism = tdm[tdm["source_reddit"] == "socialism"]
counts_socialism = counts_socialism.drop(columns=["source_reddit"], axis=1)
counts_socialism = counts_socialism.sum(axis=0)
counts_socialism = pd.DataFrame(counts_socialism)
counts_socialism = counts_socialism.rename(columns={0: "frequency"})
counts_socialism.sort_values(by="frequency", ascending=False)

Unnamed: 0,frequency
n't,5185.0
's,4784.0
people,2707.0
'',2613.0
``,2394.0
...,...
transference,0.0
epitome,0.0
**man**^,0.0
backhoe,0.0


In [85]:
counts_libertarian = tdm[tdm["source_reddit"] == "libertarian"]
counts_libertarian = counts_libertarian.drop(columns=["source_reddit"], axis=1)
counts_libertarian = counts_libertarian.sum(axis=0)
counts_libertarian = pd.DataFrame(counts_libertarian)
counts_libertarian = counts_libertarian.rename(columns={0: "frequency"})
counts_libertarian.sort_values(by="frequency", ascending=False)

Unnamed: 0,frequency
n't,4617.0
's,4290.0
gt,2607.0
'',2606.0
``,2316.0
...,...
percieve,0.0
circlejerking,0.0
medlem,0.0
determinism.,0.0


#### tf-idf

- Intuition: if a word is common, then a high frequency is not so meaningful.
- The fact that "because" shows up a lot in the libertarian subreddit is not that important
- In text processing, it is common to discount word scores by 1 divided by the number of documents where the word occurs
- We can build some intuition for this on the whiteboard
- For our purposes, we will define the inverse document frequency as 1/Dw, where Dw is the count of documents that contain a word across the whole corpus (both reddits)
- If 10 documents contain the word "#jurynullification" what is Dw for "#jurynullification"?
- What do you think the Dw for "because" would be?
- To get a tf-idf score we multiply the term frequency (tf) by the idf score.

### Compute idf 

- Compute the idf for each word from the TDM

In [86]:
mx = np.sum(tdm.drop(["source_reddit"], axis=1)> 0)
D = tdm.shape[0]
idf = pd.DataFrame({"idf": 1/mx})

### Compute tf-idf scores

- Compute tf-idf scores for each word from the TDM

In [91]:
counts_libertarian["tfidf"] = counts_libertarian["frequency"] * idf["idf"]
counts_socialism["tfidf"] = counts_socialism["frequency"] * idf["idf"]

### Handling rare words 

- tf-idf and other metrics tend to boost word importance scores for rare words (why?)
- One way to do this is to ignore words that only occur in only 1 or 2 documents 
- Modify your code to ignore rare words
- There are a few ways to do this

In [92]:
idf[idf["idf"] > .25] = 0
counts_libertarian["tfidf"] = counts_libertarian["frequency"] * idf["idf"]
counts_socialism["tfidf"] = counts_socialism["frequency"] * idf["idf"]

In [93]:
counts_socialism.sort_values("tfidf", ascending=False)[0:100].index

Index(['Mr', 'feat.', 'Turkish', 'hip', 'Trots', 'ft.', 'hop.', 'Poland',
       'Bookchin', 'EU', 'Harvey', 'cites', 'Community', 'absentee', 'Kliman',
       'Bolivar', 'Rosa', 'deed', 'Proudhon', 'Berlin', 'bourgeosie', 'coca',
       'Proyect', 'hop', 'Bioshock', 'FARC', 'LTRPF', 'er', 'stereotypes',
       'Hamas', 'Eurozone', 'moneyless', 'intrinsically', 'Costs', 'Rise',
       'Verso', 'Tijoux', 'soc', 'Buddhist', 'marxists', 'delusions',
       'left-liberal', 'Scottish', 'WSWS', 'Technique', 'Immortal', 'Cuban',
       'Labour', 'Yugoslavia', 'Guevara', 'rap', 'parliaments', 'IRC', 'Venus',
       'editors', 'battalion', 'uncompromising', 'Alice', 'tribalism', 'USSR.',
       'obesity', 'vouchers', 'Cuba', 'organising', 'NATO', 'blah', 'Marxism',
       'innovation.', 'Vox', 'PKK', 'halls', 'accumulate', 'Comunista',
       'Reconstrucción', 'patriarchal', 'comfort/entertainment', 'achievable',
       'Ana', 'Feminists', 'deficit', 'Mos', 'rewarding', 'Lupe', 'Buddhism',
    

In [94]:
counts_libertarian.sort_values("tfidf", ascending=False)[0:100].index

Index(['^|', 'Nexium', 'quot', 'hijackers', 'mg', '009', 'ESA', 'glasses',
       'Prilosec', 'Gilded', 'NRA', 'passengers', 'temp', 'amp', 'DNA',
       'airlines', '039', 'incestuous', '/message/compose', 'to=autowikibot',
       'Conley', 'WTC', '^or', 'OTC', 'incest', 'message=', '^delete',
       'subject=AutoWikibot', 'herd', 'stickers', 'FISA', 'prescription',
       '*****', 'soda', 'ID', 'sorority', 'Barr', '^libertarianism', 'FICA',
       '2F', 'MJ', 'elective', '^of', 'encryption', 'plane.', 'Gardner',
       'exchanges', 'tl', '//en.wikipedia.org/wiki/Libertarianism', 'refund',
       'charter', 'sexuality.', 'heroin', 'Musk', 'temps', 'Pentagon',
       'marriage', 'lenses', 'airline', 'GM', 'planes', 'Kasich',
       'announcement', 'Air', 'Towers', 'nihilist', 'nuisance', 'vaccination',
       'Flight', 'statute', 'Biden', 'blah', '10th', 'standard.', 'resolved',
       '14th', 'evaluating', 'dwelling', 'SWAT', 'joints', 'Josh', 'weed.',
       'encrypted', 'Richmond', 

### Comparing raw counts w/ tf-idf

- Compare the raw counts with the tf-idf scaled words
- Which seems to do a better job capturing 

### Phrases 

- If we have time, we can consider how to redo this analysis using phrases 
- Why might we want to do that? 
- Some [software](https://github.com/slanglab/phrasemachine) for this...