### Working with unstructured data (Module 07)


In [1]:
from ptb import TreebankWordTokenizer
# This is the Penn Tree Bank tokenizer from NLTK as just one file
tok = TreebankWordTokenizer()

#### Adding a covariate

- In text processing, a variable that occurs along with words (but is not a word) is sometimes called a covariate 

- This is similar to the idea of "metadata"

- Example covariates: day of the week, time of day, number of upvotes, author, political party 

Questions: 
    
- Make a term document matrix for `libertarian.jsonl`
- Make a term document matrix for `socialism.jsonl`
- This is mostly review from last time.
- You should share the vocabularies between these two subreddits
- There are two changes

### Raw counts 

- Instead of a binary matrix, replace the 1s with the counts of each word
- So if a word occurs 5x in a document, the number should be 5 in the term-document matrix

### Stop words 

Before we start, we will add one new thing, stop words. A stop word is a common word that is excluded from analysis in text processing. In NLP, it is common to exclude stop words. There are many stop word lists out there. We will use a [common list](https://gist.github.com/sebleier/554280) from NLTK. 

Start off by downloading the list using requests. Hint: click raw on Github to get a link to the raw data.

Add in all of the punctuation in the string module

In [2]:
import requests
import string

url = "https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords"
r = requests.get(url).text
stop_words = r.split("\n") + [i for i in string.punctuation]
stop_words = [i.lower() for i in stop_words]

- When you read in the tokens from your file, this time ignore the stop words. Lower case the word to see if it is a stop word 

In [3]:
import json

def get_D_and_V(fn):
    V = set()
    D = 0

    with open(fn, "r") as inf:
        for doc in inf:
            D += 1
            doc = json.loads(doc)
            for token in tok.tokenize(doc["body"]):
                if token.lower() not in stop_words:
                    V.add(token)

    V = list(V) # we want a consistent order. Not sure the latest on Python set ordering
    
    return D, V

D1, V1 = get_D_and_V(fn="libertarian.jsonl")
D2, V2 = get_D_and_V(fn="socialism.jsonl")

V = list(set(V1 + V2))

n2v = {k:v for k, v in enumerate(V)}
v2n = {v:k for k, v in enumerate(V)}

### Make a tdm for each subreddit, sharing vocabulary

In [None]:
import numpy as np
import pandas as pd

def make_tdm(_D, _V, fn):

    out = np.zeros((D2, len(_V)))

    with open(fn, "r") as inf:
        for docno, doc in enumerate(inf):
            doc = json.loads(doc)
            for token in tok.tokenize(doc["body"]):
                if token.lower() not in stop_words:
                    out[docno][v2n[token]] += 1

    out = pd.DataFrame(data=out, columns=V)
    
    return out

tdm_libertarian_df =  make_tdm(D1, V, fn="libertarian.jsonl")
tdm_socialism_df =  make_tdm(D2, V, fn="socialism.jsonl")

In [None]:
### Add in covariates, and make a big dataframe

tdm_libertarian_df["source_reddit"] = "libertarian"
tdm_socialism_df["source_reddit"] = "socialism"

tdm = pd.concat([tdm_libertarian_df, tdm_socialism_df])

#### Question 

- What does the tdm represent? 

### Question 

- What are the top terms, based on raw count in the libertarian subreddit?
- What are the top terms, based on raw count in the socialism subreddit?
- Do you think it helps that you removed stop words?
- Can you think of ways to expand the stop word list that might help?

In [None]:
counts_socialism = tdm[tdm["source_reddit"] == "socialism"]
counts_socialism = counts_socialism.drop(columns=["source_reddit"], axis=1)
counts_socialism = counts_socialism.sum(axis=0)
counts_socialism = pd.DataFrame(counts_socialism)
counts_socialism = counts_socialism.rename(columns={0: "frequency"})
counts_socialism.sort_values(axis=1, by="frequency", ascending=False, inplace=False)

In [None]:
counts_libertarian = tdm[tdm["source_reddit"] == "libertarian"]
counts_libertarian = counts_libertarian.drop(columns=["source_reddit"], axis=1)
counts_libertarian = counts_libertarian.sum(axis=0)
counts_libertarian = pd.DataFrame(counts_libertarian)
counts_libertarian = counts_libertarian.rename(columns={0: "frequency"})
counts_libertarian.sort_values(axis=1, by="frequency", ascending=False , inplace=False)

In [None]:
mx = np.sum(tdm.drop(["source_reddit"], axis=1)> 0)
D = tdm.shape[0]
idf = pd.DataFrame({"idf": np.log(D/mx)})

In [None]:
counts_libertarian["tfidf"] = counts_libertarian["frequency"] * idf["idf"]
counts_socialism["tfidf"] = counts_socialism["frequency"] * idf["idf"]

In [None]:
counts_socialism.sort_values("tfidf", ascending=False)[0:100].index

In [None]:
counts_libertarian.sort_values("tfidf", ascending=False)[0:100].index

In [5]:
counts_libertarian

NameError: name 'counts_libertarian' is not defined