# Using RAKE to extract keywords from documents

There is some quirkiness here because the python-rake implementation considers newlines just to be standard whitespace, whereas newlines in the OCR docs may or may not indicate semantic breaks.

This is a quick way to get a sense of a small corpus. Rake's keywords tend to be more informative than single-token approaches such as a unigram tf-idf.

Here, we round rake's keyword scores so they can be used as counts for a crude weighting mechanism for measuring keyword importance.

In [22]:
import os

DATADIR = '../data/DocumentCloud/subset'

def documents(datadir=DATADIR):
    for fn in os.listdir(datadir):
        yield open(os.path.join(datadir, fn)).read()
docs = [doc for doc in documents()]

In [21]:
import RAKE
from collections import Counter

keywords = Counter()
rake = RAKE.Rake(RAKE.SmartStopList())
for doc in docs:
    keywords.update({k:round(v) for k,v in rake.run(doc) if '\n\n' not in k})
print([k.replace('\n', ' ') for k,v in keywords.most_common(30)])

['chicago transit authority office', 'chicago transit authority', '567 west lake street', 'unanimous voice vote', 'sales award recommendations', 'chicago', 'employee retirement review committee', 'ordinance authorizing', 'cta related policy matters', 'employee retirement review committee 567', 'secretary notice', 'committee', 'transit board', 'lake street board room', 'meeting', 'regular meeting', '“transit board meetings”', 'chicago transit board', '“meeting notices', 'public comment', 'directors terry peterson', 'posted agenda', 'strategic planning', 'sales award recommendations 2', 'executive session', 'real estate matters', 'capital construction projects', 'general operations issues 4', 'chairman peterson asked', 'arabel alva rosales']
