## prep

Step 1: request an API key

https://regulationsgov.github.io/developers/key/

Assign the key as a quoted string called `API_KEY`.

In [1]:
from my_secrets.keys import regulations as API_KEY

Step 2: download all NLTK data. This is necessary for the algorithms to work since they are based on *corpora* that are used in pattern matching.

http://www.nltk.org/data.html

In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## get comments

Let's pick a docket id and search for all comments.

In [7]:
import requests

In [52]:
document_url = 'https://api.data.gov/regulations/v3/documents.json'

payload = {
    'api_key': API_KEY,
    'dckid': 'EPA-HQ-OAR-2010-0505',
    'dct': 'PS',
    'rpp': 200
}

In [53]:
response = requests.get(document_url, params=payload)
documents = response.json()
documents.keys()

dict_keys(['documents', 'totalNumRecords'])

This docket has over 6 million public submissions.

With the potentially large volume of public comments it is best to filter first by using the `'s'` search query in the `payload`.

In [54]:
documents['totalNumRecords']

6564595

## natural language

Let's explore the 200 we received.

In [55]:
comments = documents['documents']
len(comments)

200

In [56]:
a_comment = comments[3]
a_comment

{'agencyAcronym': 'HHS',
 'allowLateComment': False,
 'attachmentCount': 0,
 'commentDueDate': '2011-09-30T23:59:59-04:00',
 'commentStartDate': '2011-08-03T00:00:00-04:00',
 'commentText': 'Pregnancy is not a disease, and drugs and surgeries to prevent it are not basic health care that the government should require all Americans to purchase.  Please remove sterilization and prescription contraceptives from the list of "preventive services" the federal government is mandating in private health plans.  It is especially important to exclude any drug that may cause an early abortion, and to fully respect religious freedom as other federal laws do.  The narrow religious exemption in HHS\'s new rule protects almost no one. I urge you to allow all organizations and individuals to offer, sponsor and obtain health coverage that does not violate their moral and religious convictions.',
 'docketId': 'HHS-OS-2011-0023',
 'docketTitle': 'Group Health Plans and Health Insurance Issuers Relating to 

In [57]:
text = a_comment['commentText']

Now let's import some useful tools.

In [58]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english')

It is considered best practice to first normalize the case of the text, unless performing parts of speech. 

Notice that punctuation are separated from the words.

In [59]:
words = word_tokenize(text.lower())
words[:30]

['pregnancy',
 'is',
 'not',
 'a',
 'disease',
 ',',
 'and',
 'drugs',
 'and',
 'surgeries',
 'to',
 'prevent',
 'it',
 'are',
 'not',
 'basic',
 'health',
 'care',
 'that',
 'the',
 'government',
 'should',
 'require',
 'all',
 'americans',
 'to',
 'purchase',
 '.',
 'please',
 'remove']

To filter out to important words we need to
1. remove short words, called `stopwords`
2. remove punctuation

In [60]:
important_words = []
for word in words:
    
    # skip anything that does not contain letters
    if not word.isalpha():
        continue
    
    if word not in stoplist:
        important_words.append(word)

In [61]:
important_words[:10]

['pregnancy',
 'disease',
 'drugs',
 'surgeries',
 'prevent',
 'basic',
 'health',
 'care',
 'government',
 'require']

Let's iterate over all of the comments and find the important words.

In [62]:
def important(comment):
    words = word_tokenize(comment.lower())
    
    important_words = []
    for word in words:
    
        # skip anything that does not contain letters
        if not word.isalpha():
            continue
    
        if word not in stoplist:
            important_words.append(word)
    
    return important_words

In [63]:
all_important_words = []

for comment in comments:
    
    # some items have not text?
    if 'commentText' in comment:
        text = comment['commentText']
    
    all_important_words.extend(important(text))   

In [64]:
len(all_important_words)

13568

## common words

Now that we have a flat list of all of the important words let's perform some statistical analysis.

The `FreqDist` method let's create frequency distributions to determine the most common words. The stop words and punctuation have already been filtered out.

In [65]:
from nltk import FreqDist

In [66]:
distributions = FreqDist(all_important_words)

Here are the 10 most common words.

In [67]:
distributions.most_common(10)

[('national', 138),
 ('public', 95),
 ('would', 89),
 ('health', 75),
 ('monuments', 73),
 ('please', 67),
 ('one', 64),
 ('act', 61),
 ('proposed', 57),
 ('use', 55)]

And the 10 most infrequently used words.

In [68]:
distributions.most_common()[-10:]

[('mainland', 1),
 ('hawaii', 1),
 ('island', 1),
 ('listen', 1),
 ('deaf', 1),
 ('interim', 1),
 ('whats', 1),
 ('fructose', 1),
 ('syrup', 1),
 ('catch', 1)]