# Natural Language Processing Techniques on Public Records Requests Data

In [1]:
import spacy
import pandas as pd
import xlrd
#from sklearn.manifold import TSNE
from spacy.tokenizer import Tokenizer
from gensim.models import Word2Vec, ldamodel
from gensim.corpora import Dictionary
from collections import Counter
import itertools
from spacy.lang.en.stop_words import STOP_WORDS
import re
import time
import pprint
pp = pprint.PrettyPrinter()

# this isn't strictly an import, but it's used globally
# so it's up here. change the model for different word vectors
nlp = spacy.load('en_core_web_lg')

# Cleaning and structuring data for analysis

At the onset of this project, we obtained data extracts from a number of institutions and municipalities of different sizes, with the intention of collating them into one repository for analysis. Upon receipt, there were a handful of things that we realized, which dramatically reduced the kind of analysis that we could perform on the data.

1. **Government bodies have different reporting standards for metadata.** I believe this is related to the platform/infrastructure the institution is using to fulfill and manage requests. The fields of interest, as well as the types of data accessible within those fields, are often so distinct that they cannot be combined in a straightforward way. For example, some municipalities have tons of great metadata reporting, especially around the estimated amount of time and actual amount of time that it took to fulfill a request.

2. **Inputs are not standardized.** Both people and institutions submit public records requests, which means that there are differing degrees of detail, complexity, and formatting. Some people submit a few words or a case number in their request, while others copy-paste several spreadsheet columns. 

3. **There is personal identifying information everywhere.** The more I dug into records requests, the more I saw people signing their requests with their name, address, and contact information. It's hard to strip out this information with software, especially because requests ask for information on a particular person or named entity. This is problematic for open data projects, namely because 

4. **All data are equal, but some data are more equal than others.** This project charts out some methods that can be used across departments that have different numbers of records requests, but there's little we can do (and little inference we can provide) with a solitary request.

The confluence of all of these factors effectively means that it's hard to look for trends. *Not great.*

It's really important to clean and harmonize the data as best we can. As such, we ended up throwing away most of the data that we had and working with a much smaller subset. After importing the data into a pandas dataframe, we kept:

- The department name (a string);
- Record creation date (a datetime object)
- Request summary (a string)

In [2]:
# this section loads an excel sheet.
# it will be updated to connect to a database in the future.
#
data = pd.read_excel('reformatted.xlsx')

In [3]:
# reindex the data above, using only the columns we care about
truncated = data[['department_name', 'create_date', 'request_summary']]

# then drop all of the null values
truncated = truncated.dropna()

# number of entries in the truncated data block
len(truncated)

#sum(truncated.request_type_description != truncated.spd_overall_rec_req_description)

23399

In [5]:
# here's a list of all of the possible departments we can choose from
# 
depts = truncated.department_name.value_counts()
deptsdf = pd.DataFrame(depts)
deptsdf['pct'] = (deptsdf['department_name'] / len(truncated)).round(5)
deptsdf

Unnamed: 0,department_name,pct
SPD,13789,0.5893
SFD,3976,0.16992
Site Administrator,939,0.04013
SCI,809,0.03457
FAS,740,0.03163
DOT,667,0.02851
LAW,297,0.01269
LEG,294,0.01256
SCL,279,0.01192
SPU,270,0.01154


In [33]:
# departments to handle differently because they have a large number of req's
# SPD; FAS; SFD; HSD; DOT; LAW
spd = truncated[truncated.department_name != 'SPD']
zzz = truncated[truncated.department_name == 'ZZZ']
#truncated.groupby(['department_name', 'request_type_description']).count()

# Project 0: Helper functions and generating a word cloud for each department

In [34]:
def sentences(df, col):
    # ARGUMENTS: df; a pandas dataframe
    # col, a string; a column of a particular pandas dataframe
    #
    sents = []
    #
    # this function takes information out of a table
    # and dumps it into a list for further processing
    for row in df[col]:
        sents.append(str(row))
    return sents

def nlp_proc(data):
    # ARGUMENTS: data, either a string or a list
    # OUTPUTS: spacy processed text data
    #
    # this checks to see if the data we are processing is a list
    # and if it is, runs the NLP function on the different huge
    # strings in the list, returning the list
    if type(data) == list:
        nlplist = [ nlp(i) for i in data ]
        return nlplist
    else:
        # otherwise it just processes the data; just a backup
        return nlp(data)
    
def filter_noise(token, mtl=3, cs=False):
    # ARGUMENTS: token; a spacy token
    # optional: mtl (minimum token length); an int
    # optional: cs (custom stop); a boolead
    # OUTPUTS: T/F
    #
    is_noise = False
    # this function performs a series of checks to see if 
    # a token is noise and then returns t/f
    # essentially it's a giant switch
    #
    #
    noisy_pos_tags = ["PROP", "PRON"]
    #
    # here's a regular expression for matching dates/times from a string
    # spacy doesn't handle that task well
    dates = re.compile('\d{1,2}(?P<sep>[-/])\d{1,2}(?P=sep)\d{2,4}')
    times = re.compile(r'\d{1,2}(:\d{1,2})?(am|pm)?')
    #
    # these are some stop words that occur pretty frequently across docs;
    # it might make sense to expand these further
    custom_stop_words = ['this', 'that', 'please', 'be', 'file',
                         'copy', 'with', 'from', 'document', 'like',
                         'have', 'other', 'thank', 'and/or', 'record']
    #
    #
    # cs refers to 'custom stop', and adds the stopwords to the spacy
    # list if true
    if cs == True:
        for stopword in custom_stop_words:
            lexeme = nlp.vocab[stopword]
            lexeme.is_stop = True
    #
    # measures length of token; default is 3
    if len(token.text) <= mtl:
        is_noise = True
    # filters based on list above
    elif token.pos_ in noisy_pos_tags:
        is_noise = True
    elif token.text == '-PRON-':
        is_noise = True
    elif bool(dates.findall(token.text)) == True:
        is_noise = True
    elif bool(times.findall(token.text)) == True:
        is_noise = True
    # filters stop words
    elif token in STOP_WORDS:
        is_noise = True
    elif token.is_stop == True:
        is_noise = True
    # filters things that are/look like numbers
    elif token.is_digit == True:
        is_noise = True
    elif token.is_currency == True:
        is_noise = True
    elif token.like_num == True:
        is_noise = True
    # filters web stuff
    elif token.like_url == True:
        is_noise = True
    elif token.like_email == True:
        is_noise = True
    # filters punctuation
    elif token.is_punct == True:
        is_noise = True
    elif token.is_left_punct == True:
        is_noise = True
    elif token.is_right_punct == True:
        is_noise = True
    elif token.is_bracket == True:
        is_noise = True
    elif token.is_quote == True:
        is_noise = True
    elif token.is_space == True:
        is_noise = True
    return is_noise 


def lem_stop(text, l=True):
    # ARGUMENTS: text; a list containing lists, containing spacy tokens
    # l; a boolean flag, True by default
    # OUTPUTS: a list containing lists, containing lemmas of tokens
    #
    # empty list, later to become a list of lists
    txt = []
    #
    for sentences in text:
        if l == True:
            # if the default isn't changed,
            # check to see if a word is noise; if not, keep the lemma form in the list
            sent = [token.lemma_ for token in sentences if filter_noise(token) == False]
        else:
            # if the defaults are different, just add the complete token
            sent = [token for token in sentences if filter_noise(token) == False] 
        # then append the sentence to the list
        txt.append(sent)
    return txt

def ner(text):
    # ARGUMENTS: text; a list containing lists, containing spacy tokens
    # OUTPUTS: a list containing lists, containing named entities
    #
    # empty list, later to become a list of lists
    txt = []
    #
    for sentences in text:
        sent = [entity for entity in sentences.ents] 
        # then append the sentence to the list
        txt.append(sent)
    return txt

In [48]:
start = time.time()
# this line takes all the text in the request summary column
# and extracts each entry into a list
o = sentences(spd, 'request_summary')

# extract the information, producing a list of sentences (really, a list of tokens)
docs = nlp_proc(o)

# finally, we:
# remove stopwords
# remove punctuation
# lemmatize (reduce to simplest form for purposes of similarity)

texts = lem_stop(docs)

end = time.time()
# this should take approximately 10min to run on the full seattle dataset

In [74]:
print(end-start)
countents = Counter(itertools.chain(*ents))
edf = pd.DataFrame.from_dict(countents, orient='index', columns=['count'])
edf = edf.sort_values(by='count', ascending=True).reset_index()
edf

255.81355595588684


Unnamed: 0,index,count
0,(TRN),1
1,(Mobile),1
2,"(the, City)",1
3,"(the, the, City, Centre)",1
4,"(DPD, Project)",1
5,(4624),1
6,(S.),1
7,"(Dates):9510, 28TH)",1
8,"(Ryan, C., Nute)",1
9,"(AVE, NW)",1


In [31]:
counter = Counter(itertools.chain(*texts))
cdf = pd.DataFrame.from_dict(counter, orient='index', columns=['count'])

# percentage term frequency compared to rest of document
cdf['pct'] = cdf['count'] / sum(cdf['count'])

# normalization to reweight the "importance" of a word, invariate to document size
cdf['prp'] = cdf['count'] / cdf['count'].max()

cdf = cdf.sort_values(by='count', ascending=False)
cdf.head(50)
#prr2vec_data = [ lem_stop(i) for i in docs ]
# the output here is a list of all of the processed, filtered text
#
# this is wrapped in a list in order to help word2vec work with the word embeddings
# rather than just the characters of the text. normally word2vec takes sentences
# rather than just words

Unnamed: 0,count,pct,prp
record,4431,0.020289,1.0
request,4305,0.019712,0.971564
seattle,3982,0.018233,0.898668
city,2757,0.012624,0.622207
include,2225,0.010188,0.502144
public,2126,0.009735,0.479801
storage,2016,0.009231,0.454976
email,1983,0.00908,0.447529
underground,1892,0.008663,0.426992
tanks,1784,0.008169,0.402618


In [21]:
# Create Dictionary
id2word = Dictionary(texts)

# Term Document Frequency
corpus = [id2word.doc2bow(txt) for txt in texts]

dictionary = Dictionary(texts) 
passes = 45
rs = 7
num_topics = 10

lda_40 = ldamodel.LdaModel(corpus, 
                        num_topics=num_topics, 
                        id2word = id2word, 
                        passes=passes, 
                        random_state=rs)

# Project 1: Clustering using LDA

### What is Latent Dirichlet Allocation (LDA)?
Latent dirichlet allocation (LDA) is a tool for finding implicit relationships in a large body of text. The algorithm produces topics, which essentially are groupings of words that we--statistically speaking--expect to have some degree of association (represented through co-occurrence). As such, topics are not explicitly related through semantics or knowledge content--it is through later inference that we understand the emergent higher-level categories.

For example, consider a body of text that contains keywords like 'China', 'black', 'white', 'spotted', 'Croatia', 'cute', and 'bamboo'. It could be the case that a subset of these words are explicitly (and exclusively) related to a particular category, while others occur within each category with about the same frequency distribution.

Using LDA, we are able to categorize all of these words into groups that make sense. The computer groups 'China', 'bamboo', 'black', 'white', and 'cute' together, and our human inference suggests *'panda'*. In contrast, if we see 'Croatia', 'spotted', 'black', 'white', and 'cute', in the computer output, we think *'dalmatian'*.

More complicated algorithms can be used to assign actual labels to topic categories.

In [22]:
pp.pprint(lda_40.print_topics())

[(0,
  '0.031*"violation" + 0.028*"avenue" + 0.028*"code" + 0.028*"complaint" + '
  '0.025*"please" + 0.023*"property" + 0.022*"seattle" + 0.021*"fire" + '
  '0.020*"provide" + 0.019*"copy"'),
 (1,
  '0.059*"record" + 0.038*"public" + 0.027*"request" + 0.019*"development" + '
  '0.018*"produce" + 0.018*"project" + 0.018*"shoreline" + 0.015*"sdot" + '
  '0.015*"permit" + 0.015*"exemption"'),
 (2,
  '0.044*"report" + 0.035*"relate" + 0.029*"include" + 0.029*"fire" + '
  '0.026*"record" + 0.019*"document" + 0.018*"incident" + 0.017*"seattle" + '
  '0.012*"limit" + 0.011*"case"'),
 (3,
  '0.053*"property" + 0.049*"site" + 0.046*"assessment" + '
  '0.044*"environmental" + 0.024*"record" + 0.022*"seattle" + '
  '0.021*"information" + 0.020*"permit" + 0.018*"request" + 0.017*"building"'),
 (4,
  '0.020*"would" + 0.015*"request" + 0.013*"be" + 0.012*"information" + '
  '0.011*"seattle" + 0.011*"thank" + 0.010*"look" + 0.009*"number" + '
  '0.009*"street" + 0.009*"-PRON-"'),
 (5,
  '0.043*"city

# Project 2: Cosine Similarity Using Word2Vec

### What is Word2Vec?

Word2vec is an algorithmic system used to produce word embeddings, which may need some explanation or unpacking. Think back to the LDA section of this document. The computer produced a clustering of words, and we used our associative human creativity to establish patterns and relationships between them. What if it were possible to determine the similarity or semantic relationship between words programmatically?

Word2vec achieves this by taking a large body of text and representing it as a vector space. Each word contained within that vector space is encoded as a vector, comprised of a 1 where the word is and 0's everywhere else). There is a hidden filter layer which compresses the size of this vector while minimizing information loss, as smaller vectors are less computationally complex to compare. Finally, the word vectors are all positioned in the vector space relative to each other, with more similar words clustered together.

This is quite abstract, so let's try out an example. Assume that we have the sentences "Bananas and apples are delicious," and "Durian and jackfruit are unpleasant." We can represent each of these as a list of words:

`document1 = ['Bananas', 'and', 'apples', 'are', 'delicious', '.']`

`document2 = ['Durian', 'and', 'jackfruit', 'are', 'unpleasant', '.']`

And then as a vector space, like so:

In [87]:
d = {'bananas': [1, 0], 'durian': [0, 1], 'and': [1, 1],
     'apples': [1, 0], 'jackfruit': [0, 1], 'are': [1, 1],
    'delicious': [1, 0], 'unpleasant': [0, 1], '.': [1, 1]}

s  = pd.Series(d,index=d.keys())
s

bananas       [1, 0]
durian        [0, 1]
and           [1, 1]
apples        [1, 0]
jackfruit     [0, 1]
are           [1, 1]
delicious     [1, 0]
unpleasant    [0, 1]
.             [1, 1]
dtype: object

While we have produced an excellent vector space here, we can make the vector space more sparse and easier to work with by dropping items that do not have semantic relevance. That includes words like 'and' and punctuation.

In [83]:
e = {'bananas': [1, 0], 'durian': [0, 1],
     'apples': [1, 0], 'jackfruit': [0, 1],
    'delicious': [1, 0], 'unpleasant': [0, 1]}

t  = pd.Series(e,index=e.keys())
t

bananas       [1, 0]
durian        [0, 1]
apples        [1, 0]
jackfruit     [0, 1]
delicious     [1, 0]
unpleasant    [0, 1]
dtype: object

Again, this is our vector *space*. A *word vector* is the representation of the word with regard to the entire vecor space. The following word vectors are represented like this, relative to their presence within the vector space, and the documents in which they appear:

`bananas = [1, 0, 0, 0, 0, 0]`

`durian = [0, 1, 0, 0, 0, 0]`

`delicious = [0, 0, 0, 0, 1, 0]`

`unpleasant = [0, 0, 0, 0, 0, 1]`

This is a simplistic picture of how a word vector operates. There is little insight that we can derive from this, other than comparing direct equivalence of vectors. But things become interesting when there is a significantly large corpus of documents, with different uses and contexts for words.

Word2vec takes word vectors for every word that appears in a corpus (as above) and represents their contexts as a series of weights. Think of it like creating a dictionary, where each definition is composed of a little piece of every other definition, but to varying degrees. Because each definition of a word is created relationally, it is possible to capture conceptual or syntactic meaning an a really robust, fascinating (almost surprising) way.

Supposing we had more data (a huge set of other documents), the previous word vectors might be transformed to contain all of the "definitions" of the other words within the dataset:

`bananas = [0.89123, 0.66545, 0.19842, 0.11901, 0.09113, 0.07221]`

`durian = [0.12311, 0.71834, 0.22142, 0.13452, 0.08721, 0.067881]`

`delicious = [0.13317, 0.17004, 0.21891, 0.66311, 0.88313, 0.70019]`

`unpleasant = [0.14141, 0.16167, 0.22212, 0.57719, 0.77311, 0.87123]`

Other algorithms can be used to reduce the dimensionality of the vector space, such that each of these vectors can be plotted in 2D space. For a naive explanation of how this works, check out the bolded components of each of these vectors:

`bananas = [`**0.89123`, `0.66545**`, 0.19842, 0.11901, 0.09113, 0.07221]`

`durian = [`**0.71311`, `0.86834**`, 0.22142, 0.13452, 0.08721, 0.067881]`

`delicious = [0.13317, 0.17004, 0.21891, 0.38311, `**0.88313`, `0.70019**`]`

`unpleasant = [0.14141, 0.16167, 0.22212, 0.57719, `**0.77311`, `0.87123**`]`

In [45]:
# word2vec
#
# error messages of note, in case further problems arise:
# https://stackoverflow.com/questions/33989826/python-gensim-runtimeerror-you-must-first-build-vocabulary-before-training-th/33991111
# 

# the number of dimensions of generated vectors. this is a good number to
# play around with. some people suggest square-root length of vocabulary
# conceptually this might map onto principle components, or number of topics
size = 50

# terms that occur less than min_count number 
# of times are ignored in calculations
# may want to change this depending on reimplementation of
# lem_stop function above
min_count = 1

# terms that occur within this window of text are associated with it
# during the training of the model. if the corpus of text contains large
# sentences then it may be a good idea to change this to something larger.
# the documentation suggests 10 as an upper bound and 4-7 as a good range.
window = 4

# skip-gram technique: boolean that determines skipgram vs continuous bag of words
# model. the default is 1, skipgram
sg = 1

prr2vec = Word2Vec(
    [prr2vec_data],
    sg=sg,
    size=size,
    min_count=min_count,
    window=window
)

In [48]:
#prr2vec.wv['zone']
#prr2vec.wv.most_similar('state')
#prr2vec.wv.vocab

[('sincerely', 0.9992898106575012),
 ('act', 0.9992607235908508),
 ('category', 0.9992291331291199),
 ('which', 0.9991952180862427),
 ('each', 0.9991693496704102),
 ("'s", 0.9991583228111267),
 ('khandelwal@kingcounty', 0.9991531372070312),
 ('boise', 0.9991462230682373),
 ('be', 0.9991364479064941),
 ('in', 0.9991249442100525)]

In [129]:
def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [130]:
tsne_plot(prr2vec)

  import sys


NameError: name 'TSNE' is not defined

#ONCE we have vectors
#step 3 - build model
#3 main tasks that vectors help with
#DISTANCE, SIMILARITY, RANKING

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 3

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

In [197]:
prr2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=2,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [208]:
prr2vec.build_vocab(docs.text)

token_count = sum([len(doc) for doc in docs.text])
print("The corpus contains {0:,} tokens".format(token_count))

RuntimeError: cannot sort vocabulary after model weights already initialized.

In [214]:
prr2vec.train(docs.text, total_examples=prr2vec.corpus_count, epochs=prr2vec.epochs)

ValueError: You must specify an explict epochs count. The usual value is epochs=model.epochs.

In [8]:
#define some parameters  
noisy_pos_tags = ['PROP']
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 

def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

cleaned_list = [cleanup(word.string) for word in docs if not isNoise(word)]
cleaned_list

['requesting',
 'copies',
 'of',
 'hsd',
 "'s",
 'service',
 'agreements',
 'with',
 'compass',
 'housing',
 'alliance',
 'and',
 'solid',
 'ground',
 'as',
 'well',
 'as',
 'correspondence',
 'emails',
 'and',
 'other',
 'documents',
 'that',
 'reference',
 'beatrice',
 'holbert',
 'carolyn',
 'kinniebrow',
 'or',
 'carolyn',
 'bilal',
 'reports',
 'regarding',
 'financial',
 'reviews',
 'of',
 'share',
 'in',
 '2009',
 '2011',
 'and',
 '2013from',
 'september',
 '2010',
 'all',
 'email',
 'correspondence',
 'from',
 'anyone',
 'at',
 'hsd',
 'that',
 'include',
 'any',
 'of',
 'the',
 'terms',
 'below',
 'peggy',
 'hotes”“scott',
 'morrow',
 'nate',
 'martin',
 'share',
 'wheel',
 'seattle',
 'housing',
 'and',
 'resource',
 'effort"“marvin',
 'futrell”"low',
 'income',
 'housing',
 'institute""michelle',
 'marchand""steven',
 'isaacson""lantz',
 'rowland',
 'jarvis',
 'capucion',
 'nickelsville""sharon',
 'lee""tent',
 'city"(1',
 'documents',
 'and',
 'communications',
 'that',
 'w

In [8]:
# str x in lambda to ensure that everything is processable by spacy
#civ['proc'] = civ['request_summary'].apply(lambda x: nlp(str(x)))
#listy = civ.proc.tolist()
s = ''
for row in civ['request_summary']:
    s += str(row)
    
doc = nlp(s)
tokenizer = Tokenizer(doc.vocab)

#for sentence in enumerate(text.sents):
#    print(sentence)
#    print('')

words = [token.lemma_ for token in doc if token.is_stop != True and token.is_punct != True]

# noun tokens that arent stop words or punctuations
nouns = [token.lemma_ for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == "NOUN"]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

In [49]:
tokens = []
lemma = []
pos = []

for doc in nlp.pipe(civ['request_summary'].astype('unicode').values, batch_size=50,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

civ['tokens'] = tokens
civ['lemma'] = lemma
civ['pos'] = pos

KeyboardInterrupt: 

In [None]:
# ideas for analysis
is it possible to predict the department that a request goes to, given the body of a request?
https://towardsdatascience.com/machine-learning-for-text-classification-using-spacy-in-python-b276b4051a49

is it possible to do LDA or topic modeling, to gauge what people are typically writing about in PRRs?

is it possible to discover the frequency of a keyword over time, and plot it?

it it possible to collect n-grams, and uncover phrases of relevance?