# Natural Language Processing Techniques on Public Records Requests Data

In [375]:
import spacy
import pandas as pd
import xlrd
#from sklearn.manifold import TSNE
from spacy.tokenizer import Tokenizer
from gensim.models import Word2Vec, ldamodel
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora import Dictionary
from collections import Counter
import itertools
from spacy.lang.en.stop_words import STOP_WORDS
import re
import numpy as np
import matplotlib.pyplot as plt
import pprint
import time
import os
from sklearn.manifold import TSNE

# this isn't strictly an import, but it's used globally
pp = pprint.PrettyPrinter()

# change the model for different word vectors
nlp = spacy.load('en_core_web_lg')

# and finally, suppress an annoying warning that pandas throws
# when using straightforward indexing methods
pd.options.mode.chained_assignment = None

# Cleaning and structuring data for analysis

At the onset of this project, we obtained data extracts from a number of institutions and municipalities of different sizes, with the intention of collating them into one repository for analysis. Upon receipt, there were a handful of things that we realized, which dramatically reduced the kind of analysis that we could perform on the data.

1. **Government bodies have different reporting standards for metadata.** I believe this is related to the platform/infrastructure the institution is using to fulfill and manage requests. The fields of interest, as well as the types of data accessible within those fields, are often so distinct that they cannot be combined in a straightforward way. For example, some municipalities have tons of great metadata reporting, especially around the estimated amount of time and actual amount of time that it took to fulfill a request.

2. **Inputs are not standardized.** Both people and institutions submit public records requests, which means that there are differing degrees of detail, complexity, and formatting. Some people submit a few words or a case number in their request, while others copy-paste several spreadsheet columns. 

3. **There is personal identifying information everywhere.** The more I dug into records requests, the more I saw people signing their requests with their name, address, and contact information. It's hard to strip out this information with software, especially because requests ask for information on a particular person or named entity. This is problematic for open data projects, namely because 

4. **All data are equal, but some data are more equal than others.** This project charts out some methods that can be used across departments that have different numbers of records requests, but there's little we can do (and little inference we can provide) for a department that has but one solitary request.

The confluence of all of these factors effectively means that it's hard to look for trends. *Not great.*

It's really important to clean and harmonize the data as best we can. As such, we ended up throwing away most of the data that we had and working with a much smaller subset. After importing the data into a pandas dataframe, we kept:

- The department name (a string);
- Record creation date (a datetime object)
- Request summary (a string)

It's important to keep the orthography of the names the same, otherwise the information flow in the code here will break.

## Future Metadata Standards
Though we are only working with essentially three fields here, I would really like to see the following standard fields as metadata for each request:

- Unique Identifier for Record/Case (hash)
- Creation Date (datetime object)
- Assigned Department (string)
- Public Record Request (string; in a perfect world this would be plaintext, to prevent people from copy-pasting in spreadsheets)
- Close Date (datetime object)
- Attached Objects (myriad of formats)

In addition to those fields, these are some extensible fields that would be nice to see:

- Estimated Completion Date
- Actual Completion Date
- Type of Requester (controlled vocabulary; based on user research)

Notice that most of these fields are standard to the GovQA or WebQA platforms. If, through the Open Data Alliance, we were to shift to a 'data lake' model for collating and sharing this information, I would also like to see:

- Source Name (string; the jurisdiction from which the records request comes from)
- Source Type (controlled vocabulary, with possible choices like 'town', 'city', 'state', etc.)
- Tags (string; prepopulated based on text mining, and later expanded based on record officer input)

## Project 0: Cleaning the Data and Cursory Analysis
Let's load things up!

In [65]:
# this section loads an excel sheet.
# in an ideal scenario, all of these disparate fiels would live in a database with
# a set of standard fields
#
seadata = pd.read_excel('reformatted.xlsx')
olydata = pd.read_excel('StudentPRRLog_06212018_2.xlsx')
pordata = pd.read_excel('Port Orchard PRR Data.xls', header=9)

In [66]:
# this is a helper function to clean text in place
def clean(text):
    # ARGUMENTS: text; a string of whatever length
    # OUTPUTS: cleaned text; a string of whatever length
    #
    # this is a regular expression for processing dates
    #
    datesplit = re.compile('\d+|\D+')    
    # more complex formatting (i.e.
    # unicode characters) has to be handled properly
    text = text.replace("\t", " ")
    text = text.replace("\p", " ")
    text = text.replace('\uf07f', " ")
    text = text.replace('\uf0b7', " ")
    text = text.replace('\uf071', " ")
    return text

In [67]:
# reindex the data above, using only the columns we care about, for olympia
oly_trunc = olydata[['Assigned Dept', 'Create Date', 'Public Record Desired']]
oly_trunc.columns = ['department_name', 'create_date', 'request_summary']

# convert everything in the request_summary column to a spring, so no errors
# are thrown when doing natural language processing on data later
oly_trunc['request_summary'] = oly_trunc['request_summary'].astype(str)
oly_trunc['request_summary'] = oly_trunc['request_summary'].apply(lambda x: clean(x))

# then drop all of the null values
oly_trunc = oly_trunc.dropna().reset_index().set_index('create_date')
del oly_trunc['index']

# and similar, but for port orchard
droplist = [column for column in pordata.columns if 'Unnamed' in column]
pordata.drop(droplist, axis=1, inplace=True)
por_trunc = pordata[['Assigned Dept', 'Create Date', 'Public Record Desired']]
por_trunc.columns = ['department_name', 'create_date', 'request_summary']
por_trunc['request_summary'] = por_trunc['request_summary'].astype(str)
por_trunc['request_summary'] = por_trunc['request_summary'].apply(lambda x: clean(x))
por_trunc = por_trunc.dropna().reset_index().set_index('create_date')
del por_trunc['index']

# same, but for seattle
sea_trunc = seadata[['department_name', 'create_date', 'request_summary']]
sea_trunc['request_summary'] = sea_trunc['request_summary'].astype(str)
sea_trunc['request_summary'] = sea_trunc['request_summary'].apply(lambda x: clean(x))
sea_trunc = sea_trunc.dropna().reset_index().set_index('create_date')
del sea_trunc['index']

# number of entries in the truncated data blocks
print("olympia data entries:", len(oly_trunc),
      "\nseattle data entries:", len(sea_trunc),
      "\nport orchard data entries:", len(por_trunc))
#sum(truncated.request_type_description != truncated.spd_overall_rec_req_description)

# later on, consider using NA values as test set for topic modeling. might be overkill?

olympia data entries: 7640 
seattle data entries: 23414 
port orchard data entries: 246


In [68]:
# here's a list of all of the possible departments we can choose from
# as well as the overall percentage that they make up
#

def dept_count_pct(df):
    depts = df['department_name'].value_counts()
    deptsdf = pd.DataFrame(depts)
    deptsdf['pct'] = (deptsdf['department_name'] / len(df)).round(5)
    return deptsdf

dept_count_pct(oly_trunc)

Unnamed: 0,department_name,pct
Olympia Police Department,5118,0.6699
Community Planning and Development,1435,0.18783
Multiple Departments,317,0.04149
Fire,253,0.03312
Administrative Services,183,0.02395
Public Works,169,0.02212
Human Resources,77,0.01008
Legal,43,0.00563
Executive,24,0.00314
Parks,18,0.00236


In [69]:
dept_count_pct(sea_trunc)

Unnamed: 0,department_name,pct
SPD,13803,0.58952
SFD,3976,0.16981
Site Administrator,939,0.0401
SCI,809,0.03455
FAS,740,0.03161
DOT,667,0.02849
LAW,297,0.01268
LEG,294,0.01256
SCL,279,0.01192
SPU,270,0.01153


In [70]:
dept_count_pct(por_trunc)

Unnamed: 0,department_name,pct
City Clerk's Office,183,0.7439
Community Development,45,0.18293
Public Works Department,14,0.05691
Site Administrator,3,0.0122
Finance Department,1,0.00407


Now that the data is clean and massaged into a form that we can easily work with, we should try to gather some descriptive information about records. Namely, how long are they, on average? We'll measure this in terms of characters, as well as tokens (words).

In [71]:
# this is a helper function that gives us the average length in characters
# for each request for each department, and then a higher-level descriptive table
# of the same
def avglen(df):
    # ARGUMENTS: df; a dataframe
    # OUTPUTS: a dict; containing column name, average len in chars, number of requests
    #
    
    # get a shortlist of all the names of departments
    names = set(df.department_name.values)
    
    # for 
    varlist = [df[df.department_name == i]['request_summary'] 
               for i in names]
    
    l = [len(i) for i in varlist]
    s = []    
    for i in varlist:
        s.append(sum([len(str(j)) for j in i]))
    
    avg = [i / j for i,j in zip(s, l)]
    
    lendict = {n : (v, e ) for n, v, e in zip(names, avg, l)}

    return pd.DataFrame.from_dict(lendict, orient='index',
                             columns=['average length (chars)', 'requests per dept'])

olyagl = avglen(oly_trunc)
olyagl

Unnamed: 0,average length (chars),requests per dept
Executive,316.291667,24
Fire,101.988142,253
Administrative Services,250.065574,183
Legal,170.395349,43
Olympia Police Department,152.511333,5118
Human Resources,297.311688,77
Community Planning and Development,177.438328,1435
Parks,241.333333,18
Municipal Court,53.0,3
Multiple Departments,436.861199,317


In [72]:
poragl = avglen(por_trunc)
poragl

Unnamed: 0,average length (chars),requests per dept
Finance Department,192.0,1
Community Development,267.333333,45
City Clerk's Office,388.480874,183
Public Works Department,284.785714,14
Site Administrator,3.0,3


In [73]:
seaagl = avglen(sea_trunc)
seaagl

Unnamed: 0,average length (chars),requests per dept
MOS,571.803571,224
LAW,427.646465,297
ART,697.875,8
PCD,557.231884,69
CEN,756.638889,36
OSE,432.294118,17
HSD,470.911111,90
OFH,647.030303,33
CIV,251.625,8
EEC,350.590909,44


In [74]:
desc = {'seattle': seaagl.describe()[1:],
                'olympia': olyagl.describe()[1:],
                'port orchard' : poragl.describe()[1:]}

descriptions = pd.concat(desc)
descriptions

Unnamed: 0,Unnamed: 1,average length (chars),requests per dept
olympia,mean,226.336325,694.545455
olympia,std,109.130513,1522.56549
olympia,min,53.0,3.0
olympia,25%,161.453341,33.5
olympia,50%,241.333333,169.0
olympia,75%,294.907323,285.0
olympia,max,436.861199,5118.0
port orchard,mean,227.119984,49.2
port orchard,std,143.564417,76.838792
port orchard,min,3.0,1.0


In [75]:
# this performs natural language processing
# on the requests, casting them to new columns

# watch out when you run it, because it takes about 15min to process
por_trunc['tokens'] = por_trunc['request_summary'].apply(lambda x: nlp(x))
oly_trunc['tokens'] = oly_trunc['request_summary'].apply(lambda x: nlp(x))
sea_trunc['tokens'] = sea_trunc['request_summary'].apply(lambda x: nlp(x))

# Project 1: Generating a word cloud for each department

Word clouds, while occasionally superficial, are handy proof-of-concept precursors for more complex linguistic analysis. In order to generate a word cloud, you have to go through most of the same processes (cleaning and splitting data, breaking into constituent parts, etc.) that you do when working on a machine learning pipeline, for example. To that end, here's a wall of code that is used to generate a word cloud.

In [426]:
def filter_noise(token):
    # ARGUMENTS: token; a spacy token
    # optional: mtl (minimum token length); an int
    # optional: cs (custom stop); a boolead
    # OUTPUTS: T/F
    #
    is_noise = False
    # this function performs a series of checks to see if 
    # a token is noise and then returns t/f
    # essentially it's a giant switch
    #
    #
    #
    # here's a regular expression for matching dates/times from a string
    # spacy doesn't handle that task well
    dates = re.compile('\d{1,2}(?P<sep>[-/])\d{1,2}(?P=sep)\d{2,4}')
    times = re.compile(r'\d{1,2}(:\d{1,2})?(am|pm)?')
    
    #
    # filters stop words
    if token.is_stop == True:
        is_noise = True
    elif token in STOP_WORDS:
        is_noise = True
        
    # measures length of token; default is 3
    elif len(token.text) <= 3:
        is_noise = True
    
    # regex filters
    elif bool(dates.findall(token.text)) == True:
        is_noise = True
    elif bool(times.findall(token.text)) == True:
        is_noise = True
    elif token.text == '-PRON-':
        is_noise = True
        
    # filters things that are/look like numbers
    elif token.is_digit == True:
        is_noise = True
    elif token.is_currency == True:
        is_noise = True
    elif token.like_num == True:
        is_noise = True
    # filters web stuff
    elif token.like_url == True:
        is_noise = True
    elif token.like_email == True:
        is_noise = True
        
    # filters punctuation
    elif token.is_punct == True:
        is_noise = True
    elif token.is_left_punct == True:
        is_noise = True
    elif token.is_right_punct == True:
        is_noise = True
    elif token.is_bracket == True:
        is_noise = True
    elif token.is_quote == True:
        is_noise = True
    elif token.is_space == True:
        is_noise = True
    elif token.is_alpha == False:
        is_noise = True
    return is_noise 


# after we transform the text from raw strings into lemmas, we have
# to do a secondary layer of filtering to remove additional strings that we
# don't really care about, or things that don't contribute in a strong way
# to our analysis

def filter_unicode_lemmas(text):
    # ARGUMENTS: text; a unicode string
    # OUTPUTS: T/F
    #
    #
    is_noise = False
    # these are some stop words that occur pretty frequently across docs;
    # it might make sense to expand these further
    custom_stop_words = ['this', 'that', 'please', 'be', 'file',
                         'copy', 'with', 'from', 'like',
                         'have', 'other', 'thank', 'and/or',
                        'with', '-PRON-', 'jan', 'feb', 'mar', 'apr',
                        'jun', 'jul', 'aug', 'sep', 'sept', 'oct', 'nov',
                        'dec', 'seattle', 'report', 'city', 'document',
                         'there', 'request', 'will', 'these']
    
    if text in custom_stop_words:
        is_noise = True
    return is_noise


def lem_stop(text, a=False):
    # ARGUMENTS: text; a tuple containing lists, containing spacy tokens
    # OPTIONAL ARGUMENTS: a; returns just the lemmas
    # OUTPUTS: a list containing strings
    text = [token.lemma_ for token in text if filter_noise(token) == False]
    if a == True:
        return text
    return [token for token in text if filter_unicode_lemmas(token) == False]

def entities(text):
    # ARGUMENTS: text; a list containing spacy tokens
    # OUTPUTS: a list containing named entities
    return [entity.text for entity in text.ents] 
    
def export(df, fn):
    # ARGUMENTS: df; a dataframe
    # fn; string, a filename
    #
    # take the file name and append CSV
    filename = fn+"-data.csv"
    # and then write it to the current directory
    df.to_csv(filename, encoding='utf-8')

In [79]:
# this filters out a bunch of words, getting the raw text into a new column

por_trunc['lemmas'] = por_trunc['tokens'].apply(lambda x: lem_stop(x))
oly_trunc['lemmas'] = oly_trunc['tokens'].apply(lambda x: lem_stop(x))
sea_trunc['lemmas'] = sea_trunc['tokens'].apply(lambda x: lem_stop(x))

In [81]:
# this bit of code concatenates all of the different cleaned data into one dataframe

cities = {'seattle': sea_trunc,
              'olympia': oly_trunc,
              'port orchard' : por_trunc}

city_export = pd.concat(cities)

# this reshapes the large dataframe into a time-series dataframe, so we can work with
# words over time (more on this later)

timeseries = city_export['lemmas'].apply(pd.Series).stack().reset_index(level=2, drop=True)
words_over_time = pd.DataFrame(timeseries)

#export(words_over_time, 'words_over_time')

In [82]:
# this bit of code runs the above pipeline for each of the different entries in the dataset

# and this helper function generates a dataframe of counts for each word,
# as well as proportion

def countdf(df):
    counter = Counter(itertools.chain(*df))
    cdf = pd.DataFrame.from_dict(counter, orient='index', columns=['count'])
    
    # percentage term frequency compared to rest of document
    cdf['pct'] = cdf['count'] / sum(cdf['count'])

    # normalization to reweight the "importance" of a word, invariate to document size
    cdf['prp'] = np.log2(cdf['count'])

    cdf = cdf.sort_values(by='count', index=True, ascending=False)

    return cdf

## Project 1.1: Frequency of Named Entities

In our previous sections, we had broken up the body of texts into one huge list of words. Something potentially more interesting would be looking for entites (i.e. nouns or noun phrases of note) that appear frequently within a records request.

Having this information might suggest items that could be proactively disclosed. For example, if it turns out that one of the most frequently named noun phrases is 'building permits', it might make sense to see which type of building permits are out there, and further if there is a particular class of them that can be easily released to the public.

Moreover, if this information is already available, a department recieving these requests might want to redirect users to the place where the information already lives.

In [427]:
def filter_ents(text):
    # ARGUMENTS: lemma; a unicode string
    # OUTPUTS: T/F
    #
    is_noise = False
    # these are some stop words that occur pretty frequently across docs;
    # it might make sense to expand these further
    remove_ents = [' ', '', str([])]
    for ent in text:
        if ent in remove_ents:
            is_noise = True
    return is_noise

ents = entities(docs)
collapse_ents = [' '.join(e) for e in ents if filter_ents(e) == False]

edf = pd.DataFrame(ents)#, columns=['string'])
#ecounter = pd.DataFrame(edf.value_counts())
#collapse = [str(e) for e in ents for i in e]
#ecounter
#edf.values.reshape(edf.shape, 1)
edf


NameError: name 'docs' is not defined

# Project 2: Clustering using LDA

### What is Latent Dirichlet Allocation (LDA)?
Latent dirichlet allocation (LDA) is a tool for finding implicit relationships in a large body of text. The algorithm produces topics, which essentially are groupings of words that we--statistically speaking--expect to have some degree of association (represented through co-occurrence within a document). As such, topics are not explicitly related through semantics or knowledge content--it is through later inference that we understand the emergent higher-level categories.

For example, consider a body of text that contains keywords like 'China', 'black', 'white', 'spotted', 'Croatia', 'cute', and 'bamboo'. It could be the case that a subset of these words are explicitly (and exclusively) related to a particular category, while others occur within each category with about the same frequency distribution.

Using LDA, we are able to categorize all of these words into groups that make sense. The computer groups 'China', 'bamboo', 'black', 'white', and 'cute' together, and our human inference suggests *'panda'*. In contrast, if we see 'Croatia', 'spotted', 'black', 'white', and 'cute', in the computer output, we think *'dalmatian'*. It's okay that the different topics overlap--as in human languages, semantic meaning is often distributed across different words.

More complicated algorithms can be used to assign actual labels to topic categories.

In [109]:
# in order to do this LDA work, we have to pull all of our data out of the dataframe
def unpack(df, col):
    strings = []
    for i in df[col]:
        strings.append(i)
    return strings

alldict = unpack(city_export, 'lemmas')

In [98]:
# Create Dictionary for everything
id2word = Dictionary(alldict)

# Term Document Frequency
corpus = [id2word.doc2bow(txt) for txt in alldict]

In [99]:
# running this version of the model will take about 5min on the data
passes = 15
rs = 7
# how do we decide on a good number of topics
num_topics = 10
chunksize = 1000

everything_model = ldamodel.LdaModel(corpus, 
                        num_topics = num_topics, 
                        chunksize = chunksize,
                        update_every = 2,
                        id2word = id2word, 
                        passes = passes, 
                        random_state = rs)

In [103]:
pp.pprint(everything_model.print_topics())
# subtract the sum of all of the topic coefficients from 1 in order to get the most viable topic model

[(0,
  '0.071*"incident" + 0.057*"police" + 0.037*"video" + 0.022*"record" + '
  '0.021*"involve" + 0.019*"audio" + 0.019*"statement" + 0.018*"relate" + '
  '0.018*"officer" + 0.017*"witness"'),
 (1,
  '0.040*"property" + 0.035*"fire" + 0.027*"permit" + 0.022*"avenue" + '
  '0.020*"building" + 0.020*"violation" + 0.017*"record" + 0.016*"street" + '
  '0.015*"locate" + 0.014*"code"'),
 (2,
  '0.030*"would" + 0.017*"need" + 0.016*"information" + 0.015*"know" + '
  '0.013*"number" + 0.011*"what" + 0.011*"send" + 0.011*"name" + 0.011*"about" '
  '+ 0.010*"time"'),
 (3,
  '0.036*"case" + 0.036*"driver" + 0.035*"accident" + 0.034*"insurance" + '
  '0.027*"date" + 0.026*"number" + 0.024*"location" + 0.022*"company" + '
  '0.018*"olympia" + 0.018*"county"'),
 (4,
  '0.031*"between" + 0.025*"include" + 0.022*"relate" + 0.019*"communication" '
  '+ 0.014*"regard" + 0.014*"january" + 0.014*"record" + 0.014*"office" + '
  '0.012*"email" + 0.011*"correspondence"'),
 (5,
  '0.043*"officer" + 0.031*"

# Project 2: Cosine Similarity Using Word2Vec

### What is Word2Vec?

Word2vec is an algorithmic system used to produce word embeddings, which may need some explanation or unpacking. Think back to the LDA section of this document. The computer produced a clustering of words, and we used our associative human creativity to establish patterns and relationships between them. What if it were possible to determine the similarity or semantic relationship between words programmatically, based on their context?

Word2vec achieves this by taking a large body of text and representing it as a vector space. Each word contained within that vector space is encoded as a vector, comprised of a 1 where the word is and 0's everywhere else). There is a hidden filter layer which compresses the size of this vector while minimizing information loss, as smaller vectors are less computationally complex to compare. Finally, the word vectors are all positioned in the vector space relative to each other, with more similar words clustered together.

This is quite abstract, so let's try out an example. Assume that we have the sentences "Bananas and apples are delicious," and "Durian and jackfruit are unpleasant." We can represent each of these as a list of words:

`document1 = ['Bananas', 'and', 'apples', 'are', 'delicious', '.']`

`document2 = ['Durian', 'and', 'jackfruit', 'are', 'unpleasant', '.']`

And then as a vector space, like so:

In [87]:
d = {'bananas': [1, 0], 'durian': [0, 1], 'and': [1, 1],
     'apples': [1, 0], 'jackfruit': [0, 1], 'are': [1, 1],
    'delicious': [1, 0], 'unpleasant': [0, 1], '.': [1, 1]}

s  = pd.Series(d,index=d.keys())
s

bananas       [1, 0]
durian        [0, 1]
and           [1, 1]
apples        [1, 0]
jackfruit     [0, 1]
are           [1, 1]
delicious     [1, 0]
unpleasant    [0, 1]
.             [1, 1]
dtype: object

While we have produced an excellent vector space here, we can make the vector space more sparse and easier to work with by dropping items that do not have semantic relevance. That includes words like 'and' and punctuation.

In [83]:
e = {'bananas': [1, 0], 'durian': [0, 1],
     'apples': [1, 0], 'jackfruit': [0, 1],
    'delicious': [1, 0], 'unpleasant': [0, 1]}

t  = pd.Series(e,index=e.keys())
t

bananas       [1, 0]
durian        [0, 1]
apples        [1, 0]
jackfruit     [0, 1]
delicious     [1, 0]
unpleasant    [0, 1]
dtype: object

Again, this is our vector *space*. A *word vector* is the representation of the word with regard to the entire vecor space. The following word vectors are represented like this, relative to their presence within the vector space, and the documents in which they appear:

`bananas = [1, 0, 0, 0, 0, 0]`

`durian = [0, 1, 0, 0, 0, 0]`

`delicious = [0, 0, 0, 0, 1, 0]`

`unpleasant = [0, 0, 0, 0, 0, 1]`

This is a simplistic picture of how a word vector operates. There is little insight that we can derive from this, other than comparing direct equivalence of vectors. But things become interesting when there is a significantly large corpus of documents, with different uses and contexts for words.

Word2vec takes word vectors for every word that appears in a corpus (as above) and represents their contexts as a series of weights. Think of it like creating a dictionary, where each definition is composed of a little piece of every other definition, but to varying degrees. Because each definition of a word is created relationally, it is possible to capture conceptual or syntactic meaning an a really robust, fascinating (almost surprising) way.

Supposing we had more data (a huge set of other documents), the previous word vectors might be transformed to contain all of the "definitions" of the other words within the dataset:

`bananas = [0.89123, 0.66545, 0.19842, 0.11901, 0.09113, 0.07221]`

`durian = [0.12311, 0.71834, 0.22142, 0.13452, 0.08721, 0.067881]`

`delicious = [0.13317, 0.17004, 0.21891, 0.66311, 0.88313, 0.70019]`

`unpleasant = [0.14141, 0.16167, 0.22212, 0.57719, 0.77311, 0.87123]`

Other algorithms can be used to reduce the dimensionality of the vector space, such that each of these vectors can be plotted in 2D space. For a naive explanation of how this works, check out the bolded components of each of these vectors:

`bananas = [`**0.89123`, `0.66545**`, 0.19842, 0.11901, 0.09113, 0.07221]`

`durian = [`**0.71311`, `0.86834**`, 0.22142, 0.13452, 0.08721, 0.067881]`

`delicious = [0.13317, 0.17004, 0.21891, 0.38311, `**0.88313`, `0.70019**`]`

`unpleasant = [0.14141, 0.16167, 0.22212, 0.57719, `**0.77311`, `0.87123**`]`

By performing vector addition and subtraction, we can see which words have relationships with other words.

In [335]:
# in order to do this w2v work, we have to pull all of our data out of the dataframe
# similar to how we did with LDA

unpacked_sents = unpack(city_export, 'tokens')

def flatten(ls): return [i for s in ls for i in s]

def strip_filter(s):
    disallowed = "!#$%^&*()<>?,[]-{}~`;|+=_ :.@/\'"
    filterone = ''.join(c for c in s if c not in disallowed)
    return ''.join(f for f in filterone if f.isdigit() == False)

def unpack_flatten(textlist):
    texts = []
    # takes in a huge list of lists
    # breaks the list into individual sentences
    for sentence in textlist:
        data = [sent.string.strip().lower().split() for sent in sentence.sents]
        data = flatten(data)
        spaces = [clean(datum) for datum in data if len(datum) > 1]
        nopunct = [strip_filter(datum) for datum in data]
        texts.append(nopunct)
    return texts

sentence_list = unpack_flatten(unpacked_sents)
# necessary to do further cleaning here, like numbers, dates, etc.

sentence_list

[['copy', 'of', 'police', 'report'],
 ['pursuant',
  'to',
  'the',
  'public',
  'records',
  'act',
  'i',
  'am',
  'requesting',
  'records',
  'of',
  'the',
  'following',
  '',
  'denial',
  'letters',
  'issued',
  'to',
  'each',
  'individual',
  'denied',
  'a',
  'concealed',
  'pistol',
  'license',
  'application',
  'by',
  'your',
  'police',
  'department',
  '',
  'cpl',
  'revocation',
  'letters',
  'issued',
  'to',
  'each',
  'individual',
  'whose',
  'cpl',
  'was',
  'revoked',
  'by',
  'your',
  'police',
  'department',
  '',
  'denial',
  'letters',
  'issued',
  'to',
  'each',
  'individual',
  'or',
  'dealer',
  'denied',
  'a',
  'pistol',
  'transfer',
  'application',
  'by',
  'your',
  'police',
  'department',
  'if',
  'the',
  'police',
  'department',
  'does',
  'not',
  'issue',
  'denial',
  'or',
  'revocation',
  'letters',
  'then',
  'any',
  'and',
  'all',
  'records',
  'pertaining',
  'to',
  'the',
  'denial',
  'or',
  'revocation

In [336]:
# word2vec
#
# error messages of note, in case further problems arise:
# https://stackoverflow.com/questions/33989826/python-gensim-runtimeerror-you-must-first-build-vocabulary-before-training-th/33991111
# 

# the number of dimensions of generated vectors. this is a good number to
# play around with. some people suggest square-root length of vocabulary
# conceptually this might map onto principle components, or number of topics
size = 50

# terms that occur less than min_count number 
# of times are ignored in calculations
# may want to change this depending on reimplementation of
# lem_stop function above
min_count = 3

# terms that occur within this window of text are associated with it
# during the training of the model. if the corpus of text contains large
# sentences then it may be a good idea to change this to something larger.
# the documentation suggests 10 as an upper bound and 4-7 as a good range.
window = 4

# skip-gram technique: boolean that determines skipgram vs continuous bag of words
# model. the default is 1, skipgram
sg = 1

seed = 1

epochs=20

downsampling = 1e-3

prr2vec = Word2Vec(
    sentence_list,
    sg=sg,
    seed=seed,
    size=size,
    min_count=min_count,
    window=window,
    sample=downsampling,
    iter=epochs
)


"""prr2vec = Word2Vec(
    alldict,
    sg=sg,
    size=size,
    min_count=min_count,
    window=window
)"""

'prr2vec = Word2Vec(\n    alldict,\n    sg=sg,\n    size=size,\n    min_count=min_count,\n    window=window\n)'

In [337]:
prr2vec.train(sentence_list,
              total_examples=prr2vec.corpus_count,
              epochs=prr2vec.epochs)

# save output
def save_w2v_model(model):
    if not os.path.exists("trained"):
        os.makedirs("trained")
    model.save(os.path.join("trained", "prr2vec-numspunct-filtered.w2v"))

save_w2v_model(prr2vec)
#print("Word2Vec vocabulary length:", len(prr2vec.vocab))
#prr2vec.train(alldict, total_examples=prr2vec.corpus_count)
#prr2vec.wv['zone']
#prr2vec.wv.most_similar('state')
#prr2vec.wv.vocab


#prr2vec_data = [ lem_stop(i) for i in docs ]
# the output here is a list of all of the processed, filtered text
#
# this is wrapped in a list in order to help word2vec work with the word embeddings
# rather than just the characters of the text. normally word2vec takes sentences
# rather than just words

In [339]:
prr2vec = Word2Vec.load(os.path.join("trained", "prr2vec-numspunct-filtered.w2v"))

tsne = TSNE(n_components=2, random_state=0)
all_word_vectors_matrix = prr2vec.wv.vectors
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

In [152]:
points = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[prr2vec.wv.vocab[word].index])
            for word in prr2vec.wv.vocab
        ]
    ],
    columns=["word", "x", "y"]
)

In [270]:
export(points, 'word-coords-nopunctnums')

In [355]:
def analogy(word0='A', word2='Y', word3='Z'):
    if word0 == 'A':
        calc = 'B'
    else:
        calc = prr2vec.wv.most_similar(positive=[word0, word2],
                negative=[word3], topn=1)
    # "King" - "Man" + "Queen" == "Woman"
    #print(w2v_model.wv.most_similar(positive=['woman', 'king'], negative=['man']))
    return "{0} is to {1} as {2} is to {3}.".format(word0, calc[0][0], word2, word3)

# amazing word embeddings

#analogy('surveillance', 'audio', 'video')
#analogy('dashcam', 'underground', 'tanks')
#analogy('tank', 'request', 'record')
#analogy('permit', 'parcel', 'zoning')
#analogy('port', 'olympia', 'seattle')
#analogy('state', 'county', 'thurston')
#analogy()

'bodycam is to icvs as underground is to tanks.'

In [333]:
prr2vec.wv.most_similar('archaeology')

[('preservationrecords', 0.9620466232299805),
 ('managementrick', 0.9250814318656921),
 ('andersonrickandersondahpwagovcity', 0.8919956088066101),
 ('regents', 0.8355648517608643),
 ('officeking', 0.8338093161582947),
 ('palardydirector’s', 0.8316220641136169),
 ('attorneykimberly', 0.811776876449585),
 ('engineersbase', 0.8107672929763794),
 ('millskimberlymillsseattlegovdepartment', 0.7876366972923279),
 ('“city', 0.7872108221054077)]

# Project 3: Bi-grams and Tri-Grams

Since we are now able to figure out the 

In [482]:
bigram = Phrases(sentence_list, min_count=1, threshold=1)
phraser = Phraser(bigram)


In [485]:
bpl = list(phraser[sentence_list])[:2]

def find_ngrams(input_list):
    return zip(*[input_list[i:] for i in range(len(input_list))])

find_ngrams(bpl)

<zip at 0x1a5ad1d648>

In [435]:
def count_list(ls):
    counter = Counter()
    for item in ls:
        counter.update(item)
    return counter

count_list(bgp).most_common()[:20]

[('and', 16530),
 ('the', 14672),
 ('of', 9340),
 ('to', 7091),
 ('or', 5726),
 ('for', 5327),
 ('of_the', 4758),
 ('a', 3546),
 ('i_am', 3471),
 ('all', 3373),
 ('any', 3247),
 ('records', 3221),
 ('police_report', 3147),
 ('for_the', 2808),
 ('to_the', 2795),
 ('in', 2671),
 ('report', 2466),
 ('on', 2441),
 ('i_would', 2381),
 ('any_and', 2304)]

the deeper that we go into this count of bi-grams, the less important/viable this seems for analysis. perhaps future research is needed (i.e. increasing the size of the grams.)

In [208]:
# check at the end of this for some good ideas working with ngrams over time
##
###
#https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python

RuntimeError: cannot sort vocabulary after model weights already initialized.

In [214]:
prr2vec.train(docs.text, total_examples=prr2vec.corpus_count, epochs=prr2vec.epochs)

ValueError: You must specify an explict epochs count. The usual value is epochs=model.epochs.

In [8]:
#define some parameters  
noisy_pos_tags = ['PROP']
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 

def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

cleaned_list = [cleanup(word.string) for word in docs if not isNoise(word)]
cleaned_list

['requesting',
 'copies',
 'of',
 'hsd',
 "'s",
 'service',
 'agreements',
 'with',
 'compass',
 'housing',
 'alliance',
 'and',
 'solid',
 'ground',
 'as',
 'well',
 'as',
 'correspondence',
 'emails',
 'and',
 'other',
 'documents',
 'that',
 'reference',
 'beatrice',
 'holbert',
 'carolyn',
 'kinniebrow',
 'or',
 'carolyn',
 'bilal',
 'reports',
 'regarding',
 'financial',
 'reviews',
 'of',
 'share',
 'in',
 '2009',
 '2011',
 'and',
 '2013from',
 'september',
 '2010',
 'all',
 'email',
 'correspondence',
 'from',
 'anyone',
 'at',
 'hsd',
 'that',
 'include',
 'any',
 'of',
 'the',
 'terms',
 'below',
 'peggy',
 'hotes”“scott',
 'morrow',
 'nate',
 'martin',
 'share',
 'wheel',
 'seattle',
 'housing',
 'and',
 'resource',
 'effort"“marvin',
 'futrell”"low',
 'income',
 'housing',
 'institute""michelle',
 'marchand""steven',
 'isaacson""lantz',
 'rowland',
 'jarvis',
 'capucion',
 'nickelsville""sharon',
 'lee""tent',
 'city"(1',
 'documents',
 'and',
 'communications',
 'that',
 'w

In [8]:
# str x in lambda to ensure that everything is processable by spacy
#civ['proc'] = civ['request_summary'].apply(lambda x: nlp(str(x)))
#listy = civ.proc.tolist()
s = ''
for row in civ['request_summary']:
    s += str(row)
    
doc = nlp(s)
tokenizer = Tokenizer(doc.vocab)

#for sentence in enumerate(text.sents):
#    print(sentence)
#    print('')

words = [token.lemma_ for token in doc if token.is_stop != True and token.is_punct != True]

# noun tokens that arent stop words or punctuations
nouns = [token.lemma_ for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == "NOUN"]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

In [49]:
tokens = []
lemma = []
pos = []

for doc in nlp.pipe(civ['request_summary'].astype('unicode').values, batch_size=50,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

civ['tokens'] = tokens
civ['lemma'] = lemma
civ['pos'] = pos

KeyboardInterrupt: 

# ideas for analysis
is it possible to predict the department that a request goes to, given the body of a request?
https://towardsdatascience.com/machine-learning-for-text-classification-using-spacy-in-python-b276b4051a49

is it possible to do LDA or topic modeling, to gauge what people are typically writing about in PRRs?

is it possible to discover the frequency of a keyword over time, and plot it?

it it possible to collect n-grams, and uncover phrases of relevance?

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

http://nicschrading.com/project/Intro-to-NLP-in-Python/