# Searching Bill de Blasio's Emails with the Universal Sentence Encoder

By Jeremy B. Merrill, [Quartz](https://www.qz.com).

As part of Quartz's participation in the Luanda Leaks investigation, we built an system for searching large heterogenous document sets with AI. For more info, LINK TK.

This is the interactive part of the demo. To use it, you need to split the provided sample documents (or your own) into sentence-length chunks with the `to_sentences.py` script, then index the chunks with `to_es.py` and `to_annoy.py`. _Then_, come back here and explore with this notebook.

The indexing scripts do two things. First, the `to_es.py` script indexes the documents and the chunks for ordinary keyword-level searching and document-retrieval. Second, the `to_annoy.py` script embeds each sentence into a 512-dimensional vector using the Universal Sentence Encoder, then indexes those vectors with Annoy, which allows quick nearest-neighbor searches.

The script for splitting the sample text splits each document into roughly sentence-length chunks. Universal Sentence Encoder ignores any words after the 128th -- and it's very slow at long sentences. So we split the documents into single sentences or overlapping groups of short sentences and index those.

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece
import time
import json
import numpy as np
import faiss
from annoy import AnnoyIndex
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import numpy as np
import unicodedata
import csv 
from os.path import basename
import unicodedata
from to_sentences import *

In [2]:
# just to make sure everything's working right, let's make sure we can talk to the GPU
# if we can't, things'll just be real slow.
# you should get 
#   [[22. 28.]
#   [49. 64.]]
# as the result here.

with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))


[[22. 28.]
 [49. 64.]]


In [3]:
use_module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/1"

g = tf.Graph()
with g.as_default():
    text_input = tf.placeholder(dtype=tf.string, shape=[None])
    embed_module = hub.Module(use_module_url)
    embedded_text = embed_module(text_input)
    init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

session = tf.Session(graph=g)
session.run(init_op)

In [4]:
def generate_embeddings (messages_in):
    return session.run(embedded_text, feed_dict={text_input: messages_in})

In [5]:
# let's just see what happens if we use USE to translate a sentence into a 512-dimenstional vector.
# the value in each of the 512 columns is meaningless by itself, but,
# as a whole, similar sentences tend to have similar vectors

generate_embeddings(["my favorite kind of bagel is whole wheat"])

array([[-1.86588373e-02, -7.35830739e-02, -6.27417117e-02,
         5.43986633e-02, -6.33138269e-02,  3.12137231e-03,
        -3.39234248e-03,  8.04598851e-04, -6.56886101e-02,
         2.53207833e-02,  2.40778998e-02,  5.04001193e-02,
        -1.83276180e-02,  1.76155288e-02,  4.74402793e-02,
         3.57116014e-02,  7.41831139e-02, -6.37539616e-03,
        -2.11162958e-02,  5.88243641e-02, -3.31615061e-02,
        -2.33161990e-02,  2.62841098e-02,  5.01824468e-02,
         6.79138079e-02, -7.55049512e-02, -1.56724490e-02,
        -1.95526164e-02,  6.99534863e-02,  1.26652876e-02,
         3.94687504e-02, -6.20715693e-02, -3.33373831e-03,
         7.46464059e-02,  3.09899393e-02,  6.86386041e-03,
        -4.23100889e-02,  5.27019799e-02, -4.48292792e-02,
        -1.91675182e-02,  4.28366847e-02, -2.75708437e-02,
         6.98972717e-02,  3.40210572e-02,  7.39742890e-02,
        -3.54908220e-02, -7.66176032e-03,  5.59488079e-03,
        -3.07705756e-02,  3.51644754e-02, -2.83770934e-0

In [6]:
# let's just tell our notebook where to find our search index, our USE index and the mapping of names to IDs.

ES_INDEX_FULL_TEXT = "nycdocs-use"
ES_INDEX_CHUNK = "nycdocs-use-chunk128" 
vector_dims = 512
vector_index = AnnoyIndex(vector_dims, 'angular')
annoy_fn = 'nycdocs-chunk15_annoy.bin'
vector_index.load(annoy_fn) # super fast, will just mmap the file

es = Elasticsearch()

with open("nycdocs-use-chunk128_idx_name.json", 'r') as f:
    idx_name = json.load(f)
with open("nycdocs-use-chunk128_name_idx.json", 'r') as f:
    name_idx = json.load(f)

vec_cnt = vector_index.get_n_items()
print("sanity checks")
print(f"vectors: {vec_cnt}" )
print(f"docs: {list(idx_name.keys())[-1]}")
assert vec_cnt > 3700
assert vec_cnt == int(list(idx_name.keys())[-1]) + 1 # the first index is zero, so the *count* of indices should be 1 + the largest index.
print("yep, we're all ok")

sanity checks
vectors: 37281
docs: 37280
yep, we're all ok


## Querying

Let's run some searches. Since this is a runnable notebook, you can run your own searches, but I've got a few samples to show you.

In [11]:
# let's set the stage for the fun part.
# feel free to dig into search.py to see how the plumbing works.
from search import *
searcher = QzUSESearchFactory(vector_index, idx_name, name_idx, es, ES_INDEX_FULL_TEXT, ES_INDEX_CHUNK, generate_embeddings)

### Stage 1: text searches

Let's see if there are any documents in de Blasio's released documents about "establishing a new corporations"

In [12]:
searcher.query_by_text("establishing a new corporation")

p888
https://example.com/p888/
build start-up companies and new products.

p3658
https://example.com/p3658/
graduates hustling to commercialize a new idea, to start-ups on the verge of explosive

p1743
https://example.com/p1743/
new public investments that complement new affordable

p1760
https://example.com/p1760/
new public investments that complement new affordable

p2938
https://example.com/p2938/
innovative efforts and lead the nation in providing free public Wi- Fi.ﬂ

p3739
https://example.com/p3739/
Today, participating companies came together at

p3657
https://example.com/p3657/
companies, early stage investors, and government to spark innovation and help improve

p3814
https://example.com/p3814/
requests for proposals, and City worked with Enterprise to fund a number of emerging

p1693
https://example.com/p1693/
The new model also includes

p2231
https://example.com/p2231/
entity/individual had business before the city, was lobbying



Yes, there are! The AI system knows that "start-up companies" are like "new corporations", which is why, presumably, it returned the first result -- a fragment of an email.

### Stage 2:

Let's find a few example documents manually, then use the AI system to find similar documents. I did a few quick keyword searches in the PDF to find a couple of page chunks about *food*. We provide those documents to the model, which'll provide us with some "similar" documents.

The output prints out a document ID, a URL (which of course doesn't actually work here) and some sanity checks -- some words we might expect to be in many of the documents. It's okay if they're all False -- it may mean that the model is more creative than we thought.

In [13]:
a_few_docs_about_food = ["p603c4", "p610c4", "p610c1", "p1938c0", "p1046c5", "p1046c9"]
search = searcher.query_by_docs(a_few_docs_about_food, 
                                ["bagel", "pizza", "burger", "cuisine"], 5000)
# search.to_csv()
search.show(show_seed_docs=False)

p70c2
http://example.com/p70
sanity checks: ([False, False, False, True])

p1359c0
http://example.com/p1359
sanity checks: ([False, False, False, False])

p1939c4
http://example.com/p1939
sanity checks: ([False, False, False, False])

p1503c0
http://example.com/p1503
sanity checks: ([False, False, False, False])

p1236c0
http://example.com/p1236
sanity checks: ([False, False, False, False])

p556c1
http://example.com/p556
sanity checks: ([False, False, False, False])

p557c2
http://example.com/p557
sanity checks: ([False, False, False, False])

p1586c1
http://example.com/p1586
sanity checks: ([False, False, False, False])

p761c5
http://example.com/p761
sanity checks: ([False, False, False, False])

p1058c9
http://example.com/p1058
sanity checks: ([False, False, False, False])

p1059c8
http://example.com/p1059
sanity checks: ([False, False, False, False])

p1940c1
http://example.com/p1940
sanity checks: ([False, False, False, False])

p1416c8
http://example.com/p1416
sanity checks: ([F

p193c3
http://example.com/p193
sanity checks: ([False, False, False, False])

p2219c12
http://example.com/p2219
sanity checks: ([False, False, False, False])

p1344c0
http://example.com/p1344
sanity checks: ([False, False, False, False])

p1509c2
http://example.com/p1509
sanity checks: ([False, False, False, False])

p2474c7
http://example.com/p2474
sanity checks: ([False, False, False, False])

p773c0
http://example.com/p773
sanity checks: ([False, False, False, False])

p1026c0
http://example.com/p1026
sanity checks: ([False, False, False, False])

p827c0
http://example.com/p827
sanity checks: ([False, False, False, False])

p1450c5
http://example.com/p1450
sanity checks: ([False, False, False, False])

p4225c0
http://example.com/p4225
sanity checks: ([False, False, False, False])

p2286c1
http://example.com/p2286
sanity checks: ([False, False, False, False])

p2226c0
http://example.com/p2226
sanity checks: ([False, False, False, False])

p613c8
http://example.com/p613
sanity checks:


p1590c0
http://example.com/p1590
sanity checks: ([False, False, False, False])

p2412c10
http://example.com/p2412
sanity checks: ([False, False, False, False])

p120c0
http://example.com/p120
sanity checks: ([False, False, False, False])

p1890c3
http://example.com/p1890
sanity checks: ([False, False, False, False])

p1773c9
http://example.com/p1773
sanity checks: ([False, False, False, False])

p1776c8
http://example.com/p1776
sanity checks: ([False, False, False, False])

p920c3
http://example.com/p920
sanity checks: ([False, False, False, False])

p1142c10
http://example.com/p1142
sanity checks: ([False, False, False, False])

p2438c14
http://example.com/p2438
sanity checks: ([False, False, False, False])

p2397c15
http://example.com/p2397
sanity checks: ([False, False, False, False])

p2412c22
http://example.com/p2412
sanity checks: ([False, False, False, False])

p1144c5
http://example.com/p1144
sanity checks: ([False, False, False, False])

p1150c5
http://example.com/p1150
sanit

sanity checks: ([False, False, False, False])

p3694c1
http://example.com/p3694
sanity checks: ([False, False, False, False])

p1493c0
http://example.com/p1493
sanity checks: ([False, False, False, False])

p1118c0
http://example.com/p1118
sanity checks: ([False, False, False, False])

p3936c4
http://example.com/p3936
sanity checks: ([False, False, False, False])

p107c4
http://example.com/p107
sanity checks: ([False, False, False, False])

p94c3
http://example.com/p94
sanity checks: ([False, False, False, False])

p1857c1
http://example.com/p1857
sanity checks: ([False, False, False, False])

p1623c2
http://example.com/p1623
sanity checks: ([False, False, False, False])

p2644c3
http://example.com/p2644
sanity checks: ([False, False, False, False])

p1065c6
http://example.com/p1065
sanity checks: ([False, False, False, False])

p1306c2
http://example.com/p1306
sanity checks: ([False, False, False, False])

p826c0
http://example.com/p826
sanity checks: ([False, False, False, False])

p

p2692c4
http://example.com/p2692
sanity checks: ([False, False, False, False])

p2490c0
http://example.com/p2490
sanity checks: ([False, False, False, False])

p2089c0
http://example.com/p2089
sanity checks: ([False, False, False, False])

p3279c0
http://example.com/p3279
sanity checks: ([False, False, False, False])

p1991c0
http://example.com/p1991
sanity checks: ([False, False, False, False])

p1994c0
http://example.com/p1994
sanity checks: ([False, False, False, False])

p1997c0
http://example.com/p1997
sanity checks: ([False, False, False, False])

p71c2
http://example.com/p71
sanity checks: ([False, False, False, False])

p2398c5
http://example.com/p2398
sanity checks: ([False, False, False, False])

p2413c10
http://example.com/p2413
sanity checks: ([False, False, False, False])

p2439c5
http://example.com/p2439
sanity checks: ([False, False, False, False])

p527c0
http://example.com/p527
sanity checks: ([False, False, False, False])

p279c2
http://example.com/p279
sanity checks:

p1687c0
http://example.com/p1687
sanity checks: ([False, False, False, False])

p2077c1
http://example.com/p2077
sanity checks: ([False, False, False, False])

p1481c0
http://example.com/p1481
sanity checks: ([False, False, False, False])

p1032c0
http://example.com/p1032
sanity checks: ([False, False, False, False])

p3236c2
http://example.com/p3236
sanity checks: ([False, False, False, False])

p94c6
http://example.com/p94
sanity checks: ([False, False, False, False])

p98c0
http://example.com/p98
sanity checks: ([False, False, False, False])

p3088c0
http://example.com/p3088
sanity checks: ([False, False, False, False])

p1017c0
http://example.com/p1017
sanity checks: ([False, False, False, False])

p2580c1
http://example.com/p2580
sanity checks: ([False, False, False, False])

p705c4
http://example.com/p705
sanity checks: ([False, False, False, False])

p1804c4
http://example.com/p1804
sanity checks: ([False, False, False, False])

p203c11
http://example.com/p203
sanity checks: ([F

p206c2
http://example.com/p206
sanity checks: ([False, False, False, False])

p312c3
http://example.com/p312
sanity checks: ([False, False, False, False])

p1167c1
http://example.com/p1167
sanity checks: ([False, False, False, False])

p3861c7
http://example.com/p3861
sanity checks: ([False, False, False, False])

p3297c3
http://example.com/p3297
sanity checks: ([False, False, False, False])

p3316c2
http://example.com/p3316
sanity checks: ([False, False, False, False])

p3332c2
http://example.com/p3332
sanity checks: ([False, False, False, False])

p763c2
http://example.com/p763
sanity checks: ([False, False, False, False])

p1485c0
http://example.com/p1485
sanity checks: ([False, False, False, False])

p751c0
http://example.com/p751
sanity checks: ([False, False, False, False])

p3997c4
http://example.com/p3997
sanity checks: ([False, False, False, False])

p1056c3
http://example.com/p1056
sanity checks: ([False, False, False, False])

p1669c0
http://example.com/p1669
sanity checks: 

p1870c0
http://example.com/p1870
sanity checks: ([False, False, False, False])

p1033c5
http://example.com/p1033
sanity checks: ([False, False, False, False])

p1034c7
http://example.com/p1034
sanity checks: ([False, False, False, False])

p1511c4
http://example.com/p1511
sanity checks: ([False, False, False, False])

p2568c0
http://example.com/p2568
sanity checks: ([False, False, False, False])

p785c14
http://example.com/p785
sanity checks: ([False, False, False, False])

p883c1
http://example.com/p883
sanity checks: ([False, False, False, False])

p1043c0
http://example.com/p1043
sanity checks: ([False, False, False, False])

p1904c0
http://example.com/p1904
sanity checks: ([False, False, False, False])

p3860c1
http://example.com/p3860
sanity checks: ([False, False, False, False])

p3903c4
http://example.com/p3903
sanity checks: ([False, False, False, False])

p3945c2
http://example.com/p3945
sanity checks: ([False, False, False, False])

p2489c6
http://example.com/p2489
sanity che


p1037c4
http://example.com/p1037
sanity checks: ([False, False, False, False])

p1040c5
http://example.com/p1040
sanity checks: ([False, False, False, False])

p1808c0
http://example.com/p1808
sanity checks: ([False, False, False, False])

p2767c1
http://example.com/p2767
sanity checks: ([False, False, False, False])

p2529c2
http://example.com/p2529
sanity checks: ([False, False, False, False])

p1832c4
http://example.com/p1832
sanity checks: ([False, False, False, False])

p1861c6
http://example.com/p1861
sanity checks: ([False, False, False, False])

p1826c1
http://example.com/p1826
sanity checks: ([False, False, False, False])

p1849c4
http://example.com/p1849
sanity checks: ([False, False, False, False])

p1884c6
http://example.com/p1884
sanity checks: ([False, False, False, False])

p2090c3
http://example.com/p2090
sanity checks: ([False, False, False, False])

p2666c1
http://example.com/p2666
sanity checks: ([False, False, False, False])

p2078c5
http://example.com/p2078
sanity

http://example.com/p21
sanity checks: ([False, False, False, False])

p4012c0
http://example.com/p4012
sanity checks: ([False, False, False, False])

p2783c4
http://example.com/p2783
sanity checks: ([False, False, False, False])

p2818c8
http://example.com/p2818
sanity checks: ([False, False, False, False])

p2493c2
http://example.com/p2493
sanity checks: ([False, False, False, False])

p2325c4
http://example.com/p2325
sanity checks: ([False, False, False, False])

p766c3
http://example.com/p766
sanity checks: ([False, False, False, False])

p1819c8
http://example.com/p1819
sanity checks: ([False, False, False, False])

p1037c6
http://example.com/p1037
sanity checks: ([False, False, False, False])

p1040c7
http://example.com/p1040
sanity checks: ([False, False, False, False])

p1060c10
http://example.com/p1060
sanity checks: ([False, False, False, False])

p1506c4
http://example.com/p1506
sanity checks: ([False, False, False, False])

p445c0
http://example.com/p445
sanity checks: ([Fal

sanity checks: ([False, False, False, False])

p1911c3
http://example.com/p1911
sanity checks: ([False, False, False, False])

p1668c8
http://example.com/p1668
sanity checks: ([False, False, False, False])

p1691c1
http://example.com/p1691
sanity checks: ([False, False, False, False])

p1869c2
http://example.com/p1869
sanity checks: ([False, False, False, False])

p2524c7
http://example.com/p2524
sanity checks: ([False, False, False, False])

p2997c2
http://example.com/p2997
sanity checks: ([False, False, False, False])

p3398c4
http://example.com/p3398
sanity checks: ([False, False, False, False])

p3362c8
http://example.com/p3362
sanity checks: ([False, False, False, False])

p1188c20
http://example.com/p1188
sanity checks: ([False, False, False, False])

p1192c25
http://example.com/p1192
sanity checks: ([False, False, False, False])

p1544c3
http://example.com/p1544
sanity checks: ([False, False, False, False])

p1550c1
http://example.com/p1550
sanity checks: ([False, False, False, 


p259c0
http://example.com/p259
sanity checks: ([False, False, False, False])

p955c4
http://example.com/p955
sanity checks: ([False, False, False, False])

p993c4
http://example.com/p993
sanity checks: ([False, False, False, False])

p1008c8
http://example.com/p1008
sanity checks: ([False, False, False, False])

p2404c12
http://example.com/p2404
sanity checks: ([False, False, False, False])

p2141c7
http://example.com/p2141
sanity checks: ([False, False, False, False])

p2167c3
http://example.com/p2167
sanity checks: ([False, False, False, False])

p2175c5
http://example.com/p2175
sanity checks: ([False, False, False, False])

p2289c5
http://example.com/p2289
sanity checks: ([False, False, False, False])

p2301c12
http://example.com/p2301
sanity checks: ([False, False, False, False])

p2311c5
http://example.com/p2311
sanity checks: ([False, False, False, False])

p2320c0
http://example.com/p2320
sanity checks: ([False, False, False, False])

p2340c3
http://example.com/p2340
sanity che

### Useful tools:

Here's a way to look at the text of a document (so you can find a chunk number). And afterwards, is the way to see just the text of a particular chunk.

In [10]:
# a quick way to look at the components of a document, for refining the lists of search docs.

doc_id = "p70"
full_text_res = es.get(index=ES_INDEX_FULL_TEXT, id=doc_id)
full_text = full_text_res["_source"]["text"]
print(full_text[:1000])
print(full_text_res["_source"]["routing"])
[(i, graf) for i, graf in enumerate(to_short_paragraphs(full_text))]

From:Litvak, Gwendolyn
To:"Jason Chiusano"
; 
Connie Chung
Cc:Daria Siegel
; 
; 
Nicole.Kolinsky@berlinrosen.comSubject:RE: Ruchi Indian Cuisine Deal - NYC - Downtown: Amazon Local
Date:Monday, April 07, 2014 12:59:02 PM
I seemed to have missed this. Qusano- could you send me
  sound clip si vous plait
?  From: Jason Chiusano
 Sent: Monday, April 07, 2014 12:52 PM
To: Connie Chung
Cc: Daria Siegel; Litvak, Gwendolyn; 
 Nicole.Kolinsky@berlinrosen.comSubject: Re: Ruchi Indian Cuisine Deal - NYC - Downtown: Amazon Local
 I get caught up on the W. 
My Long Island tongue doesn't allow for the proper pronunciation. 
 
 
 On Mon, Apr 7, 2014 at 12:50 PM, Connie Chung <CChung@hraadvisors.com> wrote:mulligaTOAHnyand I™m in
 CONNIE J. C
HUNGSenior Analyst
  |  HR&A A
DVISORS, INC.Direct: 
(646) 545-6248
  |  Mobile: 
(917) 699-7893
 From: Daria Siegel [mailto:
dsiegel@downtownny.com] Sent: Monday, April 07, 2014 10:57 AM
To: Litvak, Gwendolyn; '
'; 
Cc: 'Nicole.Kolinsky@berlinrosen.com'; Connie

[(0,
  'From:Litvak, Gwendolyn To:"Jason Chiusano" ; Connie Chung Cc:Daria Siegel ; ; Nicole.Kolinsky@berlinrosen.comSubject:RE: Ruchi Indian Cuisine Deal - NYC - Downtown: Amazon Local Date:Monday, April 07, 2014 12:59:02 PM I seemed to have missed this.'),
 (1,
  "From: Jason Chiusano Sent: Monday, April 07, 2014 12:52 PM To: Connie Chung Cc: Daria Siegel; Litvak, Gwendolyn; Nicole.Kolinsky@berlinrosen.comSubject: Re: Ruchi Indian Cuisine Deal - NYC - Downtown: Amazon Local I get caught up on the W. My Long Island tongue doesn't allow for the proper pronunciation."),
 (2,
  "From: Litvak, Gwendolyn [ mailto:GLitvak@cityhall.nyc.gov] Sent: Monday, April 07, 2014 10:54 AM To: '; ' 'Cc: 'Nicole.Kolinsky@berlinrosen.com'; Daria Siegel; ' cchung@hraadvisors.com'Subject: Re: Ruchi Indian Cuisine Deal - NYC - Downtown: Amazon Local I don't care for indian food but ill sit and have a diet coke or a lemonade."),
 (3, 'Are we all going?'),
 (4,
  'From: Nick Cotz Sent: Monday, April 07, 2014 1

In [14]:
# chunk text

doc_id = "p70c4"
res = es.get(index=ES_INDEX_CHUNK, id=doc_id)
res["_source"]["text"]

'From: Nick Cotz Sent: Monday, April 07, 2014 10:53 AM To: Jason Chiusano Cc: Nicole Kolinsky < Nicole.Kolinsky@berlinrosen.com>; dsiegel@downtownny.com; cchung@hraadvisors.com ; Litvak, Gwendolyn Subject: Re: Ruchi Indian Cuisine Deal - NYC - Downtown: Amazon Local'