# Doc2vec for Investigative Journalism


[Read our article about using doc2vec for the Mauritius Leaks](https://qz.com/1670632/how-quartz-used-ai-to-help-reporters-search-the-mauritius-leaks/)

Doc2vec is a powerful tool we've used to help search the Mauritius Leaks documents. But it's generally useful, creating a multipurpose, generic notion of **similarity** that you can use to find documents about the same thing as documents you already have.


### Goal: Find docs about _homelessness_

#### That don't contain the word "homelessness".

In this tutorial, we're going to look at some emails from the office of New York City mayor Bill de Blasio that were released under the Freedom of Information Law. (The emails were part of the "Agent of the City" hubbub; you can download the original file [here](https://a860-openrecords.nyc.gov/response/120252?token=c784372fd140497081b4bfcff9f0e3a0).)

Let's **pretend** that we're reporters, that we're interested in writing about what the de Blasio administration has said over time, interally, about its plan to address **homelessness** in the city. And let's pretend that we have a keyword search system for these emails; for these, the search bar in our PDF viewer would work fine, but for other, larger leaks with heterogenous filetypes (like the Mauritius Leaks), a more intense software solution would be necessary. 

Since we have a keyword search system, we can just search for "homelessness" and get a lot of results. But what if we want to find the ones we're missing, the ones that don't contain that keyword?


### Dependencies
You'll need to have the following dependencies installed to run this notebook.

* `pip install pypdf2 gensim scipy`
* `brew install elasticsearch` (or `apt-get install elasticsearch`... and I'm sure there's a way to do it on Windows.



## Step 0. Convert our documents to JSONL

As part of the training process, we'll have to iterate over our documents several times. Rather than re-parse the PDF each time, we're going to just parse it once and write the results to a file. JSONL is the format I've chosen to store the text of the document pages in; the format (`{"_source": {"content": "the actual content }}`) is meant to mimic the output of Apache Tika. JSONL is just a plain text file with one JSON object per line.

In [1]:
import PyPDF2
import json
from os.path import exists
jsonl_file = "nyc_docs.jsonl"
if not exists(jsonl_file):
    pdf_file = open('2018.05.24_BerlinRosen_Responsive_Records.pdf', 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    with open(jsonl_file, 'w') as f:
        for page_num in range(read_pdf.getNumPages()):
            page = read_pdf.getPage(page_num)
            page_content = page.extractText().encode('utf-8').decode("utf-8") 
            f.write(json.dumps({"_source": {"content": page_content}, "_id": f"p{page_num}"}) + "\n")

## Step 1. Train a model with Doc2vec

the resulting model is in the repo. (it takes about 2 minutes and 30 seconds to train).

In [50]:
from utilities import *
from os import makedirs
import logging
from tqdm import tqdm_notebook
from datetime import datetime
start = datetime.now()

model_filename = "models/nyc_docs.jsonl.model"

if not exists(model_filename):
    # these are some settings you might want to change, under certain circumstances.
    num_epochs = 20
    vector_size = 100
    window = 5
    alpha = 0.025 # aka learning rate
    min_count = 5

    model = Doc2Vec(
        vector_size=vector_size, 
        dbow_words = 1, 
        dm=0,
        epochs=1, 
        workers=4,
        window=window, 
        seed=1337, 
        min_count=min_count,
        alpha=alpha, 
        min_alpha=alpha
      )
    makedirs(dirname(model_filename), exist_ok=True)

    vocab_iterator = SugarcaneJsonlIterator(jsonl_file, ngrams_type="bigrams")
    model.build_vocab(vocab_iterator)

    for epoch in tqdm_notebook(range(num_epochs)):
        model.train(SugarcaneJsonlIterator(jsonl_file, ngrams_type="bigrams"),
            total_examples=model.corpus_count, 
            epochs=1
          )
        model.save(model_filename)

    print("finish training w2v" +  str(datetime.now()))
    print("training w2v took {} seconds ({}h {}m)".format(int((datetime.now() - start).seconds), (datetime.now() - start).seconds // 3600, ((datetime.now() - start).seconds % 3600) // 60))


## step 2: put documents into ElasticSearch

We need a way to search the documents (to simulate what a reporter with a keyword-search tool could do) and a way to query individual documents whose IDs are returned by the doc2vec model. I chose to do this with ElasticSearch, but Postgresql or other tools would work fine. 

This assumes ElasticSearch is running locally on port 9200. (If you installed ElasticSearch with `brew`, you may have to run `brew services start elasticsearch` to actually start ElasticSearch.)

In [14]:
import requests

index_name = "nycdocs"

try:
    requests.put(f"http://localhost:9200/{index_name}", {})
except:
    pass # it's okay if we get an error because the index already exists.

with open(jsonl_file, 'r') as f:
    for line in f:
        line = json.loads(line)
        source = line["_source"]
        source["id"] = line["_id"]
        requests.post(f"http://localhost:9200/{index_name}/_doc/", data=json.dumps(source), headers={"Content-Type": "application/json"})

In [15]:
# let's make sure that there's data in the ElasticSearch index before we go forward.
ret = requests.get(f"http://localhost:9200/{index_name}/_count")
assert json.loads(ret.content)["count"] > 0

## step 3: Querying our Model

In [16]:
from gensim.models import Doc2Vec
model = Doc2Vec.load(model_filename)

array([ 0.14892638,  0.84225196,  0.5657305 , -0.05731661,  1.0092884 ,
       -0.16890779, -0.9475695 , -0.31454888,  0.42846683,  0.9906296 ,
        1.3255194 , -0.9393652 ,  1.0227212 ,  0.05691548,  0.288966  ,
        0.4966944 , -0.3193189 ,  0.52391315,  0.23011431, -0.55094516,
        0.5415337 ,  0.53747773, -0.29408413, -0.4506765 ,  0.58461386,
       -1.0946444 ,  0.58335054, -0.16434972, -0.24865314, -0.15691808,
       -0.08979879, -0.23274064, -0.55003446,  0.8960646 ,  0.24233627,
       -0.45479947, -0.1096103 , -0.22644855, -0.04939383, -0.65178746,
       -0.32949883,  0.54339314, -0.9402834 , -0.10056758,  0.27990496,
       -0.52813953, -0.38317814,  0.54720885, -0.05945592,  0.9192372 ,
        0.36784324, -0.09500701, -0.08845058,  1.002136  , -0.80019456,
        0.3984145 ,  0.21191886, -0.36903706,  0.03561793,  0.72101   ,
       -0.03929467,  0.22953278, -0.21876392,  0.7107592 ,  0.28367153,
       -0.7346413 , -0.26492935,  0.7894046 , -0.06782962, -0.49

In [48]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import csv 
class Doc2vecSearcher:
    def __init__(self, client, d2v_model):
        self.client = client
        self.model = d2v_model
        
    def has_term_in_content(self, term, es_id):
        r = Search(using=client, index=index_name) \
        .query("match", id=es_id)
        resp = r.execute()
        return term.lower() in resp[0].content.lower()

    def search_for_documents_like_this_that_search_would_miss(self, documents, search_term, topn=300):
        """find documents similar to those found by a given search, but which are missing the search term"""
        maybe_matches = self.model.docvecs.most_similar(documents, topn=topn)

        # filter out filepaths (since those are useless to me.)
        maybe_matches = [id_score for id_score in maybe_matches if '/' not in id_score[0]]

        # add metadata to rows.
        output = []
        for match in maybe_matches:
            row = {
                "page_num": match[0],
                "url": "http://localhost:9200/nycdocs/_search?q=id:" + match[0],
                "score": match[1],
                "in_documents": match[0] in documents,
                "matches_search_term": self.has_term_in_content(search_term, match[0]),
            }
            output.append(row)
        return output

    def to_csv(self, search_results, search_term=None, csv_fn=None):
        search_term_cln = search_term.replace(" ", "_")
        with open(csv_fn if csv_fn else f"{search_term_cln}.csv", 'w', newline='') as csvfile:
            fieldnames = ["page_num", 
                          "url", 
                          "score", 
                          f"matches search term (\"{search_term}\")" if search_term else "matches search term",
                         ]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')

            writer.writeheader()
            for row in search_results:
                row = row.copy()
                row[f"matches search term (\"{search_term}\")" if search_term else "matches search term"] = row["matches_search_term"]
                writer.writerow(row)
client = Elasticsearch()
doc_searcher = Doc2vecSearcher(client, model)

### finding emails about homelessness


In [38]:
known_pages_about_homelessness = [ 
    "p1154",
    "p1262",
    "p3813"
]

other_homelessness_docs = doc_searcher.search_for_documents_like_this_that_search_would_miss(known_pages_about_homelessness, "homelessness")
matches = [doc for doc in other_homelessness_docs if not doc["in_documents"] and not doc["matches_search_term"]]
print("here are 5 documents that the model thinks are most similar to our examples")
print("BUT which don't contain the word 'homelessness'")
matches[0:5]

here are 5 documents that the model thinks are most similar to our examples


[{'page_num': 'p1972',
  'url': 'http://localhost:9200/nycdocs/_search?q=id:p1972',
  'score': 0.5875276327133179,
  'in_documents': False,
  'matches_search_term': False},
 {'page_num': 'p1693',
  'url': 'http://localhost:9200/nycdocs/_search?q=id:p1693',
  'score': 0.5547754168510437,
  'in_documents': False,
  'matches_search_term': False},
 {'page_num': 'p1103',
  'url': 'http://localhost:9200/nycdocs/_search?q=id:p1103',
  'score': 0.5315240621566772,
  'in_documents': False,
  'matches_search_term': False},
 {'page_num': 'p1054',
  'url': 'http://localhost:9200/nycdocs/_search?q=id:p1054',
  'score': 0.5300723910331726,
  'in_documents': False,
  'matches_search_term': False},
 {'page_num': 'p1077',
  'url': 'http://localhost:9200/nycdocs/_search?q=id:p1077',
  'score': 0.5211280584335327,
  'in_documents': False,
  'matches_search_term': False}]

if we want to, we can write these docs to a CSV, to look at them later.

In [49]:
doc_searcher.to_csv(other_homelessness_docs, "homelessness")
