# Ad Hoc Relevance Ranking

Main task: Extract document segments relevant to a set of keywords

Secondary tasks:
* [ ] Select words within the document segment to highlight for easier reading
* [ ] Select the top-k most relevant document segments for each user given their keywords
* [ ] Keep track of which document segments have been sent to which users and don't duplicate

## Segmenting

TODO: Make this more sophisticated so that the conceptual breaks between sections are respected

Input: A set of documents

Output:

* All tokens from all documents
* All tokens for each document
* A set of document segments:
    * the original text of the segment
    * the tokens from that segment
    * the page number that the start of the segment comes from (TODO: fix this, right now it's the end)
    * the document filename that the segment comes from
    
Method:

Go through each document line by line and collect nonoverlapping sets of lines that are each at least W tokens long. (W = 100)

## Relevance score

Input:

* Set of candidate document segments
* Set of keywords

Output:

A numeric score for each document segment, where a higher score is a better match to the keywords.
Some documents may have a score of None if there were no useful tokens to compare to the keywords.

Method:

1. Compute inverse document frequency for each keyword

    * I actually used inverse document proportion... but it's the same by a scaling factor
    
    * $\frac{1}{\mbox{# documents in which keyword appears + smoothing}}$
    
    * smoothing = 1

2. Vectorize keywords and tokens in each document_segment

3. For each document segment:

    a. Get pairwise cosine similarity between each keyword and each token in the document segment
    
    b. Get average cosine similarity (across document segment tokens) for each keyword
    
    c. Sum average cosine similarity for each keyword, weighted by that keyword's inverse document frequency



## Evaluation

### Precision

This is easier to evaluate

### Recall

This is harder to evaluate

## TODO

* [x] narrow down document set by metadata
    * `metadata.csv`
    * use `DocumentManager`
* [x] mock-up of input data
* [ ] clean up code
* [ ] piece together with holden's interface
* [ ] keep track of users
* [ ] add more users and keywords
* [ ] figure out appropriate casing & smoothing
* [ ] adapt for different-length sections
* [ ] figure out sectioning so we get more coherent sections
* [ ] work out evaluation metrics
* [ ] try different vectors? bigrams? contextual??

In [514]:
input_data = [
  {
    "userid": "0912341",
    "queries": [
      {
        "keywords": [
          "affordable",
          "housing",
          "ADU",
          "vote",
          "residential",
          "homeless"
        ],
        "start_date": "2019-05-01",
        "end_date": None,
        "municipalities": [
          "San Jose",
          "Cupertino",
          "Sunnyvale",
          "Palo Alto",
          "Mountain View"
        ]
      },
      {
        "keywords": [
          "affordable",
          "housing",
          "ADU",
          "vote",
          "residential",
          "homeless"
        ],
        "start_date": "2019-05-01",
        "end_date": None,
        "municipalities": [
          "San Jose",
          "Cupertino",
          "Sunnyvale",
          "Palo Alto",
          "Mountain View"
        ]
      }
    ]
  },
  {
    "userid": "1029834",
    "queries": [
      {
        "keywords": [
          "affordable",
          "housing",
          "ADU",
          "vote",
          "residential",
          "homeless"
        ],
        "start_date": "2019-05-01",
        "end_date": None,
        "municipalities": [
          "Biggs",
          "Gridley"
        ]
      },
    ]
  },
  {
    "userid": "1241234",
    "queries": [
      {
        "keywords": [
          "affordable",
          "housing",
          "ADU",
          "vote",
          "residential",
          "homeless"
        ],
        "start_date": "2019-05-01",
        "end_date": None,
        "municipalities": [
          "San Jose",
          "Cupertino",
          "Sunnyvale",
          "Palo Alto",
          "Mountain View"
        ]
      },

      {
        "keywords": [
          "affordable",
          "housing",
          "ADU",
          "vote",
          "residential",
          "homeless"
        ],
        "start_date": "2019-05-01",
        "end_date": None,
        "municipalities": [
          "San Jose",
          "Cupertino",
          "Sunnyvale",
          "Palo Alto",
          "Mountain View"
        ]
      },

      {
        "keywords": [
          "affordable",
          "housing",
          "ADU",
          "vote",
          "residential",
          "homeless"
        ],
        "start_date": "2019-05-01",
        "end_date": None,
        "municipalities": [
          "San Jose",
          "Cupertino",
          "Sunnyvale",
          "Palo Alto",
          "Mountain View"
        ]
      }
    ]
  }
]

In [447]:
import pandas as pd
import numpy as np
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity

print("loading language model")
# load language model (this takes a few minutes)
model = api.load('word2vec-google-news-300')
print("model loaded")

from datetime import datetime
import os
from autolocal.nlp import Tokenizer
from gensim.parsing.preprocessing import *

In [None]:
vectors = model.wv
del model
vectors.init_sims(True) # normalize the vectors (!), so we can use the dot product as similarity measure

print('embeddings loaded ')
print('loading docs ... ')

In [None]:
# vectors[["ADU", "housing"]]

In [465]:
# this is an example set of keywords for a single query
keywords = ['housing', 'affordable', 'homelessness', 'ADU', 'residential', 'homeless', 'vote']

In [466]:
# this is read metadata
metadata = pd.read_csv("../data/index_sfbay_small/metadata.csv")
def get_txt_name(f):
    path_parts = os.path.split(f)
    fname = path_parts[1]
    local_dir = os.path.basename(path_parts[0])
    return os.path.join(local_dir, fname)[:-4]+".txt"
metadata["txt_file"] = [get_txt_name(f) for f in metadata["local_path_pdf"]]
metadata = metadata[["txt_file", "city", "committee", "date"]]
metadata["date"] = [datetime.strptime(d, '%Y-%m-%d') for d in metadata["date"]]

In [467]:
# this is a relevance filter
# we'll have to modify to like, last week or something, or only agendas, or whatever
this_year = datetime(2019, 1, 1)
past_six_months = datetime(2019, 5, 1)
cities = ["Cupertino", "Sunnyvale", "Palo Alto", "San Jose"]
potential_documents = metadata
potential_documents = potential_documents[potential_documents["date"] >= past_six_months]
potential_documents = potential_documents[[(c in cities) for c in potential_documents["city"]]]

In [468]:
tokenizer = Tokenizer()

# TODO: is lowercasing necessary?
preprocess_filters = [
    lambda x: x.lower(),
    strip_punctuation,
    strip_numeric,
    strip_non_alphanum,
    strip_multiple_whitespaces,
    strip_numeric,
    remove_stopwords,
    strip_short
]

In [499]:
# this mess is the tokenizing, the reading document, the segmenting documents
# and also selecting relevant docs

documents = []
document_sections = []
# section_length = 20 # lines
section_length = 100 # tokens
min_section_length = 5

docs_tokens = []

document_files = metadata["txt_file"]
directory = '../data/docs'
for filename in document_files:
    try:
        f = open(os.path.join(directory, filename))
        document_tokens = []
        document_str = f.read()
        document_segment_lines = []
        document_segment_tokens = []
        document_tokens = []
        pages = document_str.split('\f')
        for p, page in enumerate(pages): 
            lines = page.split('\n')
            for line in lines:
                line_tokens = preprocess_string(line, filters=preprocess_filters)
                document_tokens += line_tokens
                docs_tokens += line_tokens
                potential_docs = [x for x in potential_documents["txt_file"]]
#                 print(potential_documents["txt_file"][0])
#                 print(filename)
#                 print(filename in potential_docs)
                if filename in potential_docs:
                    document_segment_lines.append(line)
                    document_segment_tokens += line_tokens
                    if len(document_segment_tokens) >= section_length:
                        document_sections.append((
                            document_segment_tokens,
                            p,
                            filename,
                            "\n".join(document_segment_lines)
                        ))
                        document_segment_lines = []
                        document_segment_tokens = []
            if len(document_segment_tokens) >= min_section_length:
                document_sections.append((
                    document_segment_tokens,
                    p,
                    filename,
                    "\n".join(document_segment_lines)
                ))
                document_segment_lines = []
                document_segment_tokens = []
            documents.append(document_tokens)
        if i%1000 == 0:
            print(i)          
        i+=1
    except FileNotFoundError:
        pass

26000
27000
28000
29000
30000
31000


In [510]:
# i think this is idf, but it might be something different...
# yeahhhh it's totally word frequency
# it should be idf. let's write that and see if it still works
from collections import Counter
word_counts = Counter(docs_tokens)
smoothing = 1
# smoothing = 100
doc_freqs = {}
doc_freq_total = 0

all_keywords = keywords
words_to_check = all_keywords

for w in word_counts:
    if w in vectors:
#     if w in vectors and w in words_to_check:
        doc_freqs[w] = word_counts[w] + smoothing
        doc_freq_total += word_counts[w] + smoothing
word_counts = None
doc_props = {w: doc_freqs[w]/doc_freq_total for w in doc_freqs}
inverse_doc_props = {w: (1/doc_props[w] if doc_props[w]>0 else 0) for w in doc_props}
# inverse_doc_props

In [511]:
# this is the scoring funcrtion
keyword_vectors = np.array([vectors[t] for t in keywords if t in inverse_doc_props])
keyword_weights = np.array([inverse_doc_props[t] for t in keywords if t in inverse_doc_props])
document_section_scores = []
for s, section in enumerate(document_sections):
        score = None
        section_tokens = section[0]
        # TODO: Zipf to figure out what the cutoff should be for normal communication
        if len(set(section_tokens))<20:
            score = 0
        else:
            section_vectors = np.array([vectors[t] for t in section_tokens if t in inverse_doc_props])
            if section_vectors.shape[0]>0:
    #             section_weights = np.array([inverse_doc_props[t] for t in section_tokens if t in inverse_doc_props])
                similarities = cosine_similarity(section_vectors, keyword_vectors)
    #             similarities = similarities * section_weights
    #             similarities = similarities*(similarities>0.2)
                keyword_similarities = np.mean(similarities, axis=0)
    #             keyword_similarities = np.average(similarities, axis=0, weights=section_weights)
                score = np.sum(keyword_similarities*keyword_weights)
        document_section_scores.append(score)

In [512]:
max_score = np.max(np.array([s for s in document_section_scores if s!=None]))
print(max_score)
best_doc_index = [i for i, s in enumerate(document_section_scores) if s==max_score][0]
print(document_sections[best_doc_index][3])

8326.09524166195
2.  Cross-reference this report to the September 24, 2019, City Council 
meeting.
.
Memorandum
Attachment A
Attachment B
Presentation

Homelessness and Housing Insecurity Among Families with Children.  
[REFERRAL FROM JUNE 13, 2019 MEETING]  (Housing)
Accept the report on Homelessness and Housing Insecurity for 
Homeless Families with Children.
Memorandum
Presentation

City of San José

Page 3 

Printed on 9/12/2019




In [513]:
# this is *like* top k, but not quite
max_score = np.max(np.array([s for s in document_section_scores if s!=None]))
best_doc_indices = [i for i, s in enumerate(document_section_scores) if s!=None and s>5000]
# print(len(best_doc_indices))
for i in best_doc_indices:
    print(document_sections[i][2])
    print(document_sections[i][3])
    print("~~~~")
# print(document_sections[best_doc_index])

sunnyvale/Sunnyvale_2019-11-21_City-Council:-Community-Meeting-Notice_Agenda.txt
City of Sunnyvale

Notice and Agenda

City Council: Community Meeting Notice

Thursday, November 21, 2019

6:00 PM

Sunnyvale Community Center
Ballroom
550 East Remington Drive
Sunnyvale, CA 94087

Special Meeting of the City Council: "Housing Strategy Final Open House: Prioritizing 

Sunnyvale’s Housing Needs"

This is a notice that a community meeting will be held on Thursday, November 21, 2019 at
6:00 p.m. and a quorum of the City Council may be present.

HOUSING STRATEGY FINAL OPEN HOUSE: PRIORITIZING SUNNYVALE'S HOUSING 
NEEDS

ADJOURNMENT

NOTICE TO THE PUBLIC

Pursuant to the Americans with Disabilities Act, if you need special assistance in 
this meeting, please contact the Office of the City Clerk at (408) 
730-7483.Notification of 48 hours prior to the meeting will enable the City to make 
reasonable arrangements to ensure accessibility to this meeting. (28 CFR 35.160 
(b) (1))

City of Sunnyvale