# Ad Hoc Relevance Ranking System

Main task: Extract document segments relevant to a set of keywords

Secondary tasks:
* Select words within the document segment to highlight for easier reading
* Select the top-k most relevant document segments for each user given their keywords
* Keep track of which document segments have been sent to which users and don't duplicate

## Segmenting

TODO: Make this more sophisticated so that the conceptual breaks between sections are respected

Input: A set of documents

Output:

* All tokens from all documents
* All tokens for each document
* A set of document segments:
    * the original text of the segment
    * the tokens from that segment
    * the page number that the start of the segment comes from (TODO: fix this, right now it's the end)
    * the document filename that the segment comes from
    
Method:

Go through each document line by line and collect nonoverlapping sets of lines that are each at least W tokens long. (W = 100)

## Relevance score

Input:

* Set of candidate document segments
* Set of keywords

Output:

A numeric score for each document segment, where a higher score is a better match to the keywords.
Some documents may have a score of None if there were no useful tokens to compare to the keywords.

Method:

1. Compute inverse document frequency for each keyword

    * I actually used inverse document proportion... but it's the same by a scaling factor
    
    * $\frac{1}{\mbox{# documents in which keyword appears + smoothing}}$
    
    * smoothing = 1

2. Vectorize keywords and tokens in each document_segment

3. For each document segment:

    a. Get pairwise cosine similarity between each keyword and each token in the document segment
    
    b. Get average cosine similarity (across document segment tokens) for each keyword
    
    c. Sum average cosine similarity for each keyword, weighted by that keyword's inverse document frequency



## Evaluation

### Precision

This is easier to evaluate

### Recall

This is harder to evaluate

## TODO

* [ ] figure out appropriate casing
* [ ] narrow down document set by metadata
    * `metadata.csv`
    * use `DocumentManager`
* [ ] add more keywords
* [ ] figure out sectioning so we get more coherent sections
* [ ] try different vectors? e.g. contextual??

In [None]:
import pandas as pd
import numpy as np
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity

print("loading language model")
# load language model (this takes a few minutes)
model = api.load('word2vec-google-news-300')
print("model loaded")

In [2]:
vectors = model.wv
del model
vectors.init_sims(True) # normalize the vectors (!), so we can use the dot product as similarity measure

print('embeddings loaded ')
print('loading docs ... ')

  """Entry point for launching an IPython kernel.


embeddings loaded 
loading docs ... 


In [25]:
# vectors[["ADU", "housing"]]

array([[-3.75537425e-02, -8.87633935e-02,  4.03469950e-02,
        -3.66226584e-02, -2.70014517e-02, -4.50024195e-02,
         9.00048390e-03, -4.56231423e-02, -9.99364033e-02,
        -1.02807244e-03, -2.25012098e-02,  6.89002573e-02,
        -4.03469950e-02,  1.15454480e-01, -2.77773552e-02,
        -4.15884405e-02, -2.21908484e-02, -3.78641039e-02,
         6.12963969e-03,  1.37334969e-02, -6.26930222e-02,
         2.07942203e-02,  6.98313415e-02, -5.49339876e-02,
        -1.01177849e-01, -1.13475928e-03,  2.99498849e-02,
         1.31593272e-01, -4.00366336e-02,  2.52168719e-03,
         6.83765218e-04, -6.82795346e-02,  2.18804870e-02,
        -4.96578403e-02,  2.09494010e-02, -1.78457871e-02,
         6.05204925e-02, -1.03971101e-02, -7.75903761e-02,
        -2.76221745e-02,  6.54862747e-02,  9.31084529e-02,
        -5.86583242e-02,  1.31593272e-01, -7.64265191e-03,
         2.59927753e-03, -2.05614488e-03, -5.83479628e-02,
        -4.55843459e-04, -6.85898960e-02, -1.43542197e-0

In [224]:
keywords = ['housing', 'affordable', 'homelessness', 'ADU']

In [323]:
directory = '../data/legistar_corpus'
import os
from autolocal.nlp import Tokenizer
from gensim.parsing.preprocessing import *

tokenizer = Tokenizer()

documents = []
document_sections = []
# section_length = 20 # lines
section_length = 100 # tokens
min_section_length = 5

docs_tokens = []

# TODO: is lowercasing necessary?
preprocess_filters = [
    lambda x: x.lower(),
    strip_punctuation,
    strip_numeric,
    strip_non_alphanum,
    strip_multiple_whitespaces,
    strip_numeric,
    remove_stopwords,
    strip_short
]

i=0
print(len(os.listdir(directory)))
for filename in os.listdir(directory): 
    with open (os.path.join(directory, filename)) as f: 
        document_tokens = []
        document_str = f.read()
        document_segment_lines = []
        document_segment_tokens = []
        document_tokens = []
#         if i<100:
        if True:
            pages = document_str.split('\f')
            for p, page in enumerate(pages): 
                lines = page.split('\n')
                for line in lines:
                    line_tokens = preprocess_string(line, filters=preprocess_filters)
                    document_segment_lines.append(line)
                    document_segment_tokens += line_tokens
                    document_tokens += line_tokens
                    docs_tokens += line_tokens
                    if len(document_segment_tokens) >= section_length:
                        document_sections.append((
                            document_segment_tokens,
                            p,
                            filename,
                            "\n".join(document_segment_lines)
                        ))
                        document_segment_lines = []
                        document_segment_tokens = []
                if len(document_segment_tokens) >= min_section_length:
                    document_sections.append((
                        document_segment_tokens,
                        p,
                        filename,
                        "\n".join(document_segment_lines)
                    ))
                    document_segment_lines = []
                    document_segment_tokens = []
                documents.append(document_tokens)           
        i+=1
        if i%100 == 0:
            print(i)


30647
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400
15500
15600
15700
15800
15900
16000
16100
16200
16300
16400
16500
16600
16700
16800
16900
17000
17100
17200
17300
17400
17500
17600
17700
17800
17900
18000
18100
18200
18300
1840

In [324]:
print(len(document_sections))
print(document_sections[0])

405326
(['city', 'santa', 'clara', 'canceled', 'planning', 'commission', 'wednesday', 'july', 'city', 'hall', 'council', 'chambers', 'meeting', 'canceled', 'planning', 'commission', 'meeting', 'scheduled', 'july', 'canceled', 'items', 'scheduled', 'regular', 'scheduled', 'meeting', 'wednesday', 'july', 'city', 'council', 'chambers', 'city', 'santa', 'clara', 'city', 'santa', 'clara', 'page', 'printed'], 0, 'Santa-Clara_2018-07-11_Planning-Commission_Agenda.txt', 'City of Santa Clara\n\nCANCELED\n\n Planning Commission\n\nWednesday, July 11, 2018\n\n7:00 PM\n\nCity Hall Council Chambers\n\n– MEETING CANCELED –\n\nThe Planning Commission Meeting \n\nscheduled for July 11, 2018, at 7:00 p.m. has been canceled.\n\n(No Items Scheduled)\n\nThe next regular scheduled meeting will be on Wednesday, July 25, 2018, at \n\n7:00 p.m. in the City Council Chambers of the City of Santa Clara.\n\nCity of Santa Clara\n\nPage 1 of 1 \n\nPrinted on 7/2/2018\n\n')


In [325]:
# print(document_sections[0])
# print(document_sections[1])

* Inverse document frequency for each keyword
* For each document section, for each keyword, get average document similarity
    * for each word in the document section, for each keyword, get similarity of word vectors
    * average across the document
* For each document section, sum average document similarity, weighted by keyword inverse document frequency

In [326]:
from collections import Counter
word_counts = Counter(docs_tokens)
smoothing = 1
# smoothing = 100
doc_freqs = {}
doc_freq_total = 0
for w in word_counts:
    if w in vectors:
        doc_freqs[w] = word_counts[w] + smoothing
        doc_freq_total += word_counts[w] + smoothing
word_counts = None
doc_props = {w: doc_freqs[w]/doc_freq_total for w in doc_freqs}
inverse_doc_props = {w: (1/doc_props[w] if doc_props[w]>0 else 0) for w in doc_props}
# inverse_doc_props

In [None]:
inverse_doc_props

In [332]:
keyword_vectors = np.array([vectors[t] for t in keywords if t in inverse_doc_props])
keyword_weights = np.array([inverse_doc_props[t] for t in keywords if t in inverse_doc_props])
document_section_scores = []
for s, section in enumerate(document_sections):
        score = None
        section_tokens = section[0]
        # TODO: Zipf to figure out what the cutoff should be for normal communication
        if len(set(section_tokens))<20:
            score = 0
        else:
            section_vectors = np.array([vectors[t] for t in section_tokens if t in inverse_doc_props])
            if section_vectors.shape[0]>0:
    #             section_weights = np.array([inverse_doc_props[t] for t in section_tokens if t in inverse_doc_props])
                similarities = cosine_similarity(section_vectors, keyword_vectors)
    #             similarities = similarities * section_weights
    #             similarities = similarities*(similarities>0.2)
                keyword_similarities = np.mean(similarities, axis=0)
    #             keyword_similarities = np.average(similarities, axis=0, weights=section_weights)
                score = np.sum(keyword_similarities*keyword_weights)
        document_section_scores.append(score)

In [333]:
max_score = np.max(np.array([s for s in document_section_scores if s!=None]))
print(max_score)
best_doc_index = [i for i, s in enumerate(document_section_scores) if s==max_score][0]
print(document_sections[best_doc_index])

2670.7668517555016
(['residents', 'living', 'poverty', 'level', 'families', 'incomes', 'poverty', 'level', 'specifically', 'extremely', 'low', 'low', 'incomes', 'greatest', 'risk', 'homeless', 'require', 'assistance', 'meeting', 'rent', 'mortgage', 'obligations', 'order', 'prevent', 'homelessness', 'census', 'data', 'suggest', 'percent', 'cupertino', 'residents', 'living', 'poverty', 'level', 'specifically', 'percent', 'family', 'households', 'percent', 'families', 'children', 'living', 'poverty', 'level', 'households', 'require', 'specific', 'housing', 'solutions', 'deeper', 'income', 'targeting', 'subsidies', 'housing', 'supportive', 'services', 'single', 'room', 'occupancy', 'units', 'rent', 'subsidies', 'vouchers', 'homeless', 'demand', 'emergency', 'transitional', 'shelter', 'cupertino', 'difficult', 'determine', 'given', 'episodic', 'nature', 'homelessness', 'generally', 'episodes', 'homelessness', 'families', 'individuals', 'occur', 'single', 'event', 'periodically', 'county', '

In [345]:
max_score = np.max(np.array([s for s in document_section_scores if s!=None]))
best_doc_indices = [i for i, s in enumerate(document_section_scores) if s!=None and s>1500]
# # print(len(best_doc_indices))
# for i in best_doc_indices:
#     print(document_sections[i][3])
#     print("~~~~")
# # print(document_sections[best_doc_index])
best_doc_segments = []
for i in best_doc_indices:
    tokens, p, filename, orig = document_sections[i]
    score = document_section_scores[i]
    best_doc_segments.append({
        "page_number": p,
        "filename": filename,
        "original_text": orig,
        "relevance_score": score
    })
pd.DataFrame(best_doc_segments).to_csv("best_doc_segments.csv")

In [None]:
all_keywords = keywords
# How many documents does each keyword occur in?
keyword_doc_counts = {k: 0 for k in all_keywords}
keyword_doc_total = 0
for document in documents:
    for k in keyword_doc_counts:
        if k in document:
            keyword_doc_counts[k] += 1
            keyword_doc_total += 1
            
keyword_section_counts = {k: 0 for k in all_keywords}
keyword_section_total = 0
for document in document_sections:
    for k in keyword_section_counts:
        if k in document[0]:
            keyword_section_counts[k] += 1
            keyword_section_total += 1

In [None]:
keyword_doc_proportions = {k: keyword_doc_counts[k]/keyword_doc_total for k in keyword_doc_counts}
keyword_section_proportions = {k: keyword_section_counts[k]/keyword_section_total for k in keyword_section_counts}
print(keyword_doc_proportions)
print(keyword_section_proportions)

In [None]:
inverse_doc_freq = {k: 1/(keyword_doc_proportions[k]) if keyword_doc_proportions[k]>0 else 0 for k in keyword_doc_proportions}
inverse_doc_freq

In [None]:
keyword_vectors = np.array([vectors[t] for t in keywords if t in vectors])
keyword_weights = np.array([inverse_doc_freq[t] for t in keywords if t in vectors])
document_section_scores = []
for s, section in enumerate(document_sections):
#     if s < 10:
        score = None
        section_tokens = section[0]
#         print("read tokens")
        # get word vectors
        section_vectors = np.array([vectors[t] for t in section_tokens if t in vectors])
#         print("word vectors looked up")
        if section_vectors.shape[0]>0:
            similarities = cosine_similarity(section_vectors, keyword_vectors)
            similarities = similarities*(similarities>0.1)
#             print("similarities computed")
            keyword_similarities = np.mean(similarities, axis=0)
            score = np.sum(keyword_similarities*keyword_weights)
        document_section_scores.append(score)
# print(document_section_scores)

In [None]:
max_score = np.max(np.array([s for s in document_section_scores if s!=None]))
print(max_score)
best_doc_index = [i for i, s in enumerate(document_section_scores) if s==max_score][0]
print(document_sections[best_doc_index])

# max_score = np.max(np.array([s for s in document_section_scores if s!=None]))
# best_doc_indices = [i for i, s in enumerate(document_section_scores) if s!=None and s>20]
# best_doc_indices
# for i in best_doc_indices:
#     print(document_sections[i])
# # print(document_sections[best_doc_index])

In [None]:
# why are these the same?
print(" ".join(document_sections[0][0]))
print(" ".join(document_sections[1][0]))

In [147]:
sims = cosine_similarity(np.array(vectors[["ADU", "applesauce"]]), np.array(vectors[["housing", "after", "attachment", "cooking"]]))
sims*(sims>0)

array([[ 0.1902841 , -0.        ,  0.12652467,  0.01557165],
       [ 0.06678405,  0.04545134,  0.05078512,  0.29745492]],
      dtype=float32)

In [171]:
print(all_keywords)
cosine_similarity(np.array(vectors[all_keywords]), np.array(vectors[["housing", "after", "attachment", "cooking"]]))

['housing', 'affordable', 'homelessness', 'accessory', 'dwelling', 'unit', 'ADU']


array([[ 1.        ,  0.0360411 ,  0.03927992,  0.09836677],
       [ 0.31637853, -0.06969636,  0.02391504,  0.14009611],
       [ 0.45313504, -0.01476945,  0.01381858,  0.09197982],
       [ 0.0892075 ,  0.03608286,  0.21938115,  0.07302914],
       [ 0.40325326,  0.02292117,  0.17923255,  0.15913229],
       [ 0.16630656,  0.12939759,  0.13552879,  0.03668242],
       [ 0.1902841 , -0.09115382,  0.12652467,  0.01557165]],
      dtype=float32)