# Language and Topic models

A common suggestion to users for coming up with good queries is to think of words that would likely appear in a relevant document, and to use those words as the query. The language modeling approach to IR directly models this idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. 

Today we will score documents with respect to user query using language models and also get some experience with topic modelling.

## Loading data

In this class we will use the dataset we already used once - [this topic-modeling dataset](https://code.google.com/archive/p/topic-modeling-tool/downloads).

In [1]:
!wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
!wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
!wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_fuel_845docs.txt
!wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_braininjury_10000docs.txt

--2020-04-12 16:38:22--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.28.128, 2607:f8b0:400e:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.28.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13985603 (13M) [application/octet-stream]
Saving to: ‘testdata_news_music_2084docs.txt’


2020-04-12 16:38:23 (45.2 MB/s) - ‘testdata_news_music_2084docs.txt’ saved [13985603/13985603]

--2020-04-12 16:38:26--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.142.128, 2607:f8b0:400e:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.142.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Leng

In [2]:
# TODO: read the dataset
import nltk
nltk.download('punkt')

all_data = []
def read_dataset(file_path):
    docs = []
    with open(file_path) as fp:
        for cnt, line in enumerate(fp):
            docs.append(nltk.word_tokenize(line.lower()))
    return docs

fuel_data = read_dataset("testdata_news_fuel_845docs.txt")
brain_inj_data = read_dataset("testdata_braininjury_10000docs.txt")
economy_data = read_dataset("testdata_news_economy_2073docs.txt")
music_data = read_dataset("testdata_news_music_2084docs.txt")

all_data = fuel_data + brain_inj_data + economy_data + music_data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
print("# of documents", len(all_data))
assert len(all_data) == 15002

# of documents 15002


## 1. Ranking Using Language Models
Our goal is to rank documents by *P(d|q)*, where the probability of a document is interpreted as the likelihood that it is relevant to the query. 

Using Bayes rule: *P(d|q) = P(q|d)P(d)/P(q)*

*P(q)* is the same for all documents, and so can be ignored. The prior probability of a document *P(d)* is often treated as uniform across all *d* and so it can also be ignored. What does it mean? 

It means that comparing *P(q|d)* between different documents we can compare how relevant are they to the query. How can we estimate *P(q|d)*?

*P(q|d)* can be estimated as:
![](https://i.imgur.com/BEIMAC1.png)

where M<sub>d</sub> is the language model of document *d*, tf<sub>t,d</sub> is the term frequency of term *t* in document *d*, and L<sub>d</sub> is the number of tokens in document *d*. That is, we just count up how often each word occurred, and divide by the total number of words in the document *d*. The first thing we need to do is to build a term-document matrix for tour dataset.

In [0]:
# TODO: build term-document matrix for the dataset
import numpy as np
from collections import Counter

word_index = {}
doc_lenths = {}
total_words = 0

for doc_id, doc_content in enumerate(all_data):
  terms = Counter(doc_content)
  doc_lenths[doc_id] = len(doc_content)
  for term in terms:
    total_words += terms[term]
    if term in word_index:
      word_index[term][doc_id] = terms[term]
      word_index[term]['number_in_collection'] = word_index[term]['number_in_collection']+terms[term]
    else:
      word_index[term] = {'number_in_collection':terms[term]}
      word_index[term][doc_id] = terms[term]

### Smoothing

Now, you need to implement the abovementioned logic in the `lm_rank_documents` function below. Do you see any potential problems?

Yes, data sparsity - we don't expect to meet each term in each doc, so, in most cases, we will get zero scores, which is not what we really want.

The solution is smooting.

One option is *additive smoothing* - adding a small number (0 to 1) to the observed counts and renormalizing to give a probability distribution.

Another option is called Jelinek-Mercer smoothing - a simple idea that works well in practice is to use a mixture between a document-specific distribution and distribution estimated from the entire collection:

![](https://i.imgur.com/8Qv41Wp.png)

where 0 < λ < 1 and M<sub>c</sub> is a language model built from the entire document collection.

Refer to *Chapter 12* for the detailed explanation.


You are going to apply both in your `lm_rank_documents` function. This function takes as an input tdm matrix, and ranks all documents "building" a language model for each document, returning relative probabilities of query being generated by a document as a document's score.

In [0]:
import numpy as np
def lm_rank_documents(query, index, doc_lenths, smoothing='additive', param=0.001):
  # TODO: score each document in tdm using this document's language model
  # implement two types of smoothing. Looks up term frequencies in tdm
  # return document scores in a convenient form
  # param is alpha for additive / lambda for jelinek-mercer
  """
  param: index: term: {'number_in_collection':doc frequency in collection, doc_id: frequency, ...}
  """
  assert smoothing in ['additive', 'jm' ], "smoothing parameter should be set to 'additive' or 'jm' for Jelinek-Mercer"

  scores = {} 
  query_terms = Counter(nltk.word_tokenize(query.lower()))

  if smoothing == 'additive':
    for term in query_terms:
      if term not in index: # if this term doesnt appear anywhere we ignore it
        continue
      tfm = index[term]
      for doc_id, doc_lenth in doc_lenths.items():
        try: doc_score = tfm[doc_id]/doc_lenth + param
        except: doc_score = param

        if doc_id not in scores:
          scores[doc_id] = doc_score
        else:
          scores[doc_id] = scores[doc_id]*doc_score

  else:
    for term in query_terms:
      if term not in index: # if this term doesnt appear in whole corpus we ignore it
        continue
      tfm = index[term]
      for doc_id, doc_lenth in doc_lenths.items():
        try: doc_score = tfm[doc_id]/doc_lenth
        except: doc_score = 0

        if smoothing == 'additive':
          score = doc_score + param
        else:
          collection_score = tfm['number_in_collection']/total_words
          score = param*doc_score + (1-param)*collection_score


        if doc_id not in scores:
          scores[doc_id] = score
        else:
          scores[doc_id] = scores[doc_id]*score
  return scores

### Testing

Check if this type of ranking gives meaningful results. For each query output document category, doc_id, score, and the beginning of the document, as it is shown below. Analyze if categories and contents match the queries. 

In [6]:
def process_query(raw_query, smoothing, param, top_n = 5):
    print(f'Query: {raw_query} smoothing {smoothing}')
    # TODO: process user query and print search results including document category, id, score, and some part of it
    scores = lm_rank_documents(raw_query, word_index, doc_lenths, smoothing=smoothing, param=param)
    # choose top  n best documents
    scores = [(doc_id, score) for doc_id, score in scores.items()]
    scores = sorted(scores, key=lambda x: x[1], reverse=True)[:top_n]
    
    # find document text and context to print
    for doc_id, score in scores:
      print(f'{doc_id} {score}')
      print(' '.join(all_data[doc_id]))

user_queries = ["piano concert", "symptoms of head trauma", "wall street journal"]
for q in user_queries:
    process_query(q, smoothing='additive', param=0.001)
    print("\n")
    process_query(q, smoothing='jm', param=0.8)
    print("\n")

Query: piano concert smoothing additive
13834 6.841047880402486e-05
sometimes the most satisfying renovation the one that doesn happen two years ago geoffrey menin bought loft fifth avenue near 20th street drawn the large open duplex because would provide perfect setting for his exquisite bosendorfer piano friend recommendation hired architect and began discussing ways reconfigure the apartment thought both the kitchen and bath needed redone but beyond that wasn sure what wanted months went and became clear that and the architect didn have the same taste they parted and hired another one this architect suggested significant alterations the space including putting the piano platform moving the door the apartment and expanding and renovating the kitchen menin partner the law firm levine plotkin amp menin which represents people the music fashion and movie businesses well executives employment negotiations even consulted with sound engineer because was considering creating vaulted ceiling

## 2. Topic modeling

Now let's use *Latent Dirichlet Allocation* to identify topics in this collection and check if they match the original topics (fuel, economy, etc.). Go through the tutorial [here](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0) and apply the ideas there to our dataset. 

In [8]:
# TODO: apply LDA to our dataset and output the resulting categories 
from sklearn.feature_extraction.text import CountVectorizer
import re

# 1. Preprocessing, remove punct
def read_dataset(file_path):
    docs = []
    with open(file_path) as fp:
        for cnt, line in enumerate(fp):
          docs.append(re.sub('[,\.!?]', '', line).lower())
    return docs

fuel_data = read_dataset("testdata_news_fuel_845docs.txt")
brain_inj_data = read_dataset("testdata_braininjury_10000docs.txt")
economy_data = read_dataset("testdata_news_economy_2073docs.txt")
music_data = read_dataset("testdata_news_music_2084docs.txt")

all_data = fuel_data + brain_inj_data + economy_data + music_data

count_vectorizer = CountVectorizer(stop_words='english')
count_data = count_vectorizer.fit_transform(all_data)

# 2. Load the model
from sklearn.decomposition import LatentDirichletAllocation as LDA
number_topics = 4
number_words = 10


# 3. Create and fit the model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=4, n_jobs=-1,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=0)

In [9]:
# 4. Print the topics found by the LDA model
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

Topics found via LDA:

Topic #0:
patients injury brain traumatic study tbi results injuries trauma head

Topic #1:
said new year enron percent bush company news president york

Topic #2:
new atlanta said york like news journal music time year

Topic #3:
said people bush afghanistan military united war states officials new


In [10]:
!pip install pyldavis

Collecting pyldavis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 5.0MB/s 
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 42.8MB/s 
Building wheels for collected packages: pyldavis, funcy
  Building wheel for pyldavis (setup.py) ... [?25l[?25hdone
  Created wheel for pyldavis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=04c045848866ed135ae088b40a1b5a8598f851410b18c65860a30834b32e2f94
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... [?25l[?25hdone
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=ca54495e

In [11]:
# visualization as in the tutorial
%%time

from pyLDAvis import sklearn as sklearn_lda
import pickle 
import pyLDAvis
import os

# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = sklearn_lda.prepare(lda, count_data, count_vectorizer)

CPU times: user 1min 2s, sys: 150 ms, total: 1min 3s
Wall time: 1min 18s


In [12]:
pyLDAvis.display(LDAvis_prepared)