# Language and Topic models

A common suggestion to users for coming up with good queries is to think of words that would likely appear in a relevant document, and to use those words as the query. The language modeling approach to IR directly models this idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. 

Today we will score documents with respect to user query using language models and also get some experience with topic modelling.

## Import all required modules 

In [1]:
from urllib.request import urlopen
import os, tqdm
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer

## Loading data

In this class we will use the dataset we already used once - [this topic-modeling dataset](https://code.google.com/archive/p/topic-modeling-tool/downloads).

In [2]:
def read_data(data_dir = './data'):
    all_data = {}
    try :
        assert os.path.isdir(data_dir), 'file not found'
        print(f'Reading data from : {data_dir}')
        (_, _, filenames)  = next(os.walk('./data'))
        for doc in filenames:
            try:
                with open(os.path.join(data_dir,doc),'r') as file:
                    text = file.readlines()
                    topic = doc.split('.')[0]
                    for i , line in enumerate(text,1):
                        all_data[f'{topic}_{i}'] = line
            except:
                pass
                        
    except :
        print("Downloading data ...")
        os.mkdir(data_dir)
        links = {"fuel":"https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_fuel_845docs.txt",
                "economy":"https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt",
                "music": "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt",
                "brain_injury":"https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_braininjury_10000docs.txt"}
        
        for topic, link in tqdm.tqdm_notebook(links.items()):
            f = urlopen(link)
            myfile = f.readlines()
            file = open('./data/'+topic+'.txt','w+')
            for i , doc in enumerate(myfile,1):
                decoded_doc = doc.decode('utf-8').strip()
                all_data[f'{topic}_{i}'] = decoded_doc
                file.write(decoded_doc+"\n") # save the file to local dir
    return all_data

In [3]:
all_data = read_data()
print("# of documents", len(all_data))
assert len(all_data) == 15002

Reading data from : ./data
# of documents 15002


## 1. Ranking Using Language Models
Our goal is to rank documents by *P(d|q)*, where the probability of a document is interpreted as the likelihood that it is relevant to the query. 

Using Bayes rule: *P(d|q) = P(q|d)P(d)/P(q)*

*P(q)* is the same for all documents, and so can be ignored. The prior probability of a document *P(d)* is often treated as uniform across all *d* and so it can also be ignored. What does it mean? 

It means that comparing *P(q|d)* between different documents we can compare how relevant are they to the query. How can we estimate *P(q|d)*?

*P(q|d)* can be estimated as:
![](https://i.imgur.com/BEIMAC1.png)

where M<sub>d</sub> is the language model of document *d*, tf<sub>t,d</sub> is the term frequency of term *t* in document *d*, and L<sub>d</sub> is the number of tokens in document *d*. That is, we just count up how often each word occurred, and divide by the total number of words in the document *d*. The first thing we need to do is to build a term-document matrix for tour dataset.

In [4]:
# build term-document matrix for the dataset
vectorizer = CountVectorizer(token_pattern = r"(?u)\b\w+\b")
tdm = vectorizer.fit_transform(list(all_data.values())).toarray().T

### Smoothing

Now, you need to implement the abovementioned logic in the `lm_rank_documents` function below. Do you see any potential problems?

Yes, data sparsity - we don't expect to meet each term in each doc, so, in most cases, we will get zero scores, which is not what we really want.

The solution is smooting.

One option is *additive smoothing* - adding a small number (0 to 1) to the observed counts and renormalizing to give a probability distribution.

Another option is called Jelinek-Mercer smoothing - a simple idea that works well in practice is to use a mixture between a document-specific distribution and distribution estimated from the entire collection:

![](https://i.imgur.com/8Qv41Wp.png)

where 0 < λ < 1 and M<sub>c</sub> is a language model built from the entire document collection.

Refer to *Chapter 12* for the detailed explanation.


You are going to apply both in your `lm_rank_documents` function. This function takes as an input tdm matrix, and ranks all documents "building" a language model for each document, returning relative probabilities of query being generated by a document as a document's score.

In [5]:
def lm_rank_documents(query, tdm, terms_list, doc_names,smoothing='additive', param=0.001):
    result = []
    if smoothing == 'additive':
        #additive smoothing
        alpha = param
        tdm += alpha
        tdm /= tdm.sum(axis=1).reshape(-1,1) #Renormalizing
        
        # Sort & return results
        return sorted(result, key= lambda x : x[1], reverse=True) #Sort re
    
    else:
        # Jelinek-Mercer smoothing
        full_ptd = {}
        doc_len = tdm.sum(axis=0)

        for w, c in tqdm.tqdm_notebook(zip(np.array(terms_list),tdm)):
            #Probability of term in each document -- Takes forever to finish
            full_ptd.update({ w : {doc_names[i] : c[i]/doc_len[i] for i in np.nonzero(c)[0]}})
            
        # get probability of term in full context
        full_ptc = dict(zip(terms_list,tdm.sum(axis=1)/tdm.sum())) 

        lambd = param
        result = []
        for doc in tqdm.tqdm_notebook(doc_names):
            score = 1.0
            # score*=p(t|d)
            for qw in query:
                ptd = full_ptd.get(qw,{}).get(doc,1) #P(t|Md)
                ptc = full_ptc.get(qw,1) #P(t|Mc)
                pd = lambd * ptd + (1-lambd) * ptc 
                score *= pd
            
            # Add document score to result
            result.append((doc, score))
            
        # Sort & return results
        return  sorted(result, key= lambda x : x[1], reverse=True)
    

### Testing

Check if this type of ranking gives meaningful results. For each query output document category, doc_id, score, and the beginning of the document, as it is shown below. Analyze if categories and contents match the queries. 

## <center><font color='red'>Implementation Not Finished !!!</font></center>

In [7]:
def process_query(raw_query):
    # TODO: process user query and print search results including document category, id, score, and some part of it
    query = query.split(" ")
    #lm_rank = lm_rank_documents()
    

user_queries = ["piano concert", "symptoms of head trauma", "wall street journal"]
for q in user_queries:
    process_query(q)
    print("\n")

user query: [95mpiano concert[0m
prepared query: Counter({'piano': 1, 'concert': 1})

search results:
[1mmusic[0m 13330 0.012384164490679759
atlanta prominent midtown intersection one step closer becoming major cultural landmark the atlanta ...
[1meconomy[0m 11335 0.012384164490679759
atlanta prominent midtown intersection one step closer becoming major cultural landmark the atlanta ...
[1mmusic[0m 12926 0.011382499792705511
felt like was going church marry guy never met said the jazz violinist regina carter the metaphorica...
[1mmusic[0m 14390 0.010661589922122
hailed los angeles brightest flower its flashiest ship sail its keenest architectural triumph perhap...
[1mmusic[0m 13818 0.010549141787975117
everything was finished sept the super bowl logo would reflection new orleans featuring streetcar an...


user query: [95msymptoms of head trauma[0m
prepared query: Counter({'symptoms': 1, 'head': 1, 'trauma': 1})

search results:
[1mbrain_injury[0m 7319 0.060228773783760

## 2. Topic modeling

Now let's use *Latent Dirichlet Allocation* to identify topics in this collection and check if they match the original topics (fuel, economy, etc.). Go through the tutorial [here](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0) and apply the ideas there to our dataset. 

In [6]:
# Results Print function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

In [7]:
#prepare data for LDA Model fit
number_topics = 4
vectorizer = CountVectorizer(token_pattern = r"(?u)\b\w+\b")
data = vectorizer.fit_transform(list(all_data.values()))

In [8]:
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(data)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=4, n_jobs=-1,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=0)

In [9]:
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, vectorizer, 50)

Topics found via LDA:

Topic #0:
the and that for said with new its are have from company has was but percent more year will york they not enron their than which had about last companies this times been would million were market one when some business his who economy feb also stock other out years

Topic #1:
the and that for with was his but from you she are said they her who have has this one not new about when all had their there more like can were what out people which will been music just says time two than into after first some its would

Topic #2:
the and with patients was were injury for brain this after that traumatic from study are tbi results injuries trauma head treatment had between severe not outcome clinical patient these group methods cases years all may children case who than associated have during age fractures using more data following one

Topic #3:
the and that for with said are new will has from news but have his atlanta was they bush this not who service their mor