### Introduction

Latent Dirichlet Allocation, or LDA, is a 3-level hierarchical Bayesian model. Put it differently, it is a generative statistic model that explains how a collection of text documents can be described by a set of unobserved topics. Each item of the collection is modelled as a finite mixture over a latent set of topics here, while each topic is characterized by a distribution of words. As a generalization of pLSA model, it differs primarily by treating the topic mixture as a Dirichlet prior, leading to more reasonable mixtures and less susceptibility to overfitting. LDA is an important model in NLP, solving the problem of topic discovery, similarity comparison, document modeling and classification.

As for more detail of this model, basically, it is a Bag-of-Word based model, based on word co-occurrance. It considers that a piece of text is composed of many words, without considering order. One piece of text can have many topics, while each word from it is generated by one of the topics. The first step of this model is, to get document i a topic distribution theta(i) from Dirichlet distribution alpha. Then, from theta(i), it gets topic number j for document i as z(i,j). Next, from Dirichlet distribution beta, it generates the word distribution of z(i,j) then sample the final word w(i,j). With Gibbs sampling and EM model, we can convergent to the final result. In summary, alpha and beta are corpus level parameters, and z and w are word level variables.

![LDA](Latent_Dirichlet_allocation.svg.png)

### Libraries

In [1]:
import numpy as np
import pandas as pd
import os

from gensim import corpora, models
from gensim.parsing.preprocessing import preprocess_string

### Data and Preprocessing

In [2]:
data = np.load('literature.npy', allow_pickle=True).item()

In [3]:
# Extract papers from dictionary and save in a list
texts = []

for _, sections in data.items():
    full_text = " ".join(sections.values())
    texts.append(full_text)

# function from gensim, can delete stop words, transfer to lower case, etc.
processed_texts = [preprocess_string(text) for text in texts]

In [4]:
dictionary = corpora.Dictionary(processed_texts) # construct dictionary from the papers
corpus = [dictionary.doc2bow(text) for text in processed_texts] # construct corpus, each paper is transferred into a list of (word_id, word_count) tuples

### LDA

Usually we have to train model with our own data while using LDA, especially for minor disciplines. Otherwise the topics can not match well.

In [5]:
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=50, random_state=42) # LdaMulticore (..., workers=8, ...)

### Results

lda_model.show_topics(num_topics=10, num_words=50)

In [37]:
for _, bow in enumerate(corpus):
    print(lda_model.get_document_topics(bow))

[(33, np.float32(0.96535563)), (42, np.float32(0.01416937))]
[(7, np.float32(0.045853946)), (19, np.float32(0.74599874)), (31, np.float32(0.020226821)), (40, np.float32(0.16536564))]
[(31, np.float32(0.0266248)), (42, np.float32(0.0571256)), (49, np.float32(0.91036904))]
[(31, np.float32(0.9819842))]
[(1, np.float32(0.59377176)), (4, np.float32(0.10625016)), (7, np.float32(0.027269533)), (21, np.float32(0.13394631)), (25, np.float32(0.0102562765)), (29, np.float32(0.08460806)), (35, np.float32(0.04128924))]
[(31, np.float32(0.9997854))]
[(1, np.float32(0.047283813)), (4, np.float32(0.88994294)), (19, np.float32(0.04369402)), (35, np.float32(0.018097706))]
[(8, np.float32(0.5319542)), (25, np.float32(0.14291236)), (31, np.float32(0.061975256)), (35, np.float32(0.2308555))]
[(8, np.float32(0.9853141))]
[(25, np.float32(0.964285)), (42, np.float32(0.03551389))]
[(7, np.float32(0.99975896))]
[(25, np.float32(0.01275665)), (31, np.float32(0.011655295)), (40, np.float32(0.96094334))]
[(42, n

In [None]:
topics = []

for _, bow in enumerate(corpus):
    dist = lda_model.get_document_topics(bow, minimum_probability=0)
    topic = np.array([prob for _, prob in dist])
    topics.append(topic)

### Evaluation

LDA can automize text classification part, however, it can not really assign text with externally defined codes for top-down paper review, especially when pre-defined class names are not present in the training data and the dictionaty it generates. Thus, usually we have to manually annotate the papers based on topics we get, top words of such topics, and the matching topics of each paper. But anyway, LDA can lower the workload compared with doing fully manual annotation. Here we first take paper 0 for evaluation. Its topic is likely to be 33 and 42 according to the result.

In [19]:
print(lda_model.print_topic(33, topn=200))

0.040*"migrat" + 0.024*"environment" + 0.023*"household" + 0.016*"event" + 0.013*"individu" + 0.010*"climat" + 0.010*"migrant" + 0.010*"commun" + 0.009*"stressor" + 0.009*"chang" + 0.009*"intent" + 0.008*"zone" + 0.008*"relat" + 0.008*"head" + 0.008*"studi" + 0.007*"peopl" + 0.007*"transit" + 0.007*"like" + 0.007*"level" + 0.006*"respond" + 0.006*"affect" + 0.006*"term" + 0.006*"ghana" + 0.006*"econom" + 0.006*"decis" + 0.006*"variabl" + 0.005*"educ" + 0.005*"countri" + 0.005*"differ" + 0.005*"result" + 0.005*"factor" + 0.005*"model" + 0.004*"land" + 0.004*"adapt" + 0.004*"non" + 0.004*"impact" + 0.004*"member" + 0.004*"forest" + 0.004*"percept" + 0.004*"major" + 0.004*"savannah" + 0.004*"includ" + 0.003*"area" + 0.003*"effect" + 0.003*"associ" + 0.003*"long" + 0.003*"rural" + 0.003*"intern" + 0.003*"year" + 0.003*"black" + 0.003*"ag" + 0.003*"crop" + 0.003*"agricultur" + 0.003*"influenc" + 0.003*"survei" + 0.003*"sudden" + 0.003*"stai" + 0.003*"control" + 0.003*"develop" + 0.003*"loca

In [38]:
print(lda_model.print_topic(42, topn=200))

0.025*"hawaw" + 0.016*"peopl" + 0.015*"return" + 0.012*"project" + 0.011*"livelihood" + 0.010*"group" + 0.010*"migrat" + 0.010*"area" + 0.009*"local" + 0.008*"jawasir" + 0.008*"forc" + 0.008*"right" + 0.007*"tradit" + 0.007*"land" + 0.007*"opportun" + 0.007*"new" + 0.007*"agricultur" + 0.006*"irrig" + 0.006*"anim" + 0.006*"drought" + 0.006*"nile" + 0.006*"import" + 0.005*"displac" + 0.005*"sudan" + 0.005*"nomad" + 0.005*"famili" + 0.005*"differ" + 0.005*"pastoralist" + 0.005*"cultiv" + 0.004*"northern" + 0.004*"establish" + 0.004*"farm" + 0.004*"women" + 0.004*"stai" + 0.004*"success" + 0.004*"refuge" + 0.004*"work" + 0.004*"situat" + 0.004*"wadi" + 0.004*"live" + 0.004*"interview" + 0.004*"sub" + 0.004*"homeland" + 0.004*"leader" + 0.004*"develop" + 0.004*"labour" + 0.004*"institut" + 0.004*"time" + 0.004*"possibl" + 0.003*"addit" + 0.003*"process" + 0.003*"environment" + 0.003*"surviv" + 0.003*"rain" + 0.003*"provid" + 0.003*"secur" + 0.003*"number" + 0.003*"rainfal" + 0.003*"famin" 

In [41]:
data = [[1, 1, 1, 1, \
        1, 1, 1, 0, 0, \
        1, 1, 1, 1, \
        1, 0, 1, 1, \
        1, 1, 1, \
        0, 1, 1, 0, 0, 0, \
        0, 0, 1, 1, 1, 1, 0, \
        0, 1, 0]]
result = pd.DataFrame(data, columns=['Qualitative method', 'Quantitative method', 'Socio-demo-economic data', 'Environmental data', \
                       'Individuals', 'Households', 'Subnational groups', 'National groups', 'International groups', \
                       'Urban', 'Rural', 'Time frame considered', 'Foresight', \
                       'Rainfall pattern / Variability', 'Temperature change', 'Food scarcity / Famine / Food security', 'Drought / Aridity / Desertification', \
                       'Floods', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation', 'Self assessment / Perceived environment', \
                       'Labour migration', 'Marriage migration', 'Refugees', 'International migration', 'Cross-border migration', 'Internal migration', \
                       'Rural to urban', 'Rural to rural', 'Circular / Seasonal', 'Long distance', 'Short distance', 'Temporal', 'Permanent', \
                       'Age', 'Gender', 'Ethnicity / Religion']).astype(str)

In [29]:
manual_result = pd.read_excel('manual.xlsx').astype(str)

In [30]:
manual_result = manual_result.iloc[[3]].drop(columns=['ID', 'AUTHOR', 'TITLE']).reset_index(drop=True)
manual_result

Unnamed: 0,Qualitative method,Quantitative method,Socio-demo-economic data,Environmental data,Individuals,Households,Subnational groups,National groups,International groups,Urban,...,Rural to urban,Rural to rural,Circular / Seasonal,Long distance,Short distance,Temporal,Permanent,Age,Gender,Ethnicity / Religion
0,0,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0


In [42]:
bool_result = (result.iloc[0] == manual_result.iloc[0])
bool_result.mean(axis=0)

np.float64(0.5277777777777778)