### Introduction

Latent Dirichlet Allocation, or LDA, is a 3-level hierarchical Bayesian model. Put it differently, it is a generative statistic model that explains how a collection of text documents can be described by a set of unobserved topics. Each item of the collection is modelled as a finite mixture over a latent set of topics here, while each topic is characterized by a distribution of words. As a generalization of pLSA model, it differs primarily by treating the topic mixture as a Dirichlet prior, leading to more reasonable mixtures and less susceptibility to overfitting. LDA is an important model in NLP, solving the problem of topic discovery, similarity comparison, document modeling and classification.

As for more detail of this model, basically, it is a Bag-of-Word based model, based on word co-occurrance. It considers that a piece of text is composed of many words, without considering order. One piece of text can have many topics, while each word from it is generated by one of the topics. The first step of this model is, to get document i a topic distribution theta(i) from Dirichlet distribution alpha. Then, from theta(i), it gets topic number j for document i as z(i,j). Next, from Dirichlet distribution beta, it generates the word distribution of z(i,j) then sample the final word w(i,j). With maximum likelihood and EM odel, we can get the final result. In summary, alpha and beta are corpus level parameters, and z and w are word level variables.

![LDA](Latent_Dirichlet_allocation.svg.png)

### Libraries

In [1]:
import numpy as np
import pandas as pd
import os

from gensim import corpora, models
from gensim.parsing.preprocessing import preprocess_string

### Data and Preprocessing

In [2]:
data = np.load('literature.npy', allow_pickle=True).item()

In [3]:
# Extract papers from dictionary and save in a list
texts = []

for _, sections in data.items():
    full_text = " ".join(sections.values())
    texts.append(full_text)

# function from gensim, can delete stop words, transfer to lower case, etc.
processed_texts = [preprocess_string(text) for text in texts]

In [4]:
dictionary = corpora.Dictionary(processed_texts) # construct dictionary from the papers
corpus = [dictionary.doc2bow(text) for text in processed_texts] # construct corpus, each paper is transferred into a list of (word_id, word_count) tuples

### LDA

In [5]:
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=50, random_state=42)

### Results

In [6]:
topics = []

for _, bow in enumerate(corpus):
    dist = lda_model.get_document_topics(bow, minimum_probability=0)
    topic = np.array([prob for _, prob in dist])
    topics.append(topic)

In [14]:
for i, bow in enumerate(corpus):
    dist = lda_model.get_document_topics(bow, minimum_probability=0)
    topic = max(dist, key=lambda x: x[1])[0]

    print(f"paper {i}'s topic might be {topic}")

paper 0's topic might be 33
paper 1's topic might be 19
paper 2's topic might be 49
paper 3's topic might be 31
paper 4's topic might be 1
paper 5's topic might be 31
paper 6's topic might be 4
paper 7's topic might be 8
paper 8's topic might be 8
paper 9's topic might be 25
paper 10's topic might be 7
paper 11's topic might be 40
paper 12's topic might be 42
paper 13's topic might be 23
paper 14's topic might be 35
paper 15's topic might be 29
paper 16's topic might be 25
paper 17's topic might be 5
paper 18's topic might be 33
paper 19's topic might be 8
paper 20's topic might be 7
paper 21's topic might be 29
paper 22's topic might be 1
paper 23's topic might be 35
paper 24's topic might be 7
paper 25's topic might be 35
paper 26's topic might be 14
paper 27's topic might be 21
paper 28's topic might be 8
paper 29's topic might be 21


### Evaluation

LDA can automize text classification part, however, it can not really assign text with externally defined codes, especially when pre-defined class names are not present in the dictionary. Thus, usually we have to manually annotate the papers based on topics we get, top words of such topics, and the matching topic of each paper. But anyway, it can lower the workload compared with doing fully manual annotation. Here we first take paper 0 for evaluation. Its topic is likely to be 33 according to the result.

In [19]:
print(lda_model.print_topic(33, topn=200))

0.040*"migrat" + 0.024*"environment" + 0.023*"household" + 0.016*"event" + 0.013*"individu" + 0.010*"climat" + 0.010*"migrant" + 0.010*"commun" + 0.009*"stressor" + 0.009*"chang" + 0.009*"intent" + 0.008*"zone" + 0.008*"relat" + 0.008*"head" + 0.008*"studi" + 0.007*"peopl" + 0.007*"transit" + 0.007*"like" + 0.007*"level" + 0.006*"respond" + 0.006*"affect" + 0.006*"term" + 0.006*"ghana" + 0.006*"econom" + 0.006*"decis" + 0.006*"variabl" + 0.005*"educ" + 0.005*"countri" + 0.005*"differ" + 0.005*"result" + 0.005*"factor" + 0.005*"model" + 0.004*"land" + 0.004*"adapt" + 0.004*"non" + 0.004*"impact" + 0.004*"member" + 0.004*"forest" + 0.004*"percept" + 0.004*"major" + 0.004*"savannah" + 0.004*"includ" + 0.003*"area" + 0.003*"effect" + 0.003*"associ" + 0.003*"long" + 0.003*"rural" + 0.003*"intern" + 0.003*"year" + 0.003*"black" + 0.003*"ag" + 0.003*"crop" + 0.003*"agricultur" + 0.003*"influenc" + 0.003*"survei" + 0.003*"sudden" + 0.003*"stai" + 0.003*"control" + 0.003*"develop" + 0.003*"loca

In [20]:
data = [[1, 1, 1, 1, \
        1, 1, 1, 0, 0, \
        1, 1, 1, 1, \
        1, 0, 1, 1, \
        1, 1, 1, \
        0, 1, 0, 0, 0, 0, \
        0, 0, 0, 1, 1, 0, 0, \
        0, 1, 0]]
result = pd.DataFrame(data, columns=['Qualitative method', 'Quantitative method', 'Socio-demo-economic data', 'Environmental data', \
                       'Individuals', 'Households', 'Subnational groups', 'National groups', 'International groups', \
                       'Urban', 'Rural', 'Time frame considered', 'Foresight', \
                       'Rainfall pattern / Variability', 'Temperature change', 'Food scarcity / Famine / Food security', 'Drought / Aridity / Desertification', \
                       'Floods', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation', 'Self assessment / Perceived environment', \
                       'Labour migration', 'Marriage migration', 'Refugees', 'International migration', 'Cross-border migration', 'Internal migration', \
                       'Rural to urban', 'Rural to rural', 'Circular / Seasonal', 'Long distance', 'Short distance', 'Temporal', 'Permanent', \
                       'Age', 'Gender', 'Ethnicity / Religion']).astype(str)

In [29]:
manual_result = pd.read_excel('manual.xlsx').astype(str)

In [30]:
manual_result = manual_result.iloc[[3]].drop(columns=['ID', 'AUTHOR', 'TITLE']).reset_index(drop=True)
manual_result

Unnamed: 0,Qualitative method,Quantitative method,Socio-demo-economic data,Environmental data,Individuals,Households,Subnational groups,National groups,International groups,Urban,...,Rural to urban,Rural to rural,Circular / Seasonal,Long distance,Short distance,Temporal,Permanent,Age,Gender,Ethnicity / Religion
0,0,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0


In [34]:
bool_result = (result.iloc[0] == manual_result.iloc[0])
bool_result.mean(axis=0)

np.float64(0.6111111111111112)