# Topic Modelling

In this session we will discuss an approach to discover underlying "topics" in your text collection. But before we start, let's discuss the task more broadly.

There are two ways of identifying topics in a collection of text, which rely on a specific aspect:

1. you know which topics you are looking for
2. you don't know which topics you are looking for

In this session we are looking at the 2nd scenario. If you are instead in the first case, in a later session we will discuss about text classification.

## What do we mean by "topics"

1. Groups of tokens that are likely to appear in **the same context**
2. A **hidden structure** that determines how tokens appear in the corpus

The 1st is what you see (tokens co-occuring together), the second is what you are assuming when you use a topic modelling approach.

![](images/lda.png)

## How do we get these topics?

Many ways:
- Latent Semantic Analysis
- Probabilistic Latent Semantic Analysis
- Latent Dirichlet Allocation (LDA) <-- the most adopted approach

In the last ten years, LDA has been a highly popular approach in digital humanities for corpus exploration, due to its flexibility (it can be applied to any language, given a tokenizer). You "just" need to select the number of topics you want to discover, in advance.

[Slides to go though together](https://docs.google.com/presentation/d/1u5Fs1C6vwdfsv93H-iX3c4jkwyjgijxShsUjoGZPwcc/edit?usp=sharing) (starting from slide 15)

In [6]:
def clean_animacy_text(snippet:str)-> str:
    """
    Remove specific tags from sentences in the animacy dataset

    Args:
        snippet (str): a snippet of text from the dataset

    Returns:
        str: the same snippet, without the tags
    """
    assert type(snippet) is str, 'The input is not a string'
    snippet = snippet.replace('[SEP]','')
    snippet = snippet.replace('***','')
    return snippet


In [27]:
string = 'Äî the spinning-jenny and the steam- ***engine*** have been long known to be inimical'

clean_str = clean_animacy_text(string)
print (clean_str)

Äî the spinning-jenny and the steam- engine have been long known to be inimical


In [5]:
import spacy

# Spacy has word embeddings available in its "large" English language model. 
# Note that downloading this model would take a few minutes
#spacy.cli.download("en_core_web_sm")

# Load the large English model
nlp = spacy.load("en_core_web_sm")

def tokenize_str(text:str)->str:
    assert type(text) is str
    processed_text = nlp(text)
    tokenised_text = [token.text for token in processed_text]
    return tokenised_text

In [28]:
tokenize_str(clean_str)

['Äî',
 'the',
 'spinning',
 '-',
 'jenny',
 'and',
 'the',
 'steam-',
 'engine',
 'have',
 'been',
 'long',
 'known',
 'to',
 'be',
 'inimical']

In [7]:
import pandas as pd

animacy_df = pd.read_csv('data/LwM-nlp-animacy-annotations-machines19thC.tsv',sep='\t')
snippets = animacy_df['SentenceCtxt'].to_list()

snippets = [clean_animacy_text(snippet) for snippet in snippets]
tokenised_snippets = [tokenize_str(snippet) for snippet in snippets]

print (len(tokenised_snippets))

594


In [8]:
from gensim import corpora, models

# for running LDA in gensim we need a dictionary of all the words
dictionary = corpora.Dictionary(tokenised_snippets)
# and to count the word frequency in each sentence
X = [dictionary.doc2bow(text) for text in tokenised_snippets]

  from scipy.linalg.special_matrices import triu


In [9]:
# we decice a number of topics
num_topics = 5
chunksize = 100
passes = 50
iterations = 1000

# and we run topic models
ldamodel = models.ldamodel.LdaModel(X, num_topics=num_topics, id2word = dictionary, chunksize=chunksize,
    alpha='auto',
    eta='auto',update_every=1,
    iterations=iterations,
    passes=passes)

In [10]:

# let's get the most relevant word for each topic
get_topics =ldamodel.show_topics(num_topics=num_topics, num_words=10,formatted=False)

topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in get_topics]

topic_ids =[]

#Below code prints topics and words
for topic,words in topics_words:

    print(str(topic)+ "::"+ str(words))
    topic_ids.append(topic)

0::[',', 'the', 'of', '.', 'and', 'a', 'to', ' ', 'in', 'that']
1::['°', 'blue', 'door', 'narrow', 'eyes', '29', 'past', 'broke', 'dashed', 'ad']
2::['HISTORY', 'consumption', 'strong', 'press', 'Russian', '350', 'while', '17', 'principal', 'factory']
3::['I', 'you', 'my', "'", 'me', "n't", '\t', 'mind', "'s", 'am']
4::[',', 'the', 'of', 'and', '.', ' ', ';', 'in', 'is', 'The']


In [13]:
%matplotlib inline
import pyLDAvis
import warnings
import pyLDAvis.gensim_models as gensimvis
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
topicData = gensimvis.prepare(ldamodel, X, dictionary, mds='mmds')   

pyLDAvis.display(topicData)  

  default_term_info = default_term_info.sort_values(
