# Text Analysis - Finding Themes and Patterns
## Pt 2 - Topic Models

What if you have texts that don't come with handy labels to distinguish between them. What if you want to discover the themes of discussion across text, and you have so many documents you can't hope to read them all. One approach is to use *Unsupervised Machine Learning*, a type of data analysis that is essentially the computer user saying "look at all this data, and tell me what patterns you see".

In this notebook we'll be using a particular type of unsupervised learning called *Topic Modelling*. Topic modelling looks particularly at the words and phrases used in texts and works out, based on how often words appear in different texts, what themes there might be across a collection of documents.

Some limitations to keep in mind...
- Topic modelling doesn't consider the ordering of words, just the existence or absence of words
- Topic modelling doesn't understand the meaning of words, just the existence or absence of words.
- Topic modelling doesn't impliclty know how many topics are in a collection of texts, you have to tell it (but there are ways around this).
- Topic modelling can produce junk topics.
- There is no objective way to determine if your topic modelling is 'good', it relies a lot of qualitative assessment and knowledge of the documents themselves.

In [1]:
import pandas as pd
import spacy
import gensim

# For later plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# FUNCTIONS
def train_phraser(texts, stopwords):
    sentences = [
        [token.lemma_.lower() for token in sentence if token.lemma_.lower().isalpha()]
        for doc in texts 
        for sentence in doc.sents]
    
    bigram_phraser = gensim.models.Phrases(sentences, common_terms=stopwords)
    return bigram_phraser


def filter_text(spacy_doc, phraser, stopwords):
    transformed_doc = []
    for sentence in spacy_doc.sents:
        sentence_tokens = [token.lemma_.lower() for token in sentence if token.lemma_.lower().isalpha()]
        transformed = phraser[sentence_tokens]
        transformed_doc.extend(transformed)
    tokens = [token for token in transformed_doc if token.lower() not in stopwords]
    return tokens

def dummy_function(doc):
    return doc

def top_terms(df, group_name=None, top_n=5):
    if group_name is not None:
        df = df.drop(columns=[group_name])
        
    return df.sum().sort_values(ascending=False).head(top_n)

## Setup 
Run all the cells from here to "SETUP END!" to load, and preprocess our data. You can do this quickly by clicking to the left of this cell to select it, holding shift and then clicking to select the cell "SETUP END!". This should highlight all cells in between, then just click the run button. Whilst you're waiting for it to run have a read of the intro to the next section.

### 1. Loading our Sample Data
See part 1 for full details on this process

In [3]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_to_fetch = ['alt.atheism', 'talk.religion.misc','comp.graphics', 'sci.space']

news_set = fetch_20newsgroups(subset='all', 
                              categories=newsgroups_to_fetch,
                              remove=('headers', 'footers', 'quotes'))

In [4]:
df = pd.DataFrame({'text':news_set['data'], 'category_num':news_set['target']})

In [5]:
category_lookup = {position: item for position, item in enumerate(news_set['target_names'])}
category_lookup

{0: 'alt.atheism', 1: 'comp.graphics', 2: 'sci.space', 3: 'talk.religion.misc'}

In [6]:
df['category_label'] = df['category_num'].apply(lambda category_number: category_lookup[category_number])

In [7]:
df.head()

Unnamed: 0,text,category_num,category_label
0,My point is that you set up your views as the ...,0,alt.atheism
1,\nBy '8 grey level images' you mean 8 items of...,1,comp.graphics
2,FIRST ANNUAL PHIGS USER GROUP CONFERENCE\n\n ...,1,comp.graphics
3,"I responded to Jim's other articles today, but...",3,talk.religion.misc
4,"\nWell, I am placing a file at my ftp today th...",1,comp.graphics


### 2. Preparing our Text

In [8]:
nlp = spacy.load('en_core_web_md')

In [9]:
stop_list = nlp.Defaults.stop_words


In [10]:
%time df['text_nlp'] = list(nlp.pipe(df['text'],n_process=7))

CPU times: user 29.1 s, sys: 1.57 s, total: 30.6 s
Wall time: 1min 21s


In [11]:
phraser = train_phraser(df['text_nlp'], stopwords=stop_list)

In [12]:
df['cleaned_tokens'] = df['text_nlp'].apply(filter_text, stopwords=stop_list, phraser=phraser)

In [13]:
df['cleaned_tokens']

0       [point, set, view, way, believe, eveil, world,...
1       [grey, level, image, mean, item, image, work, ...
2       [annual, phigs, user, group, conference, annua...
3       [respond, jim, article, today, neglect, respon...
4       [place, file, ftp, today, contain, polygonal, ...
                              ...                        
3382    [work, program, display, wireframe, model, use...
3383    [russian, ill, fate, phobos, mission, year_ago...
3384    [oh, gee, billion, dollar, cover, cost, feasab...
3385    [look, software, run, brand, new, know, site, ...
3386    [month, look, job, computer_graphic, software,...
Name: cleaned_tokens, Length: 3387, dtype: object

In [14]:
print(df.loc[2, 'cleaned_tokens'])

['annual', 'phigs', 'user', 'group', 'conference', 'annual', 'phigs', 'user', 'group', 'conference', 'hold', 'march', 'orlando', 'florida', 'conference', 'organize', 'laer', 'design', 'research_center', 'co', 'operation', 'ieee', 'graph', 'attendee', 'come', 'country', 'span', 'tinent', 'good', 'cross_section', 'phigs', 'community', 'represent', 'conference', 'participant', 'include', 'phigs', 'user', 'workstation', 'vendor', 'party', 'phigs', 'implementor', 'dard', 'committee', 'member', 'researcher', 'industry', 'academia', 'opening', 'speaker', 'richard', 'puk', 'challenge', 'phigs', 'user', 'charge', 'phigs', 'participate', 'phigs', 'standardization', 'activity', 'communicate', 'need', 'phigs', 'implementor', 'close', 'speaker', 'andries', 'van_dam', 'describe', 'vision', 'future', 'graphic', 'standard', 'phigs', 'technical', 'paper', 'session', 'conference', 'cover', 'follow', 'topic', 'phigs', 'x', 'application', 'toolkits', 'application', 'issues', 'texture_mapping', 'nurbs', 'p

# SETUP END!

## Unsupervised Machine Learning



In our data so far we have known which documents fall into different categories because we have labels for them. What do we do however, if we do not have helpful labels, or if we want to ask the computer whether it thinks that a corpus of documents might be divided up thematically somehow?

Here we are going to use various techniques that come under the heading of "Unsupervised Machine Learning". Supervised machine learning is when we know the categories that our data might fall into, and we can give the computer examples of data along with their labels, and the computer works out how to best predict, just given the data, what the label is likely to be.

Unsupervised machine learning essentially turns the relationship on its head, where we have the data, and we say to the computer, "you tell me how we should divide these up". In text, we use unsupervised machine learning to try and distinguish between documents about different topics, or to extract particular themes that might run across topics, without necessarily knowing these distinctions or themes in advance. 

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/unsupervised.png?raw=true" width="700">

In my research I tend to use these techniques as a first step to informing further qualitative analysis, discovering themes and then diving in to understand why those themes are prominent and what they actually mean.

[Image source](https://www.edureka.co/blog/introduction-to-machine-learning/)

## Unsupervised Learning: Topic Modelling
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/topic_modelling.png?raw=true" align="right" width="400">

Topic modelling is a well established standard in text analysis, and one of the most often used techniques in social science research. Topic modelling takes a set of documents and looks at what words tend to co-occur in the same documents.

At the start of the process, the "topics" are like empty bubbles, they are there in the data but we don't yet know what they are. Based on the co-occurence of words and the seperation of those co-occurences into different documents the algorithm starts to work out...

- Which documents are more similar to one another and which dissimilar.
- Which words tend to co-occur more, and which less.

Based on these observations the algorithm creates two matrices.

- Term to Topic matrix - a matrix of scores that indicate how strongly each **term** is affiliated with each topic
- Document to Topic matrix - a matrix of scores that indicate how strongly each **document** is affiliated with each topic.

Topic modelling is what we call a 'soft clustering' method, as items can have a strong affiliation with more than one topic or category. Hard clustering requires an item be associated with only one category.

### Latent Dirichlet Allocation (LDA) Topic Modelling
*Other Topic Modelling Algorithms are available*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
count_vec = CountVectorizer(min_df=0.01, max_df=0.999, preprocessor=dummy_function, analyzer=dummy_function)

In [None]:
count_matrix = count_vec.fit_transform(df['cleaned_tokens'])
count_matrix

In [None]:
lda_model = LatentDirichletAllocation(n_components=4) # components meaning topics

#### Quick look at the two matrices
We'll come back to these more later

In [None]:
document_topic_matrix = lda_model.fit_transform(count_matrix)

print(document_topic_matrix.shape)
document_topic_matrix

In [None]:
topic_term_matrix = lda_model.components_

print(topic_term_matrix.shape)
topic_term_matrix

### Exploring your Topic Models 
#### Quick and Easy: LDAVis

LDAVis is a visualisation tool specifically designed to allow quick interrogation of topic models. It offers an easy interactive interface to your models to allow you to quickly explore your data.

In [None]:
# we need to import and then also enable notebook mode so we can see the visual inside Jupyter

import pyLDAvis
from pyLDAvis.sklearn import prepare as ldavis_prepare
pyLDAvis.enable_notebook()

In [None]:
#LDAvis has a special method for models built with scikit learn, 
# it takes the lda model we used to make the model, 
# the original count vectorizer matrix 
# and the vectorizer that made the matrix

# we initialise it here
visualiser = ldavis_prepare(lda_model, count_matrix, count_vec, sort_topics=False) # ignore the deprecation warning - it is hopefully going to be fixed in an upcoming version

#### Interpreting LDAvis
Run the cell below to open the visual.

On the left of the screen is the seperate topics. 
- They are positioned closer, or further from each other, depending on how much the topics overlap. For example we can see that some topics overlap a lot, whilst others are similar, but do not necessarily overlap, indicating there is some distinction between them. 
- The size of the bubbles indicates how significant those topics are within the overall corpus.
- The numbers refer to the topic number but the numbers begin at 1, rather than 0 (helpfully).

On the right is the term information for the topics.
- If no topic is selected it gives the overall top terms for the corpus
- If a topic is selected it shows you the top terms for that topic, including an estimate of how frequent that term is in that topic (red) compared to its overall frequency (blue).
- Adjusting the slider at the top right allows you to tweak the measures to show terms more relevant to the topic itself.
- Slide all the way to the left to see terms that are highly specific to the topic but to the point that they might be too niche to be meaningful.
- Slide all the way to the right for terms that are broader but may be too generic as to not really distinguish the topics.
- A good rule of thumb is to set the slider around 0.6 for a balanced output.

What is interesting here is that whilst we manually chose 10 topics, knowing that really there are only 3 or 4 seperate categories, we still have different varieties of topic within each topical area. This indicates the variation of discussions going on within different discussion groups.

In [None]:
visualiser

In [None]:
pyLDAvis.save_html(visualiser,'count_my_topic_model_vis.html')

# Choosing the number of Topics

We chose 4 topics because we knew there were four seperate newsgroups. The result implies that really there are only three topics, as one topic seems to be junk and there is not a clear distinction between the atheism and religion discussions.

However, what if those different discussion groups have a variety of topics **within** them? This is the true value of unsupervised learning for social science, to highlight themes across textual datasets that would either
- take significant academic labour to unearth (months of qualitative close reading) or 
- would  be hidden due to the sheer scale of the datasets being interrogated.

Nothing is stopping us from just changing the number of topics when we initialise the LDA model, however how do we know how many topics to choose, or whether that is a valid decision?

### Gensim

Unfortunately there is no single library that covers all our needs for text analysis. Whilst Scikit learn has some methods for measuring how good topic models are, Gensim's `CoherenceModel` is far superior and it is worth using the library just for this feature. 

It should be noted that everything that we have done in the Text Analysis classes could have been done solely in Gensim, but it is a less accessible library for learners than scikit.

In [None]:
import gensim

In [None]:
# Gensim relies on custom built dictionaries and a special object that
# they refer to as a corpus but is quite different from how we understand a corpus

gs_dict = gensim.corpora.Dictionary(df['cleaned_tokens'])
gs_corpus = [gs_dict.doc2bow(text) for text in df['cleaned_tokens']]
# in order to get the coherence score we need to create a gensim version of an LDA model using their own functions

gs_lda = gensim.models.LdaMulticore(num_topics=4,
                                    corpus=gs_corpus,
                                    id2word=gs_dict)


# which we can then finally feed to their CoherenceModel to get a score

coherence_model = gensim.models.CoherenceModel(model=gs_lda,
                                               texts=df['cleaned_tokens'],
                                              dictionary=gs_dict)

In [None]:
coherence_model.get_coherence()

## Choosing the number of Topics: Testing Models

In general the best way to find the right model, is to create lots of them, and get the coherence scores for each, then we can use the scores to help us make a decision about what models might be worth taking further.

In [None]:
# let's build a coherence_scorer function that can take care of the whole process for us.

def coherence_scorer(tokens, min_topics, max_topics, step=1):
    gs_dict = gensim.corpora.Dictionary(tokens)
    gs_corpus = [gs_dict.doc2bow(text) for text in tokens]
    
    topic_range = list(range(min_topics, max_topics, step))
    
    results = []
    
    for topic_n in topic_range:
        gs_lda = gensim.models.LdaMulticore(num_topics=topic_n,
                                        corpus=gs_corpus,
                                        id2word=gs_dict)
        coherence_model = gensim.models.CoherenceModel(model=gs_lda,
                                                   texts=tokens,
                                                  dictionary=gs_dict)
        score = coherence_model.get_coherence()
        results.append(score)
    
    results_df = pd.DataFrame({'score':results}, index=topic_range)
    results_df.index.name = 'topic_n'
    return results_df

In [None]:
results = coherence_scorer(df['cleaned_tokens'], min_topics=2, max_topics=30, step=2)

In [None]:
results.sort_values('score', ascending=False)

In [None]:
results.plot()

### Choosing the number of Topics: Using AVERAGE Coherence
Because LDA is a probabilistic algorithm, there is a degree of randomness to the results. 
This means the coherence of each model may be slightly different each time. Helpful!

To combat this lets tweak our coherence scorer so that it tries out each number of topics multiple times, and then gives us the mean score. This will unfortunately multiply the amount of time our testing takes to run, but it will give us more confidence in the choice of model.



In [None]:
# we add a new keyword argument, runs, and set it to 1 so by default our function will only try each number of topics once. 

def avg_coherence_scorer(tokens, min_topics, max_topics, step=1, runs=1):
    gs_dict = gensim.corpora.Dictionary(tokens)
    gs_corpus = [gs_dict.doc2bow(text) for text in tokens]
    
    topic_range = list(range(min_topics, max_topics, step))
    
    results = []
    
    for topic_n in topic_range:
        
        # here we create an extra list to store the scores of multiple runs of the same topic number
        subset_results = [] 
        
        #Inside our first loop, we create another loop that runs as many times as we have set the variable 'runs'.
        # We're using range to control howmany times we loop, but we set the value of the range output (the number) as  _ to indicate
        # that it doesn't matter to the code.
        
        for _ in range(runs): 
            gs_lda = gensim.models.LdaMulticore(num_topics=topic_n,
                                            corpus=gs_corpus,
                                            id2word=gs_dict)
            coherence_model = gensim.models.CoherenceModel(model=gs_lda,
                                                       texts=tokens,
                                                      dictionary=gs_dict)
            score = coherence_model.get_coherence()
            # we run our modelliong and scoring as normal, but append the score to the subset list instead
            subset_results.append(score)
        #once the loop has ended we use pd.np.mean to get the mean value of all the values in the list
        mean_score = pd.np.mean(subset_results)
        # and append THAT score to our main results list
        results.append(mean_score)
    
    results_df = pd.DataFrame({'score':results}, index=topic_range)
    results_df.index.name = 'topic_n'
    return results_df

In [None]:
results = avg_coherence_scorer(df['cleaned_tokens'], min_topics=2, max_topics=30, step=2, runs=3)

In [None]:
results.sort_values('score', ascending=False)

In [None]:
results.plot()

### Choosing the number of Topics: Interpreting the scores
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/plot.png?raw=true" align="right" width="500">

It is not simply the case that the highest score is best. Some things to consider...

#### Influences on the score
- How much data you have.
- How well preprocessed your text was - there are additional steps we could have taken such as finding common phrases.
- There are other parameters to tweak beyond the number of topics. Details on these are beyond the scope of this module.

#### How good is my score?
- In general you're looking for a score between 0.3 and 0.7. Anything below is too low, anything above is so high it is probably an error!
- Ideally the score should be around 0.5+ but with a corpus this small (yes this is small) it is unlikely to improve.

#### Which model should I choose?

- In general it is best to directly examine the results of different models.
- The coherence score can act as a guide telling you which models score highly.
- However it is not just the highest scoring model that is best. Often we look for the 'elbow' in the data, the model that causes a sudden jump in coherence.
- Models with more topics and higher scores aren't necessarily better. 


In [None]:
chosen_model = LatentDirichletAllocation(n_components=8)

In [None]:
chosen_model.fit(count_matrix)

In [None]:
visualiser = ldavis_prepare(chosen_model, count_matrix, count_vec, sort_topics=False)
# ignore the deprecation warning - it is hopefully going to be fixed in an upcoming version

In [None]:
visualiser

# Extracting Model Information
Whilst the visualiser is useful, you may have other questions about your data which are only answerable by getting access to the results directly.

#### Top words per topic

In [None]:
# we can access the topic to term matrix using the .components_ attribute of the model, and
# mix it with the feature names of the vectorizer to make ourselves a topic to term score sheet.
topic_term_df = pd.DataFrame(chosen_model.components_, columns=count_vec.get_feature_names())
topic_term_df

In [None]:
# if we group by the index of this dataframe, i.e group into individual topics we can use our trusty top terms function
# these should match the visualiser when relevance slider was set to 1.0

top_words = topic_term_df.groupby(topic_term_df.index).apply(top_terms, top_n=20)

pd.set_option('display.max_rows',200)
top_words


### Wordclouds
Whilst lists of words and scores are great for computers, humans have a tough time interpreting them. Here we will use a function to create wordclouds from our `top_words` list. Creating a these wordclouds is a little tricky so the function has been premade for you.
- Required: `top_words` ---The list of top words as we created above (must be in this same format to work).
- Option: `save` ---set save=True to save the wordclouds to disk for inclusion in your reports.
- Option: `width` and `height` --- specify the size of each cloud.

In [None]:
def create_wordclouds(top_words, save=False, width=1000, height=500):
    from wordcloud import WordCloud
    import seaborn as sns
    n_topics = top_words.index.get_level_values(0).max() +1
    palette = sns.color_palette("husl", n_topics)
    
    groups = top_words.reset_index().groupby('level_0')
    
    for topic,data in groups:
        word_scores = data.set_index('level_1')[0].to_dict()
        wc = WordCloud(background_color='white',color_func=lambda *args, **kwargs: palette.as_hex()[topic],width=width, height=height).generate_from_frequencies(word_scores)
        a = plt.gca()
        a.axis('off')
        plt.imshow(wc, interpolation='bilinear')
        plt.title(f"Topic:{topic}",loc='left')
        if save:
            plt.savefig(f"topic_{topic}.png")
        plt.show()

In [None]:
create_wordclouds(top_words,save=True, width=1000)

#### Top Topic per document

In [None]:
# The document to topic matrix is created by transforming the count matrix created by our vectoriser using our trained LDA model.
document_topic_df = chosen_model.transform(count_matrix)
document_topic_df = pd.DataFrame(document_topic_df)
document_topic_df

The `.idxmax()` method works out the highest value for a row or for a column, and then returns the index value of that highest row.
Because in Pandas both rows have index names, and technically column names are indexes too, this means we can ask `.idxmax()` to return us the names of the `columns` with the highest value for each row.

In [None]:
document_topic_df.idxmax(axis='columns')

In [None]:
df['top_topic'] = document_topic_df.idxmax(axis='columns')

In [None]:
df.head()

Here we can create a crosstab table to count how many documents of each cateogry end up assigned to each topic. This helps us to get a sense of how well the model is working. Is there a lot of overlap between topics or do they broadly seperate nicely into distinctive groups?

In [None]:
count_comparison = pd.crosstab(index=df['category_label'],
                         columns=df['top_topic'],
                         values=df['text'],
                         aggfunc='count')
count_comparison

In [None]:
# using the normalize keyword argument we can get the numbers to reflect 
# proportions rather than raw counts which is useful if you have variation in the number of samples for each category
pct_comparison = pd.crosstab(index=df['category_label'],
                         columns=df['top_topic'],
                         values=df['text'], 
                         aggfunc='count',
                         normalize='index')
pct_comparison

Using Seaborn we can make a heatmap of these numbers to get a quicker sense of where the larger numbers fall.

To see the range of available colour palettes have a look at the [Python Graph Gallery](https://python-graph-gallery.com/python-colors/)

In [None]:


# we create a figure of a particular size first, (width,height) in inches
plt.figure(figsize=(8,6))

#seaborn is based on matplotlib so it will pick up the new figure automatically and use it to create the rest


sns.heatmap(count_comparison, cmap='PuBu', linewidths=1, annot=True, fmt='d') 
# cmap is colour map see more colour options at the link above.
# linewidths is the size of the lines between boxes
# annot switches on annotations of individual boxes
# fmt=d indicates that the annotation should be formatted as a simple number

In [None]:
plt.figure(figsize=(8,6))


sns.heatmap(pct_comparison, cmap='Greens', linewidths=1, annot=True, fmt='.1%') # cmap is colour map, for a full list
# fmt= .1% indicates the numbers should be formatted as percentages and rounded to one decimal place. 
# Notice that chosing this formatting has automatically changed our 0.159... numbers into proper percentages (15.9%)

### Topic Affiliation Strengths
However, because topic models are soft clustering algorithms, it recognises that individual documents may express a variety of topics to different degrees. We can visualise this by grouping by the category label for each document, and then taking the average topic score for each topic. This is easy with a groupby.

However, because topic models are soft clustering algorithms, it recognises that individual documents may express a variety of topics to different degrees.

We can visualise this by grouping by the category label for each document, and then taking the average topic score for each topic. This is easy with a `.groupby`.

In [None]:
df.head()

In [None]:
document_topic_df['category_label'] = df['category_label']

In [None]:
document_topic_df.head()

In [None]:
document_topic_df

In [None]:
topic_proportions = document_topic_df.groupby('category_label').mean()
topic_proportions

In [None]:
plt.figure(figsize=(8,6))


sns.heatmap(topic_proportions, cmap='Purples', linewidths=1, annot=True, fmt='.1%') # cmap is colour map, for a full list
# fmt= .1% indicates the numbers should be formatted as percentages and rounded to one decimal place. 
# Notice that chosing this formatting has automatically changed our 0.159... numbers into proper percentages (15.9%)

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(pct_comparison, cmap='Greens', linewidths=1, annot=True, fmt='.1%')

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/hm.png?raw=true" align="right" width="400">
Compare this with our previous heatmap that showed the percentage of documents that most strongly affiliated with each topic per category. The two heatmaps show quite  different things.

- The green heatmap shows the proportion of documents, per category, that most strongly affiliate with each topic. We  gave each document a top topic, regardless of how strong the affiliation with that topic was, just so long as it was the strongest.

- The purple heatmap shows the mean affiliation strength, this is a more nuanced score because it is showing us the average strength of affiliation to each topic, across all documents. This affiliation strength is based on the kinds of words used by the document. 

The way we would interpret and talk about this kind of heatmap (using the orange heatmap to the right as an example)
 - The content of documents from alt.atheism most strongly expressed topic 6, however these documents also expressed themes identified by topics 3 and 0.
 - The content of documents from sci.space was also strongly affiliated with topic 0. This indicates some intersection between the discussions on alt.atheism and sci.space.

We would then need to look at the top terms for topic 0 to discern what the topic was about, and perhaps look at documents that were most strongly affiliated with one topic, but also had a high score in another topic (How? Suggestion: filter the df by top topic first, then sort by topic score on whichever other topic you were interested in).

# Summary
The above demonstrates the effectiveness of topic modelling for finding themes within large amounts of text. At no point did we tell the topic modelling algorithm which documents came from which newsgroup, we just provided it the preprocessed text and asked it to work it out.

Interestingly, our testing processes indicated that there were more topics than the number of newsgroups we had, indicating it had discovered distinct themes of discussion within newgroups. If used on more homogenous text, provided there are enough samples of the text, topic modelling can be used to tease out different themes.

Whilst it is a common approach in computational social science, as LDA relies on word frequencies, it is not as nuanced as other approaches to pulling out themes that consider word significance (like TFIDF does), or even techniques that can consider the semantic similarity of texts, i.e. two documents may use different words but be  similar in that they are talking about similar things. As a result LDA works best on very large datasets, and the more homogenous the texts, the more samples it needs to distinguish differences between them.