In [None]:
!pip install gensim pyLDAvis

# SC207 Text Mining
## LDA Topic Modelling
### Discovering the latent topics that exist across a corpus

More advanced forms of text analysis require that text documents are converted into numerical values or features. In this  section we will examine:

* different methods for representing a collection of texts as numbers
* the decisions we need to make when generating a particular representation as well as the kinds of insights each numerical representation can give us.

## Tools
- [Gensim](https://radimrehurek.com/gensim/): A library designed for all manner of text processing. Whilst some of its features exist in SciKit Learn, Gensim provides a more comprehensive range of text analysis models.



In this notebook we'll be using a particular type of unsupervised learning called *Topic Modelling*. Topic modelling looks particularly at the words and phrases used in texts and works out, based on how often words appear in different texts, what themes there might be across a collection of documents. Crucially, LDA topic modelling recognises that different documents may express a range of different topics. This can be useful for a range of different research questions that might ask for different groups may deploy or even connect different discourses together.

Some limitations to keep in mind...
- Topic modelling doesn't consider the ordering of words, just the existence or absence of words
- Topic modelling doesn't understand the meaning of words, just the existence or absence of words.
- Topic modelling doesn't implicitly know how many topics are in a collection of texts, you have to tell it, and it may be worth varying this number depending on your research question.
- Topic modelling can produce junk topics.
- There is no objective way to determine if your topic modelling is 'good'. Whilst we have some assessment measures, it relies on a lot of qualitative assessment and knowledge of the documents themselves.

In [None]:
import gensim
import pandas as pd
import pyLDAvis

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:

df = pd.read_csv('sample_news_large_with_tokens.csv')
df.info()

Gensim works a little differently to SciKit learn but it is relatively easy to get up and running....

In [None]:
# Gensim relies on custom built dictionaries and a special object that
# they refer to as a corpus but is quite different from how we understand a corpus

# The dictionary object expects to recieve a list of documents, and that each document is itself a list of tokens.
tokenised_docs =
tokenised_docs[0:2]

In [None]:
# The dictionary creates a reference between words and a reference number
gs_dict =

# The dictionary's filter extremes method works like our min_df and max_df arguments from our sklearn vectorisers.


In [None]:
# we use the dictionary to create a gensim corpus, which is essentially a
# list where each entry contains a list of word reference numbers and their frequency in that document.

gs_corpus =
gs_corpus[:1] # see the first document

In [None]:
# We create our topic model object by passing it this corpus, the dictionary and setting the number of topics.

n_topics = 3

gs_lda =

In [None]:
# Once it has run we can examine the model...


Two key things to understand...
1. Every document has a score indicating how much it expresses *each* topic. A document could be highly associated with more than one topic.
2. Each *word* in the corpus has a score indicating how strongly it is associated with *each* topic.

We can see these scores like so...

In [None]:
idx = 12
df.loc[idx,'title']

In [None]:
# Document to topic matrix...
doc_topic_matrix =


In [None]:
# term to topic matrix - note we have one row per topic, and then one column per word in the dictionary
term_topic =
term_topic

## Visualising your topics

 #### Interpreting LDAvis
Run the cell below to save the visual, then go and open it outside of Jupyter. It should open in your web browser.

On the left of the screen is the seperate topics.
- They are positioned closer, or further from each other, depending on how much the topics overlap. For example we can see that some topics overlap a lot, whilst others are similar, but do not necessarily overlap, indicating there is some distinction between them.
- The size of the bubbles indicates how significant those topics are within the overall corpus.
- The numbers refer to the topic number but the numbers begin at 1, rather than 0 (helpfully).

On the right is the term information for the topics.
- If no topic is selected it gives the overall top terms for the corpus
- If a topic is selected it shows you the top terms for that topic, including an estimate of how frequent that term is in that topic (red) compared to its overall frequency (blue).
- Adjusting the slider at the top right allows you to tweak the measures to show terms more relevant to the topic itself.
- Slide all the way to the left to see terms that are highly specific to the topic but to the point that they might be too niche to be meaningful.
- Slide all the way to the right for terms that are broader but may be too generic as to not really distinguish the topics.
- A good rule of thumb is to set the slider around 0.6 for a balanced output.

In [None]:
from pyLDAvis.gensim_models import prepare
vis =



# How many topics?
Yet again the question is, what is the right value for the number of topics? Like before we will run the model multiple times, and score the model to determine which values may be most appropriate.

To score the models we can use Gensim's `u_mass` coherence score. This approach looks at how often words co-occur in documents. Topics with high co-occurence words are considered to be more coherent.

In [None]:
#The top topics method lists the best topics in the model based on each topic's coherence
# The method outputsa list of topic tuples, the first item in the tuple is a list of topic words and their topic scores, the second is the coherence of that topic.
top_topics =
top_topics

In [None]:
coherence_scores =



In [None]:
average_coherence_score =
average_coherence_score

In [None]:
k_range = range(1,10)
scores = []
for k in k_range:
    model = gensim.models.LdaMulticore(num_topics=k,
                                        corpus=gs_corpus,
                                        id2word=gs_dict,)
    coherence_scores = [item[1] for item in model.top_topics(corpus=gs_corpus)]
    avg = sum(coherence_scores) / k

    scores.append(avg)

In [None]:
scores

In [None]:
import seaborn as sns
sns.set(rc={'figure.figsize':(8.2,5.8)})
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})



## Examine our top models

In [None]:
n_topics = 8

chosen_model =

In [None]:
# print topics

In [None]:
vis = prepare(topic_model=chosen_model,
                       corpus=gs_corpus,
                       dictionary=gs_dict)

pyLDAvis.save_html(vis, 'ldavis.html')

In [None]:
# Topic assignments

doc_topic_matrix = chosen_model.get_document_topics(gs_corpus)
assignments = []
for i in range(len(df)):
    doc_assignments = doc_topic_matrix[i]
    high_score_topic, high_score = max(doc_assignments, key= lambda x: x[1])
    assignments.append(high_score_topic)

In [None]:
assignments

In [None]:
dat = {'topic':assignments,
       'query':df['query'].tolist()}

hm_data = pd.DataFrame(dat)
hm_data['count'] = 1
counts = hm_data.groupby(['topic','query'], as_index=True).count().unstack()
sns.heatmap(counts, annot=True, linewidths=1)

## Summary
Whilst LDA topic modelling is well established, it often struggles to produce stunningly coherent topics. Particularly when there is significant overlap in those topics. In part this is due to its reliance on word frequency rather than more nuanced approaches. In the next session we'll look at the cutting edge of topic modelling that utilises pre-trained models and adjusted TFIDF scoring for more coherent results.