# Introduction to Topic Modelling

---
---
## Recap of Python Basics
Welcome back! Before we get started, let's recap the Python that we learnt last time.

## What exactly is topic modelling?
Topic modelling is an **unsupervised** **classification** technique. 

**Natural language processing**
... 

**Model**
... 

**Supervised and unsupervised techniques**
... 

**Classification**
... 


### Alternatives to topic modelling in Python
If you are just looking to explore the topics of a few documents in a casual way, you can use the online digital texts environment [Voyant](), which allows you to upload or copy-and-paste texts and explore a corpus with a number of graphical tools, including topics.

For serious research, a well-known tool for topic modelling is called [MALLET](http://mallet.cs.umass.edu/topics.php), which is a programme (written in Java) that you download to your computer. You have to type commands to use MALLET, but it has otherwise done a great deal for you. [Getting Started with Topic Modeling and MALLET](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet) from Programming Historian gives a step-by-step tutorial on MALLET.

There is a graphical interface for MALLET called [Topic Modeling Tool](https://github.com/senderle/topic-modeling-tool) that is a bit easier to use. The [Quickstart Guide](https://senderle.github.io/topic-modeling-tool/documentation/2017/01/06/quickstart.html) will get you up and running.

If you are looking to use R rather than Python, then `tidytext` is a popular NLP library that will help you work with the `topicmodels` package. The book _Text Mining with R_ devotes [chapter 6](https://www.tidytextmining.com/topicmodeling.html) to this.

With the alternatives out of the way, let's see how we can do topic modelling in Python!

## More Python
As necessary as we go along. Maybe some clarification and extension of last time's.

## Worked example
* Corpus of more than one text
* Remember cleaning from last time (may not actually do this in notebooks, but rather just remember)

### Bag of words

A bag-of-words (BoW) corpus is a _vocabulary_ of the known words in the corpus together with some _measure_ of how often they occur. The measurement may be:
* binary (presence or absence)
* count (how many times the word occurs)
* frequency (count divided by the total number of words).  

This combination of vocabulary and measurement is called a **document vector**.

#### Example

Here is a simplified example to demonstrate the principles of creating a vector from a document.

Document (20 words):

>'No room to poise the lance or bend the bow;
> But hand to hand, and man to man, they grow:'
 
 (from _The Iliad of Homer_, translated by Alexander Pope (1899)) 
  
Vocabulary of unique words (15 words):

* no
* room
* to
* poise
* the
* lance
* or
* bend
* bow
* but
* hand
* and
* man
* they
* grow

Count measurements (how many times each word appears in the document):

* no = 1
* room = 1
* to = 3
* poise = 1
* the = 2
* lance = 1
* or = 1
* bend = 1
* bow = 1
* but = 1
* hand = 2
* and = 1
* man = 2
* they = 1
* grow = 1

If we treat this vocabulary as a list with a fixed order, we can just extract the counts into a list. This is the document vector.

`[1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1]`

In order to compare other documents with this one for similarity, we could generate a document vector with the same vocabulary list for each document, or expand the vocabulary list to cover all the words in all the documents we are interested in.

#### The 'bag' in bag of wordsprocessing

In this most basic BoW model all order and location of the words is discarded. For example, it does not matter if the words 'red' and 'nose' are adjacent ('red nose'), or at the beginning or end of a sentence; BoW just treats the words individually. It is like a 'bag' of Scrabble™ tiles, where each tile is a word, all rattling around together in no particular order.

It is possible to create a BoW corpus that uses two or more adjacent words, and potentially . For example, if you measure all pairs of words in our example document (above) you might end up with a vocabulary that looks like this:

* no room
* room to
* to poise
* poise the
* the lance
* lance or
* or bend
* bend the
* the bow
* bow but
* but hand
* hand to
* to hand
* hand and
* and man
* man to
* to man
* man they
* they grow

#### n-grams

Two adjacent words together like this is known as an **bigram**. The case before where we took just one word is called a **unigram**. Three words is a **trigram** and so on. These are all special cases of **n-gram**, where _n_ is some number of words.

#### Vocabulary choice

As you may have suspected by now, the size and nature of the vocabulary you choose is vitally important. A large vocabulary will take more computational power and memory to analyse. A vocabulary with many rare words (so the count for these words is 0) creates what is called a _sparse_ vector, which has less useful information in it. Likewise, very common but largely meaningless words are often wasteful to include, for example, we would probably want to exclude a list of **stopwords**.

#### Term Frequency–Inverse Document Frequency (TF-IDF)
If you measure word frequency, highly frequent words come to dominate your results and yet they may not be as meaningful or interesting as rarer words. For example, if you are looking at articles about the history of the Moon landings, even if you have removed all the stopwords, you may well find that the words 'lunar', 'moon', 'landing', 'orbit', and 'earth' predominate. Subtle differences in topic between documents may be lost.

One way to deal with this is to use a _weighting factor_ called **TF-IDF**. A value is calculated for each word that reflects:
* Term frequency (TF) - the number of times the word appears in the document
* Document frequency (DF) - the number of documents in the corpus that contain the word

For example, if a very uncommon word is present in two documents, this word is weighted more highly than a word that is present in all documents in a corpus.

In [7]:
import spacy
spacy.load('en')

<spacy.lang.en.English at 0x7fba8c6f7748>

In [8]:
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

In [9]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to /home/mary/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [10]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/mary/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

In [12]:
from pathlib import Path
import os
dataset = os.path.join('data', 'dataset.csv')

import random
text_data = []
with open(dataset) as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > .99:
            print(tokens)
            text_data.append(tokens)

['domain', 'gradient', 'detection', 'architecture', 'analog', 'motion', 'sensor']
['phase', 'noise', 'oscillator', 'implantable', 'biomedical', 'application']
['protocol', 'level', 'performance', 'analysis', 'collision', 'protocol', 'system']
['optimization', 'method', 'joint', 'allocation', 'modulation', 'scheme', 'coding', 'rates', 'resource', 'block', 'power', 'organize', 'network']
['similarity', 'estimation', 'using', 'locality', 'sensitive', 'hash']
['multi', 'hysteresis', 'application', 'multi', 'scroll', 'chaotic', 'oscillator']
['normal', 'measurement', 'visual', 'motion', 'sensor']
['social', 'network', 'extraction', 'conference', 'participant']
['session', 'base', 'overload', 'control', 'aware', 'server']
['stochastic', 'learning', 'algorithm', 'application', 'contextual', 'advertising']
['munica', 'advance', 'social', 'network', 'device', 'greeting', 'cards']
['breath', 'energy']
['modelling', 'analysis', 'multicell', 'converter', 'using', 'discrete', 'model']
['wireless', 

In [14]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7fba51b0cc18>

In [15]:
corpus = [dictionary.doc2bow(text) for text in text_data]
corpus[:100]

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1)],
 [(19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1)],
 [(32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1)],
 [(7, 1), (11, 1), (38, 1), (39, 1), (40, 2), (41, 1)],
 [(5, 1), (6, 1), (42, 1), (43, 1), (44, 1)],
 [(25, 1), (45, 1), (46, 1), (47, 1), (48, 1)],
 [(49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1)],
 [(7, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1)],
 [(25, 1), (48, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1)],
 [(65, 1), (66, 1)],
 [(13, 1), (37, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1)],
 [(21, 1), (25, 1), (72, 1), (73, 1), (74, 1)],
 [(75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1)],
 [(69, 1), (76, 1), (81, 1), (82, 1), (83, 1)],
 [(84, 1), (85, 1), (86, 1)],
 [(20, 1), (21, 1), (83, 1), (

In [17]:
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))

In [18]:
dictionary.save('dictionary.gensim')

In [19]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.026*"application" + 0.026*"oscillator" + 0.026*"motion" + 0.026*"sensor"')
(1, '0.042*"model" + 0.042*"algorithm" + 0.042*"compression" + 0.023*"image"')
(2, '0.050*"network" + 0.035*"coding" + 0.035*"social" + 0.034*"block"')
(3, '0.040*"efficient" + 0.022*"domain" + 0.022*"detection" + 0.022*"architecture"')
(4, '0.045*"analysis" + 0.045*"protocol" + 0.025*"system" + 0.025*"program"')


In [20]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(26, 1), (56, 1), (58, 1)]
[(0, 0.050024964), (1, 0.55387896), (2, 0.29605168), (3, 0.05002095), (4, 0.05002346)]


---
---
## Choosing the Right Text-Mining Techniques

Table/Q&A

---
---
## Summary

Blah

Blah: 

* sdfsdfsdf
* sdfsdfsdf

👌👌👌

The next notebook ...

---
---
## What's Next?
If you have decided that text-mining in Python is for you, then here are some more resources to study in your own time:

* Go further in natural language processing, python...
* Follow a more in-depth set of Jupyter notebooks [The Art of Literary Text Analysis](https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb).
* Install Python using Anaconda on your computer: [Installing Anaconda on Windows](https://www.datacamp.com/community/tutorials/installing-anaconda-windows) [Installing Anaconda on Mac](https://www.datacamp.com/community/tutorials/installing-anaconda-mac-os-x).

Even if you are not sure programming is for you, [Cambridge Digital Humanities](https://www.cdh.cam.ac.uk/) (CDH) has a number of resources to support your research. 

* CDH Learning - [training events/workshops](https://www.cdh.cam.ac.uk/learning/cdh-events) and mentoring programme
* CDH Lab - email [lab@cdh.cam.ac.uk](mailto:lab@cdh.cam.ac.uk) for advice on your project, whether you are just getting started, somewhere in the middle, or thinking about the future
