# Topic Modelling
## 1. Overview
* Topic modelling is the process of identifying groups of similar words which occur together in a document that can allow you to infer a specific topic
* For example, if the words 'dog', 'cat', 'vet' etc. occurred commonly in a document, you could say that a suitable topic for that document is 'pets'
* Topics can be thought of mathematically as a distribution of words, where certain words occur far more frequently than others and give meaning to the topic selected
* Documents can be thought of as distributions of topics (where each document contains more than one topic) and you can assign a specific topic to a document based on the frequently occuring topic/set of words
* Topic modelling is an unsupervised method, where the user must determine the number of topics to be extracted and the labels to be assigned to each topic (this process is never perfect and purely depends on user preference)

In [1]:
# load libraries
import pandas as pd

# load text data
npr = pd.read_csv('NLP Course Files/TextFiles/npr.csv')

# peek at data
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [2]:
# examine first article length
len(npr['Article'][0])

7646

In [3]:
# examine df
npr.shape

(11992, 1)

## 2. Latent Dirichlet Allocation
* There are many options for topic modelling, but a well tested one in Python is LDA (based on the Dirichlet distribution)
* The process for topic modelling with LDA is as follows:
    * User picks n_topics to extract from the data
    * LDA algorithm calculates the probability of a topic in a document
    * It also calculates the probability of a word in a topic
    * It then iteratively (many, many times!) re-assigns new, more accurate topics to each word based on the proportion of words throughout all documents that have been assigned to each topic
    * This essentially hones in on the best topic to assign to each word, and therefore the best topic to assign to each document based on frequency and importance of words across all docs
    * The user then analyses the most frequent words within the selected topic of a document and labels the topic accordingly
* The text is processed by the LDA algorithm by first converting words to vectors to allow numerical computation of the document term matrix (DTM)

In [4]:
# load libraries
from sklearn.feature_extraction.text import CountVectorizer # get word/token frequencies

# create vectorizer instance
# max_df = discard words that show up in 90% of docs (e.g. stop words, really common words)
# min_df = discard words that occur in < 2 docs (e.g. typos, mis-spelling)
# remove all stop words (e.g. the, is, a...)
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

# fit model to data and transform into vectors
# dtm = document term matrix (docs x words)
dtm = cv.fit_transform(npr['Article'])

# check output (should be sparse matrix)
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [5]:
# load libraries
from sklearn.decomposition import LatentDirichletAllocation

# create object instance
# no right or wrong for n_components (i.e. topics), you can experiment here
lda = LatentDirichletAllocation(n_components=7, random_state=42)

# fit object to data (can take a while as it's an iterative process)
lda.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

### Extracting Latent Topics
* They are known as latent topics because the LDA algorithm hasn't defined labels or descriptions of these topics, it's simply inferred similarity between groups of words
* We can access the groups of words attached to each topic via the **get_feature_names()** method
* Here, each word has been assigned a probability of falling into each of the defined topics
* We can sort by word probability within specific topics to see the most important words to that topic and begin to define the topic labels

In [6]:
# vectorizer feature names is just a list of words across all our docs
print(type(cv.get_feature_names()))

# check size of vectorizer
# length should be identical to # unique words in our dtm (above)
# this is because our vectorizer has one vector per unique word
print(len(cv.get_feature_names()))

# we can therefore get words directly from it
cv.get_feature_names()[38755]

<class 'list'>
54777


'psychoactive'

In [7]:
# investigate topics
print(len(lda.components_)) # number of topics
print(type(lda.components_)) # ndarray of 7 topic probabilities per word
print(lda.components_.shape) # shape is # topics x # words
print('\n')

# iterate through topics, extracting top 15 words per topic
for i, topic in enumerate(lda.components_):
    print(f'TOP 15 WORDS FOR TOPIC #{i}')
    
    # return index positions sorted from least to greatest value
    # get top 15 values (i.e. highest probability words from topic)
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])    
    print('\n')

7
<class 'numpy.ndarray'>
(7, 54777)


TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', '

### Assigning Topics
* Once we have created our LDA model for determining topic from words mix, we can assign the highest probability topic to each of the articles/documents in our input data
* We can use our LDA model to transform our DTM vector values into probabilities that show us which topic each word is most relevant to
* If you've selected n_topics, you will have n_probabilities per word

In [8]:
# calculate probabilities of topics per document/article
topic_results = lda.transform(dtm)

# check first result (% probability of word per topic)
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [9]:
# assign topics to each article/document
# argmax() = get index position of highest probability
npr['Topic'] = topic_results.argmax(axis=1)

# check new data
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


## 3. Non-Negative Matrix Factorization
* NMF is an alternative method to LDA
* Both allow you to perform topic modelling but utilizing different methods
* LDA is a probabilistic method, calculating the likelihood of words occurring within topics and optimizing topic and document topic assignment via iterative reassignment
* NMF is a linear algebraic method which performs dimensionality reduction and clustering simultaneously (similar to PCA) in order to determine topic assignment and coefficients for words and documents
* Both have their pros and cons, we will now look at NMF to contrast it to the earlier LDA code
* [LDA vs. NMF](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df)

In [10]:
# load libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# load text data
npr = pd.read_csv('NLP Course Files/TextFiles/npr.csv')

# create vectorizer instance
# filter out words that occur in >95% of docs and < 2 occurrences total
# remove stop words to prevent noise
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

# create dtm using vectorizer
dtm = tfidf.fit_transform(npr['Article'])

# check dtm
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

Notes:
* The first step is to convert our raw text into a TF-IDF vectorized sparse matrix (document term matrix - DTM)
* During this process, stop words are removed, raw text is converted into numerical vectors and weighting is applied to handle the frequency of word occurrence across all documents (TF-IDF)
* This DTM is then our non-negative matrix input for the NMF stage
* By specifying the number of topics we would like to determine, the NMF model splits our DTM into a clustered matrix (words by topic) and a coefficient matrix (documents by topic) where all values are coefficients (not probabilities)

**NOTE:**
   * Fitting NMF to the data will likely be quicker than LDA because numpy is very well suited to the transformation algorithms implemented by NMF
   * In general, NMF is much faster than LDA, particularly for very large datasets
   * The topics returned by NFM will not be identical to LDA, even on the same dataset. They might be similar, but because the algorithms are different, so too will be the results

In [11]:
# load libraries
from sklearn.decomposition import NMF

# create model instance (aim for 7 topics)
nmf_model = NMF(n_components=7, random_state=42)

# fit model to dtm
nmf_model.fit(dtm)

# investigate topics by words
for index, topic in enumerate(nmf_model.components_):
    # get top 15 words per topic
    print(f'THE TOP 15 WORDS FOR TOPIC # {index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC # 0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC # 1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC # 2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC # 3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC # 4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC # 5
['love', 've', 'don

In [12]:
# calculate topic label per document and word in dtm
# each result in this object is the probability/coefficient for each topic (per document)
topic_results = nmf_model.transform(dtm)

# assign highest value topic label for each document
npr['Topic'] = topic_results.argmax(axis=1)

# create topic labels dict
topic_dict = {0:'Health', 1:'Election 1', 2:'Healthcare', 3:'Politics', 4:'Election 2', 5:'Music', 6:'Education'}

# assign labels to topics in df
npr['Topic Label'] = npr['Topic'].map(topic_dict)

# check output
npr.head()

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,Election 1
1,Donald Trump has used Twitter — his prefe...,1,Election 1
2,Donald Trump is unabashedly praising Russian...,1,Election 1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Politics
4,"From photography, illustration and video, to d...",6,Education
