# P40 Topic Modelling

Automatic topic detection in large corpora

## Topics

* Word frequencies
* Topic models
* (Sentiment analysis)


<img align="middle" src="./png/text_mining.png" width="800"/>


# Word frequencies

## Word frequencies

What is the most simple analysis we can do on a document?

Count the frequencies of all words used in the document!

Let’s see were this takes us…

## Documents as a Bag of words

<img align="middle" src="./png/bow.png" width="300"/>

* First step is to transform text into a ‘Bag-Of-Words’
* This is a matrix with all the unique words and their frequencies (how often they occur) per document
* Each word is a feature in this matrix

<img align="middle" src="./png/bow_table.png" width="800"/>


## Term frequency and inverse document frequency

One way to measure how important a word is in a document is by counting the *term frequency* for this word.

This will result in a lot of words that occur very frequently but we know are not important like “the”, “is”, “of”.

So another way to assess the importance of a word in a corpus of documents is to look at the *inverse document frequency* of this word. Doing so will decrease weight for common words and increase weight for less common words.

So for inverse document frequency we define:

$$idf(term) = ln(\frac{n_{documents}}{n_{documents containing term}})$$

And multiplying gives:

$$tfidf = tf∗idf$$

## Top 2000 dataset

<img align="middle" src="./png/top2000.png" width="600"/>


In this class we will use the Top 2000 dataset.

It can be found in the zipfile data_for_windows.zip

## Wordcloud

Which song is visualised in this image?

<img align="middle" src="./png/song.png" width="800"/>


## Make your own wordcloud

Install and import required modules

    ! pip install wordcloud
    ! pip install nltk


In [None]:
# we need to download some data from nltk
import nltk
# a gui screen will open to download relevant stuff
# 
#nltk.download('punkt') # 'punkt' 'stopwords'
#nltk.download('stopwords') # 'punkt' 'stopwords' 'wordnet' 'omw-1.4'
#nltk.download()

In [None]:
# load all libraries we need
import wordcloud as wc
import matplotlib.pyplot as plt
from os.path import isfile, join
from os import listdir
import zipfile
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.corpus import stopwords

## Extract and load Top2000 data

Extract the data we will be using from the zip file:

In [None]:
# adjust to your likings
path_to_zip_file = "./data/top2000.zip"
directory_to_extract_to = "./data/t2000"
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

In [None]:
import os
os.getcwd()




In [None]:
data_path_file_name

In [None]:
# read one song
data_path_file_name = directory_to_extract_to + "/top2000/lyrics/The_Beatles_Can_t_Buy_Me_Love.txt"
with open(data_path_file_name) as f:
    text_raw = f.read()
print(text_raw)


### Stopwords
Stopwords are usually excluded, because they affect the result with less informative words

In [None]:
wc_stopwords = set(wc.STOPWORDS)
print(wc_stopwords)

In [None]:
wordcloud = wc.WordCloud(stopwords = wc_stopwords, 
                max_words = 20,
                collocations = False,
                max_font_size=80).generate(text_raw)
plt.imshow(wordcloud, interpolation='bilinear') 
plt.axis("off")
plt.show()                 


##  Data structures

Structuring text data can be done in different ways. This is worth contrasting with the ways text is often stored in text mining approaches.

* *String* : Text can, of course, be stored as strings.
* *Corpus* : These types of objects typically contain raw strings annotated with additional metadata and details.
* *Document-term matrix* : This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count or tf-idf.
* *tidy text* : from the R language the concept of [tidy data principles](https://towardsdatascience.com/what-is-tidy-data-d58bb9ad2458) also holds for text


## Corpus

If we have to deal with a lot of documents we can create a structured object for it.

We will be using the nltk corpus reader package: https://www.nltk.org/api/nltk.corpus.reader.html

Structure the text documents in a corpus can be done like so:



In [None]:
corpus_root = directory_to_extract_to + "/top2000/lyrics/"
file_ext = "txt"
file_ids = [f for f in listdir(corpus_root) if isfile(join(corpus_root, f)) and f.lower().endswith(file_ext)]

In [None]:
listdir(corpus_root)


In [None]:

corpus = PlaintextCorpusReader(corpus_root, file_ids)
print("The number of documents:", len(corpus.fileids()))
print("The number of sentences =", len(corpus.sents()))
print("The number of words =", len([word for sentence in corpus.sents() for word in sentence]))
print("The number of characters =", len([char for sentence in corpus.sents() for word in sentence for char in word]))

## Document-term matrix

A document-term matrix contains terms with their frequencies of all documents in the corpus.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vect = CountVectorizer(max_df=2)
# term document matrix (more efficient for large corpora)
term_document_matrix = count_vect.fit_transform([corpus.raw(i) for i in file_ids])
df_dtm = pd.DataFrame(term_document_matrix.toarray(), columns=count_vect.get_feature_names_out())
df_dtm['file_ids'] = file_ids
df_dtm=df_dtm.set_index('file_ids')
df_dtm

## Frequent terms

Filter most frequent terms in the corpus

In [None]:
# FreqDist requires a list of words as input
# We will lowercase the text in each document in the corput, join it with the other documents into one long string
# and finally split the string into words and store them in a list
freq = nltk.FreqDist(' '.join([corpus.raw(i).lower() for i in file_ids]).split())
top_words = freq.most_common(10)
top_words


A lot of stopwords! What about ‘ain’ and ‘don’?

## Clean text

Let’s clean the text from stopwords, whitespace, numbers and punctuation

There is also a package named [textcleaner](https://pypi.org/project/textcleaner/) you can use 

Based on [this SO answer](https://stackoverflow.com/questions/54396405/how-can-i-preprocess-nlp-text-lowercase-remove-special-characters-remove-numb)

In [None]:
# takes some time
df = pd.DataFrame(columns=['Text'])
df['text'] = [corpus.raw(i) for i in file_ids]
df['file_ids'] = file_ids

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re # regular expressions
# optional lemanize
lemmatizer = WordNetLemmatizer()
# optional stemmer
stemmer = PorterStemmer() 

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence = sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')          # sentence > sequence of words
    tokens = tokenizer.tokenize(rem_num)         # remove words that only contain numbers
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(lemma_words)


df['clean_text'] = df['text'].map(lambda s:preprocess(s)) 

In [None]:
# check results
df.head()

In [None]:
corpus_clean = df[['file_ids','clean_text']]

In [None]:
df.head()

In [None]:
# example
corpus_clean['clean_text']

## Find popular terms after cleaning

Popular terms in the Top 2000, notice that we now supply a dataframe column as input for transformation to count_vect:

In [None]:
freq = nltk.FreqDist(' '.join(corpus_clean['clean_text']).split())
topWords = freq.most_common(20)
topWords

## Lines per song

In [None]:
# here we are back to the raw corpus
newline_count_file = [[corpus.raw(i).count('\n'), i] for i in file_ids]
newline_count_file_sorted = sorted(newline_count_file, key=lambda x: -x[0])[0:10]
ys, xs = [*zip(*newline_count_file_sorted)]

In [None]:
import numpy as np
plt.barh(xs, ys)

## Most used word in one song

In [None]:
df1 = pd.DataFrame(columns=['word', 'n', 'total'])
for i in file_ids:
    list_with_words = ' '.join([corpus.raw(i).lower()]).split()
    freq = nltk.FreqDist(list_with_words)
    df1.loc[i] = [freq.most_common(1)[0][0], freq.most_common(1)[0][1], len(list_with_words)]
df1.sort_values("n", ascending=False).head()

## Term frequencies

In [None]:
from collections import Counter

# example songs
song_list = ["Pearl_Jam_Black.txt", "James_Brown_Sex_Machine.txt",
               "The_Blues_Brothers_Everybody_Needs_Somebody_To_Love.txt",
               "Justin_Timberlake_Cry_Me_A_River.txt"]

for song in song_list:
    cnt = Counter()
    total_words = len(corpus.raw(song).lower().split())
    
    for text in corpus.raw(song).lower().split():
        cnt[text] += 1
    # See most common ten words
    cnt.most_common(10)

    word_freq = pd.DataFrame(cnt.most_common(20), columns=['words', 'count'])
    word_freq["total_words"] = total_words
    word_freq["n_total"] = round(word_freq.apply(lambda row: row["count"] / row.total_words, axis=1),2)
    #word_freq.head()
    fig, ax = plt.subplots(figsize=(12, 3))

    # Plot horizontal bar graph
    word_freq.sort_values(by='count').plot.bar(x='words',
                        y='count',
                        ax=ax,
                        color="brown")
    ax.set_title(song)
    plt.show()

### Sentiment Analysis

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach.

In [None]:
# !pip install random-word
# !pip install vadersentiment
nltk.download("vader_lexicon")
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
vader = SentimentIntensityAnalyzer()

word_list = ["yes", "no", "abort", "not yes", "upset", "happy", "angry", "holiday"]
for word in word_list:
    
    sentiments = [vader.polarity_scores(word)]
    print(f"\nWord: {word}")
    print(f"Sentiment {sentiments}")
    sentiment_scores = SentimentIntensityAnalyzer().polarity_scores(word)


### Sentiment lexicons

There are general-purpose lexicons available for doing sentiment analysis with Python such as:

* AFINN from Finn Årup Nielsen,
* Vader, and
* pre-trained AI sentiment models 

The Vader lexicon. VADER stands for Valence Aware Dictionary for sEntiment Reasoner) and has a model that can deal with problem text like “not great” (ie, negations) and is also sensitive to intensity of language or amplifiers (“very happy” vs “happy”).

The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.  AFINN preprocesses text by removing the punctuation and converting all the words to lower-case.

There are many pre-trained AI models models available which can be used to better suit your language on use case. These mostly require TensorFlow or PyTorch.

### Explore the sentiment lexicons yourself

Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text.

In [None]:
!pip install afinn
from afinn import Afinn
afinn = Afinn(language='en')
# The score method returns the sum of word valence scores for a text string.
afinn.score('I had a bad day.')


### Sentiment scores with inner join

We use four songs to do the sentiment analysis

In [None]:
print(song_list)
df_corps = df.set_index('file_ids', inplace=False)
df_sentiment_selected = df_corps.loc[song_list]
df_sentiment_selected
sentiments = [vader.polarity_scores(document) for document in df_sentiment_selected['clean_text']]
sentiments

We take sections of ten lines and calculate sentiment on each section.

In [None]:
# ! pip install more_itertools
from more_itertools import grouper

def group_lines(iterable, n=10):
    return ["\n".join((line for line in lines if line))
            for lines in grouper(n, iterable.split("\n"), fillvalue="")]

[document for document in group_lines(df_corps.loc['Pearl_Jam_Black.txt']['text'])]


## Sentiment scores

Now we can plot these sentiment scores across the duration of the song.


In [None]:
fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, sharey=True, figsize=(12,8))

song_index=0
for i, ax in enumerate(axes.flatten()):
    song = song_list[song_index]
    song_index = song_index + 1

    split_lines = group_lines(df_corps.loc[song]['text'])

    split_lines_sentiments = [vader.polarity_scores(document) for document in split_lines]
    split_lines_sentiments_compound = [item["compound"] for item in split_lines_sentiments]

    ax.bar([*range(1, len(split_lines)+1, 1)], split_lines_sentiments_compound)
    ax.set_title(song)

plt.show()

### Homework

Explore the impact of choosing a different lexicon.

### Topic Modeling



In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
from sklearn.feature_extraction import _stop_words

In [None]:
# max_features limits the number of features to use
vect = CountVectorizer(max_features=1000,ngram_range=(1,1),stop_words=['english','dutch'])

In [None]:
# build a document term matrix
dtm=vect.fit_transform(corpus_clean['clean_text'])

In [None]:
# document term matrix
dtm

In [None]:
pd.DataFrame(dtm.toarray(),columns=vect.get_feature_names_out())

### Latent Dirichlet Allocation


Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.

The graphical model of LDA is a three-level generative model:

<img align="middle" src="./png/lda_model_graph.png" width="600"/>


Note on notations presented in the graphical model above, which can be found in Hoffman et al. (2013):

* The corpus is a collection of D documents.
* A document is a sequence of N words.
* There are K topics in the corpus.
* The boxes represent repeated sampling.

In the graphical model, each node is a random variable and has a role in the generative process. A shaded node indicates an observed variable and an unshaded node indicates a hidden (latent) variable. In this case, words in the corpus are the only data that we observe. The latent variables determine the random mixture of topics in the corpus and the distribution of words in the documents. The goal of LDA is to use the observed words to infer the hidden topic structure.

[more on scikitlearn](https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda)


In [None]:
# how many topics do we want to find
lda=LatentDirichletAllocation(n_components=10)

In [None]:
# fit the model
lda.fit_transform(dtm)

### Visualization of topics

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [None]:
zit=pyLDAvis.sklearn.prepare(lda,dtm,vect)

In [None]:
pyLDAvis.display(zit)

In [None]:
# function to get relevant words that define the topics
def get_model_topics(model, vectorizer, topics, n_top_words=5):
    word_dict = {}
    feature_names = vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        word_dict[topics[topic_idx]] = top_features

    return pd.DataFrame(word_dict)

## What do the topics mean?

So now we have found a latent clustering of relevant words into topics.

And we can use this to predict for a new doucment which topics are talked about in this document.

In [None]:
# what do the topics mean?
topics = ['T1','T2','T3','T4','T5','T6','T7','T8','T9','T10']

In [None]:
# the most relevant wordes describe the topic
get_model_topics(lda,vect,topics,n_top_words=8)

### Use the lda model to predict what topic is discussed in a new document

In [None]:
def get_inference(model, vectorizer, topics, text, threshold):
    v_text = vectorizer.transform([text])
    score = model.transform(v_text)

    labels = set()
    for i in range(len(score[0])):
        if score[0][i] > threshold:
            labels.add(topics[i])

    if not labels:
        return 'None', -1, set()

    return topics[np.argmax(score)], score, labels

In [None]:
# this is a Dutch song
text = corpus_clean.iloc[3]['clean_text']
# there is topic that is about 'Dutch' songs ...
(topic, scores, topic_labels) = get_inference(lda, vect, topics, text, threshold=0.0)
topic

In [None]:
text

In [None]:
# get topic scores for each document
doc_topic_dist_unnormalized = np.matrix(lda.transform(dtm))

# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

In [None]:
# find the topic with highest probability
doc_topic_dist.argmax(axis=1)[0:10]

## Exercise : filter Dutch songs only and do topic model on the Dutch songs

In [None]:
# aantal Nederlandse liedjes
nl_topic = 5  # this can be different every time as we cannot predict the order in which the topics arefound
nl_filter = doc_topic_dist.argmax(axis=1)==nl_topic
corpus_clean_nl = corpus_clean[nl_filter.A1]
corpus_clean_nl.shape

In [None]:
# max_features limits the number of features to use
vect2 = CountVectorizer(max_features=500,ngram_range=(1,1),stop_words=['dutch'])
# build a document term matrix
dtm_nl = vect2.fit_transform(corpus_clean_nl['clean_text'])
# how many topics do we want to find
lda_nl = LatentDirichletAllocation(n_components=10)
# fit the model
lda_nl.fit_transform(dtm_nl)

In [None]:
# the most relevant wordes describe the topic
get_model_topics(lda_nl,vect2,topics,n_top_words=8)

In [None]:
zit2 = pyLDAvis.sklearn.prepare(lda_nl,dtm_nl,vect2)

In [None]:
pyLDAvis.display(zit2)