# I) Cluster Analysis

This analysis deep dives possible topic clusters based on an unsupervised algorithm.

We want to cluster topics based on text tokens and their co-occurence in the text corpus that is made up of 530 article abstracts.
__________________

### Topic Modelling explained

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently.

Topic models are also referred to as probabilistic topic models, which refer to statistical algorithms for discovering the latent semantic structures of an extensive text body. 

Topic models are useful for purpose of document clustering.

Finding good topics depends on the quality of text processing , the choice of the topic modeling algorithm, the number of topics specified in the algorithm.

___________________

### LDA-Analysis

Here, we apply the LDA: LDA (short for Latent Dirichlet Allocation) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic. A topic is represented as a weighted list of words.

There are 3 main parameters of the model:

- the number of topics
- the number of words per topic
- the number of topics per document

One application of LDA in machine learning - specifically, topic discovery, a sub-problem in natural language processing - is to discover topics in a collection of documents, and then automatically classify any individual document within the collection in terms of how "relevant" it is to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words or phrases) that, taken together, suggest a shared theme.

## 1) Data-Preprocessing

In [3]:
# Load library
import pandas as pd

# Import df
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Portfolio_Projects/03_PhD_Analysen/04_NLP_CG_VBM/Rohdaten_0603.csv", sep = ";", index_col = 0)

# View first rows
df.head()

Unnamed: 0_level_0,Journal,Title,Year,Abstract
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Management Review Quarterly,Determinants and effects of sustainable CEO co...,2019,Sustainability-oriented CEO compensation is be...
2,Management Review Quarterly,A governance puzzle to be solved? A systematic...,2020,"To address global sustainability challenges, a..."
3,Journal of Economics and Finance,The analysis of corporate governance policy an...,2016,The main purpose of this study is to investiga...
4,Journal of Economics and Finance,The impact of governance characteristics on th...,2014,The study examines the relationship between th...
5,Journal of Economics and Finance,Board independence and market reactions around...,2011,This study focuses on whether board independen...


In [4]:
# Subset df: No., Journal and Abstract
df = df[["Journal", "Abstract"]]
df["index"] = df.index

# View first rows
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Journal,Abstract,index
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Management Review Quarterly,Sustainability-oriented CEO compensation is be...,1
2,Management Review Quarterly,"To address global sustainability challenges, a...",2
3,Journal of Economics and Finance,The main purpose of this study is to investiga...,3
4,Journal of Economics and Finance,The study examines the relationship between th...,4
5,Journal of Economics and Finance,This study focuses on whether board independen...,5


In [5]:
# Print number of occurences per journal
df["Journal"].value_counts()

Corporate Governance: An International Review         123
Journal of Management and Governance                   78
International Journal of Disclosure And Governance     39
Journal of Management                                  35
Strategic Management Journal                           35
British Journal of Management                          32
Managerial and Decision Economics                      27
Journal of Management Studies                          19
BRQ Business Research Quarterly                        17
Review of Managerial Science                           15
European Management Review                             13
International Studies of Management & Organization     12
International Journal of Management Reviews            12
Journal of Economics and Finance                       11
Journal of Business Economics and Management           11
Cogent Business & Management                            8
Journal of International Business Studies               6
Journal of Gen

In [6]:
# Print number of journals
print(f"In total, we consider {df['Journal'].nunique()} peer-reviewed journals")


In total, we consider 29 peer-reviewed journals


## 2) Text Preprocessing

### 2a) Tokenization, Removing stopwords, Retain alphabetics, Lowercasing

In [7]:
# Create one string by combining all abstracts
abstracts = " ".join(abstract for abstract in df.Abstract)
print(f"There are {len(abstracts)} words in the combination of all abstracts.")

# Remove Corporate Governance terms
stopwords_cg = ["Corporate", "Governance", "corporate", "governance", "the", "CEO", "ceos", "level", "find", "findings", "related", "paper", "listed", "CG", "effect", 
            "finding", "result", "study", "boards", "based", "board", "Board", "firm", "firms", "family", "performance", "director", "directors", "companies",
            "member", "results", "sample", "suggest", "show", "literature", "research", "model", "management", "article", "one", ".", ",", "(", ")", "The", "We", "In"
            "-", "This", "this", "'", "Our", "’", "In", "also", "evidence", "empirical", "better", "high", "use", "that", "That", "than", "using", "agency", "may", 
            "institutional", "resource", "relationship", "theory", "executive", "different", "data", "analysis", "whether", "non", "new"]

# Import required function: tokenize from gensim.utils
from gensim.utils import tokenize

# Define function to filter out common Corporate Governance terms and tokenize simultaneously
def remove_cg_words(text) :
  return[word for word in tokenize(text, to_lower = True) if word not in stopwords_cg]

# Apply filter-function to abstracts
abstracts = remove_cg_words(abstracts)

# Retain only alphabetic words
abstracts_lofialph = [token for token in abstracts if token.isalpha()]

# Remove english stopwords (defined by default)
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = nltk.corpus.stopwords.words('english')

# Append stopword-list with corporate governance specific terminologies
new_words=("ceo", "governance")
for word in new_words:
    stop_words.append(word)

# Print first 10 stop_words
print(f"\n\nHere we see 10 sample stopwords {stop_words[:10]}\n")

# Filter Abstracts for stopwords and remove them
abstracts_lofialph2 = [token for token in abstracts_lofialph if token not in stop_words]
abstracts_lofialph2

# Retain only words that have more than 3 characters
abstracts_lofialph3 = [token for token in abstracts_lofialph2 if len(token) >= 3]

print(f"Here are ten preprocessed tokens: {abstracts_lofialph3[:10]}")


There are 687159 words in the combination of all abstracts.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here we see 10 sample stopwords ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Here are ten preprocessed tokens: ['sustainability', 'oriented', 'compensation', 'widely', 'discussed', 'among', 'policy', 'makers', 'practice', 'academia']


### 2b) Text Normalization

In [8]:
# Import Punkt-Tokenizer
nltk.download("wordnet")

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Instantiate the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: abstracts
abstracts_lemm = [[lemmatizer.lemmatize(token) for token in abstracts_lofialph3]]
type(abstracts_lemm)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


list

### 2c) Formatting Text-Corpus

In [9]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the abstracts: dictionary
abstracts_dict = Dictionary(abstracts_lemm)
print(f"{abstracts_dict}\n")

# View the tokens and their respective id
list(abstracts_dict.token2id.items())[:10]

Dictionary(5070 unique tokens: ['abandoned', 'aberrant', 'ability', 'able', 'abnormal']...)



[('abandoned', 0),
 ('aberrant', 1),
 ('ability', 2),
 ('able', 3),
 ('abnormal', 4),
 ('abolish', 5),
 ('abolition', 6),
 ('abroad', 7),
 ('absence', 8),
 ('absorb', 9)]

**Create a gensim corpus**
A Gensim corpus is a list of lists. Each document represents a list. Each documuent is now a series of tuples.
The first item represents the token_id, the second element represents the token_frequency.

### 2d) Build BoW for Corpus

In [10]:
# Create a gensim corpus: 
abstracts_corpus = [abstracts_dict.doc2bow(token) for token in abstracts_lemm]

# Remain word and frequency, original word order is lost
[[(abstracts_dict[id], freq) for id, freq in cp] for cp in abstracts_corpus[:1]]

[[('abandoned', 2),
  ('aberrant', 2),
  ('ability', 46),
  ('able', 25),
  ('abnormal', 13),
  ('abolish', 1),
  ('abolition', 1),
  ('abroad', 1),
  ('absence', 6),
  ('absorb', 2),
  ('absorptive', 2),
  ('abundant', 2),
  ('abuse', 1),
  ('abusively', 1),
  ('ac', 1),
  ('academia', 2),
  ('academic', 35),
  ('academically', 1),
  ('academician', 1),
  ('accelerating', 1),
  ('accentuate', 1),
  ('accentuated', 3),
  ('accentuating', 1),
  ('accept', 2),
  ('acceptance', 4),
  ('accepted', 6),
  ('accepting', 1),
  ('access', 20),
  ('accommodating', 1),
  ('accompanied', 1),
  ('accomplish', 1),
  ('accord', 1),
  ('accordance', 3),
  ('according', 14),
  ('accordingly', 3),
  ('account', 31),
  ('accountability', 19),
  ('accountable', 3),
  ('accounting', 51),
  ('accrual', 6),
  ('accuracy', 2),
  ('accurate', 3),
  ('accurately', 2),
  ('achieve', 12),
  ('achieved', 2),
  ('achievement', 4),
  ('achieving', 7),
  ('aci', 2),
  ('acknowledged', 1),
  ('acknowledgement', 1),
  

## 4) Build Topic Model

### 4a) Train Corpus and create Tfidf-Model

The Tfidf is different from the regular corpus because it down weights the tokens i.e. words appearing frequently across documents. During initialisation, this tf-idf model algorithm expects a training corpus having integer values (such as Bag-of-Words model).

1. Step: Train corpus (Bag-of-Word corpus)
2. Step: Train corpus within the tfidf model **models.TfidfModel()**

In [11]:
# Create list of list by getting the word ids and their frequencies in corp_abstracts
for token in abstracts_corpus:
    print([[abstracts_dict[id], freq] for id, freq in token])

# Train corpus within the tfidf model
import numpy as np
from gensim import models

abstracts_tfidf = models.TfidfModel(abstracts_corpus, smartirs = "ntc")




### 4b) Compute LDA Model

The model describes in what percentage each document talks about each topic. A topic is represented as a weighted list of words.  

Now, we compute this weighted list of words below.

In [12]:
# Define LdaModel function
abstracts_lda = models.LdaModel(abstracts_corpus, id2word=abstracts_dict, num_topics = 10)

# Show Model results
from pprint import pprint
pprint(abstracts_lda.show_topics())

[(0,
  '0.008*"ownership" + 0.006*"financial" + 0.005*"shareholder" + '
  '0.005*"mechanism" + 0.005*"risk" + 0.005*"impact" + 0.005*"role" + '
  '0.005*"market" + 0.005*"control" + 0.004*"compensation"'),
 (1,
  '0.006*"risk" + 0.006*"shareholder" + 0.005*"financial" + 0.005*"value" + '
  '0.005*"market" + 0.005*"impact" + 0.004*"ownership" + 0.004*"mechanism" + '
  '0.004*"influence" + 0.004*"country"'),
 (2,
  '0.006*"financial" + 0.006*"shareholder" + 0.005*"market" + 0.005*"risk" + '
  '0.005*"value" + 0.005*"ownership" + 0.004*"mechanism" + '
  '0.004*"compensation" + 0.004*"influence" + 0.004*"practice"'),
 (3,
  '0.005*"shareholder" + 0.005*"value" + 0.005*"market" + 0.005*"risk" + '
  '0.005*"mechanism" + 0.004*"financial" + 0.004*"practice" + '
  '0.004*"ownership" + 0.004*"role" + 0.004*"structure"'),
 (4,
  '0.007*"shareholder" + 0.006*"financial" + 0.006*"risk" + 0.005*"structure" '
  '+ 0.005*"ownership" + 0.005*"market" + 0.004*"control" + 0.004*"value" + '
  '0.004*"mec

### 4c) Visualize LDA Model

In order to visualize the LDA-results, we apply the **`pyLDAvis`**, which is an interactive, specific LDA visualization package in python. 

The **area of circle** represents the **importance of each topic over the entire corpus**, the **distance** between the center of circles **indicate the similarity between topics**. 

For each topic, the histogram on the right side listed the top **30 most relevant terms**

Each bubble on the left-hand side represents topic. The larger the bubble, the more prevalent or dominant the topic is. Good topic model will be fairly big topics scattered in different quadrants rather than being clustered on one quadrant.
- The model with too many topics will have many overlaps, small sized bubbles clustered in one region of chart.
- If you move the cursor the different bubbles you can see different keywords associated with topics.

In [13]:
# Import pyLDAvis 
!pip install pyLDAvis==2.1.2
import pyLDAvis.gensim

# Visualize 
pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim.prepare(abstracts_lda, abstracts_corpus, abstracts_dict, R = 30) # arguments "R", "mds"

vis



  from collections import Iterable
  head(R).drop('saliency', 1)


### 4d) Interpreting the LDA Visualization

**`LDAvis`** attempts to answer some basic questions about the fitted model:
- (1) What is the meaning of each topic?, 
- (2) How prevalent is each topic?, and 
- (3) How do the topics relate to each other? 
- (4) Are our topics **interpretable**?
- (5) Are our topics **unique**? (two different topics have different words)
- (6) Are our topics **exhaustive**? (are all your documents well represented by these topics?)

Different visual components answer each of these questions, some of which are original, and some of which are borrowed from existing tools.

####**Intertopic Distance Map**: 
presents a global view of the topic model, and answers questions 2 and 3.

#### **Horizontal Bar chart**: 
represent the individual terms that are the most useful for interpreting the currently selected topic on the left. A pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term.

The left and right panels of our visualization are linked such that selecting a topic (on the left) reveals the most useful terms (on the right) for interpreting the selected topic.

In addition, selecting a term (on the right) reveals the conditional distribution over topics (on the left) for the selected term.

A key innovation of our system is how we determine the most useful terms for interpreting a given topic, and how we allow users to interactively navigate.

## 5) Evaluate the LDA Model

Now, that we clustered the text corpus into ten topics based on co-occurrence of tokens, we want to know how centric a certain topic is based on the entire text corpus.


**Print % of Topics a document is about**

In [14]:
# Get the percentage number of the topics
abstracts_lda[abstracts_corpus[0]]

[(5, 0.18911093), (6, 0.06745454), (7, 0.7434062)]

The entire text_corpus is heavily concerned with topic 0, which makes ca. 62% of the content:
- Topic 0: 61.91%
- Topic 1:  3.23%
- Topic 2: 11.52%
- Topic 4:  3.69%
- Topic 6:  7.24%
- Topic 8: 12.04%

-> The other topics are no significant to the text corpus


## Hyperparameter-Tuning the Model

- Include **bi- and tri-grams** to grasp more relevant information

- Another classic preparation step is to **use only nouns and verbs **using POS tagging (POS: Part-Of-Speech)

- Adding **stop words** that are too frequent in your topics and re-running your model is a common step.

- **Alpha, Eta**: If you’re not into technical stuff, forget about these. Otherwise, you can tweak alpha and eta to adjust your topics. Start with ‘auto’, and if the topics are not relevant, try other values. I recommend using low values of Alpha and Eta to have a small number of topics in each document and a small number of relevant words in each topic.

- Increase the **number of passes** to have a better model: controls how often we train the model on the entire corpus



# II) N-grams Cluster Analysis

In this second part of our Cluster Analysis, we enhance topic modelling by 

In [44]:
# Import pandas library
import pandas as pd

# Load dataframe
df2 = pd.read_csv("/content/drive/MyDrive/01_Promotion/01_ZCG_Veröffentlichung/Rohdaten_0603.csv", sep = ";", index_col = 0, error_bad_lines=False)

# View first rows
df2.head()



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0_level_0,Journal,Title,Year,Abstract
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Management Review Quarterly,Determinants and effects of sustainable CEO co...,2019,Sustainability-oriented CEO compensation is be...
2,Management Review Quarterly,A governance puzzle to be solved? A systematic...,2020,"To address global sustainability challenges, a..."
3,Journal of Economics and Finance,The analysis of corporate governance policy an...,2016,The main purpose of this study is to investiga...
4,Journal of Economics and Finance,The impact of governance characteristics on th...,2014,The study examines the relationship between th...
5,Journal of Economics and Finance,Board independence and market reactions around...,2011,This study focuses on whether board independen...


## 1) Text Preprocessing

After loading `en_core_web_lg` pipeline, we must restart runtime and execute all prior cells. Then, we create nlp-objects and proceed as usual.

#### Tokenization, Lemmatization, POS-Tagging, Stopword Removal

In [45]:
# Import spaCy
! python -m spacy download en_core_web_lg 

# Import spacy and load nlp-pipeline
import spacy
nlp = spacy.load("en_core_web_lg") # disable to run script more quickly


In [None]:
import spacy
# Create nlp Object
df2["nlp_abstract"] = df2["Abstract"]
df2["nlp_abstract"] = [nlp(abstract) for abstract in df2.nlp_abstract]
print(df2.head(3))

In [None]:
# Define allowed Part-of-Speech-Tags
allowed_postags = ["NOUN", "VERB", "ADJ"]

# Lemmatize, Retaining non-stopwords, and only tokens which pos is in allowed_postages
df2["nlp_abstract"] = df2["Abstract"].apply(lambda abstract: [token.lemma_ for token in  nlp(abstract) if token.is_stop == False and token.pos_ in allowed_postags and token.is_alpha == True])

In [47]:
# View first rows of df2
df2.head(5)

Unnamed: 0_level_0,Journal,Title,Year,Abstract,nlp_abstract
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Management Review Quarterly,Determinants and effects of sustainable CEO co...,2019,Sustainability-oriented CEO compensation is be...,"(Sustainability, -, oriented, CEO, compensatio..."
2,Management Review Quarterly,A governance puzzle to be solved? A systematic...,2020,"To address global sustainability challenges, a...","(To, address, global, sustainability, challeng..."
3,Journal of Economics and Finance,The analysis of corporate governance policy an...,2016,The main purpose of this study is to investiga...,"(The, main, purpose, of, this, study, is, to, ..."
4,Journal of Economics and Finance,The impact of governance characteristics on th...,2014,The study examines the relationship between th...,"(The, study, examines, the, relationship, betw..."
5,Journal of Economics and Finance,Board independence and market reactions around...,2011,This study focuses on whether board independen...,"(This, study, focuses, on, whether, board, ind..."


In [48]:
# Check the number of abstracts
len(df2["nlp_abstract"])

530

In [51]:
# Create List of lists
abstracts_array = df2["nlp_abstract"].to_list()
print(abstracts_array[:5])
print(type(abstracts_array))

[['sustainability', 'orient', 'ceo', 'compensation', 'discuss', 'policy', 'maker', 'corporate', 'practice', 'academia', 'date', 'management', 'literature', 'yield', 'grow', 'body', 'empirical', 'result', 'determinant', 'effect', 'sustainable', 'ceo', 'compensation', 'primarily', 'empirical', 'study', 'analyze', 'extent', 'sustainability', 'relate', 'issue', 'determine', 'design', 'sustainable', 'ceo', 'compensation', 'sustainability', 'orient', 'ceo', 'compensation', 'impact', 'corporate', 'performance', 'scatter', 'nature', 'research', 'field', 'impede', 'overarching', 'empirical', 'substantiation', 'argument', 'favor', 'sustainable', 'ceo', 'compensation', 'structured', 'literature', 'review', 'address', 'gap', 'analyze', 'empirical', 'study', 'key', 'determinant', 'effect', 'sustainable', 'ceo', 'compensation', 'multi', 'level', 'analysis', 'contribute', 'discussion', 'sustainable', 'ceo', 'compensation', 'identify', 'central', 'empirical', 'insight', 'methodological', 'content', 'r

In [52]:
# Create flat list
abstracts_flat = [token for abstract in abstracts_array for token in abstract]
print(abstracts_flat[:10])

['sustainability', 'orient', 'ceo', 'compensation', 'discuss', 'policy', 'maker', 'corporate', 'practice', 'academia']


#### Computing n-grams

Gensim’s `Phrases` model can build and implement the bigrams, trigrams, quadgrams and more

In [53]:
# Load libraries and packages
import gensim
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# Define bigrams and trigrams
bigram_phrases = Phrases(abstracts_array, min_count=3, threshold = 2)
trigram_phrases = Phrases(bigram_phrases[abstracts_array], min_count=3, threshold = 2)

# Faster way to get a sentence formatted as a trigram/bigram
bigram = Phraser(bigram_phrases)
trigram = Phraser(trigram_phrases)

# Aggregate Uni-, Bi- and Tri-grams
abstracts_array_new = trigram[bigram[abstracts_array]]



In [54]:
# View the new text corpus
abstracts_array_new[0]

['sustainability',
 'orient',
 'ceo_compensation',
 'discuss',
 'policy_maker',
 'corporate',
 'practice',
 'academia',
 'date',
 'management_literature',
 'yield',
 'grow_body',
 'empirical_result',
 'determinant',
 'effect',
 'sustainable_ceo_compensation',
 'primarily',
 'empirical',
 'study_analyze',
 'extent',
 'sustainability',
 'relate',
 'issue',
 'determine',
 'design',
 'sustainable_ceo_compensation',
 'sustainability',
 'orient',
 'ceo_compensation',
 'impact_corporate',
 'performance',
 'scatter',
 'nature',
 'research_field',
 'impede',
 'overarching',
 'empirical',
 'substantiation',
 'argument',
 'favor',
 'sustainable_ceo_compensation',
 'structured',
 'literature_review',
 'address',
 'gap',
 'analyze',
 'empirical_study',
 'key',
 'determinant',
 'effect',
 'sustainable_ceo_compensation',
 'multi',
 'level',
 'analysis',
 'contribute',
 'discussion',
 'sustainable_ceo_compensation',
 'identify',
 'central',
 'empirical',
 'insight',
 'methodological',
 'content',
 're

#### Removing TF-IDF

TF-IDF which means Term Frequency and Inverse Document Frequency, is a scoring measure widely used in information retrieval (IR) or summarization. TF-IDF is intended to reflect how relevant a term is in a given document. Basically, it removes every word that occurs in every text.

Therefore we follow this **approach**: 
1. `Dictionary`: Create a dictionary as a mapping between words and their integer ids
2. `doc2bow`: Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.

In [55]:
# Import Dictionary: a mapping between words and their integer ids.
from gensim.corpora import Dictionary

# Create a Dictionary
dictionary = Dictionary(abstracts_array_new)

# Create a Text corpus
corpus = [dictionary.doc2bow(abstract) for abstract in abstracts_array_new]

# Print results
print("Number of unique tokens: %d" % len(dictionary))
print("Number of documents: %d" % len(corpus))

Number of unique tokens: 4753
Number of documents: 530


**Compute Tfidf-Model**

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

In [56]:
# Import TfidfModel
from gensim.models import TfidfModel

# Instantiate TfidfModel
tfidf = TfidfModel(corpus, id2word = dictionary)

# Define words that occur not frequently, based on a threshold of 0.1
low_value = 0.1
low_value_words = []
for abstract in corpus:
    low_value_words += [id for id, value in tfidf[abstract] if value < low_value]

# View 10 low_value_words
low_value_words[:10]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 11]

These are the ids of the first 10 `low_value_words`

In [57]:
# View  ids and freq of the first 10 abstracts in the corpus
[[(dictionary[id], freq) for id, freq in abstract] for abstract in corpus[:10]]

[[('academia', 1),
  ('academic', 1),
  ('addition', 1),
  ('address', 1),
  ('analysis', 1),
  ('analyze', 1),
  ('area', 1),
  ('argument', 1),
  ('board', 1),
  ('central', 1),
  ('ceo_compensation', 3),
  ('content', 1),
  ('contribute', 1),
  ('corporate', 1),
  ('current', 1),
  ('date', 1),
  ('design', 2),
  ('determinant', 3),
  ('determine', 1),
  ('discuss', 1),
  ('discussion', 1),
  ('effect', 3),
  ('empirical', 3),
  ('empirical_result', 1),
  ('empirical_study', 1),
  ('evidence', 1),
  ('extent', 1),
  ('favor', 1),
  ('foci', 1),
  ('future_research', 1),
  ('gap', 1),
  ('grow_body', 1),
  ('identify', 1),
  ('impact_corporate', 1),
  ('impede', 1),
  ('insight', 1),
  ('investor', 1),
  ('issue', 1),
  ('key', 1),
  ('level', 1),
  ('literature_review', 1),
  ('management', 1),
  ('management_literature', 1),
  ('methodological', 1),
  ('multi', 1),
  ('nature', 1),
  ('orient', 3),
  ('overarching', 1),
  ('path', 1),
  ('performance', 1),
  ('policy_maker', 1),
  

 Before running the Cluster-Analysis, we must take care of the predefined low_value_words. These words can be excluded using the `.filter_tokens()` function of gensim

In [58]:
# Filter low value words out of the dictionary before running LDA
dictionary.filter_tokens(bad_ids = low_value_words)

Create a Bag-of-Words corpus, i.e. unique tokens as keys and frequencies as values.

In [59]:
# Recompute corpus now that low value words are filtered out
new_corpus = [dictionary.doc2bow(abstract) for abstract in abstracts_array_new]

## 2) Computing LdaModel

In [60]:
# Import LdaModel package
from gensim.models import LdaModel

# Instantiate and customize LdaModel
lda_model = LdaModel(corpus = new_corpus, 
                    id2word = dictionary, 
                            num_topics = 10,            # defined topics
                            update_every = 1,           # update every time we run the model
                            random_state = 42,          # important for reproducability
                            minimum_probability = 0.05) # topics with a probability lower than this threshold will be filtered out
         
           

# Show Model results
from pprint import pprint
pprint(lda_model.show_topics())

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

[(0,
  '0.011*"board_compensation" + 0.009*"share_price" + 0.007*"venture" + '
  '0.007*"seo" + 0.007*"hostile_takeover" + 0.007*"transnational_interlock" + '
  '0.006*"social_tie" + 0.006*"fraud_case" + 0.006*"continuity" + '
  '0.006*"prestigious"'),
 (1,
  '0.012*"ceo_pay" + 0.011*"foreign_subsidiary" + 0.010*"external_pressure" + '
  '0.010*"female_director" + 0.008*"second_layer" + 0.008*"elasticity" + '
  '0.008*"stakeholder_value" + 0.007*"wrongdoing" + 0.007*"layer" + '
  '0.007*"financial_fraud"'),
 (2,
  '0.021*"contracting" + 0.011*"faultline" + 0.009*"strategic_control" + '
  '0.008*"payout" + 0.008*"sustainable_ceo_compensation" + 0.008*"govern_firm" '
  '+ 0.008*"backdate_stock_option" + 0.008*"election_period" + '
  '0.008*"alliance" + 0.007*"organizational_discretion"'),
 (3,
  '0.012*"cross_list" + 0.012*"clawback_provision" + 0.011*"depositary" + '
  '0.009*"premium" + 0.008*"coalition" + 0.007*"proximity" + '
  '0.007*"corporate_governance_regime" + 0.006*"subprime_l

## 3) Visualize LDA Model

In [64]:
# Import pyLDAvis 
# !pip install pyLDAvis==2.1.2
import pyLDAvis.gensim

# Visualize LdaModel
pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim.prepare(lda_model, new_corpus, dictionary, R = 30) # arguments "R", "mds"

# Show the LDA-Model
vis

  head(R).drop('saliency', 1)


We see that the ten clusters are evenly distributed not only in regards to their distance to each other but also to their proportion to the total text corpus.

## 4) Distributing the Cluster-Analysis
Once created the model, we can save it as an html-file and share it with customers, management and colleagues.

In [65]:
lda = pyLDAvis.save_html(vis, "/content/drive/MyDrive/Colab Notebooks/Portfolio_Projects/03_PhD_Analysen/04_NLP_CG_VBM/LDA_Cluster_Analysis_2022.html")