# Introduction

Before running the code, please install Tesseract on your laptop (https://github.com/UB-Mannheim/tesseract/wiki)

The following code consists of 4 blocks:

- Block 0 installs all the Python libraries that are needed to run the code. This part needs to be run only the very first time you use the code. __Once the libraries are installed, you do not need to reinstall them.__

- Block 1 defines the user-specific inputs. Here you need to input the path to the folder containing the PDF files you want to analyze and the keywords (search terms) that define your topic of interest. __You need to run this block every time you use this code.__

- Block 2 extracts the text from the PDF files. Once the text is extracted it will be stored as a .txt file. Accessing text from a .txt file is much faster than from a PDF file. __You need to run this part of code only when you are working on a new set of PDF files.__

- Block 3 selects relevant papers using TF-IDF and runs LDA. __You must run this block every time you want to analyze (or re-analyze) a corpus of papers.__


# Block 0 - Before starting, install the necessary Python libraries

In [None]:
#install necessary libraries
#this step is required only the very first time you use this code
import sys
!conda install --yes --prefix {sys.prefix} -c conda-forge poppler
!{sys.executable} -m pip install --user pdf2image
!conda install --yes --prefix {sys.prefix} -c conda-forge pytesseract
!{sys.executable} -m pip install --user opencv-python
!{sys.executable} -m pip install --user pyldavis
!{sys.executable} -m pip install -U pip setuptools wheel
!{sys.executable} -m pip install -U spacy
!{sys.executable} -m spacy download en_core_web_sm

# Block 1 - Define user specific inputs (INPUT REQUIRED)

Important: In the following code snippet, replace the path and search terms (keywords) with your own path to the folder containing PDF files of the papers and search keywords.

In [None]:
import pandas as pd
import os
from nltk import SnowballStemmer
import warnings
warnings.filterwarnings('ignore')

#USER INPUTS
#folder where all the papers PDF files are stored; 
#make sure there are no other files in the folder beyond the pdfs of the papers you are interested in 
path_to_folder='C:/Users/Desktop/Literature_review/' #INPUT HERE: path must be inputted with slashes,i.e. "/"; using  backslashes, i.e."\", will cause an error

#list of words used to search for papers
search_keywords=['innovation','innovations','new product','new products', 'culture','cultures'] #INPUT HERE


#Stemming keywords - this step is important to identify relevant papers later
stemmer = SnowballStemmer('english')
keyword_stems=[]
for keyword in search_keywords:
    stems=[stemmer.stem(token) for token in keyword.split()]
    for s in stems:
        keyword_stems.append(s)

print(set(keyword_stems))

Stemming keywords allows us to reduce variance in the terms used to identify the concept you are interested in in the papers. These stems will later be used to distinguish relevant vs. irrelevant papers. For example, without stemming, "innovation", "innovations", "innovative", "innovate" and so on are to the computer completely different things. For a person, instead, they all communicate that the paper is discussing something related to "innovation". To make the computer capture these commonalities, we use stemming. 

It is necessary that you read this list of stems (see above) and decide which stems or combinations of stems, are more representative of what you are interested in. Hence, __it is now necessary to input the final list of stems you consider relevant.__ If you are interested in concepts that are represented by a combination of stems, they should be added to the list as "stem1_stem2", as for example "new_product" in our case.

In [None]:
keywords_stems=['innov','cultur','new_product'] # #INPUT HERE: replace with the stems you are interested in 
papers_list = []
for file in os.listdir(path_to_folder):
    if file.endswith('.pdf'):
        papers_list.append(file)
    #gets the list of files in the folder
papers_ann = pd.DataFrame (papers_list, columns = ['File Name'])

#generate folders where outputs will be stored
folders_names=["Pages_to_images","Results","Txt_files"]
for folder in folders_names:
    if os.path.exists(path_to_folder+folder) == False:
        os.mkdir(os.path.join(path_to_folder, folder)) 

# Block 2 - Extracting text from PDF files

The first step in analyzing papers is extracting text from PDFs. The best way to retrieve text from PDFs is to first convert each page in an image and only then to retrieve text from each image using optical character recognition (OCR). Although this is more computationally intensive, OCR libraries produce, in our experience, more accurate results than PDF reading libraries. 

The following code snippet converts each page of the PDFs to an image. The image is saved in the same location as the PDF files and named after the PDF file name and page number.


In [None]:
from pdf2image import convert_from_path

def convert_pdf_to_images(pdf_path):
    images = convert_from_path(pdf_path)
    for index, image in enumerate(images):
        image.save(f'{pdf_path[0: -4]}-{index}.png')
        

w = path_to_folder[:2] + "\\"

for r,d,f in os.walk(w):
    for files in f:
         if files == "tesseract.exe":
              tess_path=os.path.join(r,files)
              break

for p in range(len(papers_ann)):
    path = path_to_folder + papers_ann['File Name'][p]
    print(papers_ann['File Name'][p])
    convert_pdf_to_images(path)



The following code snippet extracts text from each of the previously created images and generates two .txt files for each paper. The first one is named after the original PDF file and contains the entire main body of the paper. The second one is named as "[original PDF file name]-ref.txt" and contains the list of references in the paper. 

If the paper’s main body does not start on the first page, the code will include the cover page information (e.g., title, author(s), source, published by, stable URL, etc.) in the main body txt file. Notably, 75% of the papers in our sample start on the first page. Hence, this approach substantially reduces the amount of manual work needed, while adding  only a negligible amount of useless text to a small number of papers.

We separate the reference list from the main text because reference lists might affect word counts. For example, in Hurley and Hult (1998) “innovation” is mentioned 156 times including the reference list and 127 times excluding it. 
Reference lists are saved because they can be helpful in identifying other potentially relevant papers. If you are certain that all the papers have the same citation style, these reference files could be used to generate useful stats. Given that each paper in the reference  list file appears in a new line, it is very easy to import references into Excel and then work on them as a list. We did not focus on providing this functionality within this code as there are already online tools that analyze citation networks (see e.g., https://www.connectedpapers.com/).


In [None]:
import cv2
import pytesseract
import numpy as np
import re
#import os.path
import pandas as pd

for p in range(len(papers_ann)):
    path=path_to_folder+ papers_ann['File Name'][p]
    print(papers_ann['File Name'][p])
    pagestxt = []
    i=0
    while os.path.exists(f'{path[0: -4]}-{str(i)}.png') == True:
        img_path=f'{path[0: -4]}-{str(i)}.png'
        #print(img_path)
        img = cv2.imread(img_path)


        kernel = np.ones((2, 1), np.uint8)
        img = cv2.erode(img, kernel, iterations=1)
        img = cv2.dilate(img, kernel, iterations=1)
        pytesseract.pytesseract.tesseract_cmd = tess_path
        out_below = pytesseract.image_to_string(img)
        pagestxt.append(out_below)
        #print(pagestxt)
        i+=1
    len_splits=[]
    for z in range (0,len(pagestxt)):
        a = re.split(r'R[eE][fF][eE][rR][eE][nN][cC][eE][sS]',pagestxt[-abs(z)])
        len_splits.append(len(a))
        if len(a)!=1:
            papertext=pagestxt[:-abs(z)]+[a[0]]
            txt_path = f'{path[0: -4]}.txt'
            file2 = open(txt_path,"w+")
            file2.write(" ".join(papertext))
            file2.close()

            #The following 4 lines generate the reference file, if you don't want to store references, you can mute them.
            reftxt=[a[1]]+pagestxt[-abs(z-1):] 
            ref_path = f'{path[0: -4]}-ref.txt'
            file3 = open(ref_path,"w+")
            file3.write(" ".join(reftxt)) 
            file3.close()
        elif z==len(pagestxt)-1 and set(len_splits)=={1}:
             papertext=pagestxt
             txt_path = f'{path[0: -4]}.txt'
             file4 = open(txt_path,"w+")
             file4.write(" ".join(papertext))
             file4.close()


In [None]:
# moves images and txts to dedicated folders to declutter the original folder
import shutil
source = os.listdir(path_to_folder)
destination = path_to_folder+'Pages_to_images'
destination1 = path_to_folder+"Txt_files"
for files in source:
    if files.endswith('.png'):
        shutil.move(os.path.join(path_to_folder,files), os.path.join(destination,files))
    elif files.endswith('.txt'):
        shutil.move(os.path.join(path_to_folder,files), os.path.join(destination1,files))
        
papers_ann.to_excel(path_to_folder+"Results/"+'Papers_list.xlsx')

# Block 3 
## Step 3.1 - Compute TF-IDF values 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.parsing.preprocessing import remove_stopwords
from nltk import SnowballStemmer
#import pandas as pd
import string
import spacy
import numpy as np
from nltk import BigramCollocationFinder, BigramAssocMeasures,pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer,snowball

#COLOCATIONS
def join_collocations(element, known_collocations):
    result = []
    is_collocation = False
    current_chain = []
    for i, w in enumerate(element):
        if i < len(element) - 1 and (w, element[i + 1]) in known_collocations:
            if current_chain == []:
                current_chain = [w, element[i + 1]]
            else:
                current_chain.append(element[i+1])
            is_collocation = True
        else:
            if is_collocation:
                result.append('_'.join(current_chain))
                current_chain = []
            else:
                result.append(w)
            is_collocation = False
    return result

def apply_collocations(sentence, set_colloc):
    for b1,b2 in set_colloc:
        sentence = sentence.replace("%s %s" % (b1 ,b2), "%s_%s" % (b1 ,b2))
    return sentence


nlp = spacy.load('en_core_web_sm')
documents=[]

for i in range(len(papers_ann)):
    path = path_to_folder +"Txt_files/"+ papers_ann['File Name'][i]
    print(papers_ann['File Name'][i])
    txt_path = f'{path[0: -4]}.txt'
    a = open(txt_path,'r')
    documents.append(a.read())
  
exclist = string.punctuation + string.digits
stemmer = SnowballStemmer('english')
for i in range(len(documents)):
    # remove punctuations and digits from oldtext
    table_ = str.maketrans('', '', exclist)
    documents[i] = documents[i].translate(table_)
    documents[i]=remove_stopwords(documents[i].lower())
    documents[i]=" ".join([token.lemma_ for token in nlp(documents[i])])
    documents[i] = " ".join([stemmer.stem(token) for token in documents[i].split()])
    words=[token for token in documents[i].split()]

    finder = BigramCollocationFinder.from_words(words)
    bgm = BigramAssocMeasures()
    collocations = [b for b, f in finder.score_ngrams(bgm.mi_like) if f > 1]

    
    col = pd.DataFrame({'col': collocations})
    documents[i] = apply_collocations(documents[i], set_colloc=collocations)

vect = TfidfVectorizer(analyzer='word', sublinear_tf=True)
tfidf_matrix = vect.fit_transform(documents)
df = pd.DataFrame(tfidf_matrix.toarray(), columns = vect.get_feature_names())
#if you are using Sklearn >= 1.0.x, you must replace get_feature_names() with get_feature_names_out()


## Step 3.2 - Identify relevant papers according to TF-IDF values

In [None]:
from itertools import combinations

for k in keywords_stems:
    papers_ann['tfidf_'+ k]=np.nan
    papers_ann['relevant_'+ k]=np.nan
    for i in range(len(papers_ann)):
        papers_ann['tfidf_'+ k][i]=df[k][i]
        if papers_ann['tfidf_'+k][i]>=papers_ann['tfidf_'+k].mean():
            papers_ann['relevant_'+k][i]='Y'
        else:
            papers_ann['relevant_'+k][i]='N'

res = list(combinations(keywords_stems, 2))
for r in res:
    papers_ann['relevant_'+str(r)]=np.nan
    for i in range(len(papers_ann)):
        if papers_ann['tfidf_'+r[0]][i]>=papers_ann['tfidf_'+r[0]].mean() or papers_ann['tfidf_'+r[1]][i]>=papers_ann['tfidf_'+r[1]].mean():
            papers_ann['relevant_'+str(r)][i]='Y'
        else:
            papers_ann['relevant_'+str(r)][i]='N'


papers_ann.to_excel(path_to_folder +'Results/'+'tfidf-results.xlsx')
papers_ann.head()

## Step 3.3 -  Topic Modeling using LDA (INPUT REQUIRED)

In this stage you need to select which criterion (e.g., which tf-idf threshold) you find more informative for distinguishing relevant vs. irrelevant papers. To validate our approach, we first manually annotated papers and then compared tf-idf results to manual annotation, finding that the combination ("innov", "new_product") was the most accurate. Hence, based on our experience, we suggest the following when determining which criteria to focus on: 
1) combinations of stems are likely to provide more comprehensive lists; 
2) focus on combinations of synonyms of the concept; and 
3) avoid considering all your stems together.

If you do not know which criteria to choose, you can run the code multiple times, using  different criteria, and then check which criteria produces the results you find most informative for your research.


In [None]:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn #if you are using pyLDAvis >= 3.4.0, you must replace pyLDAvis.sklearn with pyLDAvis.lda_model
import matplotlib.pyplot as plt

# Import Data

listofpapers=[]

papers_rel=papers_ann[papers_ann["relevant_('innov', 'new_product')"]=="Y"] # INPUT HERE: replace the column name with the criterion you think is most relevant


papers_rel.reset_index(inplace=True)
print(len(papers_rel.index))
papers_rel.to_excel(path_to_folder +'Results/'+'paper_rel.xlsx')
papers_rel.head()

In [None]:
for i in range(len(papers_rel)):
    path = path_to_folder +"Txt_files/"+ papers_rel['File Name'][i]
    txt_path = f'{path[0: -4]}.txt'
    a = open(txt_path,'r')
    listofpapers.append(a.read())
print(len(listofpapers))

# Convert to list
data = listofpapers
data = [sent.replace('-\n', '') for sent in data]
data = [sent.replace('\n', ' ') for sent in data]
# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))



def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] and len(token.lemma_)>2 else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])



vectorizer = CountVectorizer(analyzer='word',
                             min_df=10,# minimum reqd occurences of a word
                             #max_df=2000,
                             stop_words='english',             # remove stop words
                             lowercase=True,                   # convert all words to lowercase
                             token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
                             # max_features=50000,             # max number of uniq words
                            )

data_vectorized = vectorizer.fit_transform(data_lemmatized)
voc = vectorizer.vocabulary_
print(voc)


# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=10,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

# See model parameters
pprint(lda_model.get_params())

# Define Search Param
search_params = {'n_components': [2,3, 5, 7, 10, 12, 15,20], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params, n_jobs=-1,cv=10)

# Do the Grid Search
model.fit(data_vectorized)

# How to see the best topic model and its parameters?
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))


# Compare LDA Model Performance Scores
# Get Log Likelyhoods from Grid Search Output
n_topics = [2,3, 5, 7, 10, 12, 15,20]
grids=pd.DataFrame(model.cv_results_)

log_likelyhoods_5 = list(grids[grids['param_learning_decay']==0.5]['mean_test_score'])
log_likelyhoods_7 = list(grids[grids['param_learning_decay']==0.7]['mean_test_score'])
log_likelyhoods_9 = list(grids[grids['param_learning_decay']==0.9]['mean_test_score'])
# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_topics, log_likelyhoods_5, label='0.5')
plt.plot(n_topics, log_likelyhoods_7, label='0.7')
plt.plot(n_topics, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()

# How to see the dominant topic in each document?

# Create Document - Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

# index names
docnames = [ str(i) for i in range(len(data))]
# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)


df_document_topic.to_excel(path_to_folder+'Results/'+"main_topics-dominant_topics.xlsx")
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

#Review topics distribution across documents
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

#Visualize the LDA model
import pyLDAvis.sklearn#if you are using pyLDAvis >= 3.4.0, you must replace pyLDAvis.sklearn with pyLDAvis.lda_model
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer,mds='tsne',sort_topics=False)#if you are using pyLDAvis >= 3.4.0, you must replace pyLDAvis.sklearn with pyLDAvis.lda_model
pyLDAvis.save_html(panel, path_to_folder+'Results/'+'lda.html')
panel

all_topics = {}
num_terms = 10 # Adjust number of words to represent each topic
lambd = 0.6 # Adjust this accordingly based on tuning above
for i in range(1,3): #Adjust this to reflect number of topics chosen for final LDA model
    topic = panel.topic_info[panel.topic_info.Category == 'Topic'+str(i)].copy()
    topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
    all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
    
pd.DataFrame(all_topics).T

# Topic’s keywords
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()#if you are using Sklearn >= 1.0.x, you must replace get_feature_names() with get_feature_names_out()
df_topic_keywords.index = topicnames

# View
df_topic_keywords.head()

# top 15 keywords each topic
# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())#if you are using Sklearn >= 1.0.x, you must replace get_feature_names() with get_feature_names_out()
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=10)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords.to_excel(path_to_folder+'Results/'+"top10words.xlsx", "Main Topics")
df_topic_keywords

## Step 3.4 - Subtopics for loop (optional)
You now have the topics for your entire sample stored in the Results folder. You can stop your analysis here, if you find these results informative enough.
Alternatively, the following code snippet allows you to identify subtopics within the sets of papers allocated to each topic.

In [None]:
topics_list=df_document_topic['dominant_topic'].unique() #get the list of topics identified in the sample
for tpc in topics_list:
    list_index_topic0 = df_document_topic[df_document_topic['dominant_topic']==tpc]
    list_index_topic0.reset_index(inplace=True)
    list_index_topic0.head()
    docs_in_topic0=list(list_index_topic0['index'])
    print(docs_in_topic0)
    print(len(docs_in_topic0))
    
    
    data_topic1=[]
    for i in docs_in_topic0:
        data_topic1.append(data_lemmatized[int(i)])

    vectorizer = CountVectorizer(analyzer='word',
                                 min_df=1,                        # minimum reqd occurences of a word 
                                 stop_words='english',             # remove stop words
                                 lowercase=True,                   # convert all words to lowercase
                                 token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
                                 # max_features=50000,             # max number of uniq words
                                )

    data_vectorized = vectorizer.fit_transform(data_topic1)
    
    
    data_dense = data_vectorized.todense()

    # Compute Sparsicity = Percentage of Non-Zero cells
    print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

    # Build LDA Model
    lda_model = LatentDirichletAllocation(n_components=10,               # Number of topics
                                          max_iter=10,               # Max learning iterations
                                          learning_method='online',   
                                          random_state=100,          # Random state
                                          batch_size=128,            # n docs in each learning iter
                                          evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                          n_jobs = -1,               # Use all available CPUs
                                         )
    lda_output = lda_model.fit_transform(data_vectorized)

    print(lda_model)  # Model attributes
    
    # Log Likelyhood: Higher the better
    print("Log Likelihood: ", lda_model.score(data_vectorized))

    # Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
    print("Perplexity: ", lda_model.perplexity(data_vectorized))

    # See model parameters
    pprint(lda_model.get_params())
    
    # Define Search Param
    search_params = {'n_components': [2,3, 5, 7, 10, 12, 15,20], 'learning_decay': [.5, .7, .9]}

    # Init the Model
    lda = LatentDirichletAllocation()

    # Init Grid Search Class
    model = GridSearchCV(lda, param_grid=search_params, n_jobs=-1,cv=10)

    # Do the Grid Search
    model.fit(data_vectorized)
    
    #How to see the best topic model and its parameters?
    # Best Model
    best_lda_model = model.best_estimator_

    # Model Parameters
    print("Best Model's Params: ", model.best_params_)

    # Log Likelihood Score
    print("Best Log Likelihood Score: ", model.best_score_)

    # Perplexity
    print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))
    
    # Compare LDA Model Performance Scores
    # Get Log Likelyhoods from Grid Search Output
    n_topics = [2, 3, 5, 7, 10, 12, 15,20]
    grids=pd.DataFrame(model.cv_results_)
    log_likelyhoods_5 = list(grids[grids['param_learning_decay']==0.5]['mean_test_score'])
    log_likelyhoods_7 = list(grids[grids['param_learning_decay']==0.7]['mean_test_score'])
    log_likelyhoods_9 = list(grids[grids['param_learning_decay']==0.9]['mean_test_score'])
    # Show graph
    plt.figure(figsize=(12, 8))
    plt.plot(n_topics, log_likelyhoods_5, label='0.5')
    plt.plot(n_topics, log_likelyhoods_7, label='0.7')
    plt.plot(n_topics, log_likelyhoods_9, label='0.9')
    plt.title("Choosing Optimal LDA Model")
    plt.xlabel("Num Topics")
    plt.ylabel("Log Likelyhood Scores")
    plt.legend(title='Learning decay', loc='best')
    plt.show()
    
    # How to see the dominant topic in each document?
    # Create Document - Topic Matrix
    lda_output = best_lda_model.transform(data_vectorized)

    # column names
    topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

    # index names
    #docnames = ["Doc" + str(i) for i in range(len(data))]
    docnames = [ docs_in_topic0[i] for i in range(len(data_topic1))]
    # Make the pandas dataframe
    df_document_topic1 = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

    # Get dominant topic for each document
    dominant_topic1 = np.argmax(df_document_topic1.values, axis=1)
    df_document_topic1['dominant_topic'] = dominant_topic1

    # Styling
    def color_green(val):
        color = 'green' if val > .1 else 'black'
        return 'color: {col}'.format(col=color)

    def make_bold(val):
        weight = 700 if val > .1 else 400
        return 'font-weight: {weight}'.format(weight=weight)

    # Apply Style
    df_document_topic1.to_excel(path_to_folder+'Results/'+"dominant_topics-subtopic"+str(tpc)+".xlsx")
    df_document_topics1 = df_document_topic1.head(20).style.applymap(color_green).applymap(make_bold)
    df_document_topics1
    
    # Review topics distribution across documents
    df_topic_distribution1 = df_document_topic1['dominant_topic'].value_counts().reset_index(name="Num Documents")
    df_topic_distribution1.columns = ['Topic Num', 'Num Documents']
    df_topic_distribution1
    
    # How to visualize the LDA model with pyLDAvis?
    import pyLDAvis.sklearn#if you are using pyLDAvis >= 3.4.0, you must replace pyLDAvis.sklearn with pyLDAvis.lda_model
    pyLDAvis.enable_notebook()
    panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer,mds='tsne',sort_topics=False)#if you are using pyLDAvis >= 3.4.0, you must replace pyLDAvis.sklearn with pyLDAvis.lda_model
    pyLDAvis.save_html(panel, path_to_folder+'Results/'+'subtopics_within_'+str(tpc)+'.html')
    panel
    
    
    all_topics = {}
    num_terms = 10 # Adjust number of words to represent each topic
    lambd = 0.6 # Adjust this accordingly based on tuning above
    for i in range(1,3): #Adjust this to reflect number of topics chosen for final LDA model
        topic = panel.topic_info[panel.topic_info.Category == 'Topic'+str(i)].copy()
        topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
        all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
        
    pd.DataFrame(all_topics).T
    
    
    # Topic’s keywords
    # Topic-Keyword Matrix
    df_topic_keywords = pd.DataFrame(best_lda_model.components_)

    # Assign Column and Index
    df_topic_keywords.columns = vectorizer.get_feature_names()#if you are using Sklearn >= 1.0.x, you must replace get_feature_names() with get_feature_names_out()
    df_topic_keywords.index = topicnames

    # View
    df_topic_keywords.head()
    
    
    # top 15 keywords each topic
    # Show top n keywords for each topic
    def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
        keywords = np.array(vectorizer.get_feature_names())#if you are using Sklearn >= 1.0.x, you must replace get_feature_names() with get_feature_names_out()
        topic_keywords = []
        for topic_weights in lda_model.components_:
            top_keyword_locs = (-topic_weights).argsort()[:n_words]
            topic_keywords.append(keywords.take(top_keyword_locs))
        return topic_keywords

    topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=10)        

    # Topic - Keywords Dataframe
    df_topic_keywords = pd.DataFrame(topic_keywords)
    df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
    df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
    #print(topic)
    df_topic_keywords.to_excel(path_to_folder+'Results/'+"top10words.xlsx", "Subtopics_"+str(tpc))
    #df_topic_keywords