# Topic Extraction from PDF using NLP and Topic Modeling

This notebook demonstrates the process of extracting topics from a single PDF document using a combination of natural language processing (NLP) and topic modeling techniques. The workflow integrates several libraries, including `gensim` for topic modeling, `PyMuPDF` for PDF parsing, `spaCy` for preprocessing and tokenization, and `pyLDAvis` for interactive topic visualization.

## Key Steps

1. **PDF Parsing**  
   The notebook uses `PyMuPDF` to read the contents of the PDF file, extracting raw text from its pages.

2. **Text Preprocessing**  
   The extracted text is cleaned and processed using `spaCy`, which includes tokenization, lemmatization, and removal of stop words, punctuations, and irrelevant tokens.

3. **Topic Modeling**  
   Using the `gensim` library, the preprocessed text is transformed into a bag-of-words representation and analyzed through Latent Dirichlet Allocation (LDA) to identify latent topics within the document.

4. **Visualization**  
   The results of the topic modeling are visualized using `pyLDAvis`, allowing users to interact with and explore the discovered topics and their key terms.

## Outputs
- **Preprocessed Text:** The cleaned version of the PDF content after NLP processing.  
- **Topic Models:** A set of topics generated by LDA, including the most representative words for each topic.  
- **Interactive Visualization:** An interactive plot to explore the relationships between topics and their distributions.

This notebook provides a structured approach to extract insights from unstructured text documents, making it useful for applications like summarization, content analysis, and knowledge discovery.


In [1]:
from src.pdf_reader import *
from src.config import *
import os

reader_config = ReaderConfig()
pdf_reader = PdfReader(reader_config)

import spacy
nlp = spacy.load('en_core_web_trf')
nlp.max_length = 3000000

  return torch._C._cuda_getDeviceCount() > 0
  model.load_state_dict(torch.load(filelike, map_location=device))


In [2]:
folders = os.listdir("./data")
folders.sort()
for folder in folders:
    print(folder)

.DS_Store
Abstracts 1986-2005
ISOF 2003 Papers
ISOF 2004 Papers
ISOF 2005 Papers
ISOF 2006 Papers
ISOF 2007 Papers
ISOF 2008 Papers
ISOF 2009 Papers
ISOF 2010 Papers
ISOF 2011 Papers
ISOF 2012 Papers
ISOF 2013 Papers
ISOF 2014 Papers
ISOF 2015 Papers
ISOF 2016 Papers
ISOF 2017 Papers
ISOF 2018 Papers
ISOF 2019 Papers
ISOF 2021 Papers
ISOF 2022 Papers
ISOF 2023 Papers
ISOF 2024 Papers
Icon


In [3]:
import re
directory = "./data/ISOF 2021 Papers"
year = re.findall(r'\d+', directory)[0]
print(year)

2021


In [221]:
files = []
directory = os.path.join(os.path.expanduser("~"), directory)
if os.path.isdir(directory):
    for file in os.listdir(os.path.join(os.path.expanduser("~"), directory)):
        if file.endswith(".pdf"):
            files.append(os.path.join(directory,file))
files.sort(key = lambda x: x.split(os.sep)[-1])
for i, file in enumerate(files):
    print(i, file.split(os.sep)[-1])

0 20_10_Incivility_FranchiseOperationsManuals.pdf
1 20_11_THE FAITHLESS FRANCHISOR.pdf
2 20_12_Monopoly Control.pdf
3 20_1_The reasonable person in standard form franchise agreements – the Australian perspective.pdf
4 20_2_What does it mean to be a franchisee.pdf
5 20_3_ To do or to teach.pdf
6 20_4_Franchisors Communication of Risk and Return.pdf
7 20_5_The Impact of Ownership Exploratory Case Studies in Governance of Franchise Retail Organizations.pdf
8 20_6_Perfomance Implications of authorative contractual and normative control mechanisms.pdf
9 20_7_Multi-brand Franchisees.pdf
10 20_8_TC_The didn't give a Frappe Teaching case of Retail Food Group.pdf
11 20_9_Determiners of the franchise model failure empirical results in Mexico.pdf
12 21_10_Stepping-up? The UNCITRAL and the developement of mandatory disclosure regulation.pdf
13 21_11_A Systematic Review of Power and Control in Marketing Channels The Case of the Automotive Industry.pdf
14 21_12_Nurture the Business Relationship befo

In [222]:
indices = input("Do you want to remove files? Enter the indices (separated by a comma): ").split(",")
indices.sort(reverse=True)
if indices != ['']:
    for i in indices:
        files.pop(int(i))
    for i, file in enumerate(files):
        print(i, file.split(os.sep)[-1])

In [223]:
text_t = ""
for file_path in files:
    pdf_reader.set_path(file_path)
    pdf_reader.open()
    text = pdf_reader.read()
    text_t += text
    print(text)

FRANCHISING LESSONS IN THE AGE OF INCIVILITY:  OPERATIONS MANUALS AND TRADE SECRETS © Robert W. Emerson, 2021* Abstract The framework for a successful franchise relationship governs procedures, performance, and standards. The franchisor agrees to lend, in effect, its intellectual property and guidance, among other things, in exchange for the franchisee’s royalties and other payments. Before entering an agreement, franchisors disclose a large bundle of information to the prospective franchisee. These data may include operational insights necessary for a franchise’s success. In practice, though, franchise operations manuals only become available to franchisees once they pay for and are bound to the franchise system. This timing, and the centrality of the manual, is the key to many franchise disputes. For example, franchisees may allege they were harmed by vague, precontractual representations about the contents of operations manuals, which in turn franchisors would justify as a way to pr

In [224]:
doc = nlp(text_t[:])
doc

FRANCHISING LESSONS IN THE AGE OF INCIVILITY:  OPERATIONS MANUALS AND TRADE SECRETS © Robert W. Emerson, 2021* Abstract The framework for a successful franchise relationship governs procedures, performance, and standards. The franchisor agrees to lend, in effect, its intellectual property and guidance, among other things, in exchange for the franchisee’s royalties and other payments. Before entering an agreement, franchisors disclose a large bundle of information to the prospective franchisee. These data may include operational insights necessary for a franchise’s success. In practice, though, franchise operations manuals only become available to franchisees once they pay for and are bound to the franchise system. This timing, and the centrality of the manual, is the key to many franchise disputes. For example, franchisees may allege they were harmed by vague, precontractual representations about the contents of operations manuals, which in turn franchisors would justify as a way to pr

In [None]:
from spacy.tokens.token import Token

scientific_common_words = {
    "figure", "fig", "table", "et", "al", "etc", "eg", "i.e", "ie", 
    "eq", "equation", "dataset", "data", "analysis", "results", 
    "conclusion", "introduction", "method", "methods", "study", 
    "research", "author", "authors", "paper", "work", "approach", 
    "model", "models", "proposed", "presented", "based", "using", 
    "performed", "obtained", "observed", "used", "shown", "reported", 
    "significant", "important", "novel", "investigation", "algorithm", 
    "algorithms", "technique", "techniques", "parameters", 
    "parameter", "experimental", "experiments", "performance", 
    "standard", "implemented", "implementation", "similar", 
    "different", "result", "respectively", "compare", "compared", 
    "comparison", "additional", "respectively"
}
nlp.Defaults.stop_words.add(scientific_common_words)

def custom_stopwards(text: Token, stop_wards):
    if stop_wards is not None:
        for token in stop_wards:
            if token in text.text:
                return False
    return True

In [226]:
stop_words = ["al", "et", "non", "ltd", "pty","kelly", "washington", "kaufmann", "university"]
_tokens = []
capitalise = False
if not capitalise:
    f = str.lower
else:
    f = lambda x: x
new_text = ""
for token in doc:
    if not token.is_punct and not token.is_stop and not token.is_digit and token.is_alpha and custom_stopwards(token, stop_words):
        new_text += f(token.lemma_) + ' '
doc = nlp(new_text)

In [227]:
# Get only bigrams and trigrams
tokens = []
bigrams = []
trigrams = []
unigrams = []
for chunk in doc.noun_chunks:
    chunk_ = f(chunk.text)
    unms = chunk_.split(" ")
    length = len(unms)
    if length == 1:
        tokens.append(chunk_)
    if length == 2:
        bigrams.append(chunk_)
    if length == 3: 
        trigrams.append(chunk_)
tokens = unigrams + bigrams + trigrams
tokens

['agreement franchisor',
 'molly manners',
 'llc franchisor',
 'molly manners',
 'franchise agreement',
 'franchising franchisors',
 'molly manners',
 'ftca onclusion',
 'franchisor franchisee',
 'human creature',
 'framework turn',
 'molly manners',
 'molly manners',
 'reference civility',
 'molly manners',
 'direct evidence',
 'molly manners',
 'worldwide supp',
 'contact civility',
 'molly manners',
 'civility lesson',
 'molly manners',
 'molly manners',
 'molly manners',
 'reasonable jury',
 'molly manners',
 'worldwide supp',
 'complementary definition',
 'incredibly court',
 'record civility',
 'molly manners',
 'merely process',
 'party policymaker',
 'copyright infringement',
 'worldwide opinion',
 'molly manners',
 'right context',
 'modern franchise',
 'howard johnson',
 'right people',
 'franchising draft',
 'robert emerson',
 'kerry pipes',
 'antitrust conspiracies',
 'franchise systems',
 'robinson patman',
 'disfavored franchisee',
 'separation power',
 'franchisee consen

In [228]:
# Remove bigrams already in trigrams
trigram_bigrams = set()
for trigram in trigrams:
    words = trigram.split()
    trigram_bigrams.add(f"{words[0]} {words[1]}")
    trigram_bigrams.add(f"{words[1]} {words[2]}")
    
trigram_unigrams = set()
for trigram in trigrams:
    words = trigram.split()
    trigram_unigrams.add(f"{words[0]}")
    trigram_unigrams.add(f"{words[1]}")
    trigram_unigrams.add(f"{words[2]}")
    
print(trigram_bigrams)
# Remove bigrams that are part of trigrams
filtered_bigrams = [bigram for bigram in bigrams if bigram not in trigram_bigrams]
filtered_unigrams = [unigram for unigram in unigrams if unigram not in trigram_unigrams]
print(filtered_bigrams)

{'vegan range', 'work hire', 'litigation case', 'large population', 'resource exchange', 'source use', 'account research', 'law franchisor', 'far thing', 'present second', 'ad toy', 'control possible', 'december result', 'hotel owner', 'product household', 'datum transfer', 'clause promotion', 'miller hcp', 'weitz jap', 'modern restaurant', 'uniformity table', 'highly troubling', 'pmuf geodisp', 'french context', 'utsa dtsa', 'employee efficient', 'system upgrade', 'growth momentum', 'good thing', 'chain cover', 'prevent franchisor', 'franchisee mean', 'cambridge law', 'large diverse', 'recovery franchise', 'hotel sum', 'new car', 'ginsberg perlmutter', 'concept franchisor', 'payment level', 'manufacturer information', 'note capacity', 'governance mode', 'franchising literature', 'income statement', 'franchise auto', 'franchise growth', 'independent counsel', 'interested open', 'franchise rule', 'government task', 'ragin necessity', 'people term', 'french danish', 'dolomiti ski', 'inap

# Run Topic Modelling with Gensim

In [229]:
import gensim.corpora as corpora 
from gensim.models import LdaModel

num_topics = 20
passes = 400
no_below = 0
no_above = 0.5
keep_n = 10000
id2word = corpora.Dictionary([trigrams+filtered_bigrams])
if no_below > 0:
    id2word.filter_extremes(no_below=no_below, no_above=no_above, keep_n=keep_n)
corpus = [id2word.doc2bow(tokens)]
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, passes=passes)

In [230]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
output = f"./output/lda_model_{year}.html"
vis_data = gensimvis.prepare(lda_model, corpus, id2word, mds="mmds")
pyLDAvis.save_html(vis_data, output)

In [231]:
topics = lda_model.print_topics() 
for topic in topics: 
    print(topic)

(0, '0.000*"paper time" + 0.000*"paper industry" + 0.000*"part investment" + 0.000*"parsimonious solution" + 0.000*"paramount importance" + 0.000*"para regulation" + 0.000*"para cch" + 0.000*"participant industry" + 0.000*"paper straw" + 0.000*"paper force"')
(1, '0.000*"paper time" + 0.000*"paper industry" + 0.000*"part investment" + 0.000*"parsimonious solution" + 0.000*"paramount importance" + 0.000*"para regulation" + 0.000*"para cch" + 0.000*"participant industry" + 0.000*"paper straw" + 0.000*"paper force"')
(2, '0.000*"paper time" + 0.000*"paper industry" + 0.000*"part investment" + 0.000*"parsimonious solution" + 0.000*"paramount importance" + 0.000*"para regulation" + 0.000*"para cch" + 0.000*"participant industry" + 0.000*"paper straw" + 0.000*"paper force"')
(3, '0.000*"paper time" + 0.000*"paper industry" + 0.000*"part investment" + 0.000*"parsimonious solution" + 0.000*"paramount importance" + 0.000*"para regulation" + 0.000*"para cch" + 0.000*"participant industry" + 0.00