# PDF Processing with Python

This notebook demonstrates basic PDF processing using the `pypdf` library and `pandas`.

In [40]:
import pandas as pd

In [41]:
!pip install pypdf


# PDF Text Extraction with PdfReader

This script extracts and prints text from each page of a given PDF file using the `PdfReader` class.


In [42]:
from pypdf import PdfReader

# List of PDF files
pdf_files = ['acupunctureploicy.pdf']

def extract_text_from_pdf(file_path):
    text = ""
    try:
        # Create a PdfReader object for the given file path
        reader = PdfReader(file_path)
        
        # Print number of pages in the PDF file
        num_pages = len(reader.pages)
        print(f"Number of pages in {file_path}: {num_pages}")
        
        # Extract text from each page
        for i, page in enumerate(reader.pages):
            page_text = page.extract_text() or ""
            text += f"--- Page {i+1} ---\n{page_text}\n"
        
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
    
    return text

# Process each PDF file
for pdf_file in pdf_files:
    print(f"Processing file: {pdf_file}")
    pdf_text = extract_text_from_pdf(pdf_file)
    print(pdf_text)  # Optionally save or further process the extracted text


Processing file: acupunctureploicy.pdf
Number of pages in acupunctureploicy.pdf: 38
--- Page 1 ---
Acupuncture (CPG 024) 
Page 1 of 38 Cigna Medical Coverage Policy - Therapy Services  
Acupuncture  
 
Effective  Date:  4/15/2024 
Next Review Date: 4/15/2025 
 
 
    
 
 
INSTRUCTIONS FOR USE  
 
Cigna / ASH Medical Coverage Policies are intended to provide guidance in interpreting certain standard benefit plans adminis tered by 
Cigna Companies. Please note, the terms of a customer’s particular benefit plan document may differ significantly from the  standard 
benefit plans upon which these Cigna / ASH Medical Coverage Policies are based. In the event of a conflict, a customer’s bene fit plan 
document always supersedes the information in the Cigna / ASH Medical Coverage Policy. In the absence of a controlling federal or state coverage mandate, benefits are ultimately determined by the terms of the applicable benefit plan document.  Determinations in  each specific 
instance may requi

In [43]:
# Extract text from all PDF files
texts = [extract_text_from_pdf(pdf_file) for pdf_file in pdf_files]

Number of pages in acupunctureploicy.pdf: 38


# Text Preprocessing

This script preprocesses text by removing non-alphanumeric characters, converting it to lowercase, and removing stopwords.


In [44]:
# Preprocess Text
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def preprocess_text(text):
    # Remove non-alphanumeric characters
    text = re.sub(r'\W+', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    words = [word for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return ' '.join(words)

# Preprocess the extracted texts
preprocessed_texts = [preprocess_text(text) for text in texts]


# Preprocess Text
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def preprocess_text(text):
    # Remove non-alphanumeric characters
    text = re.sub(r'\W+', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    words = [word for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return ' '.join(words)

# Text Preprocessing and TF-IDF Vectorization

This script preprocesses text, then vectorizes it using TF-IDF, and prints the matrix shape and first 20 terms.


In [45]:
# Preprocess the extracted texts

from pypdf import PdfReader
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

preprocessed_texts = [preprocess_text(text) for text in texts]

# Vectorize Text using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X = vectorizer.fit_transform(preprocessed_texts)

# Print the shape of the TF-IDF matrix
print(f"TF-IDF matrix shape: {X.shape}")

# Optionally, print feature names (terms) to understand the vocabulary
print(f"Feature names: {vectorizer.get_feature_names_out()[:20]}")  # Print first 20 terms

TF-IDF matrix shape: (1, 1000)
Feature names: ['01' '024' '10' '100' '102' '11' '12' '13' '14' '15' '159' '16' '17' '18'
 '19' '1976' '20' '2000' '2005' '2009']


# Topic Modeling with LDA

This script applies Latent Dirichlet Allocation (LDA) to discover topics in text data and prints the top words for each topic.


In [46]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

# Number of topics
num_topics = 3

# Apply LDA
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X)

# Function to display topics
def print_top_words(model, feature_names, n_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        print(f"Topic #{topic_idx}: {' '.join(top_words)}")

# Print top words for each topic
print("Topics found via LDA:")
print_top_words(lda, vectorizer.get_feature_names_out())


Topics found via LDA:
Topic #0: pain treatment acupuncture evidence sprain low quality region encounter al
Topic #1: acupuncture pain treatment evidence sprain low quality region encounter al
Topic #2: pain treatment acupuncture evidence sprain low quality region encounter al


# PDF Text Extraction with PdfReader

This script extracts and prints text from each page of a specified PDF file using the `PdfReader` class.


In [47]:
from pypdf import PdfReader

# List of PDF files
pdf_files = ['abortionpolicy.pdf']

def extract_text_from_pdf(file_path):
    text = ""
    try:
        # Create a PdfReader object for the given file path
        reader = PdfReader(file_path)
        
        # Print number of pages in the PDF file
        num_pages = len(reader.pages)
        print(f"Number of pages in {file_path}: {num_pages}")
        
        # Extract text from each page
        for i, page in enumerate(reader.pages):
            page_text = page.extract_text() or ""
            text += f"--- Page {i+1} ---\n{page_text}\n"
        
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
    
    return text

# Process each PDF file
for pdf_file in pdf_files:
    print(f"Processing file: {pdf_file}")
    pdf_text = extract_text_from_pdf(pdf_file)
    print(pdf_text)  # Optionally save or further process the extracted text


Processing file: abortionpolicy.pdf
Number of pages in abortionpolicy.pdf: 4
--- Page 1 ---
Page 1 of 4 
Administrative  Policy:  A006    Administrative Policy  
 
Effective  Date  ....................  3/15/2024 
Next Review Date  .............. 3/15 /2025 
Coverage Policy Number  ............. A 006 
 
Abortion  
Table of Contents  
 
Administrative Policy  .............................  1 
General Background  .............................  1 
Coding Information  ...............................  2 
References  .......................................... 4  Related Coverage Resources  
 
Comparative Genomic Hybridization 
(CGH)/Chromosomal Microarray Analysis 
(CMA) for Selected Hereditary Conditions  
Genetic Testing for Reproductive Carrier 
Screening and Prenatal Diagnosis  
 
 
PURPOSE  
Administrative Policies are intended to provide further information about the administration of standard Cigna benefit plans. In the event of a conflict, a customer’s benefit plan document 
always 

In [48]:
# Extract text from all PDF files
texts = [extract_text_from_pdf(pdf_file) for pdf_file in pdf_files]

Number of pages in abortionpolicy.pdf: 4


In [49]:
# Preprocess Text
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def preprocess_text(text):
    # Remove non-alphanumeric characters
    text = re.sub(r'\W+', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    words = [word for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return ' '.join(words)

# Preprocess the extracted texts
preprocessed_texts = [preprocess_text(text) for text in texts]


In [50]:
# Preprocess the extracted texts

from pypdf import PdfReader
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

preprocessed_texts = [preprocess_text(text) for text in texts]

# Vectorize Text using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X = vectorizer.fit_transform(preprocessed_texts)

# Print the shape of the TF-IDF matrix
print(f"TF-IDF matrix shape: {X.shape}")

# Optionally, print feature names (terms) to understand the vocabulary
print(f"Feature names: {vectorizer.get_feature_names_out()[:20]}")  # Print first 20 terms

TF-IDF matrix shape: (1, 407)
Feature names: ['001488' '006' '01' '10' '100' '1243' '15' '17' '200' '2005' '202' '2022'
 '2023' '2024' '2025' '24' '25' '28' '29' '31']


In [51]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

# Number of topics
num_topics = 3

# Apply LDA
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X)

# Function to display topics
def print_top_words(model, feature_names, n_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[-n_words:][::-1]
        top_words = [feature_names[i] for i in top_words_idx]
        print(f"Topic #{topic_idx}: {' '.join(top_words)}")

# Print top words for each topic
print("Topics found via LDA:")
print_top_words(lda, vectorizer.get_feature_names_out())


Topics found via LDA:
Topic #0: pregnancy policy treatment abortion administrative induced ectopic health surgical cigna
Topic #1: abortion pregnancy policy treatment administrative induced ectopic health surgical cigna
Topic #2: pregnancy policy treatment abortion administrative induced ectopic health surgical cigna


# Topic Modeling and Similarity Analysis of PDF Documents

This script extracts text from multiple PDF files, applies LDA for topic modeling, and computes a cosine similarity matrix between the topic distributions of the documents.


In [52]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pypdf import PdfReader

# Function to extract text from PDF
def extract_text_from_pdf(file_path):
    text = ""
    try:
        reader = PdfReader(file_path)
        for page in reader.pages:
            text += page.extract_text() or ""
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
    return text

# List of PDF files
pdf_files = ['acupunctureploicy.pdf', 'abortionpolicy.pdf', 'pollenpolicy.pdf']

# Extract text from all PDF files
texts = [extract_text_from_pdf(pdf_file) for pdf_file in pdf_files]

# Vectorize text data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

# Number of topics
num_topics = 3

# Apply LDA
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X)

# Transform documents to topic distributions
topic_distributions = lda.transform(X)

# Print topic distributions
print("Topic Distributions for Each Document:")
print(topic_distributions)

# Compute cosine similarity matrix between topic distributions
cosine_sim = cosine_similarity(topic_distributions)

print("\nCosine Similarity Matrix:")
print(cosine_sim)


Topic Distributions for Each Document:
[[0.95583156 0.02200873 0.02215971]
 [0.0313098  0.03058691 0.93810329]
 [0.93250321 0.03352722 0.03396956]]

Cosine Similarity Matrix:
[[1.         0.05721728 0.99982933]
 [0.05721728 1.         0.07080647]
 [0.99982933 0.07080647 1.        ]]


The cosine similarity matrix shows the similarity between topic distributions of the documents. Values range from 0 to 1, where 1 indicates identical distributions.

Document 1 vs. Document 2: 0.0572 (Very low similarity, meaning their topic distributions are quite different.)
Document 1 vs. Document 3: 0.9998 (Very high similarity, meaning their topic distributions are almost identical.)
Document 2 vs. Document 3: 0.0708 (Low similarity, meaning their topic distributions are quite different.)