In [None]:
pip install sentence-transformers


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
import fitz  # PyMuPDF for PDF text extraction
import spacy
import numpy as np
from sklearn.cluster import KMeans
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer

# Download sentence tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

# Step 1: Extract Text from PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = " ".join([page.get_text("text") for page in doc])
    return text.strip()

# Provide the path to your PDF file
pdf_path = "/content/nlp_ner_news_summ.pdf"
document_text = extract_text_from_pdf(pdf_path)

# Step 2: Split Text into Sentences
sentences = sent_tokenize(document_text)



# Load spaCy Named Entity Recognition Model
nlp = spacy.load("en_core_web_sm")

# Step 3: Perform Named Entity Recognition (NER) and Filter Sentences
def filter_sentences_with_entities(sentences):
    filtered_sentences = [sent for sent in sentences if len(nlp(sent).ents) > 0]
    return filtered_sentences

filtered_sentences = filter_sentences_with_entities(sentences)
print(f"\nüîπ Sentences after NER filtering ({len(filtered_sentences)} sentences):\n")
for sent in filtered_sentences:
    print("-", sent)

# Load Sentence-BERT Model
sbert_model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient SBERT model

# Step 4: Convert Sentences to SBERT Embeddings
sentence_embeddings = sbert_model.encode(filtered_sentences, convert_to_numpy=True)

# Step 5: Cluster the Sentences
num_clusters = 5  # Adjust as needed
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(sentence_embeddings)

# Step 6: Print Clusters and their Sentences
print("\nüîπ Clusters and their Sentences:\n")
cluster_dict = {i: [] for i in range(num_clusters)}
for i, sentence in enumerate(filtered_sentences):
    cluster_dict[clusters[i]].append(sentence)

for cluster_id, cluster_sentences in cluster_dict.items():
    print(f"\nüìå Cluster {cluster_id + 1}:")
    for sent in cluster_sentences:
        print("-", sent)

# Step 7: Generate Summary (Select One Representative Sentence from Each Cluster)
summary = [cluster_dict[cluster_id][0] for cluster_id in range(num_clusters) if cluster_dict[cluster_id]]

# Print the final summary
print("\nüîπ Generated Summary:")
for sentence in summary:
    print("-", sentence)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



üîπ Sentences after NER filtering (21 sentences):

- The recent parliamentary elections in Washington, D.C. resulted in a coalition government, 
shifting the balance of power.
- The new administration, led by the Democratic Party, proposed 
a tax reform bill, sparking intense debates across the country.
- Opposition leaders from the 
Republican Party called for revisions, claiming the bill favors large corporations like Amazon 
and Google over citizens.
- Protests erupted in major cities such as New York and Los Angeles 
as activists demanded a more transparent legislative process.
- The national football team of Brazil won a dramatic final match in Rio de Janeiro, clinching 
their first championship title in years.
- Analysts from ESPN praised the coach's strategic 
substitutions, crediting them for the late-game turnaround.
- Meanwhile, FIFA announced an 
expansion plan, adding two new teams next season.
- A highly anticipated Hollywood blockbuster produced by Warner Bros. broke bo

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


üîπ Clusters and their Sentences:


üìå Cluster 1:
- The national football team of Brazil won a dramatic final match in Rio de Janeiro, clinching 
their first championship title in years.
- Meanwhile, FIFA announced an 
expansion plan, adding two new teams next season.
- Meanwhile, director Christopher Nolan announced a sequel, promising an even more thrilling 
storyline.
- Streaming platforms like Netflix and Disney+ rushed to acquire exclusive rights, 
intensifying competition in the industry.

üìå Cluster 2:
- Analysts from ESPN praised the coach's strategic 
substitutions, crediting them for the late-game turnaround.
- A highly anticipated Hollywood blockbuster produced by Warner Bros. broke box office 
records, earning the highest opening weekend revenue.
- The film‚Äôs success boosted the 
studio‚Äôs stock prices on the New York Stock Exchange, reflecting strong investor confidence.
- Fans praised lead actor Leonardo DiCaprio‚Äôs performance, fueling awards season speculation

In [None]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m20.0/20.0 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.4


# **CLUSTERING USING HMM**

In [None]:
import fitz  # PyMuPDF for PDF text extraction
import spacy
import numpy as np
from hmmlearn.hmm import GaussianHMM
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer

# Download sentence tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

# Step 1: Extract Text from PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = " ".join([page.get_text("text") for page in doc])
    return text.strip()

# Provide the path to your PDF file
pdf_path = "/content/nlp_ner_news_summ.pdf"
document_text = extract_text_from_pdf(pdf_path)

# Step 2: Split Text into Sentences
sentences = sent_tokenize(document_text)

# Load spaCy Named Entity Recognition Model
nlp = spacy.load("en_core_web_sm")

# Step 3: Perform Named Entity Recognition (NER) and Filter Sentences
def filter_sentences_with_entities(sentences):
    filtered_sentences = [sent for sent in sentences if len(nlp(sent).ents) > 0]
    return filtered_sentences

filtered_sentences = filter_sentences_with_entities(sentences)
print(f"\nüîπ Sentences after NER filtering ({len(filtered_sentences)} sentences):\n")
for sent in filtered_sentences:
    print("-", sent)

# Load Sentence-BERT Model
sbert_model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient SBERT model

# Step 4: Convert Sentences to SBERT Embeddings
sentence_embeddings = sbert_model.encode(filtered_sentences, convert_to_numpy=True)

# Step 5: Apply Hidden Markov Model (HMM) for Clustering
num_clusters = 5  # Adjust as needed

# Create and fit the HMM model
hmm_model = GaussianHMM(n_components=num_clusters, covariance_type="diag", n_iter=1000, random_state=42)
hmm_model.fit(sentence_embeddings)

# Step 6: Predict the Hidden States (Cluster Assignments)
clusters = hmm_model.predict(sentence_embeddings)

# Step 7: Print Clusters and their Sentences
print("\nüîπ Clusters and their Sentences (Using HMM):\n")
cluster_dict = {i: [] for i in range(num_clusters)}
for i, sentence in enumerate(filtered_sentences):
    cluster_dict[clusters[i]].append(sentence)

for cluster_id, cluster_sentences in cluster_dict.items():
    print(f"\nüìå Cluster {cluster_id + 1}:")
    for sent in cluster_sentences:
        print("-", sent)

# Step 8: Generate Summary (Select One Representative Sentence from Each Cluster)
summary = [cluster_dict[cluster_id][0] for cluster_id in range(num_clusters) if cluster_dict[cluster_id]]

# Print the final summary
print("\nüîπ Generated Summary (Using HMM):")
for sentence in summary:
    print("-", sentence)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



üîπ Sentences after NER filtering (21 sentences):

- The recent parliamentary elections in Washington, D.C. resulted in a coalition government, 
shifting the balance of power.
- The new administration, led by the Democratic Party, proposed 
a tax reform bill, sparking intense debates across the country.
- Opposition leaders from the 
Republican Party called for revisions, claiming the bill favors large corporations like Amazon 
and Google over citizens.
- Protests erupted in major cities such as New York and Los Angeles 
as activists demanded a more transparent legislative process.
- The national football team of Brazil won a dramatic final match in Rio de Janeiro, clinching 
their first championship title in years.
- Analysts from ESPN praised the coach's strategic 
substitutions, crediting them for the late-game turnaround.
- Meanwhile, FIFA announced an 
expansion plan, adding two new teams next season.
- A highly anticipated Hollywood blockbuster produced by Warner Bros. broke bo

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]




üîπ Clusters and their Sentences (Using HMM):


üìå Cluster 1:
- The national football team of Brazil won a dramatic final match in Rio de Janeiro, clinching 
their first championship title in years.
- Meanwhile, FIFA announced an 
expansion plan, adding two new teams next season.
- Meanwhile, director Christopher Nolan announced a sequel, promising an even more thrilling 
storyline.
- Streaming platforms like Netflix and Disney+ rushed to acquire exclusive rights, 
intensifying competition in the industry.

üìå Cluster 2:
- Analysts from ESPN praised the coach's strategic 
substitutions, crediting them for the late-game turnaround.
- A highly anticipated Hollywood blockbuster produced by Warner Bros. broke box office 
records, earning the highest opening weekend revenue.
- The film‚Äôs success boosted the 
studio‚Äôs stock prices on the New York Stock Exchange, reflecting strong investor confidence.
- Fans praised lead actor Leonardo DiCaprio‚Äôs performance, fueling awards season

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 1: Function to compute cosine similarity between all sentence embeddings in a cluster
def compute_cosine_similarities(embeddings):
    return cosine_similarity(embeddings)

# Step 2: Function to find the most central (most similar) sentence in each cluster
def find_cluster_summary(cluster_sentences, cluster_embeddings):
    # Compute cosine similarities between sentences in the same cluster
    cosine_sim = compute_cosine_similarities(cluster_embeddings)

    # Calculate the average similarity for each sentence in the cluster
    avg_similarities = cosine_sim.mean(axis=1)

    # Find the index of the sentence with the highest average similarity
    summary_index = np.argmax(avg_similarities)

    # Return the sentence with the highest average similarity
    return cluster_sentences[summary_index]

# Step 3: Print the summary for each cluster based on cosine similarity
print("\nüîπ Cluster Summaries Based on Cosine Similarity:\n")
for cluster_id, cluster_sentences in cluster_dict.items():
    # Get the embeddings for the sentences in the current cluster
    cluster_embeddings = sentence_embeddings[np.array([i for i, sent in enumerate(filtered_sentences) if sent in cluster_sentences])]

    # Get the summary sentence for this cluster
    summary_sentence = find_cluster_summary(cluster_sentences, cluster_embeddings)

    print(f"\nüìå Cluster {cluster_id + 1} Summary:")
    print("-", summary_sentence)



üîπ Cluster Summaries Based on Cosine Similarity:


üìå Cluster 1 Summary:
- Meanwhile, FIFA announced an 
expansion plan, adding two new teams next season.

üìå Cluster 2 Summary:
- The film‚Äôs success boosted the 
studio‚Äôs stock prices on the New York Stock Exchange, reflecting strong investor confidence.

üìå Cluster 3 Summary:
- Regulators from the 
European Union launched an inquiry into potential ethical implications of AI integration.

üìå Cluster 4 Summary:
- The new administration, led by the Democratic Party, proposed 
a tax reform bill, sparking intense debates across the country.

üìå Cluster 5 Summary:
- However, privacy concerns emerged as critics 
questioned data security measures implemented by Meta and Google.


In [None]:
pip install hmmlearn


Collecting hmmlearn
  Downloading hmmlearn-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Downloading hmmlearn-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (165 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/165.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m165.9/165.9 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hmmlearn
Successfully installed hmmlearn-0.3.3
