# **Themes Extraction and Clustering with pke**

This notebook will guide you through:
- [**Installation**](#installation) of necessary libraries (`pke`, `spacy`, etc.).
- [**Extraction of keyphrases**](#extraction) (themes) from documents using `pke`.
- [**Embedding**](#embedding) of these themes to convert them into numerical vectors (using `SentenceTransformer`).
- [**Clustering**](#clustering) the documents based on their thematic vectors (via K-Means and Hierarchical Clustering).
- Visualizing and interpreting the results.

---

## **1. Introduction**

- **What is `pke`?**  
  [`pke`](https://github.com/boudinfl/pke) is a Python toolkit for extraction of keyphrases (themes). Keyphrases are short phrases (often nouns or nominal phrases) that capture the main topics of a document.

- **Why do we extract keyphrases (themes)?**  
  By extracting them, you can:
  - Summarize documents in concise terms.
  - Compare documents based on extracted topics.
  - Use them for search, classification, clustering, and other NLP applications.

In this notebook, we will use **TopicRank** (an unsupervised method provided by `pke`) to find the top 10 keyphrases per document. Then, we will represent each document by the embeddings of these keyphrases and cluster the documents to see how they group together.

---


## **2. Installation of `pke`** <a id="installation"></a>

To extract the themes we will use the [pke - Python keyphrase extraction](https://github.com/boudinfl/pke) toolkit. pke requires [SpaCy](https://spacy.io/usage) and a SpaCy model for the language of the document.
Let's install spacy and pke first.

If you plan to use pke on a command-line installation of Python, you can use the following commands instead:

```
pip install spacy
python -m spacy download en_core_web_sm
pip install git+https://github.com/boudinfl/pke.git
```

In [4]:
import sys

# Install spacy
!{sys.executable} -m pip install spacy

# Download the English SpaCy model
!{sys.executable} -m spacy download en_core_web_sm

# Install pke from GitHub
!{sys.executable} -m pip install git+https://github.com/boudinfl/pke.git

Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/site-packages/faiss-1.7.4-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: Loading egg at /usr/local/lib/python3.11/site-packages/faiss-1.7.4-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


## **3. Environment Setup and Sample Documents**

### **3.1. Import Libraries**

In [2]:
# Libraries import
import os
import pke
import pandas as pd
import numpy as np
from collections import defaultdict

# For computing embeddings
from sentence_transformers import SentenceTransformer

# For clustering, similarity, and visualization
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

import matplotlib.pyplot as plt
import seaborn as sns

# Matplotlib & Seaborn configurations
%matplotlib inline
sns.set_theme(style="whitegrid")

### **3.2. Define Sample Documents**

The documents that we are defining deal with a broad range of topics:
- Document 1: Technology
- Document 2: Sports  
- Document 3: Food  
- Document 4: Technology
- Document 5: Sports

In [3]:

documents = [
    "In today's digital era, technology drives innovation with state-of-the-art systems and breakthrough solutions. Emphasizing connectivity and creative problem-solving, this field transforms everyday life with a dynamic blend of precision and ingenuity.",
    
    "In today's digital era, sports drive innovation with state-of-the-art training techniques and breakthrough strategies. Emphasizing connectivity and creative teamwork, this realm transforms athletic performance with a dynamic blend of precision and ingenuity.",
    
    "Culinary arts celebrate a fusion of tradition and innovation, blending carefully selected ingredients into dynamic, flavor-rich experiences. Chefs artfully combine technique and creativity, crafting recipes that echo the spirit of progress found in other innovative fields.",
    
    "Embodying the essence of digital innovation, technology reshapes the future with state-of-the-art systems and breakthrough methodologies. This modern realm thrives on connectivity and creative problem-solving, offering a dynamic platform for progress and technical excellence.",
    
    "In the competitive world of sports, every match is a vibrant display of athletic prowess and strategic teamwork. The arena pulses with energy and precise coordination, as athletes challenge themselves to overcome obstacles and secure triumph through relentless determination."
]

    
themes = ["Tech", "Sports", "Food", "Tech", "Sports"]
print("Sample Documents:\n")

for i, doc in enumerate(documents):
    print(f"Document {i+1}:", doc)

Sample Documents:

Document 1: In today's digital era, technology drives innovation with state-of-the-art systems and breakthrough solutions. Emphasizing connectivity and creative problem-solving, this field transforms everyday life with a dynamic blend of precision and ingenuity.
Document 2: In today's digital era, sports drive innovation with state-of-the-art training techniques and breakthrough strategies. Emphasizing connectivity and creative teamwork, this realm transforms athletic performance with a dynamic blend of precision and ingenuity.
Document 3: Culinary arts celebrate a fusion of tradition and innovation, blending carefully selected ingredients into dynamic, flavor-rich experiences. Chefs artfully combine technique and creativity, crafting recipes that echo the spirit of progress found in other innovative fields.
Document 4: Embodying the essence of digital innovation, technology reshapes the future with state-of-the-art systems and breakthrough methodologies. This modern

## **4. Extracting Keyphrases with `pke.TopicRank`**

**`pke`** uses a standard procedure:
1. **Candidate Selection**: Identify candidate words or phrases.
2. **Candidate Weighting**: Score the candidates (in `TopicRank`, this is done via a graph-based method).
3. **N-best Selection**: Select the top `N` keyphrases.

Let's see how pke works. For this, we are going to use a raw text file called [wiki_gershwin.txt](wiki_gershwin.txt). We first import the module and initialize the keyphrase extraction model (here: TopicRank):

In [4]:
import pke

extractor = pke.unsupervised.TopicRank()

Load the content of the document, here document is expected to be in raw format (i.e. a simple text file). The document is automatically preprocessed and analyzed with SpaCy, using the language given in the parameter:

In [5]:
text = "The latest smartphone by Apple offers cutting-edge features and a sleek design."
extractor.load_document(text, language='en')

The keyphrase extraction consists of three steps:

1. Candidate selection:  
With TopicRank, the default candidates are sequences of nouns and adjectives (i.e. `(Noun|Adj)*`)

2. Candidate weighting:  
With TopicRank, this is done using a random walk algorithm.

3. N-best candidate selection:  
The 10 highest-scored candidates are selected. They are returned as (keyphrase, score) tuples.

In [6]:
extractor.candidate_selection()
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=10)

print("Extracted themes:")
print("=================")
for keyphrase in keyphrases:
    print(f'{keyphrase[1]:.5f}   {keyphrase[0]}')

Extracted themes:
0.30331   apple
0.28524   cutting-edge features
0.23409   latest smartphone
0.17736   sleek design


You can also try out different methods for extracting themes: supervised, unsupervised, graph. Compare the themes extracted. If your texts are in other languages than English, test the themes extraction for them and assess the quality. Is this something you might want to use for your final project?

You can read more about the pke toolkit from their paper ([Boudin, 2016](https://aclanthology.org/C16-2015.pdf)).

**Let's put the code in a method!**

In [7]:
def extract_topics(text, n_keyphrases=10, language='en'):
    """
    Extract the top n_keyphrases from a text using pke's TopicRank.

    Parameters:
    -----------
    text : str
        The text from which to extract keyphrases.
    n_keyphrases : int, default=10
        Number of keyphrases to return.
    language : str, default='en'
        Language for SpaCy processing. Commonly 'en' for English.

    Returns:
    --------
    keyphrases : list of (phrase, score) tuples
    """
    # Initialize the TopicRank extractor
    extractor = pke.unsupervised.TopicRank()

    # Load the text into the extractor
    extractor.load_document(input=text, language=language)

    # Candidate selection
    extractor.candidate_selection()

    # Candidate weighting (graph-based)
    extractor.candidate_weighting()

    # Retrieve the top n keyphrases
    keyphrases = extractor.get_n_best(n=n_keyphrases)

    return keyphrases

### **4.2. Apply the Function to Each Document**

In [8]:
all_docs_topics = []

for i, doc_text in enumerate(documents):
    keyphrases = extract_topics(doc_text, n_keyphrases=10, language='en')
    all_docs_topics.append(keyphrases)
    
    # Print a summary of extracted keyphrases
    print(f"Document {i+1} - Extracted Themes:")
    print("----------------------------------")
    for kp in keyphrases:
        print(f"Score: {kp[1]:.5f} | Keyphrase: \"{kp[0]}\"")
    print("\n")

Document 1 - Extracted Themes:
----------------------------------
Score: 0.08775 | Keyphrase: "state-of-the-art systems"
Score: 0.08509 | Keyphrase: "innovation"
Score: 0.08321 | Keyphrase: "technology"
Score: 0.08145 | Keyphrase: "breakthrough solutions"
Score: 0.08093 | Keyphrase: "creative problem-solving"
Score: 0.07981 | Keyphrase: "digital era"
Score: 0.07801 | Keyphrase: "connectivity"
Score: 0.07782 | Keyphrase: "everyday life"
Score: 0.07654 | Keyphrase: "field"
Score: 0.07547 | Keyphrase: "dynamic blend"


Document 2 - Extracted Themes:
----------------------------------
Score: 0.09244 | Keyphrase: "state-of-the-art training techniques"
Score: 0.09021 | Keyphrase: "creative teamwork"
Score: 0.08949 | Keyphrase: "sports drive innovation"
Score: 0.08753 | Keyphrase: "athletic performance"
Score: 0.08702 | Keyphrase: "breakthrough strategies"
Score: 0.08612 | Keyphrase: "connectivity"
Score: 0.08582 | Keyphrase: "realm"
Score: 0.08508 | Keyphrase: "dynamic blend"
Score: 0.08341 

---
# Extra Material to complement the reading material on Moodle

## **5. Embedding the Extracted Keyphrases**

Instead of simply using **TF-IDF** on keyphrases (which may not appear as they are in the documents), we want to capture the **semantic meaning** of each keyphrase. To do this, we convert every extracted topic (keyphrase) into a **vector representation (embedding)**.

By assuming that the overall meaning of a document is approximated by the **collective meaning** of its keyphrases, we can obtain a **single vector** for each document by **aggregating** (e.g., averaging) the embeddings of its keyphrases.

**Steps**:  
1. **Embed** each keyphrase using a [**SentenceTransformer**](https://www.sbert.net/) model (e.g., `all-MiniLM-L6-v2`).  
2. **Weight** each embedding by the keyphrase’s **importance score** (from pke).  
3. **Aggregate** these weighted embeddings (e.g., by taking their mean) to form a **single document vector**.


In [None]:
# Load a SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
doc = documents[0]
topics = extract_topics(doc)
print("Document : ", doc)
print("Topics (topic, weight) : ", topics)

In [None]:
weights = [t[1] for t in topics]
phrases =[t[0] for t in topics]
#Step 1 - Embed
topics_embedding = model.encode(phrases)
topics_embedding.shape, weights
#Step 2 - Weight
weighted_topics_embedding = np.array([ topics_embedding[i]*weights[i] for i in range(len(weights))])
#Step 3 - Aggregate
document_embedding = np.mean(weighted_topics_embedding, axis=0)
weighted_topics_embedding.shape, document_embedding.shape

In [None]:
document_embeddings = []

for doc_topics in all_docs_topics:
    # doc_topics is a list of (keyphrase_string, score)
    phrases = [t[0] for t in doc_topics]
    scores = np.array([t[1] for t in doc_topics])

    # Encode each keyphrase
    phrase_embeddings = model.encode(phrases)

    # Weight the embeddings by their scores
    scores = scores.reshape(-1, 1)
    weighted_embeddings = phrase_embeddings * scores

    # Aggregate (mean) to get a single vector per document
    doc_embedding = np.mean(weighted_embeddings, axis=0)

    document_embeddings.append(doc_embedding)

print("Number of document embeddings:", len(document_embeddings))
print("Dimension of each embedding:", len(document_embeddings[0]))

## **6. Clustering the Documents**

We now have 5 **document vectors** derived from their keyphrases. Let's cluster them.

#### What is Document Clustering?

Document clustering (or text clustering) is an **unsupervised approach** (i.e., no labels are required for the documents we have). The goal is to **group documents into clusters (groups)** so that:

- Documents in the **same cluster** are **similar** to each other.
- Documents in **different clusters** are **dissimilar** from each other.

#### What Do We Need for Document Clustering?

We need two key components:

1. **Document Representation:**
   - We need to represent documents **numerically**.
   - Example: Compute **TF-IDF** to represent documents based on the terms they contain.
   - **Another method**: Extract themes and represent each document as a **vector**, where each dimension represents a theme.

2. **A Similarity Measure:**
   - We need to assess how similar two documents are in content.
   - Common similarity measures include **cosine similarity** and **Euclidean distance**.

We have the document representation, let's compute the similarity using **cosine similarity**.

In [None]:
similarities = cosine_similarity(document_embeddings)
similarities

In [None]:
# Display the heatmap
plt.imshow(similarities, cmap='viridis', vmin=0, vmax=1)


plt.colorbar() # Add a colorbar (legend)

# Create labels for each document (Doc0, Doc1, ...)
num_docs = similarities.shape[0]
fig_labels = [f"Doc{i+1}" for i in range(num_docs)]

# Use the labels on the x-axis and y-axis
plt.xticks(np.arange(num_docs), fig_labels)
plt.yticks(np.arange(num_docs), fig_labels)


plt.title("Document Similarities") # Add a title

# Overlay the numeric values on each cell
for i in range(num_docs):
    for j in range(num_docs):
        # Format values with two decimals
        plt.text(j, i, f"{similarities[i, j]:.2f}",
                 ha='center', va='center', color='white')

plt.tight_layout()
plt.grid(False)
plt.show()

### **6.1. K-Means Clustering**

We choose `k = 3` clusters for demonstration. Adjust as needed.

In [None]:
num_clusters = 3  # number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels_kmeans = kmeans.fit_predict(document_embeddings)

df_clusters_kmeans = pd.DataFrame({
    'Document': [f"Doc_{i+1}" for i in range(len(documents))],
    'Cluster': labels_kmeans,
    'Themes': themes, 
})

df_clusters_kmeans

### **6.2. Hierarchical Clustering**

We will compute the linkage matrix for various linkage criteria. Then, we'll plot dendrograms to observe how documents group at different distance thresholds.

In [None]:
def perform_hierarchical_clustering(method_of_choice, num_clusters, documents, document_embeddings, themes):
    """
    Performs hierarchical clustering on document embeddings using the specified linkage method.
    
    Parameters:
        method_of_choice (str): Linkage method ('single', 'complete', 'average', 'ward').
        num_clusters (int): The number of clusters to form.
        documents (list of str): List of document texts.
        document_embeddings (list or array-like): Document embeddings (each should be a 1D array).
        themes (list): List of themes corresponding to each document.
        tones (list): List of tones corresponding to each document.
    """
    # Convert embeddings into a 2D NumPy array
    X = np.vstack(document_embeddings)
    
    # Compute the linkage matrix for the chosen method
    Z = linkage(X, method=method_of_choice)
    
    # Plot the dendrogram for the chosen method
    plt.figure(figsize=(8, 4))
    dendrogram(Z, labels=[f"Doc_{i+1}" for i in range(len(documents))], leaf_rotation=90)
    plt.title(f"Hierarchical Clustering - {method_of_choice.capitalize()} Linkage")
    plt.tight_layout()
    plt.show()
    
    # Generate cluster labels using the 'maxclust' criterion
    labels_hier = fcluster(Z, t=num_clusters, criterion='maxclust')
    
    # Create a DataFrame to display clustering results
    df_clusters_hier = pd.DataFrame({
        'Document': [f"Doc_{i+1}" for i in range(len(documents))],
        'Cluster': labels_hier,
        'Themes': themes,
        'Text': documents
    })
    
    
    # Group documents by cluster and print them
    clusters = defaultdict(list)
    for _, row in df_clusters_hier.iterrows():
        clusters[row['Cluster']].append(row['Text'])
    
    for cluster, docs in clusters.items():
        print(f"\n--- Cluster {cluster} ---\n")
        for i, doc in enumerate(docs, 1):
            # For readability, print only the first 500 characters of each document.
            print(f"[Doc {i}]: {doc[:500]}...\n")
        print("="*80)



In [None]:
perform_hierarchical_clustering("single", num_clusters, documents, document_embeddings, themes)

In [None]:
perform_hierarchical_clustering("complete", num_clusters, documents, document_embeddings, themes)

In [None]:
perform_hierarchical_clustering("average", num_clusters, documents, document_embeddings, themes)

In [None]:
perform_hierarchical_clustering("ward", num_clusters, documents, document_embeddings, themes)

## **8. Conclusion**

In this notebook, we:
1. **Installed** and imported `pke`, `spacy`, and `sentence-transformers`.
2. Used **`pke.TopicRank`** to extract keyphrases (themes) from each sample document.
3. **Embedded** these keyphrases into vectors and aggregated them per document.
4. **Clustered** using K-Means and Hierarchical Clustering.

### **Possible Next Steps**
- **Try different pke algorithms** (e.g., `YAKE`, `TextRank`, `MultipartiteRank`).
- **Tune** clustering parameters (e.g., different values of `k`, different linkage criteria).
- Apply these methods to **larger or real-world datasets** and evaluate.

----