# Text mining

**Data Science with AIML**<br>
MITES Summer 2025<br>
2025-07-02 T

Two labs ago, we found a way to convert text (unstructured data) into a form more numerical, which gave us some perspectives on topics scraped from Wikipedia. In this lab, we continue that by using **text embedding** models, **dimensionality reduction**, and **clustering**.

**Imports**

Make sure these packages installed. (You only need to do this once for your venv.)

In [None]:
# %pip install -U pip matplotlib "numpy<2"
# %pip install -U ipywidgets sentence-transformers
# %pip install -U umap-learn hdbscan keybert bertopic
# %pip install wikipedia-api

These are the Python imports we're using:

In [None]:
import pickle
from pathlib import Path
import time
from statistics import median_high as median
import warnings
import random
warnings.simplefilter(action="ignore", category=(UserWarning, FutureWarning))

import wikipediaapi
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
import umap
import hdbscan
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer

USER_AGENT = "MITES/0.0 (+https://sum.mit.edu/course/mlj25)"

wiki_wiki = wikipediaapi.Wikipedia(
    user_agent=USER_AGENT,
    language="en",
)

wiki_cache_file = Path("./wiki_cache.pkl")

def get_wiki_text(page_title):
    """Scrape the text from a Wikipedia page
    
    Enhanced with caching, so we're not unnecessarily spamming
    their servers. (:

    Args:
        page_title (str): Title of the Wikipedia article

    Returns:
        str: The text of that article
    """
    if not wiki_cache_file.exists():
        memo = {}
    else:
        with open(wiki_cache_file, "rb") as fp:
            memo = pickle.load(fp)

    if page_title not in memo:
        page = wiki_wiki.page(page_title)
        time.sleep(1)  # avoid spamming the server
        memo[page_title] = page.text
        with open(wiki_cache_file, "wb") as fp:
            pickle.dump(memo, fp)

    return memo[page_title]

## Web scraping

Pick your own collection of Wikipedia pages to scrape, maybe three to five. This code scrapes those pages and combines them into one long variable, `ALL_THE_TEXT`.

In [None]:
page_titles = [
    "United States",
    "China",
    "Russia"
]

texts = []
for page_title in page_titles:
    text = get_wiki_text(page_title)
    texts.append(text)

ALL_THE_TEXT = "\n\n".join(texts)

print(f"{len(ALL_THE_TEXT)} characters")

Before, we tokenized things into words. We need to tokenize our text corpus into *sentences* now.

In [None]:
def break_into_sentences(text):
    """Tokenize a given text into a list of sentences

    Args:
        text (str): All the text

    Returns:
        list[str]: List of all sentences
    """
    text = text.replace(". ", ".\n\n")
    while "\n\n\n" in text:
        text = text.replace("\n\n\n", "\n\n")
    return text.split("\n\n")

sentences = break_into_sentences(ALL_THE_TEXT)
sentences = np.array(sentences)  # enhance our list into a NumPy array

m = median(len(s.split(" ")) for s in sentences)
print(f"{len(sentences)} sentences")
print(f"Median sentence length is {m} words")

# check first five sentences
sentences[:5]

## Text embedding

Pick a sentence transformer model, any of those described [on this page at sbert.net](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#original-models). `all-MiniLM-L12-v2` is the best balance of robustness & efficiency.

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
embeddings = model.encode(sentences)

print(embeddings.shape)
embeddings

### Query search

How do some **search queries** work? Like when I go online and search for something. Well, since we can convert text into numbers now, we can also see *how closeby* a search query is to all our sentences, using maths. Example:

In [None]:
query = "life span"

query_embeddings = model.encode([query])
results = semantic_search(query_embeddings, embeddings, top_k=5)[0]

for result in results:
    sentence = sentences[result["corpus_id"]]
    score = result["score"]
    print(f"(score: {score})\n{sentence}\n")

## Dimensionality reduction

OK but 300+ dimensions is too much.... Let's tone this down to 2 dimensions and then plot it.

In [None]:
reducer = umap.UMAP(random_state=23)

In [None]:
reduced_embeddings = reducer.fit_transform(embeddings)

print(reduced_embeddings.shape)
reduced_embeddings

In [None]:
# fancy title for your plot
topics = ", ".join(page_titles)
if len(page_titles) > 3: topics += "..."
plot_title = f"Topics: {topics}"

x = reduced_embeddings[:, 0]
y = reduced_embeddings[:, 1]

plt.scatter(x, y)
plt.title(plot_title)
plt.show()

Does this shape look interesting?

## Clustering

Chance are, there's *some* shape to your data, but what are those data points? We can use clustering methods to cluster the data into meaningful groups, and then we'll plot again with some color.

In [None]:
clusterer = hdbscan.HDBSCAN()

In [None]:
clusterer.fit(reduced_embeddings)

labels = [int(i) for i in sorted(set(clusterer.labels_))]

print(labels)

Let's try plotting again, this time coloring the dots by the cluster.

In [None]:
plt.scatter(x, y, c=clusterer.labels_)
plt.title(plot_title)
plt.show()

Cool! But now what even are these clusters about? Remember, each dot represents **one sentence** from your text corpus. Let's randomly sample a few from each cluster to see what they're about.

In [None]:
# organize clusters into list
clusters = []
for label in labels:
    mask = clusterer.labels_ == label
    clusters.append(sentences[mask])

In [None]:
# randomly sample k sentences from each cluster
# (note: same sentence might appear more than once
# for small clusters)
k = 5

for i, cluster in enumerate(clusters):
    print(f"Cluster {i}:")
    random_sentences = random.choices(cluster, k=k)
    for sentence in random_sentences:
        print(f"- {sentence}")
    print()

print(clusters[0])

## Naming the clusters



In [None]:
def clean(text):
    """Standardize the text

    Make lowercase, separate punctuation, fix spacing.

    Args:
        text (str): The text to standardize

    Returns:
        str: The cleaned up text
    """
    text = text.lower()
    text = text.replace("\n", " ")
    text = text.replace("!", "  ")
    text = text.replace("?", "  ")
    text = text.replace(". ", "  ")
    text = text.replace(",", "  ")
    text = text.replace('''"''', '''  ''') # Min-Jae added this
    text = text.replace("(", "  ") # Min-Jae added this
    text = text.replace(")", "  ") # Min-Jae added this
    text = text.replace(" ", "  ") # Min-Jae added this

    while "  " in text:
        text = text.replace("  ", " ")

    return text

def tokenize(text):
    """Clean & tokenize the text

    Args:
        text (str): The text to tokenize

    Returns:
        list[str]: The tokenized text, as a list of str
    """
    text = clean(text)
    tokens = text.split(" ")
    if tokens[-1] == "":
        tokens = tokens[:-1]
    if tokens[0] == "":
        tokens.pop(0)
    return tokens

stop_words = ["the", "and", "is", "are", "of", "in", "a", "to", "as", "or", "such", "for", "at", "was", "that", "their", "can", "with"]

def remove_stop_words(word_list):
    for stop_word in stop_words:
        while stop_word in word_list:
            word_list.remove(stop_word)
    return word_list

In [None]:
# find most common words in each cluster
for i, cluster in enumerate(clusters):
    word_counter = {}
    
    for sentence in cluster:
        cleaned = clean(sentence)
        tokens = tokenize(cleaned)
        unstopped_tokens = remove_stop_words(tokens)
        
        for word in unstopped_tokens:
            if word not in word_counter:
                word_counter[word] = 0
            word_counter[word] += 1
            
    sorted_words = sorted(word_counter, key=lambda word: word_counter[word], reverse=True)
    word_counter = {word: word_counter[word] for word in sorted_words}
    word_df = pd.DataFrame({"Word": word_counter.keys(), "Frequency": word_counter.values()})
    print(f"Cluster {i} Name:")
    print(*word_df.head(3)["Word"].values)
    print()
    