# Text Mining of BBC News Data

## Part 3: Document Clustering and Topic Modeling


## Document Clustering

In [None]:
from pathlib import Path

text_filepaths = sorted(Path("bbc").glob("*/*.txt"))
categories = [p.parent.name for p in text_filepaths]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(
    input="filename", encoding="utf-8", decode_error="ignore",
    min_df=5, max_df=0.7)

tfidf_docs = tfidf_vectorizer.fit_transform(text_filepaths)

In [None]:
%%time
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, n_init=1).fit(tfidf_docs)

In [None]:
kmeans.cluster_centers_.shape

**Questions**:

- Run the previous clustering a second time, what do you observe?
- Could you suggest why this is the case?

**Exercise**:

- Use the array of weights of the center of the first cluster (`kmeans.cluster_centers_[0]`) and the feature names from the vectorizer to print the top 10 most important words of that cluster.

- What is the main topic of that cluster centroid?

Hint: feel free to use a pandas `DataFrame`.

In [None]:
# %load notebook_solutions/cluster_weights.py

## Evaluating the Quality of the K-Means Predictions

In [None]:
kmeans_predictions = kmeans.predict(tfidf_docs)
len(kmeans_predictions)

In [None]:
tfidf_docs.shape

In [None]:
kmeans_predictions[:10]

In [None]:
kmeans_predictions[-10:]

In [None]:
categories[:10]

In [None]:
categories[-10:]

In [None]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(kmeans_predictions, categories)

The **Adjusted Rand Index** (ARI) considers all the possible pairs in the dataset and count the number of pairs that should be in the same cluster, the number and the numbers of pairs that should be assigned different clusters.

In [None]:
adjusted_rand_score([0, 1, 1, 0], ["a", "b", "b", "a"])

In [None]:
adjusted_rand_score([2, 0, 0, 2], ["a", "b", "b", "a"])

In [None]:
adjusted_rand_score([1, 1, 0, 0], ["a", "b", "b", "a"])

In [None]:
adjusted_rand_score([1, 0, 0, 2], ["a", "b", "b", "a"])

Some (supervised) clustering metrics:
    
- Adjusted Rand Index
- Adjusted / Normalized Mutual Information
- V-measure (homegeneity and completeness)

When we don't have ground truth labels (which is most often the case, otherwise why not use a supervised classifier?), there is no single unique best way to quantify cluster quality. One could use the following metrics but each of them makes different assumptions on the question of what is a "good" clustering result:

- Measure inter or intra cluster average / min / max distances (euclidean, cosine, or custom metric).
- Measure clustering stability when across resampling dataset and when adding small perturbations to the data.

**Exercises**

- Find the documentation of clustering metrics on the scikit-learn.org documentation;
- What is the meaning of homogeneity and completeness;
- On a toy dataset with only 4, and 2 "true" clustering classes, find a clustering that is homogeneous but not complete and the converse;
- Compute the homogneity, completness and V-measure score for the results of the KMeans algorithm above.

In [None]:
# %load notebook_solutions/homogeneity_vs_completeness.py

## Faster Clustering with Dimensionality Reduction

In [None]:
%%time
from sklearn.pipeline import make_pipeline
from sklearn.random_projection import GaussianRandomProjection

rp_kmeans = make_pipeline(GaussianRandomProjection(n_components=500),
                          KMeans(n_clusters=5, n_init=10))
rp_kmeans_predictions = rp_kmeans.fit_predict(tfidf_docs)

In [None]:
adjusted_rand_score(rp_kmeans_predictions, categories)

**Questions**:
    
- Try to reduce the dimension further, what do you observe?
- What is the number of tunable parameters of the KMeans model in this case?

In [None]:
%%time
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD

svd_kmeans = make_pipeline(TruncatedSVD(n_components=50),
                           KMeans(n_clusters=5, n_init=10))
svd_kmeans_predictions = svd_kmeans.fit_predict(tfidf_docs)

In [None]:
adjusted_rand_score(svd_kmeans_predictions, categories)

**Exercise**:

- Compute the homogeneity and completeness scores for this pipeline;
- Change the parameter `n_clusters`, how can you explain the results?

**Question**:

- How text clustering can help datascientists?
- What are some "real word" applications of unsupervised text clustering?
- What is the main limitation of the use of clustering when trying to organize documents by topics?

## Topic Modeling

In [None]:
%%time

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, max_iter=15,
                                learning_method='online',
                                learning_offset=50.).fit(tfidf_docs)

In [None]:
import pandas as pd


def print_components(components, feature_names, n=15, format="compact"):
    for i, component in enumerate(components):
        df = pd.DataFrame({"word": feature_names, "weight": component})
        top = df.nlargest(n, "weight")
        if format == "compact":
            print(f"Topic #{i}: {', '.join(top['word'])}")
        else:
            print(f"Topic #{i}:")
            print(top)
            print()
                  
print_components(lda.components_, feature_names, format="full")

Once trained, the Latent Dirichlet Allocation model can infer the probability distribution of each topic for either old or new documents.

In [None]:
lda_docs = lda.transform(tfidf_docs)
lda_docs.shape

In [None]:
lda_docs[:5]

In [None]:
lda_docs[:5].sum(axis=1)

In [None]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=10).fit(tfidf_docs)

In [None]:
print_components(nmf.components_, feature_names, format="full")

NMF can also embed the documents in its own latent space but the result cannot be directly interpreted as a probablistic model of topics as is the case for LDA:

In [None]:
nmf_docs = nmf.transform(tfidf_docs)

In [None]:
nmf_docs.shape

In [None]:
nmf_docs[:5]

In [None]:
nmf_docs[:5].sum(axis=1)

**Exercises**

- Use `TruncatedSVD(n_components=10)` and print most weighted words for each components as we did for LDA and NMF.
- Can it be used as a topic model?
- Look at the 10 smallers weights for LDA, NMF and SVD. What different do you observer?
- Transform the TF-IDF documents in SVD space and look at the first 5 documents. What do you observe?
- How would you define the difference between a "topic" dimension and other kinds of "latent" dimensions such as the one extracted by Singular Value Decomposition?
- What are applications for topic models?