# Exploring documents with K-means clustering

In [210]:
import os
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import MiniBatchKMeans

## Load up the document set

We will use a subset of the documents. Clustering is not likely to yield interesting results for the entire dataset. Before moving on, use the `tools.subset` tool or other means to create a data subset specified by document titles or other metadata.

Here, we'll define a generator to yield all the documents of the subset so we can avoid too much memory overhead.

In [211]:
DATADIR = '../data/DocumentCloud/subset'

def documents(datadir=DATADIR):
    for fn in os.listdir(datadir):
        yield open(os.path.join(datadir, fn)).read()

In [212]:
def find_docs(substring, limit=None):
    """A crude document search utility"""
    count = 0
    for doc in documents():
        if substring.lower() in doc.lower():
            count += 1
            yield doc
            if limit is not None and count >= limit:
                break

You may want to experiment with even more specific subsets of the data. The above defined `find_docs` was just the simplest thing to do ... you might find other approaches to subsetting the data.

In [213]:
search_string = ''

if search_string:
    docs = [doc for doc in find_docs(search_string)]
else:
    # this could get heavy if you have a lot of docs in your exploration set
    docs = [doc for doc in documents()]
len(docs)

380

In [214]:
n = 10_000 # the number of features to extract. Essentially, (I think) the size of the
            # lexicon, although a small number will lead to hash collisions
    
k = 5 # the number of clusters to find. You will want to experiment with this number for
      # each document set you explore

### Vectorize the document text

The [HashingVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) will convert a collection of text documents to a matrix of token occurrences. This effectively tokenizes the text and vectorizes in a single step.

In [215]:
hasher = HashingVectorizer(
    n_features=n, stop_words='english',
    alternate_sign=False, norm=None)


The [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) will convert the document vectors to tf-idf.

tf-idf is a term weighting approach that scores the relative importance of terms in a document. For more info, see the [Wikipedia entry](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [216]:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=n,
    min_df=2, stop_words='english', use_idf=True)
X = vectorizer.fit_transform(docs)

What we have now for our feature matrix `X` is document term vectors with weightings for the "importance" of a term. The shape of X will be the document-count x features-extracted.

In [217]:
X.shape

(380, 3385)

The vectors are sparse matrices, which makes the data representation much lighter given that any given document will only have a small subset of the extracted terms.

In [218]:
X[0]

<1x3385 sparse matrix of type '<class 'numpy.float64'>'
	with 156 stored elements in Compressed Sparse Row format>

### Fit the model to the documents

Now we are ready to cluster.

In [219]:
km = MiniBatchKMeans(
    n_clusters=k, init='k-means++', n_init=1,
    init_size=1000, batch_size=1000, verbose=False)

In [220]:
km.fit(X)

MiniBatchKMeans(batch_size=1000, compute_labels=True, init='k-means++',
                init_size=1000, max_iter=100, max_no_improvement=10,
                n_clusters=5, n_init=1, random_state=None,
                reassignment_ratio=0.01, tol=0.0, verbose=False)

The model `km` now contains the clustering information. By looking at the labels that the model has applied to the documents, you can see that there are `k` clusters, with a cluster ID associated with each document.

In [221]:
km.labels_

array([3, 3, 4, 1, 2, 3, 0, 1, 3, 3, 4, 3, 1, 2, 2, 4, 2, 4, 0, 1, 1, 2,
       4, 3, 3, 3, 0, 1, 3, 3, 1, 4, 2, 2, 4, 0, 1, 3, 3, 1, 3, 0, 4, 4,
       0, 2, 1, 4, 1, 1, 2, 3, 2, 4, 2, 2, 4, 3, 4, 0, 0, 4, 3, 3, 0, 2,
       1, 3, 1, 4, 2, 0, 3, 3, 3, 0, 1, 3, 1, 4, 4, 0, 4, 1, 4, 1, 0, 4,
       1, 2, 2, 3, 0, 2, 3, 3, 1, 4, 1, 4, 4, 1, 0, 3, 1, 3, 0, 3, 4, 0,
       3, 0, 3, 3, 4, 4, 2, 4, 1, 2, 2, 0, 1, 4, 0, 4, 0, 2, 4, 3, 0, 2,
       2, 1, 3, 3, 4, 0, 2, 4, 4, 1, 3, 2, 3, 3, 1, 3, 1, 2, 4, 1, 3, 1,
       4, 2, 1, 3, 2, 3, 4, 4, 1, 0, 2, 0, 2, 3, 3, 2, 0, 1, 3, 3, 0, 2,
       1, 3, 4, 3, 2, 4, 0, 1, 4, 2, 0, 2, 1, 1, 2, 4, 2, 3, 4, 0, 2, 3,
       3, 0, 1, 0, 4, 1, 3, 4, 1, 1, 4, 4, 2, 2, 3, 1, 0, 4, 1, 4, 1, 4,
       3, 1, 2, 2, 3, 2, 4, 1, 4, 0, 2, 4, 2, 4, 1, 4, 2, 1, 3, 1, 1, 2,
       2, 1, 4, 2, 1, 2, 3, 3, 4, 0, 0, 4, 3, 3, 2, 2, 0, 3, 4, 4, 3, 0,
       2, 0, 1, 3, 3, 2, 0, 0, 2, 2, 3, 0, 3, 1, 2, 3, 1, 3, 1, 2, 4, 2,
       4, 2, 4, 4, 0, 4, 2, 3, 1, 4, 3, 2, 2, 2, 1,

### Inspect the clusters

To get an intuitive sense of what is in each cluster, we can extract the top terms from each. 

In [222]:
print("Cluster # | Doc count | Terms")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
topics = []

for i in range(k):
    topic = []
    for ind in order_centroids[i, :100]:
        topic.append(terms[ind])
    topics.append(set(topic))
    
for i, t in enumerate(topics):
    for j, t2 in enumerate(topics):
        if i == j:
            continue
        t = t - t2
    print(i, list(km.labels_).count(i), ' '.join(t), sep=' | ')

Cluster # | Doc count | Terms
0 | 48 | stations phone accommodation denied asap efficiencies month corner afternoon requests people accessibility 60661 vs reference processing clock station paid additional facilitator elevator reasonable announcements jefferson information provided intake advisory plan total tuesday including training se occurred totaled complaints representative customer davis escalator 2016 headquarters request streets
1 | 78 | coverage participants payment results specialist 22 litigation cancelled account conference bills december premium healthcare supplemental 17 16 compensation discuss room 23 cancellation 18 pm special 21 given irs application deferred hardship performance nd 24 education 27 retirement underpayments 25 investment 28 potential market rescheduled 19 preliminary administration 20 expense employee pending thursday scheduled 2nd 26
2 | 82 | relating projects contact listed open called policy terry related legal estate grant location matters aserpe r

Now, given that one of these clusters looks interesting, you might want to take a look at the documents.

In [223]:
cluster_id = 3
indexes = [i for i,v in enumerate(km.labels_) if v==3]
cluster_docs = [docs[i] for i in indexes]

In [224]:
print(cluster_docs[0])

MINUTES: Finance Audit and Budget Committee. thOctober
, 2018. 10
NOTICED: Following earlier committee. Commenced: 9:24 AM.

AGENDA: The posted agenda for the meeting can be found at www.transitchicago.com
, “About CTA”,
“Transit Board Meetings”, “Meeting Notices, Agendas, and Minutes”, “10/10/2018”, “Committee o
Finance Audit, and Budget.

ROLL CALL: Chairman Silva, Irvine, Peterson, Miller, Patterson, Youngblood. There was a quorum
committee members present, and one (Alva Rosales) absent.

COMMITTEE ACTION: The committee reviewed the Finance report and approved th
the
, September 1
2018 committee minutes.

Then, after extensive review by the committee, Chairman Silva asked for a motion to place all
recommended approved items, the four ordinances and the 14 contracts, on the omnibus for Boar
approval. Moved and seconded, the motion to recommend Board approval of the omnibus was
approved with six yes votes.
The approved items are as follows:

1 An ordinance authorizing an intergovernme