# Chapter 6: Clustering for Text Similarity

## Clustering by Document Similarity
### Partitive Clustering

#### k-means clustering
Initialize the NLTK `KMeansClusterer` with our desired number of clusters (*k*) and our preferred distance measure (`cosine_distance`), and avoid a result iwth clusters that contain no documents.  
Then we'll add our no-op `fit()` method and a `transform()` method that calls the internal `KMeansClusterer` model's `cluster()` method, specifying that each document should be assigned a cluster. 

In [1]:
from nltk.cluster import KMeansClusterer
from sklearn.base import BaseEstimator, TransformerMixin

class KMeansClusters(BaseEstimator, TransformerMixin):
    
    def __init__(self, k=7):
        '''
        k is the number of clusters
        model is the implementation of Kmeans
        '''
        self.k = k
        self.distance = nltk.cluster.util.cosine_distance
        self.model = KMeansClusterer(self.k, self.distance, 
                                     avoid_empty_clusters=True)
    
    def fit(self, documents, labels=None):
        return self
    
    def transform(self, documents):
        '''
        Fits the K-Means model to one-hot vectorized documents
        '''
        return self.model.cluster(documents, assign_clusters=True)

Normalize and vectorize documents for our `KMeansClusters` class.  
Instead of returning a representation of documents as bags-of-words, this version of the `TextNormailzer` will perfrom stopwords removal and lemmatization and return a string for each document.  
*(Note: most of this is from Ch4, p72)*

In [2]:
from nltk.stem.wordnet import WordNetLemmatizer
import unicodedata
from nltk.corpus import wordnet as wn

class TextNormalizer(BaseEstimator, TransformerMixin):

    def __init__(self, language='english'):
        self.stopwords  = set(nltk.corpus.stopwords.words(language))
        self.lemmatizer = WordNetLemmatizer()

    def is_punct(self, token):
        return all(
            unicodedata.category(char).startswith('P') for char in token
        )

    def is_stopword(self, token):
        return token.lower() in self.stopwords

    def normalize(self, document):
        return [
            self.lemmatize(token, tag).lower()
            for paragraph in document
            for sentence in paragraph
            for (token, tag) in sentence
            if not self.is_punct(token) and not self.is_stopword(token)
        ]

    def lemmatize(self, token, pos_tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(pos_tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

    def fit(self, X, y=None):
        return self

    def transform(self, documents):
        return [
            self.normalize(document)
            for document in documents
        ]

Vectorize documents after normalization (before clustering) with `OneHotVectorizer` class.  
Use Scikit-Learn's `CountVectorizer` with `binary=True`, which will wrap both frequency encoding and binarization.  
The `transform()` method will return a representation of each doucment as a one-hot vectorized array.

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer

class OneHotVectorizer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.vectorizer = CountVectorizer(binary=True)
        
    def fit(self, documents, labels=None):
        return self
    
    def transform(self, documents):
        freqs = self.vectorizer.fit_transform(documents)
        return [freq.toarray()[0] for freq in freqs]

Now, create a `Pipeline` inside our `main()` execution to perform *k*-means clustering.  
Initialize a `PickledCorpusReader` (Ch3, p51), specifying use of only the "news" category.  
Initialize a pipeline to streamline our custom `TextNormalizer`, `OneHotVectorizer`, and `KMeansClusters` classes. By calling `fit_transform()` on the pipeline, we perfrom each of the steps in sequence.

In [4]:
# nltk.download('wordnet')

In [5]:
if __name__ == '__main__':
    from reader import PickledCorpusReader
    from sklearn.pipeline import Pipeline
    import nltk
    
    corpus = PickledCorpusReader('../ATAwP/corpus')
    docs = corpus.docs(categories=['news']) 
    
    model = Pipeline([
        ('norm', TextNormalizer()), 
        ('vect', OneHotVectorizer()), 
        ('clusters', KMeansClusters(k=7))
    ])
    
    clusters = model.fit_transform(docs)
    pickles = list(corpus.fileids(categories=['news']))
    for idx, cluster in enumerate(clusters):
        print(f"Document '{pickles[idx]}' assigned to cluster {cluster}.")

AttributeError: 'list' object has no attribute 'lower'

#### Optimizing k-means
Experiment with different values of k.  
Tune other parts of the pipeline, such as switching to TF-IDF vectorization instead of one-hot encoding.  
We could use a feature selector in place of our `TextNormalizer`.  
Optimize for speed by switching from `nltk.cluster` module to `sklearn.cluster`'s `MiniBatchKMeans`, a k-means variant that uses randomly sampled subsets.

In [6]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.base import BaseEstimator, TransformerMixin

class KMeansClusters(BaseEstimator, TransformerMixin):
    
    def __init__(self, k=7):
        '''
        k is the number of clusters
        model is the implementation of Kmeans
        '''
        self.k = k
        self.model = MiniBatchKMeans(self.k)
    
    def fit(self, documents, labels=None):
        return self
    
    def transform(self, documents):
        return self.model.fit_predict(documents)

In [7]:
if __name__ == '__main__':
    from reader import PickledCorpusReader
    from sklearn.pipeline import Pipeline
    import nltk
    
    corpus = PickledCorpusReader('../ATAwP/corpus')
    print("corpus type:", type(corpus))
    docs = corpus.docs(categories=['news'])
    print("docs type:", type(docs))
    
    model = Pipeline([
        ('norm', TextNormalizer()), 
        ('vect', OneHotVectorizer()), 
        ('clusters', KMeansClusters(k=7))
    ])
    print("model type:", type(model))
    
    clusters = model.fit_transform(docs)
    pickles = list(corpus.fileids(categories=['news']))
    for idx, cluster in enumerate(clusters):
        print(f"Document '{pickles[idx]}' assigned to cluster {cluster}.")

corpus type: <class 'reader.PickledCorpusReader'>
docs type: <class 'generator'>
model type: <class 'sklearn.pipeline.Pipeline'>


AttributeError: 'list' object has no attribute 'lower'