# Modern Data Science 
**(Module 05: Deep Learning)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session D - Text Analytics (1) : Classification and Clustering

**The purpose of this session is to introduce how to work with textual data which are tremendously produced everyday. In this practical session, we present the following topics:**

1. Text pre-processing techniques, also called text normalization, which involves using a variety of techniques to convert raw text into well defined sequences of linguistic components that have standard structure and notation.

2. Text classification or categorization which involves trying to organize text documents into various categories, based on inherent properties or attributes of each text document.

3. Document clustering uses unsupervised ML algorithms to group the documents into various clusters.

** References and additional reading and resources**
- [Mining Twitter Data with Python: 7 parts](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)
- [texttk -- Text Preprocessing in Python](https://github.com/fmpr/texttk)


---





## 1. Text Preprocessing ( or Normalizarion)

Machine learning (ML) algorithms usually work with input features that are numeric in nature. You need to clean, normalize, and pre-process the initial textual data. Textual data are usually in native raw format which is not well formatted and standardized. Text pre-processing involves using a variety of techniques to convert raw text into well-defined sequences of linguistic components that have standard structure and notation.

In this section, we introduce the most popular text pre-processing techniques used in text analytics:
- Expanding contractions
- Lemmatization
- Removing special characters and symbols
- Removing stopwords


We are starting with a small corpus including 3 documents:

In [None]:
corpus=["The brown fox wasn't that quick and he couldn't win the race", 
        "Hey that's a great deal! I just bought  phones for 199", 
        "You'll learn a lot in the book. Python is an amazing language!"]

### 1.1 Expanding contractions

*Contractions* are shortened version of words or syllables, e.g., isn’t, won’t. We have created a vocabulary
for contractions and their corresponding expanded forms that you can access in the file **contractions.py** in a Python dictionary which is partly shown as 
<img src="https://github.com/tuliplab/mds/raw/master/Jupyter/image/contraction.PNG", width=300>

In this snippet, we create a function called   ``expand_contractions`` which containes the function ``expanded_match`` to find each contraction that matches the ``regex`` pattern we create out of all the contractions in our ``contraction_mapping`` dictionary. On matching any contraction, we substitute it with its corresponding expanded version and retain the correct case of the word.

In [None]:
# this routine will expand the contration in texts using some pre-defined contractions and rules
# see the list of English contractions here https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions. 
# You can add your own contraction to extend the list
# The list of pre-defined contractions is stored in constractions.py file in CONTRACTION_MAP
import re # regular expression lib
   
# this function looks for each contraction and called above function
def expand_contractions(text, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),flags=re.IGNORECASE|re.DOTALL)
    
    # this function returns each expanded contraction
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]

        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())  

        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text    

For expanding contractions in the corpus, we only need to call the preceding function,``expand_contractions``, for each document in the corpus. You can see how each contraction has been correctly expanded in the output just like we expected it.

In [None]:
from contractions import CONTRACTION_MAP

n_corpus=[expand_contractions(doc,CONTRACTION_MAP) for doc in corpus]
print(n_corpus)

### 1.2 Lemmatization
The process of [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is to remove word affixes to get to a base form of the word. This base form is also known as the root word, also known as the lemma, which will always be present in the dictionary. The lemmatization process is usually based on the part-of-speech (POS) of each word. We use ``pos_tag`` package in Natural Language Toolkit (nltk) to indicate the POS for each word. The function ``pos_tag_text()`` is used to assign the POS for every token (word) in the given ``text``.  Note that in ``pos_tag_text()`` function, we use  ``tokenize_text90`` to separate the text into words.

In [None]:
from nltk import pos_tag
from nltk.corpus import wordnet as wn
import nltk

# Annotate text tokens with POS tags
def pos_tag_text(text):
    
    def to_wn_tags(pos_tag):
        if pos_tag.startswith('J'):
            return wn.ADJ
        elif pos_tag.startswith('V'):
            return wn.VERB
        elif pos_tag.startswith('N'):
            return wn.NOUN
        elif pos_tag.startswith('R'):
            return wn.ADV
        else:
            return None
    
    tagged_text = pos_tag(tokenize_text(text))
    tagged_lower_text = [(word.lower(), to_wn_tags(pos_tag))
                         for word, pos_tag in
                         tagged_text]
    return tagged_lower_text
    
def tokenize_text(text):
    tokens = nltk.word_tokenize(text) 
    tokens = [token.strip() for token in tokens]
    return tokens

The main function is ``lemmatize_text()``, which takes in a body of text data and lemmatizes each word of the text based on its POS tag if it is present and then returns the lemmatized text back to the user.

In [None]:
# use the lemmatizer based on Wordnet dictionary
from nltk.stem import WordNetLemmatizer

# lemmatize text based on POS tags  
def lemmatize_text(text):
    wnl = WordNetLemmatizer()
    pos_tagged_text = pos_tag_text(text)
    lemmatized_tokens = [wnl.lemmatize(word, pos_tag) if pos_tag else word  for word, pos_tag in pos_tagged_text]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text


For lemmatizing  text in the corpus, we only need to call the ``lemmatize_text()`` function for each document in the corpus. You can see how each word is recoverd to its lemma.

<img src="https://github.com/tuliplab/mds/raw/master/Jupyter/image/warning.png" width="40", align="left"></img> When you run the code for the first time, you might get an error that ask for 

- ``Resource 'tokenizers/punkt/english.pickle' not found ``  OR
- ``Resource 'taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle' not found``

You can (remove comment and) run the following code to download the resources.


In [None]:
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download("wordnet")

In [None]:
sn_corpus=[lemmatize_text((doc)) for doc in n_corpus]
print(sn_corpus)

### 1.3 Removing special characters and symbols

We remove special characters by tokenizing the text just so we can remove some of the tokens that are actually contractions, but we may have failed to remove them in our first step. We remove all special symbols defined in ``string.punctuation`` from our text using regular expression matches.

In [None]:
import string

def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

We now can apply this function to the text obtained in the previous step. Not that the symbol ``!`` has been removed.

In [None]:
rsn_corpus=[remove_special_characters(doc) for doc in n_corpus]
print(rsn_corpus)

### 1.4 Removing stopwords

[Stopwords](https://en.wikipedia.org/wiki/Stop_words) are words that have little or no significance. They are usually removed from text during processing so as to retain words having maximum significance and context. Stopwords are usually words that end up occurring the most if you aggregated any corpus of text based on singular tokens and checked their frequencies. In the following code, we use the (english) stopword list from NLTK library.

In [None]:
stopword_list = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

We can filter stopwords out of our corpus.

In [None]:
rrsn_corpus=[remove_stopwords(doc) for doc in rsn_corpus]
print(rrsn_corpus)

###  1.5 Combining all steps to normalize corpus
Now that we have all our functions defined, we can build our text normalization pipeline by chaining all these functions one after another. The following function implements this, where it takes in a corpus of text documents and normalizes them and
returns a normalized corpus of text documents. 

In [None]:
# Combining all steps to normalize corpus
def normalize_corpus(corpus, tokenize=False):    
    normalized_corpus = []    
    for text in corpus: # we will process every document
        text = expand_contractions(text, CONTRACTION_MAP)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
            
    return normalized_corpus

In [None]:
norm_corpus=normalize_corpus(corpus)
print("Before normalizing:\n ", corpus)
print("\nAfter normalizing:\n ", norm_corpus)

## 2. Feature Extraction for Text data

There are various feature-extraction techniques that can be applied on text data, but before we jump into then, let us consider what we mean by features. Why do we need them, and how they are useful? In a dataset, there are typically many data points. Usually the rows of the dataset and the columns are various features or properties of the dataset, with specific values for each row or observation. In ML terminology, features are unique, measurable attributes or properties for each observation or data point in a dataset. Features are usually numeric in nature and can be absolute numeric values or categorical features that can be encoded as binary features for each category in the list using a process called one-hot encoding. The process of extracting and selecting features is both art and science, and this process is called feature extraction or feature engineering.

Now we will look at some feature-extraction concepts and techniques specially aligned towards text data. We will be talking about and implementing the following feature-extraction techniques:
- Bag of Words model
- TF-IDF model
- Word vectorization models

### 2.1 Bag of Words Model

The Bag of Words model is perhaps one of the simplest yet most powerful techniques to extract features from text documents. The essence of this model is to convert text documents into vectors such that each document is converted into a vector that represents the frequency of all the distinct words that are present in the document vector space for that specific document.

The following code snippet gives us a function that implements a Bag of Words–based feature-extraction model:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus, ngram_range=(1,1)):    
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

Before applyting BOW extractor, we need to normalize our corpus using the routine introduced in Secion 1.

In [None]:
norm_corpus=normalize_corpus(corpus) # normalize corpus

bow_vectorizer, bow_features = bow_extractor(norm_corpus) # call above routine to extract features
features = bow_features.todense() # feature matrix
feature_names = bow_vectorizer.get_feature_names() # feature names
print(feature_names)
print(features)

### 2.2 TF-IDF Model
The Bag of Words model is good, but the vectors are completely based on absolute
frequencies of word occurrences. This has some potential problems where words that
may tend to occur a lot across all documents in the corpus will have higher frequencies
and will tend to overshadow other words that may not occur as frequently but may
be more interesting and effective as features to identify specific categories for the
documents. This is where TF-IDF comes into the picture. TF-IDF stands for Term
Frequency-Inverse Document Frequency, a combination of two metrics: term frequency
and inverse document frequency. This technique was originally developed as a metric for
ranking functions for showing search engine results based on user queries and has come
to be a part of information retrieval and text feature extraction now.

The following code snippet shows an implementation of getting the tfidf-based feature vectors, considering we have our Bag of Words feature vectors we obtained in the previous section:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
   
def tfidf_transformer(bow_matrix):
    transformer = TfidfTransformer(norm='l2',smooth_idf=True,use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

Note that ``bow_features`` and ``feature_names`` are the BOW matrix and feature list created in the previous section. We also created a function called ``display_features`` to display the feature matrix.

In [None]:
import numpy as np
import pandas as pd
def display_features(features, feature_names):
    df = pd.DataFrame(data=features,columns=feature_names)
    print(df)

tfidf_trans, tdidf_features = tfidf_transformer(bow_features) # compute tf-idf for each word
features = np.round(tdidf_features.todense(), 2) # round value to 2 decimal places
display_features(features, feature_names) # using above function to disply feature matrix associated with feature names


### 2.3 Word2vec models

We will be using the gensim library in our implementation, which is Python implementation for ``word2vec`` that provides several high-level interfaces for easily building these models. The basic idea is to provide a corpus of documents as input and get feature vectors for them as output. Internally, it constructs a vocabulary based on the input text documents and learns vector representations for words based on various techniques mentioned earlier, and once this is complete, it builds a model that can be used to extract word vectors for each word in a document. Using various techniques like average weighting or tfidf weighting, we can compute the averaged vector representation of a document using its word vectors.

In [None]:
import gensim

tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]
# build the word2vec model on our training corpus
model = gensim.models.Word2Vec(tokenized_corpus, size=10, window=10,min_count=2, sample=1e-3)

Once we build a model, we will define and implement two techniques of combining word vectors together in text documents based on certain weighing schemes. We will  implement two techniques mentioned as follows.
- Averaged word vectors
- TF-IDF weighted word vectors

#### Averaged Word Vectors

In this technique, we will use an average weighted word vectorization scheme, where for each text document we will extract all
the tokens of the text document, and for each token in the document we will capture the subsequent word vector if present in the vocabulary. We will sum up all the word vectors and divide the result by the total number of words matched in the vocabulary to get a final resulting averaged word vector representation for the text document.

In [None]:
# define function to average word vectors for a text document
def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])     
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
    return feature_vector

# generalize above function for a corpus of documents
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary,num_features)
                for tokenized_sentence in corpus]
    return np.array(features)

The following snippet shows our function in action on our sample corpora:

In [None]:
avg_word_vec_features = averaged_word_vectorizer(corpus=tokenized_corpus,model=model,num_features=10)
print(np.round(avg_word_vec_features, 3))

#### TF-IDF Weighted Averaged Word Vectors

This section introduces a new and novel technique of weighing each matched word vector with the word TF-TDF score and summing up
all the word vectors for a document and dividing it by the sum of all the TF-IDF weights of the matched words in the document. This would basically give us a TF-IDF weighted averaged word vector for each document.

In [None]:
# define function to compute tfidf weighted averaged word vector for a document
def tfidf_wtd_avg_word_vectors(words, tfidf_vector, tfidf_vocabulary, model, num_features):
    word_tfidfs = [tfidf_vector[0, tfidf_vocabulary.get(word)]
                   if tfidf_vocabulary.get(word)
                   else 0 for word in words]
    word_tfidf_map = {word:tfidf_val for word, tfidf_val in zip(words, word_tfidfs)}
    feature_vector = np.zeros((num_features,),dtype="float64")
    vocabulary = set(model.wv.index2word)
    wts = 0.
    for word in words:
        if word in vocabulary:
            word_vector = model[word]
            weighted_word_vector = word_tfidf_map[word] * word_vector
            wts = wts + word_tfidf_map[word]
            feature_vector = np.add(feature_vector, weighted_word_vector)
    if wts:
        feature_vector = np.divide(feature_vector, wts)
    return feature_vector

#generalize above function for a corpus of documents
def tfidf_weighted_averaged_word_vectorizer(corpus, tfidf_vectors,tfidf_vocabulary, model, num_features):
    docs_tfidfs = [(doc, doc_tfidf) for doc, doc_tfidf in zip(corpus, tfidf_vectors)]
    features = [tfidf_wtd_avg_word_vectors(tokenized_sentence, tfidf, tfidf_vocabulary,model, num_features)
                for tokenized_sentence, tfidf in docs_tfidfs]
    return np.array(features)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_extractor(corpus, ngram_range=(1,1)):
    vectorizer = TfidfVectorizer(min_df=1,norm='l2',smooth_idf=True,use_idf=True,ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

We can see our implemented function in action on our sample corpora using the following snippet:

In [None]:
tfidf_vectorizer, tdidf_features = tfidf_extractor(corpus)
# get tfidf weights and vocabulary from earlier results and compute result
corpus_tfidf = tdidf_features
vocab = tfidf_vectorizer.vocabulary_
wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_corpus, tfidf_vectors=corpus_tfidf,
                                                                     tfidf_vocabulary=vocab, model=model,num_features=10)
print(np.round(wt_tfidf_word_vec_features, 3))

In the subsequent sections, we will be putting everything together and applying it on some real-world data to build a multi-class text classification system. For this, we will be using the 20 newsgroups dataset available for download using scikit-learn. The 20 newsgroups dataset comprises around 18,000 newsgroups posts spread across 20 different categories or topics, thus
making this a 20-class classification problem!

## 3. Text Classification and Clustering with Real-world dataset
### 3.1 Dataset: 20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, probably for his [Newsweeder: Learning to filter netnews](http://qwone.com/~jason/20Newsgroups/lang95.bib) paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. *comp.sys.ibm.pc.hardware / comp.sys.mac.hardware*), while others are highly unrelated (e.g *misc.forsale / soc.religion.christian*). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:

``comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x	
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey	
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale	
talk.politics.misc
talk.politics.guns
talk.politics.mideast	
talk.religion.misc
alt.atheism
soc.religion.christian``


### Data preparation
We limit to use 5 classes of documents in this section to reduce processing time. Note that we will reuse the functions defined in Section 1 and 2. First, Let us start with loading the necessary dataset and defining functions for building the training and testing datasets:

We define two functions: ``prepare_datasets()`` to split given documents into testing and training data; ``remove_empty_docs()`` to remove empty documents:

In [None]:
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels, test_size=0.33,random_state=42)
    return train_X, test_X, train_Y, test_Y

def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)
    return filtered_corpus, filtered_labels

We can now get the data, see the total number of classes in our dataset, and split our data into training and test datasets using the following snippet (in case you do not have the data downloaded, feel free to connect to the Internet and take some time to download the complete corpus)

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism','rec.sport.baseball','talk.politics.mideast','comp.graphics', 'sci.space']
dataset = fetch_20newsgroups(subset='all',categories=categories, remove=('headers', 'footers', 'quotes'))


In [None]:
# print all the classes
print(dataset.target_names)

In [None]:
# get corpus of documents and their corresponding labels
corpus, labels = dataset.data, dataset.target 
corpus, labels = remove_empty_docs(corpus, labels)

In [None]:
# see sample document and its label index, name
print('Sample document:\n', corpus[10])
print('Class label:\n',labels[10])
print('Actual class label:\n', dataset.target_names[labels[10]])


You can see from the preceding snippet how a sample document and label looks. Each document has its own class label, which is one of the 5 topics it is categorized into. The labels obtained are numbers, but we can easily map it back to the original category name if needed using the preceding snippet. We also split our data into train and test datasets, where the test dataset is 30 percent of the total data. We will build our model on the training data and test its performance on the test data.

In [None]:
from sklearn.model_selection import train_test_split
# prepare train and test datasets
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,labels, test_data_proportion=0.3)

In [None]:
print(train_corpus[0])

In [None]:
print(test_corpus[0])

Remember, a lot of normalization steps take place that we implemented earlier for each document in the corpora, so it may take some time to complete. Once we have normalized documents, we will use our feature extractor module built earlier to start extracting features from our documents. We will build models for Bag of Words, TF-IDF and compare their performances.

<img src="https://github.com/tuliplab/mds/raw/master/Jupyter/image/warning.png" width="40", align="left"></img> This step might take time, please be patient.

In [None]:
norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus):    
    vectorizer = CountVectorizer()
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_extractor(corpus, ngram_range=(1,1)):
    vectorizer = TfidfVectorizer(min_df=1,norm='l2',smooth_idf=True,use_idf=True,ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [None]:
# bag of words features
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
bow_test_features = bow_vectorizer.transform(norm_test_corpus)

In [None]:
# tfidf features
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)

### 3.2 Text Classification

Once we extract all the necessary features from our text documents using the preceding feature extractors, we define a function that will be useful for evaluation our classification models based on the four metrics discussed earlier, as shown in the following snippet

In [None]:
from sklearn import metrics
import numpy as np
def get_metrics(true_labels, predicted_labels):
    print('Accuracy:', np.round(metrics.accuracy_score(true_labels,predicted_labels),2))
    print('Precision:', np.round(metrics.precision_score(true_labels,predicted_labels,average='weighted'),2))
    print('Recall:', np.round(metrics.recall_score(true_labels,predicted_labels,average='weighted'),2))
    print('F1 Score:', np.round(metrics.f1_score(true_labels,predicted_labels,average='weighted'),2))

def train_predict_evaluate_model(classifier, train_features, train_labels,test_features, test_labels):
    # build model
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features)
    # evaluate model prediction performance
    get_metrics(true_labels=test_labels,predicted_labels=predictions)
    return predictions

We now import two ML algorithms so that we can start building our models with them based on our extracted features.

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(bow_train_features, train_labels)
predictions = mnb.predict(bow_test_features)
print("Multinomial Naive Bayes with bag of words features")
get_metrics(test_labels,predictions)

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(tfidf_train_features, train_labels)
predictions = mnb.predict(tfidf_test_features)
print("Multinomial Naive Bayes with tfidf features")
get_metrics(test_labels,predictions)


### 3.3 Text Clustering

Now we use all normalized documents to extract features:

In [None]:
norm_corpus=normalize_corpus(corpus)

tfidf_vectorizer, tfidf_features = tfidf_extractor(norm_corpus)
bow_vectorizer, bow_features = bow_extractor(norm_corpus)

In [None]:
from sklearn.cluster import KMeans

num_clusters = 5
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_features)

In [None]:
print("Homogeneity (Purity): %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Normalized mutual information: %0.3f" % metrics.normalized_mutual_info_score(labels, km.labels_))
print("Adjusted Rand-Index: %0.3f" % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f" %metrics.silhouette_score(bow_features, km.labels_, sample_size=1000))

We are now ready to analyze the cluster results of our k-means clustering. The following code snippet depicts the detailed analysis results for k-means clustering:

In [None]:
vocab=bow_vectorizer.get_feature_names()
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(num_clusters):
    print("Cluster %d words:" % i, end=' ')
    
    for ind in order_centroids[i, :10]: #replace 6 with n words per cluster
        print(' %s' % vocab[ind], end='')
    print() #add whitespace
    print() #add whitespace
    


## 4. Exercises
1. Use other 3 classifiers for text classification in Section 3.2 and report results.
2. Use [AffinityPropagation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) to cluster text documents in Section 3.3 and report results.
3. Use Word2vec models to extract (2 kind of) features for classification and clustering in Section 3.2 and 3.3 and Exercises 1 and 2 and compare the results.
4. Use 20newsgroups dataset in Section 3 with all categories and report the results.