### Machine Learning Technique 3 - Clustering

# 1. Group Newsgroup Posts by Topic

This notebook will demonstrate how to use basic scikit functionalities to cluster objects.

The goal is to load many posts from a news group, that are about different topics, and to discover clusters of documents that are similar. We will get the topics, but will not use them in the algorithm, because this method is unsupervised and we will see how "magic" such an approach is.

#### Data:

This time, the dataset is a toy dataset that is **available in the scikit library**. Hence, you do not need to download anything yourself. You do, however, require an **Internet connection** in order to let sklearn download it for you.

#### Links:

- [Clustering Scikit Demo](http://scikit-learn.org/stable/auto_examples/text/document_clustering.html)
- [Newsgroup Documentation 1](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)
- [Newsgroup Documentation 2](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups)

# 2. Import Libraries

In [None]:
# Starting by importing our beloved libraries: pandas, numpy, matplotlib.pyplot
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

# 3. Load the Data

We are using the `fetch_20newsgroups` available in the scikit-learn examples for this notebook. We just need to import the function and call it. We will load a training set and a test set.

**Initial Dataset:**

The `fetch_20newsgroups` contains a total of 18846 entries and has 20 different categories. Those categories are:

- alt.atheism
- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x
- misc.forsale
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey
- sci.crypt
- sci.electronics
- sci.med
- sci.space
- soc.religion.christian
- talk.politics.guns
- talk.politics.mideast
- talk.politics.misc
- talk.religion.misc

We are going to only select a few of those and, as such, reduce the size of the dataset, because it will be easier to plot the data later. :)

**Training dataset:**

For the training data, `fetch_20newsgroups` allows us to request it directly and it allows us to remove unneeded parts of the posts. Here, we will remove the headers, footers and quotes.

**Test dataset:**

For the test data, `fetch_20newsgroups` also allows us to request it directyl. However, here we will leave all parts in. We will see later how it looks like.

In [None]:
# Import the sklearn example function
from sklearn.datasets import fetch_20newsgroups

# Define our subset of categories
categories = ['sci.space', 'comp.windows.x', 'rec.sport.baseball', 'soc.religion.christian']

# Request the training data
newsgroup_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=categories)

# Request the test data
newsgroup_test = fetch_20newsgroups(subset='test', categories=categories)

# 4. Explore a little

In [None]:
# Looks like a dictionary, so let's print the keys


In [None]:
# Train: Print the size of different attributes
print('- Train:')
print('{:10} documents'.format( ... ))
print('{:10} targets'.format( ... )) # Should be same as #documents
print('{:10} categories'.format( ... ))

# Test: Print the size of different attributes
print('\n- Test:')
print('{:10} documents'.format( ... ))
print('{:10} targets'.format( ... )) # Should be same as #documents
print('{:10} categories'.format( ... ))

In [None]:
# Show categories (This should be the same as we defined)




# 5. Slice / Split the data into training and validation

Since, in this example we had a helper function that already gave to us the training and test data, we do not need to split it again. We are ready for the next steps.

As reminder, if you had to split, you could use the `train_test_split` function like the following example:

    from sklearn.cross_validation import train_test_split
    train, test = train_test_split(dataset, test_size=0.3)

# 6. Pre-Processing Text Data - Feature Extraction

Let's start with pre-processing the data. We are going to focus on 2 aspects:

1. Convert the data into a DataFrame. The current data is in a dictionary where the value are lists. We prefer DataFrame, because it is easier to work with big data.
2. Process the text and extract features from the text.

## 6.1 Data to DataFrame

In [None]:
# The target are the labelled clusters of the dataset
# We are lucky to have them, but won't use them much.


In [None]:
# First save the targets for later


# Just convert the train data to a dataframe


In [None]:
# First 10 posts of training


In [None]:
# First 10 posts of testing


In [None]:
# Show the first message


In [None]:
# Let's make use of the random module to just browse through some of the rows


This is cleaner.

**There are a few things to note:**

- The training data has been imported via a function that removed headers, footers and quotes from the original message. We chose to not spend too much time on advanded techniques to clean text data, because it is better to focus on the core, which is tokenizing and extracting features.
- The test data is original and has not been pre-processed or cleaned.

## 6.2 Feature Extraction using TF-IDF

To train any kind of machine learning model using text data, we need to, somehow, transform the text into another format that allows an algorithm to do computations.

One way to extract features would be to take all words and then count them. Counting is really basic, and not really good for machine learning, because words like 'a', 'an', 'the', 'for', ... appear naturally a lot in an english text.

This is why we remove stopwords from the list of extracted words. 

Still, this is not making things good, because there are words that are not important, or less meaningful, but appear a lot, and words that are very important, or very meaningful, but unfortunatly appear only a few times. 

Using a count technique, we basically make the most frequent word win!

**TF-IDF:**

TF-IDF stands for *term frequency inverse document frequency*.

I will not go too deep into the details. Check the links below to get more explanations and details about the method.
The basic things to know is:

- You take the frequency of each words in each document (term frequency)
- You take, for each word, the number of documents containing that word (document frequency)

The document frequency is high for words that appear a lot (common/boring words) and low for words that are rare.
So, dividing the term frequency by the document frequency we get a low score for boring words and a high score for rare words.

**Links:**

- [TF-IDF Simple Explanation](http://planspace.org/20150524-tfidf_is_about_what_matters/)

- [sklearn documentation of TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)
- [TF-IDF Explanation 1](http://www.tfidf.com)
- [TF-IDF Explanation Wikipedia](https://en.wikipedia.org/wiki/Tf–idf)

In [None]:
import re
import nltk
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [None]:
# Import TF-IDF from sklearn
...

# We can give stopwords as parameters already
# We can try later what happens if we don't remove stopwords
stopwords_english = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

# Initialize the TF-IDF object.
...

# Extract features from the text data
...

# Print the shape
...

# 7. Creating, Training & Measuring the Model

The next step is to create and train our model and see how we can create our first clusters.

The algorithm we are going to is called **K-Means** and it is an unsupervised clustering algorithm. 
It can take as an input the TF-IDF matrix and also needs an **initial number of clusters**.

To understand a bit how the algorithm works, those are the basic steps that it performs:

1. Randomly choose k items and define them as initial centroids (= center of cluster). 
2. For each point, find the nearest centroid and assign the point to the cluster associated with the nearest centroid. 
3. Update the centroid of each cluster based on the items in that cluster. Typically, the new centroid will be the average of all points in the cluster. 
4. Repeats steps 2 and 3, till no point switches clusters.

**Performance:**



**Links:**

- [KMeans Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
- [KMeans Explanation 1](http://bigdata-madesimple.com/possibly-the-simplest-way-to-explain-k-means-algorithm/)
- [KMeans Explanation 2 - with animation](https://codeahoy.com/2017/02/19/cluster-analysis-using-k-means-explained/)
- [KMeans Explanation 3 - YouTube](https://www.youtube.com/watch?v=RD0nNK51Fp8)

## 7.1 Create and Train the Model

*The next cell might take a few minutes*

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

def create_cluster(data, n_clusters, max_iter=300, n_init=10):
    print('Creating KMeans Model:')
    print('  #Clusters      : {:5}'.format(n_clusters))
    print('  Max Iterations : {:5}'.format(max_iter))
    print('  #Initalisatons : {:5}'.format(n_init))
    
    print('\nIn Progress...\n')
    
    # Create the model (Other option: init='k-means++')
    ...

    # Train the model ("%time" computes the execution time in Jupyter notebooks)
    ...

    print('\nDone!')
    
    ...

In [None]:
# Define the number of clusters
# We fetched 4 categories, so we are going to use 4 clusters




# Get the computed clusters for each document / row


## 7.2 Check the Clusters

In [None]:
# We can check the clusters for the first 10 documents


In [None]:
# Add it to the dataframe


In [None]:
# We can count the values:


In [None]:
def print_clusters(model, tfidf, n_clusters):
    # Get the cluster centers
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    
    # Get all the terms
    ...

    # For each cluster, print the top 50 words
    

In [None]:
from sklearn.decomposition import PCA

def plot_cluster(matrix, model):
    ''' Plot the cluster in 2D. 
    '''
    X = matrix.todense()

    pca = PCA(n_components=2).fit(X)
    data2D = pca.transform(X)
    
    plt.figure(figsize=(20,20))
    plt.scatter(data2D[:,0], 
                data2D[:,1], 
                c=model.labels_, 
                label="Documents")
    plt.legend()

    centers2D = pca.transform(model.cluster_centers_)

    plt.hold(True)
    plt.scatter(centers2D[:,0], 
                centers2D[:,1], 
                marker='x', 
                s=200, 
                linewidths=3, 
                c='r',
                label="Cluster Centers")
    
    plt.legend()

    plt.show()

# 8. Model Evaluation

Let's finish by evaluating our model. We know we need to find some quality metrics in order to have a basis to improve.

The challenge here is that evaluating clustering results is much harder than computing the accuracy of a supervised algorithm.

So, for this lesson, we are not going to go into the details of this part, because it is more complex than what we plan to cover here. However, the scikit-learn website contains many explanation for those who are most interested in. Check the links below:

**Links:**

- [Documentation - Clustering Evaluation](http://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation)
- [Documentation - Clustering Metrics](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)

In [None]:
from sklearn import metrics

labels = train_target

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, model.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, model.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, model.labels_))
print("Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(labels, model.labels_))
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(tfidf_matrix, model.labels_, sample_size=1000))


In [None]:
# Extract features from the text data


In [None]:
newsgroup_test['text'][0]

In [None]:
newsgroup_test['text'][1]

In [None]:
newsgroup_test['text'][2]

# Trying different models and different data

Overview of categories:

- alt.atheism
- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x
- misc.forsale
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey
- sci.crypt
- sci.electronics
- sci.med
- sci.space
- soc.religion.christian
- talk.politics.guns
- talk.politics.mideast
- talk.politics.misc
- talk.religion.misc



In [None]:
categories = ['sci.space', 'comp.windows.x', 'rec.sport.baseball', 'soc.religion.christian']

# Request the training data
data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), categories=categories)

In [None]:
# Initialize the TF-IDF object.
vectorizer = TfidfVectorizer(max_df=0.5, 
                             min_df=2, 
                             max_features=10000, 
                             stop_words=stopwords_english) 
#                             tokenizer=tokenize_and_stem)

# Extract features from the text data
tfidf = vectorizer.fit_transform( data['text'] )

In [None]:
# Define the number of clusters
# We fetched 4 categories, so we are going to use 4 clusters
n_clusters = len(categories)

model_2 = create_cluster(tfidf, n_clusters)

# Get the computed clusters for each document / row
clusters = model_2.labels_.tolist()

In [None]:
print_clusters(model_2, vectorizer, n_clusters)

In [None]:
plot_cluster(tfidf, model_2)