# Clustering News Article to Predict Categories

### Load Data

First let's load 20newsgroup data which contain newsletter articles + news categorical labels. We picked 5 news group to to this exercise.

In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.graphics', 'rec.motorcycles', 'sci.space', 'talk.politics.mideast', 'talk.religion.misc']

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

Loading 20 newsgroups dataset for categories:
['comp.graphics', 'rec.motorcycles', 'sci.space', 'talk.politics.mideast', 'talk.religion.misc']


### Text feature preprocessing

Our text processing algorithm is similar to knn exercise. However 

Scikit-learn provides a module for calculating this, this is called TfidfVectorizer. We are going to create a TfidfVectorizer object and use function `fit` and `fit_transform` to generate the right input vector for our classifiers. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, max_features=2500, stop_words='english', use_idf=True, sublinear_tf=True)
x = vectorizer.fit_transform(dataset.data)
y = dataset['target']

### K-Means Clustering

#### 1. Let's dive in and start a KMean clustering algorithm to cluster our data x. 

Please create a variable called `km` to hold your KMeans object. We will use it later.

#### 2. Meansuring the performance of the clusters

We save you some time and write these performance metrics for you.

In [None]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(y, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(y, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(y, km.labels_))

#### 3. Calculate the Silhouette Coefficient of these clusters

In [None]:
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x, km.labels_, sample_size=1000))

#### 4. Run the above algorithm (1-3) again, see if you get different results.

#### 5. Let's look at the terms within each cluster. From the result below, do you think the algorithm is doing a good job at clustering news?

You can get the term associated to x by running this command,

In [None]:
terms = vectorizer.get_feature_names()
print(terms[500:520])

For each cluster, this command will take the centroid vector and sort it from largest value to lowest value. Then the command returns the indices of terms ordered from most frequent to least frequent terms

In [None]:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

Then we can print the most frequent terms in each category.

In [None]:
print("Top terms per cluster:")

for i in range(len(categories)):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()