## Introduction:
In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and ultimately clustering. In this example we scrape the news feed of of [NIST](www.nist.gov). For those not in the know, NIST is the National Institute of Standards and Technology. It is comprised of multiple research centers which include: 
* Center for Nanoscale Science and Technology (CNST)
* Engineering Laboratory (EL)
* Information Technology Laboratory (ITL)
* NIST Center for Neutron Research (NCNR)
* Material Measurement Laboratory (MML)
* Physical Measurement Laboratory (PML)

This makes it an easy target in topic modeling.

You can use also this guide to scrape other data from a webpage: http://docs.python-guide.org/en/latest/scenarios/scrape/

# Clustering NIST headlines and description

### Import the necessary modules for the workshop. 
* [lxml](http://lxml.de/) is a package for processing XML and HTML
    - If trouble installing on OSX, try running 'xcode-select --install'
* [requests](http://docs.python-requests.org/en/master/) is a package for processing HTTP requests
* [future](https://docs.python.org/2/library/__future__.html) to make a print function
* [scikit-learn](http://scikit-learn.org/stable/index.html) is a package with broad tool sets for machine learning
    - [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to vectorize raw documents into a TF-IDF matrix
    - [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
    - [MiniBatchKMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)
* [time](https://docs.python.org/2/library/time.html) to time our clustering
* [wordcloud](https://github.com/amueller/word_cloud) to generatea visualization our data
* [matplotlib](http://matplotlib.org/) to show the resulting image

In [1]:
from lxml import html
import requests
from __future__ import print_function
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
from time import time
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
import matplotlib.pyplot as plt

### Building the list of headlines and descriptions

We request NIST news based on the following URL, 'http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014'. For this workshop, we look at only 2014 news articles posted on the NIST website. 

We then pass that retrieved content to our HTML parser and search for a specific div class, "select_portal_module_wrapper" which is assigned to every headline and every description. The difference being that headlines receive a strong tag and descriptions receive a p tag.

We then merge both the headline and description into one entry in the list because we don't need to differentiate between title and description

In [2]:
print("Retrieving data from NIST")

#retrieve the data from the web page
page = requests.get('http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014') 
#use html module to parse it out and store in tree
tree = html.fromstring(page.content)

#create list of news headlines and descriptions. 
#This required obtaining the XPath of the elements by examining the web page.
list_of_headlines = tree.xpath('//div[@class="select_portal_module_wrapper"]/a/strong/text()')
list_of_descriptions = tree.xpath('//div[@class="select_portal_module_wrapper"]/p/text()')
#combine each headline and description into one value in a list
news=[]
for each_headline in list_of_headlines:
	for each_description in list_of_descriptions:
		news.append(each_headline+each_description)

print("Last item in list retrieved: %s" % news[-1])

Retrieving data from NIST
Last item in list retrieved: A New NIST Online Database: The NIST Polycyclic Aromatic Hydrocarbon Structure Index Recently, a new website containing a wealth of information on polycyclic aromatic hydrocarbons (PAHs) was made publicly available by NIST. PAHs are compounds that are produced during the … 


# Visually explore the NIST News

Before we classify our NIST news dataset, we can visually explore the data that we pulled.

In [10]:
stopwords = STOPWORDS.copy()
stopwords.add('nist')
stopwords.add('national')
stopwords.add('institute')
stopwords.add('standard')
stopwords.add('standards')
stopwords.add('technology')
stopwords.add('new')
wordcloud = WordCloud(background_color='#2c3e50',
                      width=1280, 
                      height=800, 
                      scale=1, 
                      stopwords=stopwords).generate(''.join(news))
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.savefig('word_cloud.png', dpi=100)

![wordcloud](word_cloud.png)

### Convert collection of documents to TF-IDF matrix

We now call a TF-IDF vectorizer to create a sparse matrix with term frequency-inverse document frequency weights. We constrain vectorizer by:
* the maximum document frequency to half the total documents,
* the minimum document frequency to two documents,
* and toss out common english stop words.

We also time the whole thing to see how long it takes to vectorize it.

In [4]:
print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
#create a sparse word occurrence frequency matrix of the most frequent words
vectorizer = TfidfVectorizer(input=news, max_df=0.5, min_df=2, stop_words='english')
X = vectorizer.fit_transform(news) 

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

Extracting features from the training dataset using a sparse vectorizer
done in 4.046627s
n_samples: 110224, n_features: 12197



### Let's do some clustering

We cheat and set the number of clusters to 15 since we know there are 15 subject areas in NIST. We call the KMeans classifier from sklearn and set an upper bound to the number of iterations for fitting the data to the classifier. We again time the process to see how long it takes to fit. Finally we list out each centroid and the top 10 terms associated with each centroid.

In [5]:
k = 15
km = KMeans(n_clusters=k, init='k-means++', max_iter=100)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X) 								#what's happening here??
print("done in %0.3fs" % (time() - t0))
print()

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=15, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)
done in 85.226s

Top terms per cluster:
Cluster 0: chip initiative image pure 2014 ready gas time measurements world
Cluster 1: scientists produced crystal case using ptir photothermal lateral pahs resonance
Cluster 2: research technology committee institute national standards visiting advanced agency primary
Cluster 3: cnst nanoscale center test developed new used technology laser science
Cluster 4: baldrige excellence award performance program malcolm quality organizations 2014 penny
Cluster 5: department commerce released technology standards national institute report president year
Cluster 6: forensic science committees osac organization area scientific members new standards
Cluster 7: 00 dna wednesday sequencer et participate analysts webinar invited free
Cluster 8: award house researchers