---

# Data Mining: 
### Exercises - Topic modeling

---

We extract topics from unstructured texts

In [None]:
from sklearn.feature_extraction import text
from sklearn import datasets, decomposition

n_samples = 1000
n_features = 1000
n_topics = 6
n_top_words = 20

# Loading the dataset
The 20 Newsgroups data set

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups
The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter: 



In [None]:
dataset = datasets.fetch_20newsgroups(shuffle=True, random_state=1)

# From News to Feature Vectors
We need to transform our text data into feature vectors, numerical representations which are suitable for performing statistical analysis. The most common way to do this is to apply a bag-of-words approach where the frequency of an occurrence of a word becomes a feature for our classifier.


## Term Frequency-Inverse Document Frequency

We want to consider the relative importance of particular words, so we'll use term frequency–inverse document frequency as a weighting factor. This will control for the fact that some words are more "spamy" than others.

## Mathematical details

tf–idf is the product of two statistics, term frequency and inverse document
frequency. Various ways for determining the exact values of both statistics
exist. In the case of the '''term frequency''' tf(''t'',''d''), the simplest
choice is to use the ''raw frequency'' of a term in a document, i.e. the
number of times that term ''t'' occurs in document ''d''. If we denote the raw
frequency of ''t'' by f(''t'',''d''), then the simple tf scheme is
tf(''t'',''d'') = f(''t'',''d''). Other possibilities
include:

  * boolean_data_type "frequencies": tf(''t'',''d'') = 1 if ''t'' occurs in ''d'' and 0 otherwise; 
  * logarithmically scaled frequency: tf(''t'',''d'') = log (f(''t'',''d'') + 1); 
  * augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document: :$\mathrm{tf}(t,d) = 0.5 + \frac{0.5 \times \mathrm{f}(t, d)}{\max\{\mathrm{f}(w, d):w \in d\}}$

The '''inverse document frequency''' is a measure of whether the term is
common or rare across all documents. It is obtained by dividing the total
number of documents by the number of documents containing the
term, and then taking the logarithm of that quotient.

$$\mathrm{idf}(t, D) = \log \frac{|D|}{|\{d \in D: t \in d\}|}$$

with

  * $|D| $: cardinality of D, or the total number of documents in the corpus 
  * $|\{d \in D: t \in d\}|$ : number of documents where the term $t$ appears (i.e., $\mathrm{tf}(t,d) eq 0$). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the formula to $1 + |\{d \in D: t \in d\}|$. 

Mathematically the base of the log function does not matter and constitutes a
constant multiplicative factor towards the overall result.

Then tf–idf is calculated as

$$\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t, D)$$

In [None]:
# vectorize the data using the most common words
# normalize with TF-IDF weighting (without top 5% stop words)

vectorizer = text.CountVectorizer(max_df=0.95,
                                  max_features=n_features,
                                  stop_words='english')
counts = vectorizer.fit_transform(dataset.data[:n_samples])
tfidf = text.TfidfTransformer().fit_transform(counts)

# Topics extraction

In [None]:
# Fit the NMF model
print("Fitting the NMF model on with n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
nmf = decomposition.NMF(n_components=n_topics)

nmf.fit(tfidf)
W = nmf.transform(tfidf)
H = nmf.components_

Fitting the NMF model on with n_samples=1000 and n_features=1000...


In [None]:
H.shape

(6, 1000)

In [None]:
# Inverse the vectorizer vocabulary to be able
feature_names = vectorizer.get_feature_names()

Show the top n words in each topic

In [None]:
for topic_idx, topic in enumerate(H):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

Topic #0:
people god don think like just good time way know say life israel jesus christian bible want did does going
()
Topic #1:
edu university cs host posting article nntp writes cc reply distribution cwru state uiuc game john washington new baseball michael
()
Topic #2:
com hp article writes netcom sun corp stratus posting ca nntp host portal news jim att org distribution systems support
()
Topic #3:
windows uk ac help drive problem thanks use monitor dos software using card file window mail color application pc drivers
()
Topic #4:
clipper key chip encryption government keys public secure use enforcement house law secret brad standard algorithm phone people pat security
()
Topic #5:
nasa gov space jpl center research shuttle moon program laboratory earth distribution henry brian data article sci world long posting
()
