In [1]:
# SparkContext is already defined as sc
HDFS = 'hdfs://scut0:9000/'

# Extracting the right features from your data

## Introduction to Feature hashing

Feature hashing is a technique to deal with **high-dimensional data** and is often **used with text and categorical datasets** where the features can take on many unique values (often many millions of values)

Up until now, we have often used a simple approach of collecting the distinct feature values and zipping this collection with
a set of indices to create a map of feature value to index. This mapping is then
broadcast (either explicitly in our code or implicitly by Spark) to each worker.

When dealing with huge feature dimensions in the tens of millions or more
that are common when working with text, this approach can be slow and can require
signifcant memory and network resources, both on the Spark master (to collect the
unique values) and workers (to broadcast the resulting mapping to each worker,
which keeps it in memory to allow it to apply the feature encoding to its local piece
of the input data)


Also, building and using 1-of-K feature encoding requires us to keep a mapping of each possible feature value to an index in a vector. Furthermore, the process of creating the mapping itself requires at least one additional pass through the dataset and can be tricky to do in parallel scenarios.


**Feature hashing works by assigning the vector index for a feature based on the value obtained by hashing this feature to a number (usually, an integer value) using a hash function. **

This encoding works the same way as mapping-based encoding, except that we choose a size for our feature vector upfront.

Feature hashing has the advantage that we do not need to build a mapping and keep it in memory. It is also easy to implement, very fast, and can be done online and in real time, thus not requiring a pass through our dataset frst. 

However, there are two important drawbacks

1. As we don't create a mapping of features to index values, we also cannot do
the reverse mapping of feature index to value. This makes it harder to, for
example, determine which features are most informative in our models.

2. As we are restricting the size of our feature vectors, we might experience
hash collisions. This happens when two different features are hashed into
the same index in our feature vector. **Surprisingly, this doesn't seem to have
a severe impact on model performance as long as we choose a reasonable
feature vector dimension relative to the dimension of the input data**


## Extracting the TF-IDF features from the 20 Newsgroups dataset

To illustrate the concepts in this chapter, we will use a well-known text dataset called
20 Newsgroups; this dataset is commonly used for text-classifcation tasks. This is a
collection of newsgroup messages posted across 20 different topics. There are various
forms of data available. For our purposes, we will use the bydate version of the
dataset, which is available at http://qwone.com/~jason/20Newsgroups.

### Exploring the 20 Newsgroups data

In [2]:
textFilePairs = sc.wholeTextFiles(HDFS + '20_newsgroup/*')
print(textFilePairs.count())
print(textFilePairs.first())

19997
(u'hdfs://scut0:9000/20_newsgroup/alt.atheism/49960', u'Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126\nPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew\nFrom: mathew <mathew@mantis.co.uk>\nNewsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers\nSubject: Alt.Atheism FAQ: Atheist Resources\nSummary: Books, addresses, music -- anything related to atheism\nKeywords: FAQ, atheism, books, music, fiction, addresses, contacts\nMessage-ID: <19930329115719@mantis.co.uk>\nDate: Mon, 29 Mar 1993 11:57:19 GMT\nExpires: Thu, 29 Apr 1993 11:57:19 GMT\nFollowup-To: alt.atheism\nDistribution: world\nOrganization: Mantis Consultants, Cambridge. UK.\nApproved: news-answers-request@mit.edu\nSupersedes: <19930301143317@mantis.co.uk>\nLines: 290\n\nArchive-name: a

In [3]:
newsGroup = textFilePairs.map(lambda (path, content) : (path.split('/')[-2], 1)).reduceByKey(lambda a, b: a + b)
print(newsGroup.count())
sortedNewsGroup = newsGroup.sortBy(lambda x:x[1], ascending = False)
for ng in sortedNewsGroup.collect():
    print(ng)

20
(u'sci.crypt', 1000)
(u'comp.sys.mac.hardware', 1000)
(u'sci.med', 1000)
(u'comp.windows.x', 1000)
(u'misc.forsale', 1000)
(u'talk.politics.guns', 1000)
(u'comp.os.ms-windows.misc', 1000)
(u'sci.space', 1000)
(u'rec.sport.baseball', 1000)
(u'rec.motorcycles', 1000)
(u'talk.politics.misc', 1000)
(u'comp.graphics', 1000)
(u'talk.religion.misc', 1000)
(u'talk.politics.mideast', 1000)
(u'comp.sys.ibm.pc.hardware', 1000)
(u'alt.atheism', 1000)
(u'rec.sport.hockey', 1000)
(u'sci.electronics', 1000)
(u'rec.autos', 1000)
(u'soc.religion.christian', 997)


### Tokenization

In [4]:
# split the text and convert all words to lowercase
texts = textFilePairs.map(lambda (path, text):text.encode('utf8'))
totalWords = texts.flatMap(lambda text:map(lambda word:word.lower(), text.replace('\n', ' ').split()))
print(totalWords.distinct().count())
print(totalWords.take(100))

425542
['xref:', 'cantaloupe.srv.cs.cmu.edu', 'alt.atheism:49960', 'alt.atheism.moderated:713', 'news.answers:7054', 'alt.answers:126', 'path:', 'cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew', 'from:', 'mathew', '<mathew@mantis.co.uk>', 'newsgroups:', 'alt.atheism,alt.atheism.moderated,news.answers,alt.answers', 'subject:', 'alt.atheism', 'faq:', 'atheist', 'resources', 'summary:', 'books,', 'addresses,', 'music', '--', 'anything', 'related', 'to', 'atheism', 'keywords:', 'faq,', 'atheism,', 'books,', 'music,', 'fiction,', 'addresses,', 'contacts', 'message-id:', '<19930329115719@mantis.co.uk>', 'date:', 'mon,', '29', 'mar', '1993', '11:57:19', 'gmt', 'expires:', 'thu,', '29', 'apr', '1993', '11:57:19', 'gmt', 'followup-to:', 'alt.atheism', 'distribution:', 'world', 'organization:', 'mantis', 'consultants,', 'cambridge.', 'uk.',

In [5]:
# The preceding simple approach results in a lot of tokens and does not flter out many nonword characters 
#  We can do this by splitting each raw document on nonword characters using a regular expression pattern
import re
noPunctuationWords = texts.flatMap(lambda text:map(lambda word:word.lower(), re.split('\W+', text)))
print(noPunctuationWords.take(100))

['xref', 'cantaloupe', 'srv', 'cs', 'cmu', 'edu', 'alt', 'atheism', '49960', 'alt', 'atheism', 'moderated', '713', 'news', 'answers', '7054', 'alt', 'answers', '126', 'path', 'cantaloupe', 'srv', 'cs', 'cmu', 'edu', 'crabapple', 'srv', 'cs', 'cmu', 'edu', 'bb3', 'andrew', 'cmu', 'edu', 'news', 'sei', 'cmu', 'edu', 'cis', 'ohio', 'state', 'edu', 'magnus', 'acs', 'ohio', 'state', 'edu', 'usenet', 'ins', 'cwru', 'edu', 'agate', 'spool', 'mu', 'edu', 'uunet', 'pipex', 'ibmpcug', 'mantis', 'mathew', 'from', 'mathew', 'mathew', 'mantis', 'co', 'uk', 'newsgroups', 'alt', 'atheism', 'alt', 'atheism', 'moderated', 'news', 'answers', 'alt', 'answers', 'subject', 'alt', 'atheism', 'faq', 'atheist', 'resources', 'summary', 'books', 'addresses', 'music', 'anything', 'related', 'to', 'atheism', 'keywords', 'faq', 'atheism', 'books', 'music', 'fiction', 'addresses', 'contacts', 'message', 'id']


In [6]:
# filter out string with digits
noDigitWords = noPunctuationWords.filter(lambda word: not re.search(r'\d', word))
print(noDigitWords.take(100))

['xref', 'cantaloupe', 'srv', 'cs', 'cmu', 'edu', 'alt', 'atheism', 'alt', 'atheism', 'moderated', 'news', 'answers', 'alt', 'answers', 'path', 'cantaloupe', 'srv', 'cs', 'cmu', 'edu', 'crabapple', 'srv', 'cs', 'cmu', 'edu', 'andrew', 'cmu', 'edu', 'news', 'sei', 'cmu', 'edu', 'cis', 'ohio', 'state', 'edu', 'magnus', 'acs', 'ohio', 'state', 'edu', 'usenet', 'ins', 'cwru', 'edu', 'agate', 'spool', 'mu', 'edu', 'uunet', 'pipex', 'ibmpcug', 'mantis', 'mathew', 'from', 'mathew', 'mathew', 'mantis', 'co', 'uk', 'newsgroups', 'alt', 'atheism', 'alt', 'atheism', 'moderated', 'news', 'answers', 'alt', 'answers', 'subject', 'alt', 'atheism', 'faq', 'atheist', 'resources', 'summary', 'books', 'addresses', 'music', 'anything', 'related', 'to', 'atheism', 'keywords', 'faq', 'atheism', 'books', 'music', 'fiction', 'addresses', 'contacts', 'message', 'id', 'mantis', 'co', 'uk', 'date', 'mon']


### Remove stop words

We can take a look at some of the tokens in our corpus that have the highest occurrence 
across all documents to get an idea about some other stop words to exclude

In [7]:
wordCount = noDigitWords.map(lambda word:(word, 1)).reduceByKey(lambda a, b : a+b)
sortedWordCount = wordCount.sortBy(lambda (k, v):v, ascending = False)
print(sortedWordCount.take(20))

[('the', 256555), ('edu', 164007), ('to', 133963), ('of', 122352), ('a', 111811), ('and', 102358), ('i', 92113), ('in', 87008), ('is', 75575), ('that', 70765), ('ax', 62416), ('it', 58816), ('cmu', 52409), ('for', 50392), ('com', 50158), ('you', 48181), ('cs', 45142), ('from', 39705), ('s', 38681), ('on', 35559)]


In [8]:
stopWords = {"the","a","s","an","of","or","in","for","by","on","but", "is", "not","with", "as", "was", "if",
             "they", "are", "this", "and", "it", "have", "from", "at", "my","be", "that", "to"}
print('before filtering stop words: {0}'.format(sortedWordCount.count()))
filteredSortedWordCount = sortedWordCount.filter(lambda (k, v): k not in stopWords)
print('after filtering stop words:{0}'.format(filteredSortedWordCount.count()))
print(filteredSortedWordCount.take(20))
rareWords = set(filteredSortedWordCount.filter(lambda (k,v): v==1).collect())
print(type(rareWords), len(rareWords))

before filtering stop words: 111525
after filtering stop words:111496
[('edu', 164007), ('i', 92113), ('ax', 62416), ('cmu', 52409), ('com', 50158), ('you', 48181), ('cs', 45142), ('news', 34309), ('srv', 32359), ('t', 32121), ('cantaloupe', 26048), ('net', 25459), ('message', 21954), ('subject', 21589), ('lines', 20894), ('date', 20787), ('id', 20695), ('apr', 20510), ('newsgroups', 20404), ('path', 20369)]
(<type 'set'>, 40808)


One other fltering step that we will use is removing any tokens that are only
one character in length. The reasoning behind this is similar to removing stop
words—these single-character tokens are unlikely to be informative in our text
model and can further reduce the feature dimension and model size

In [9]:
filteredSortedWordCount = filteredSortedWordCount.filter(lambda (k, v): len(k) > 1)
print(filteredSortedWordCount.count())
print(filteredSortedWordCount.take(20))

111470
[('edu', 164007), ('ax', 62416), ('cmu', 52409), ('com', 50158), ('you', 48181), ('cs', 45142), ('news', 34309), ('srv', 32359), ('cantaloupe', 26048), ('net', 25459), ('message', 21954), ('subject', 21589), ('lines', 20894), ('date', 20787), ('id', 20695), ('apr', 20510), ('newsgroups', 20404), ('path', 20369), ('can', 20028), ('organization', 19840)]


### Excluding terms based on frequency

It is also a common practice to exclude terms during tokenization when their overall
occurrence in the corpus is very low

In [10]:
print(filteredSortedWordCount.takeOrdered(20, lambda (k, v):v))

[('yermut', 1), ('sowell', 1), ('trawling', 1), ('kalmar', 1), ('jjjjjjc', 1), ('igua', 1), ('propounded', 1), ('_xogkyrzaup', 1), ('naturopathic', 1), ('neimoller', 1), ('_kk', 1), ('_km', 1), ('mc_rssqp_fqod', 1), ('inpropable', 1), ('fachtagung', 1), ('macedoine', 1), ('durocher', 1), ('antannas', 1), ('mycdb', 1), ('rlii', 1)]


As we can see, there are many terms that only occur once in the entire corpus.
Since typically we want to use our extracted features for other tasks such as
document similarity or machine learning models, tokens that only occur once are
not useful to learn from, as we will not have enough training data relative to these
tokens. We can apply another filter to exclude these rare tokens

In [11]:
print('before filtering words that appear once: {0}'.format(filteredSortedWordCount.count()))
filteredSortedWordCount = filteredSortedWordCount.filter(lambda (k, v) : v > 1)
print('after filtering words that appear once:{0}'.format(filteredSortedWordCount.count()))
print(filteredSortedWordCount.takeOrdered(20, lambda (k, v):v))

before filtering words that appear once: 111470
after filtering words that appear once:70662
[('tilton', 2), ('netcdf', 2), ('outragious', 2), ('fawr', 2), ('sation', 2), ('yougoslavie', 2), ('gruel', 2), ('originality', 2), ('_ki', 2), ('xjudging', 2), ('lmx', 2), ('centimeter', 2), ('phenylanine', 2), ('wiseguy', 2), ('natured', 2), ('naviagtion', 2), ('vecchio', 2), ('sig_alrm', 2), ('toowoomba', 2), ('millimetres', 2)]


In [12]:
# combine all the processing procedure above to  tranfrom each document to a sequence of tokens
def tokenize(text):
    global stopWords, rareWords
    wordList = [word.lower() for word in re.split('\W+', text)]
    wordList = filter(lambda word : not re.search(r'\d', word), wordList)
    wordList = filter(lambda word : word not in stopWords, wordList)
    wordList = filter(lambda word : word not in rareWords, wordList)
    wordList = filter(lambda word : len(word) > 1, wordList)
    return wordList
    
tokens_ = texts.map(lambda text :map(lambda word:word.lower(), re.split('\W+', text)))\
               .map(lambda wordList: filter(lambda word: not re.search(r'\d', word), wordList))\
               .map(lambda wordList: filter(lambda word: word not in stopWords, wordList))\
               .map(lambda wordList: filter(lambda word: word not in rareWords, wordList))\
               .map(lambda wordList: filter(lambda word: len(word)>1, wordList))
                
tokens = texts.map(lambda text : tokenize(text))
print(tokens.count(), tokens_.count())
print(len(tokens.first()), len(tokens_.first()))
print(tokens.first())

(19997, 19997)
(1264, 1264)
['xref', 'cantaloupe', 'srv', 'cs', 'cmu', 'edu', 'alt', 'atheism', 'alt', 'atheism', 'moderated', 'news', 'answers', 'alt', 'answers', 'path', 'cantaloupe', 'srv', 'cs', 'cmu', 'edu', 'crabapple', 'srv', 'cs', 'cmu', 'edu', 'andrew', 'cmu', 'edu', 'news', 'sei', 'cmu', 'edu', 'cis', 'ohio', 'state', 'edu', 'magnus', 'acs', 'ohio', 'state', 'edu', 'usenet', 'ins', 'cwru', 'edu', 'agate', 'spool', 'mu', 'edu', 'uunet', 'pipex', 'ibmpcug', 'mantis', 'mathew', 'mathew', 'mathew', 'mantis', 'co', 'uk', 'newsgroups', 'alt', 'atheism', 'alt', 'atheism', 'moderated', 'news', 'answers', 'alt', 'answers', 'subject', 'alt', 'atheism', 'faq', 'atheist', 'resources', 'summary', 'books', 'addresses', 'music', 'anything', 'related', 'atheism', 'keywords', 'faq', 'atheism', 'books', 'music', 'fiction', 'addresses', 'contacts', 'message', 'id', 'mantis', 'co', 'uk', 'date', 'mon', 'mar', 'gmt', 'expires', 'thu', 'apr', 'gmt', 'followup', 'alt', 'atheism', 'distribution', 'w

### A note about stemming

A common step in text processing and tokenization is stemming. **This is the conversion
of whole words to a base form (called a word stem)**. For example, plurals might be
converted to singular (dogs becomes dog), and forms such as walking and walker might
become walk. Stemming can become quite complex and is typically handled with
specialized NLP or search engine software (such as NLTK, OpenNLP, and Lucene,
for example). We will ignore stemming for the purpose of our example here.

# Training a TF-IDF model

We will now use MLlib to transform each document, in the form of processed
tokens, into a vector representation. The frst step will be to **use the HashingTF
implementation, which makes use of feature hashing to map each token in the input
text to an index in the vector of term frequencies.** Then, we will compute the global
IDF and use it to transform the term frequency vectors into TF-IDF vectors

In [13]:
from pyspark.mllib.feature import HashingTF, IDF
# ref: https://spark.apache.org/docs/2.1.1/mllib-feature-extraction.html
dim = pow(2, 17) # 2^17 > 70662

# The transform function of HashingTF maps each input document 
# that is, a sequence of tokens to an MLlib Vector.
hashingTF = HashingTF(dim)
tf = hashingTF.transform(tokens)

# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:
# First to compute the IDF vector and second to scale the term frequencies by IDF.
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)

In [14]:
TFIDFDoc = tfidf.first()
print(type(TFIDFDoc), TFIDFDoc.size, TFIDFDoc.values.size)
print(TFIDFDoc.values[:10])
print(TFIDFDoc.indices[:10])

(<class 'pyspark.mllib.linalg.SparseVector'>, 131072, 788)
[  1.2445212   13.05128389   2.9383072    7.60080245   6.60755068
   1.53191401   2.13460871   3.18437439   2.71951683   4.32743844]
[ 180  308 1025 1358 1542 1580 1595 2263 2424 2805]


We can see that the dimension of each sparse vector of term frequencies is 262144 (or 2^18 as we specifed).
However, the number on non-zero entries in the vector is only 788. 
The last two lines of the output show the tfidf and vector indexes for the first few entries in the vector

In [15]:
# spark.mllib's IDF implementation provides an option for ignoring terms
# which occur in less than a minimum number of documents.
# In such cases, the IDF for these terms is set to 0.
# This feature can be used by passing the minDocFreq value to the IDF constructor.
idfIgnore = IDF(minDocFreq=2).fit(tf)
tfidfIgnore = idfIgnore.transform(tf)

## Analyzing the TF-IDF weightings

In [16]:
# First, we can compute the minimum and maximum TF-IDF weights across the entire corpus
minMaxTFIDF = tfidf.map(lambda tfidfDoc: (min(tfidfDoc.values), max(tfidfDoc.values)))\
                   .reduce(lambda (min1, max1), (min2, max2): (min(min1, min2), max(max1, max2)))
print(minMaxTFIDF)

(0.0, 71443.118724658925)


In [17]:
# TF-IDF weighting will tend to assign a lower weighting to common terms. 
# To see this, we can compute the TF-IDF representation for a few of the terms that appear in
# the list of top occurrences that we previously computed, such as you, do, and we
common = sc.parallelize(['you', 'do', 'we'])
tfCommon = hashingTF.transform(common)
tfidfCommon = idf.transform(tfCommon)
print(tfidfCommon.first().values)

[ 8.11162808  7.01301579  9.21024037]


In [18]:
# Now, let's apply the same transformation to a few less common terms that we might
# intuitively associate with being more linked to specifc topics or concepts
uncommon = sc.parallelize(['telescope', 'legislation', 'investment'])
tfUncommon = hashingTF.transform(uncommon)
tfidfUncommon = idf.transform(tfUncommon)
print(tfidfUncommon.first().values)

[  9.21024037   9.90338755  23.11848891   8.80477526   9.90338755
   9.21024037   8.51709319]


# Using a TF-IDF model

**While we often refer to training a TF-IDF model, it is actually a feature extraction process or transformation rather than a machine learning model. **TF-IDF weighting is often used as a preprocessing step for other models, such as dimensionality reduction, classifcation, or regression.

To illustrate the potential uses of TF-IDF weighting, we will explore two examples.
The frst is using the TF-IDF vectors to compute document similarity, while the
second involves training a multilabel classifcation model with the TF-IDF vectors as
input features

## Document similarity with the 20 Newsgroups dataset and TF-IDF features

In [19]:
hockeyText = textFilePairs.filter(lambda (path, content) : 'hockey' in path).map(lambda (path, content): content.encode('utf8'))
print(hockeyText.first())

Newsgroups: rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!spool.mu.edu!torn!newshub.ccs.yorku.ca!ists!stpl.ists.ca!dchhabra
From: dchhabra@stpl.ists.ca (Deepak Chhabra)
Subject: Superstars and attendance (was Teemu Selanne, was +/- leaders)
Message-ID: <1993Apr5.182124.17415@ists.ists.ca>
Sender: news@ists.ists.ca (News Subsystem)
Nntp-Posting-Host: stpl.ists.ca
Organization: Solar Terresterial Physics Laboratory, ISTS
Distribution: na
Date: Mon, 5 Apr 93 18:21:24 GMT
Lines: 115


Dean J. Falcione (posting from jrmst+8@pitt.edu) writes:
[I wrote:]

>>When the Pens got Mario, granted there was big publicity, etc, etc,
>>and interest was immediately generated.  Gretzky did the same thing for LA. 
>>However, imnsho, neither team would have seen a marked improvement in
>>attendance if the team record did not improve.  In the year before Lemieux
>>came, Pittsburgh finished with 38 points.  Following his

In [20]:
hockeyTF = hockeyText.map(lambda text:hashingTF.transform(tokenize(text)))
hockeyTFIDF = idf.transform(hockeyTF)
print(hockeyTFIDF.first())

(131072,[1025,1331,1580,1595,2424,2575,2667,2738,2831,2956,3592,4240,4291,4387,4415,4513,5024,5041,5348,5706,5722,5873,6141,6447,6890,7652,7775,8314,8616,8798,8982,9799,10117,10362,10503,10779,10818,10843,10859,10903,11080,12928,13308,13630,13693,13735,13898,14204,14480,15569,16324,16818,16927,17361,17718,17946,17949,18052,18219,19199,19460,19542,19634,20844,20885,21012,21477,21767,22065,22404,23370,23663,23972,24127,24165,24166,24591,24870,25134,25353,25687,25805,25817,26540,27587,28000,28074,28147,28396,28459,29155,29474,29822,29898,30984,31385,31391,31691,32036,32191,33254,33433,33559,33707,33926,33941,33950,34439,34935,35288,35409,35893,35994,36034,36056,36617,36737,36883,36893,37093,37481,37826,38034,38305,38355,38605,39219,40202,40216,41537,41538,42049,42236,43343,43741,44616,44891,44941,45128,45739,45760,46915,47128,47163,47810,48141,48167,48314,48611,48650,48702,48991,49337,49338,49468,49477,49543,49909,50163,50394,50546,50589,50671,51398,52087,52474,52654,53796,53808,54636,549

In [21]:
hockey1 = hockeyTFIDF.sample(True, 0.1, 41).first()
hockey2 = hockeyTFIDF.sample(True, 0.1, 42).first()
print(type(hockey1), type(hockey2))

(<class 'pyspark.mllib.linalg.SparseVector'>, <class 'pyspark.mllib.linalg.SparseVector'>)


In [22]:
# cosin similarity of hockey 1 and hockey2
from pyspark.ml.linalg import Vectors
from scipy.spatial.distance import cosine
h1, h2 = Vectors.dense(hockey1), Vectors.dense(hockey2)
print(h1.dot(h2)/(h1.norm(2)*h2.norm(2)))
print(1-cosine(h1, h2))

0.0433910331401
0.0433910331401


While this might seem quite low, recall that the effective dimensionality of our
features is high due to the large number of unique terms that is typical when dealing
with text data. Hence, we can expect that any two documents might have a relatively
low overlap of terms even if they are about the same topic, and therefore would have
a lower absolute similarity score.

By contrast, we can compare this similarity score to the one computed between
one of our hockey documents and another document chosen randomly from the
comp.graphics newsgroup, using the same methodology

In [23]:
graphText = textFilePairs.filter(lambda (path, content) : 'comp.graphics' in path)\
                         .map(lambda (path, content): content.encode('utf8'))
graphicsTF =  hashingTF.transform(tokenize(graphText.first()))
graphicsTFIDF = idf.transform(graphicsTF)
g1 = Vectors.dense(graphicsTFIDF)
print(h1.dot(g1)/(h1.norm(2)*g1.norm(2)))
print(h2.dot(g1)/(h2.norm(2)*g1.norm(2)))

0.00109625504927
0.00310798348079


Finally, it is likely that a document from another sports-related topic might be more
similar to our hockey document than one from a computer-related topic. However,
we would probably expect a baseball document to not be as similar as our hockey
document. Let's see whether this is the case by computing the similarity between a
random message from the baseball newsgroup and our hockey document

In [24]:
baseballText = textFilePairs.filter(lambda (path, content) : 'baseball' in path)\
                         .map(lambda (path, content): content.encode('utf8'))
baseballTF =  hashingTF.transform(tokenize(baseballText.first()))
baseballTFIDF = idf.transform(baseballTF)
b1 = Vectors.dense(baseballTFIDF)
print(h1.dot(b1)/(h1.norm(2)*b1.norm(2)))
print(h2.dot(b1)/(h2.norm(2)*b1.norm(2)))

0.0338025271309
0.0197523934721


## Training a text classifer on the 20 Newsgroups dataset using TF-IDF

When using TF-IDF vectors, we expected that the cosine similarity measure would
capture the similarity between documents, based on the overlap of terms between
them. **In a similar way, we would expect that a machine learning model, such as a
classifer, would be able to learn weightings for individual terms; this would allow
it to distinguish between documents from different classes. **That is, it should be
possible to learn a mapping between the presence (and weighting) of certain terms
and a specifc topic.

In the 20 Newsgroups example, each newsgroup topic is a class, and we can train a
classifer using our TF-IDF transformed vectors as input.

In [25]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.evaluation import MulticlassMetrics

In [26]:
topics = newsGroup.map(lambda (c, count):c).collect()

In [27]:
topicsMap = {cla:idx for idx, cla in enumerate(topics)}
print(topicsMap)

{u'sci.med': 2, u'comp.sys.mac.hardware': 1, u'talk.politics.misc': 10, u'soc.religion.christian': 11, u'rec.motorcycles': 9, u'comp.graphics': 12, u'comp.windows.x': 3, u'misc.forsale': 4, u'talk.politics.guns': 5, u'talk.politics.mideast': 14, u'sci.crypt': 0, u'sci.space': 7, u'rec.sport.hockey': 17, u'comp.sys.ibm.pc.hardware': 15, u'rec.sport.baseball': 8, u'alt.atheism': 16, u'comp.os.ms-windows.misc': 6, u'talk.religion.misc': 13, u'rec.autos': 19, u'sci.electronics': 18}


In [28]:
textTopics = textFilePairs.map(lambda (path, content) : path.split('/')[-2])
trainData = textTopics.zip(tfidf).map(lambda (topic, vector):LabeledPoint(topicsMap[topic], vector))
trainData.cache()

PythonRDD[95] at RDD at PythonRDD.scala:48

In the preceding code snippet, we took the textTopics RDD, where each element
is the topic, and used the zip function to combine it with each element in our tfidf
RDD of TF-IDF vectors. We then mapped over each key-value element in our new
zipped RDD and created a LabeledPoint instance, where label is the class index
and features is the TF-IDF vector

In [29]:
model = NaiveBayes.train(trainData, 0.1)

In [30]:
# training accuracy
predictionsVSLabels = trainData.map(lambda point:(point.label, model.predict(point.features)))
correctCount = predictionsVSLabels.filter(lambda (label, prediction):label == prediction).count()
print('training accuracy:{0}'.format(float(correctCount)/trainData.count()))

training accuracy:0.963194479172


# Evaluating the impact of text processing

## Comparing raw features with processed TF-IDF features on the 20 Newsgroups dataset

In this example, we will simply apply the hashing term frequency transformation
to the raw text tokens obtained using a simple whitespace splitting of the document
text. We will train a model on this data and evaluate the performance on the test set
as we did for the model trained with TF-IDF features

In [31]:
rawText = texts.map(lambda text : hashingTF.transform(text.split()))
rawTrainData = textTopics.zip(rawText).map(lambda (topic, vector):LabeledPoint(topicsMap[topic], vector))
rawModel = NaiveBayes.train(rawTrainData, 0.1)
rawPredictionsVSLabels = rawTrainData.map(lambda point:(point.label, model.predict(point.features)))
rawCorrectCount = rawPredictionsVSLabels.filter(lambda (label, prediction):label == prediction).count()
print('training accuracy:{0}'.format(float(rawCorrectCount)/rawTrainData.count()))

training accuracy:0.508826323949


# Word2Vec models

Until now, we have used a bag-of-words vector, optionally with some weighting
scheme such as TF-IDF to represent the text in a document. Another recent class
of models that has become popular is related to representing individual words
as vectors.

These are generally based in some way on the co-occurrence statistics between the
words in a corpus. Once the vector representation is computed, we can use these
vectors in ways similar to how we might use TF-IDF vectors (such as using them
as features for other machine learning models). One such common use case is
computing the similarity between two words with respect to their meanings, based
on their vector representations.

In [32]:
from pyspark.mllib.feature import Word2Vec
word2vec = Word2Vec()
word2vecModel = word2vec.fit(tokens)

Once trained, we can easily fnd the top 20 synonyms for a given term (that is, the
most similar term to the input term, computed by cosine similarity between the word
vectors). For example, to fnd the 20 most similar terms to hockey, use the following
lines of code:

In [33]:
synonyms = word2vecModel.findSynonyms('basketball', 20)
for word, cosine_distance in synonyms:
    print("{}: {}".format(word, cosine_distance))

football: 0.781936278997
telecasts: 0.7474623067
franchises: 0.72790238083
scny: 0.7240534571
scoreboard: 0.717500848617
cept: 0.717376210802
devellano: 0.716233564392
quintin: 0.714930884453
goaltenders: 0.711373542351
burgh: 0.710741334258
staub: 0.709629381769
skriko: 0.70589596071
sweep: 0.704942337701
outplayed: 0.704855835902
nba: 0.703270833792
players: 0.702700878737
skater: 0.701928838807
bats: 0.696367061423
islander: 0.695730831734
baseball: 0.695713528589

# Summary

In this chapter, we took a deeper look into more complex text processing and
explored MLlib's text feature extraction capabilities, in particular the TF-IDF term
weighting schemes. We covered examples of using the resulting TF-IDF feature
vectors to compute document similarity and train a newsgroup topic classifcation
model. Finally, you learned how to use MLlib's cutting-edge Word2Vec model to
compute a vector representation of words in a corpus of text and use the trained
model to fnd words with contextual meaning that is similar to a given word