# Data collection and preprocessing

## File descriptions

{0,1,2,3,4}.zip - are all HTML files. Files listed in train_v2.csv are training files:

file - the raw file name
sponsored - 0: organic content; 1: sponsored content label
We are going to use a fraction of fake data for exploratory analysis with Spark in the iPython notebook. Test with the full dataset will be illustrated during the live demo.

The label dataset is structured as follows:

```bash
{"namespace": "html.avro",		
	 "type": "record",	
	 "name": "Html",	
	 "fields": [	
	     {"name": "id", "type": "string"},	
	     {"name": "label", "type": "double"}	
	 ]	
	}	
```

The corresponding documents are in the zip files:  

```bash
{0,1,2,3,4}.zip
```

And will have a following structure after preprocessing:

```bash
{"namespace": "html.avro",
 "type": "record",
 "name": "Html",
 "fields": [
     {"name": "id", "type": "string"},
     {"name": "images",  "type": {"type":"array", "items":"string"}},
     {"name": "links", "type": {"type":"array", "items":"string"}},
     {"name": "text", "type": "string"},
     {"name": "title", "type": {"type":"array", "items":"string"}}
 ]
}
```

## Web scraping

The term "scraping" refers to getting unstructured data and turning it into something usable. The tools available through Python are mature and easy to use. In our case the source data comes in HTML form.

### The basic workflow is:

Find the data you want on the web *(in our case it is provided to us)*.

Inspect the page you're dealing with, to figure out how to zoom-in towards the content you want. This will involve some combiation of looking at the source code of the page (especially if it is simple), and
figuring out the structure of the HTML parse tree. This step is much easier with a something like *Chrome Developer Tools*.

Write code to get out what you want:
If the page is very simple, treat it as a bunch of text => string manipulation / regular expressions in Python.
If the page is more complicated (and/or written in good style), we want to use the HTML parse tree => BeautifulSoup in Python.

### CSS selectors -- move after

This pattern -- where you have nested finds, each given by conditions on tag type, id, and class -- is very common. It's so common that there is a special convenience language for such traversals: CSS selectors.
BeautifulSoup supports a form of CSS selectors, and this will let us write the above in a more concise and expressive way:
>    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
All selectors work like a 'find_all'. Some basic building examples of selectors are:
'mytag' picks out all tags of type mytag.
'#myid' picks out all tags whose id is equal to myid
'.myclass' picks out all tags whose class is equal to myclass
'mytag#myid' will pick all tags of type mytag and id equal to myid (analgously for 'mytag.myclass')
If 'selector1' and 'selector2' are two selectors, then there is another selector 'selector1 selector2'. It picks out all tags satisfying selector2 that are descendents(*) of something satisfying selector1, i.e., it's like our nested find.
(*) It doesn't have to be a direct descedent. I.e., it can be a grand-grand-..-grand-child of something satisfying selector1. For direct descendents we'd instead write 'selector1 > selector2'

## Example with our data

First, make sure that your notebook is running the Python from your Anaconda environment:

In [1]:
import sys
sys.version

'2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:57:58) \n[GCC 4.2.1 (Apple Inc. build 5577)]'

In [2]:
from bs4 import BeautifulSoup
document = open("../preprocess/data/1118089_raw_html.txt","r").read()
soup = BeautifulSoup(document)
print soup.prettify()

<!DOCTYPE html>
<html lang="en-US" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <!-- MOBIFY - DO NOT ALTER - PASTE IMMEDIATELY AFTER OPENING HEAD TAG -->
  <script>
   (function(window, document, mjs) {
    window.Mobify = {points: [+new Date], tagVersion: [1, 0]};
    var isMobile = /ip(hone|od|ad)|android|blackberry.*applewebkit|bb1\d.*mobile/i.test(navigator.userAgent);
    isBlackList=/Android 2\.|Silk|iPhone OS 4_|iPad; CPU OS 4_/i.test(navigator.userAgent);
    var optedOut = /mobify-path=($|;)/.test(document.cookie);
    if (!isMobile || optedOut || isBlackList) {
        return;
    }
 document.write('&lt;plaintext style=display:none&gt;');
    setTimeout(function() {
        var mobifyjs = document.createElement('script');
        var script = document.getElementsByTagName('script')[0];

        mobifyjs.src = mjs;
        script.parentNode.insertBefore(mobifyjs, script);
    });
    })(this, document, 'http://bundle.padsquad.com/almostsu



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))


# Find specific tags

In [3]:
text_tags = soup.find_all('p')
print text_tags

[<p class="site-title" itemprop="headline">\n<a href="http://www.almostsupermom.com/">\n        Almost Supermom\n       </a>\n</p>, <p class="site-description" itemprop="description">\n       Able to Leap Piles of Laundry in a Single Bound\n      </p>, <p class="entry-meta">\n<time class="entry-time" datetime="2015-07-13T12:15:48+00:00" itemprop="datePublished">\n          July 13, 2015\n         </time>\n         by\n         <span class="entry-author" itemprop="author" itemscope="itemscope" itemtype="http://schema.org/Person">\n<a class="entry-author-link" href="http://www.almostsupermom.com/author/almostsu" itemprop="url" rel="author">\n<span class="entry-author-name" itemprop="name">\n            Jordyn\n           </span>\n</a>\n</span>\n<span class="entry-comments-link">\n<a href="http://www.almostsupermom.com/2015/07/gluten-free-greek-flatbread.html#respond">\n           Leave a Comment\n          </a>\n</span>\n</p>, <p>\n         Necessity is the mother of invention, so is the

In [5]:
#import os, sys, logging, string, glob
#import cssutils as cu
#import json
#import re

def parse_text(soup):
    """ parameters:
            - soup: beautifulSoup4 parsed html page
        out:
            - textdata: a list of parsed text output by looping over html paragraph tags
        note:
            - could soup.get_text() instead but the output is more noisy """
    textdata = ['']

    for tag in soup.find_all("div", {"class":"text"}):
        try:
           textdata.append(tag.text.encode('ascii','ignore').strip())
        except Exception:
           continue

    for text in soup.find_all('p'):
        try:
            textdata.append(text.text.encode('ascii','ignore').strip())
        except Exception:
            continue

    return textdata

In [6]:
parse_text(soup)

['',
 'Almost Supermom',
 'Able to Leap Piles of Laundry in a Single Bound',
 'July 13, 2015\n         \n         by\n         \n\n\n            Jordyn\n           \n\n\n\n\n           Leave a Comment',
 'Necessity is the mother of invention, so is the case with this scrumptious\n         \n          gluten free greek flatbread.',
 'It is so hot here in Atlanta. We are practically melting in the oppressive, sticky heat that the south is known for. Its the kind of heat that makes you want to sit on the couch and do nothing until the sun goes down. Turning on the oven is a repulsive thought as it will contribute to the overwhelming heat, so we nourish ourselves with salads and cold sandwiches in attempt to avoid turning into a puddle.',
 'The problem with salads and sandwiches is they can get boring. Really boring. In attempt to shake things up but stay cool, Ive been experimenting with different recipes that taste delicious, but require no cooking , at least not in the heat of the day.'

We skip the preprocessing step for the sake of saving time. You are encouraged to complete this step offline to obtain the JSOn files.

## Converting JSON to Avro

Following step will convert the output JSOn files to Avro using the schema files prepared:

```bash

```

# Loading and exploring preprocessed data 


Download the preprocessed sample here:
labels: https://www.dropbox.com/s/lhi8kbgtharwn2x/labels.avro?dl=0
data: https://www.dropbox.com/s/5f1zy74o4igxsgd/html.avro?dl=0



In [7]:
print "Ingest data..."

train_label_df = sqlContext.read.format('com.databricks.spark.avro').load("./data/labels.avro")
input_df = sqlContext.read.format('com.databricks.spark.avro').load("./data/html.avro")
input_df.printSchema()
train_label_df.printSchema()
input_df.show()

Ingest data...
root
 |-- id: string (nullable = false)
 |-- images: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- links: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- text: string (nullable = false)
 |-- title: array (nullable = false)
 |    |-- element: string (containsNull = false)

root
 |-- id: string (nullable = false)
 |-- label: double (nullable = false)

+-------+--------------------+--------------------+--------------------+--------------------+
|     id|              images|               links|                text|               title|
+-------+--------------------+--------------------+--------------------+--------------------+
|1309896|[http://www.2tout...|[http://www.2tout...|insolite art et a...|[Un lphant retour...|
|1309926|[http://static.bb...|[/, #page, /acces...|israeli missile n...|[Yo app warns Isr...|
|1309968|                  []|                  []|the page you are ...|         [Not Found]|
|1310

# Prepare feature and label dataframe

In [8]:
train_wlabels_df = input_df.join(train_label_df,"id")
train_wlabels_df.printSchema()

root
 |-- id: string (nullable = false)
 |-- images: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- links: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- text: string (nullable = false)
 |-- title: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- label: double (nullable = false)



# Bootstrap and stratified sampling

Stratified sampling refers to a type of sampling method. With stratified sampling, one divides the sample into separate groups, called strata. Then, a probability sample (often a simple random sample) is drawn from each group.

Out problem is a 2 class classification problem: possible labels are 0 or 1. Let us take a look at the fractions of labels in our sample:  

In [7]:
train_wlabels_df.where(train_wlabels_df.label == 1).count()/float(train_wlabels_df.count())*100.

9.971772880251873

In [8]:
train_wlabels_df.where(train_wlabels_df.label == 0).count()/float(train_wlabels_df.count())

0.9002822711974813

Normally we need to use stratification (stratified sampling) or bootstrap sampling when working with unbalanced data (which is clearly the case here). In short, what these methods do is they look at the class labels on the training/cross validation stage, and try to prepare a sample which would have a distribution similar to what one could draw from the whole population.


The best transformation for stratified sampling in Spark is sampleByKeyExact transformation, from the PairRDDFunctions class.

sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed) 

Returns a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).

TO DO expand, explain the difference from sampleBy in DataFrames

In [22]:
#1 is under represented class
fractions = {1.0:1.0, 0.0:0.5}
stratified = train_wlabels_df.sampleBy("label", fractions, 36L)
train, cv = stratified.randomSplit([0.7, 0.3])




# Feature engineering

Most of our features are text. There are several common types of features and approaches that form the "starting point" for NLP and text-based data problems. Here are a few of the common ones:

#### Bag of words
#### TF/IDF
#### n-grams
#### Stemming / part of speech tagging / etc.
#### Feature hashing

## Some useful Python tools are:

http://scikit-learn.org/stable/modules/feature_extraction.html
http://www.nltk.org/
http://www.nltk.org/howto/wordnet.html


## Prepare text features

We are going to try using all of the above techqniues, but with **spark.ml**

### Extract tokens

Our first step is to tokenize data. The simple way to tokenize a string (sentence) in Python is using the split() method as: 

In [9]:
input_document = "fruit eat tasty pie leaf cook tree computer computers laptop tech technology ceo jobs ipad iphone announce announced mac company companies employee employees user software released"
tokens = input_document.split(" ")
print tokens

['fruit', 'eat', 'tasty', 'pie', 'leaf', 'cook', 'tree', 'computer', 'computers', 'laptop', 'tech', 'technology', 'ceo', 'jobs', 'ipad', 'iphone', 'announce', 'announced', 'mac', 'company', 'companies', 'employee', 'employees', 'user', 'software', 'released']


Spakr ML has 2 tokenizers: default Tokenizer and RegexTokenizer which allows to specify a custom pattern to tokenize on. The latter is more flexible, we are going to use it:

In [11]:
from pyspark.ml.feature import RegexTokenizer, Tokenizer

print "Prepare text features..."
#tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenizer = RegexTokenizer(inputCol="text", outputCol="words", pattern="\\W")
tokenized_df = tokenizer.transform(train_wlabels_df)
tokenized_df.select("words").show()

Prepare text features...
+--------------------+
|               words|
+--------------------+
|[insolite, art, e...|
|[israeli, missile...|
|[the, page, you, ...|
|[advertisement, b...|
|[make, more, mone...|
|[view, comments, ...|
|[posted, in, wind...|
|[voor, veel, mens...|
|[whats, holding, ...|
|[get, the, best, ...|
|[v, e, n, m, s, u...|
|[monthly, mixes, ...|
|[youre, helping, ...|
|                  []|
|[visitscotland, u...|
|[share, pin, twee...|
|[advertisement, f...|
|[suds, is, a, lig...|
|[t, m, l, l, g, e...|
|[a, critical, cro...|
+--------------------+
only showing top 20 rows



## N-grams

Instead of looking at just single words, it is also useful to look at n-grams: these are n-word long sequences of words (i.e., each of "farmer's market", "market share", and "farm share" is a 2-gram).
The exact same tokenization techniques apply.

In [14]:
from pyspark.ml.feature import NGram

print "Try ngrams instead, or in addition..."
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngram_df = ngram.transform(tokenized_df)
ngram_df.select("ngrams").show()

Try ngrams instead, or in addition...
+--------------------+
|              ngrams|
+--------------------+
|[insolite art, ar...|
|[israeli missile,...|
|[the page, page y...|
|[advertisement bd...|
|[make more, more ...|
|[view comments, c...|
|[posted in, in wi...|
|[voor veel, veel ...|
|[whats holding, h...|
|[get the, the bes...|
|[v e, e n, n m, m...|
|[monthly mixes, m...|
|[youre helping, h...|
|                  []|
|[visitscotland us...|
|[share pin, pin t...|
|[advertisement fo...|
|[suds is, is a, a...|
|[t m, m l, l l, l...|
|[a critical, crit...|
+--------------------+
only showing top 20 rows



## Remove stopwords

It's common to want to omit certain common words when doing these counts -- "a", "an", and "the" are common enough so that their counts do not tend to give us any hints as to the meaning of documents. Such words that we want to omit are called stop words (they don't stop anything, though).

Spark ML contains a standard list of such stop words for English. One can include any custom stopwords, if need be.

In [15]:
from pyspark.ml.feature import StopWordsRemover

print "Remove stopwords"
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
filtered_df = remover.transform(tokenized_df)
#filtered_df.printSchema()
filtered_df.select("filtered").show()

Remove stopwords
+--------------------+
|            filtered|
+--------------------+
|[insolite, art, e...|
|[israeli, missile...|
|[page, trying, vi...|
|[advertisement, b...|
|[make, money, tod...|
|[view, comments, ...|
|[posted, windows,...|
|[voor, veel, mens...|
|[whats, holding, ...|
|[best, free, news...|
|[v, e, n, m, s, u...|
|[monthly, mixes, ...|
|[youre, helping, ...|
|                  []|
|[visitscotland, u...|
|[share, pin, twee...|
|[advertisement, f...|
|[suds, lightweigh...|
|[t, m, l, l, g, e...|
|[critical, crossr...|
+--------------------+
only showing top 20 rows



## Feature hashing, TF-IDF


### Feature hashing

When doing "bag of words" type techniques on a *large* corpus and without an existing vocabulary, there is a simple trick that is often useful.  The issue (and solution) is as follows: 

 - The output is a feature vector, so that whenever we encounter a word we must look up which coordinate slot it is in.  A naive way would be to keep a list of all the words encoutered so far, and look up each word when it is encountered.  Whenever we encounter a new word, we see if we've already seen it before and if not -- assign it a new number.  This requires storing all the words that we have seen in memory, cannot be done in parallel (because we'd have to share the hash-table of seen words), etc.
 - A **hash function** takes as input something complicated (like a string) and spits out a number, with the desired property being that different inputs *usually* produce different outputs.  (This is how hash tables are implemented, as the name suggests.)
 - So -- rather than exactly looking up the coordinate of a given word, we can just use its hash value (modulo a big size that we choose).  This is fast and parallelizes easily.  (There are some downsides: You cannot tell, after the fact, what word each of your feature actually corresponds to!)
 
 
### TF-IDF weighting 

With single word vocabularies, we can probably do an okay job of coming up with a reasonable (if short) list of words that distinguish between the two documents.  With n-grams, even for $n=2$, it is better to let a computer help us.  

We would like to find words that are common in one document, not not common in all of them.  This is the goal of the __td-idf weighting__.  A precise definition is:

  1. If $d$ denotes a document and $t$ denotes a term, then the _raw term frequency_ $\mathrm{tf}^{raw}(t,d)$ is
  $$ \mathrm{tf}^{raw}(t,d) = \text{the number of times the term $t$ occurs in the document $d$} $$
  The vector of all term frequencies can optionally be _normalized_ either by dividing by the maximum of ny single word's occurance count ($L^1$) or by the Euclidean length of the vector of word occurance counts ($L^2$).  Scikit-learn by defaults does this second one:
  $$ \mathrm{tf}(t,d) = \mathrm{tf}^{L^2}(t,d) = \frac{\mathrm{tf}^{raw}(t,d)}{\sqrt{\sum_t \mathrm{tf}^{raw}(t,d)^2}} $$
  2. If $$ D = \left\{ d : d \in D \right\} $$ is the set of possible documents, then  the _inverse document frequency_ is
  $$ \mathrm{idf}^{naive}(t,D) = \log \frac{\# D}{\# \{d \in D : t \in d\}} \\
  = \log \frac{\text{count of all documents}}{\text{count of those documents containing the term $t$}} $$
  with a common variant being
  $$ \mathrm{idf}(t, D) = \log \frac{\# D}{1 + \# \{d \in D : t \in d\}} \\
   = \log \frac{\text{count of all documents}}{1 + \text{count of those documents containing the term $t$}} $$
  (This second one is the default in scikit-learn. Without this tweak we would omit the $1+$ in the denominator and have to worry about dividing by zero if $t$ is not found in any documents.)
  3. Finally, the weight that we assign to the term $t$ appearing in document $d$ and depending on the corpus of all documents $D$ is
  $$ \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \mathrm{idf}(t,D) $$
  

In [17]:
from pyspark.ml.feature import HashingTF, IDF

#Hashing
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=20)
featurized_df = hashingTF.transform(filtered_df)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurized_df)
rescaled_df = idfModel.transform(featurized_df)
rescaled_df.select("features").show()

+--------------------+
|            features|
+--------------------+
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[3,19],[0.495...|
|(20,[0,1,2,3,4,5,...|
|(20,[1,2,4,5,6,7,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,9,10...|
|(20,[0,1,2,3,4,6,...|
|(20,[0,1,2,3,4,6,...|
|(20,[1,5,6,7,8,9,...|
|(20,[0,1,2,3,4,5,...|
|          (20,[],[])|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,5,...|
|(20,[0,1,2,3,4,6,...|
|(20,[0,1,2,3,4,5,...|
+--------------------+
only showing top 20 rows



## Stemming

Let us look at the two words: "computer" and "computers".  It would have been useful to identify them as one word.

This is not limited to just trailing "s" characters: e.g., the words "carry", "carries", "carrying", and "carried" all carry -- roughly -- the same meaning.  The process of replacing them by a common root, or **stem**, is called stemming -- the stem will not, in general, be a full word itself.

There's a related process called **lemmatization**: The analog of the "stem" here _is_ an actual word.  

# Building a model


## Decision Trees

A decision tree is a binary tree.  At each of the internal nodes, it chooses a feature $i$ and a threshold $t$.  Each leaf has a value.  Evaluation of the model is just traversal of the tree from the root.  At each node, for example $j$, we go down the left branch if $X_{ji} \le t$ and the right branch otherwise.  The value of the model $f(X_{ji})$ is the value at the value at the terminating leaf of this traveral.  Below, we show a picture of this on small decision tree trained on the iris data set.  Notice that each internal node has a decision criterion and each leaf has the breakdown of label classes left at this leaf of the tree.  For a geometric picture of a decision tree, take a look at this [blog post](https://shapeofdata.wordpress.com/2013/07/02/decision-trees/).


### Decision Tree Training Algorithm and Tuning Parameters

The algorithm to construct a Decision Tree recursively builds a tree structure.  At each node, it finds the split (the feature and threshold level) that maximize the improvement in a criteria (in this case, the decrease in the gini index.  This algorithm is controlled by four major parameters:

<table>
	<tr>
    <th>Feature</th>
    <th>Value</th>
	</tr>

	<tr>
    <td>`max_features`</td>
    <td>The number of features to consider when choosing a split for an internal node</td>
	</tr>

	<tr>
    <td>`max_depth`</td>
    <td>The maximum depth of tree from the root</td>
	</tr>

	<tr>
    <td>`min_samples_split`</td>
    <td>Minimum number of samples required for a split to be considered</td>
	</tr>

	<tr>
    <td>`min_samples_leaf`</td>
    <td>Minimum number of samples required for each leaf</td>
	</tr>
</table>

## Random Forests

A random forest is just an ensemble of decision trees.  The predicted value is just the average of the trees (for both regression and classification problems - for classification problems, it is the probabilities that are averaged).  You can adjust `n_estimators` to change the number of trees in the forest.  If each tree is trained on the same subset of data, why aren't they identical?  Two reasons:
1. **Subsampling**: each tree is actually trained on a random selected (with replacement) subset (i.e. bootstrap)
1. **Maximum Features**: the optimal split comes from a randomly selected subset of the features.  In scikit-learn, this feature is controlled by `max_features`.


## Extremely Random Forests
Instead of choosing the optimal split amongst a (randomly selected) subset of features, we choose random values we choose amongst randomly generated thresholds.  While the first two are options in scikit, this is implemented in `ExtraTreesClassifier`.

**Question**: What happens to bias and variance of the individual trees in the averaging process of Random Forests and Extremely Random Forests.  How would you change your parameters to compensate?

You can read more about these [here](http://scikit-learn.org/0.12/modules/ensemble.html).

## Random Forest Training Algorithm and Tuning Parameters

A Random Forest and Extremely Random Forest are pretty straightforward to train once you know how a Decision Tree works.  In fact, their construction can even be parallelized.  They have an extra parameter `n_estimators` and their construction can be parallelized by setting the parameter `n_jobs`.


### Decision Trees
1. Increasing `max_features` and `max_depth` and decreasing `min_samples_split` and `min_samples_leaf` tend to build more complex models (increase Variance and reduce Bias).
1. Straightfoward.

### Random Forests
The variance between different trees tends to cancel each other while the biases reinforce each other.  That is, becasue the trees are different, they tend to overfit in different ways but when they underfit, they underfit the same way.  So you want to use higher variance, lower bias parameters than you would with a decision tree.

In [18]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=2)

rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures",numTrees=10,impurity="gini",maxDepth=4,maxBins=32)

# Putting it all into a machine learning pipeline

In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

## How it works
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator (see slides for the meaning). These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.

In [19]:
from pyspark.ml import Pipeline
#from pyspark.sql import Row

pipeline = Pipeline(stages=[tokenizer, remover, hashingTF, idf, labelIndexer, featureIndexer, rf])

# Evaluating a model

## Metrics for Classificaton

There are a plethora of metrics for classification and they depend on whether the predictions are given in terms of the potential label classes or probabilities.

### Metrics for Class Predictions

Let's start with the simplest.

Recall this well-known table

|                     | Observation Positive     | Observation Negative    |
|---------------------|:------------------------:|:-----------------------:|
| Prediction Positive |     True Positive        | False Positive (Type I) |
| Prediction Negative | False Negative (Type II) |     True Negative       |

There are many summary statistics one can compute from this table:
1. The **Accuracy** gives the fraction labels correctly predicted (True Positives and True Negatives over everything).  
1. The **Hamming Loss** gives the fraction of labels incorrectly predicted.  It is 1 - Accuracy.
1. The **Precision** is true positives divided by all positive predictions 
1. The **Recall** is true positives divided by all positive observations.
1. There is also **F-beta** score which gives a weighted geometric average between the precision and recall (as a function of $\beta$) and the **F-1** score is the special case when $\beta = 1$.
1. The **Jaccard Similarity Coefficient** is the True positives divided by the sum of true positives, false negatives, and false positives.  

In [21]:
from sklearn import metrics
# Accuracy and Hamming distnace:

y_obs  = [0, 0, 1, 1, 0, 1, 0, 1]
y_pred = [0, 0, 1, 1, 0, 0, 0, 1]

print "Accuracy:", metrics.accuracy_score(y_obs, y_pred)
print "Hamming Loss:", metrics.hamming_loss(y_obs, y_pred)
print "Precision:", metrics.precision_score(y_obs, y_pred)
print "Recall:", metrics.recall_score(y_obs, y_pred)
print "F1:", metrics.f1_score(y_obs, y_pred)
print "Jaccard:", metrics.jaccard_similarity_score(y_obs, y_pred)

ImportError: No module named sklearn

# Binary and multiclass classification evluators in spark ml

TO DO

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

#Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
#is areaUnderROC.
#metricName options are: areaUnderROC|areaUnderPR)
metricName = "areaUnderPR"
ev = BinaryClassificationEvaluator(metricName=metricName)
#Alternative: user multiclass classification evaluator
#metricName options are f1, precision, recall
#ev = MulticlassClassificationEvaluator(metricName="f1")

# Fit the pipeline to training documents.
model = pipeline.fit(train)

print "Evaluate model on test instances and compute test error..."
prediction = model.transform(cv)
#prediction = labelConverter.transform(prediction)
prediction.select("label", "text", "probability", "prediction").show(100)

result = ev.evaluate(prediction)
print metricName,": ", result

cvErr = prediction.filter(prediction.label == prediction.prediction).count() / float(cv.count())
print 'CV Error = ' + str(cvErr)    

Evaluate model on test instances and compute test error...
+-----+--------------------+--------------------+----------+
|label|                text|         probability|prediction|
+-----+--------------------+--------------------+----------+
|  0.0|posted in windows...|[0.79218950737233...|       0.0|
|  0.0|whats holding you...|[0.86048818178579...|       0.0|
|  0.0|advertisement for...|[0.79426406386225...|       0.0|
|  0.0|posted by admin c...|[0.77296578368813...|       0.0|
|  0.0|     whatever trevor|[0.81057143515216...|       0.0|
|  0.0|                    |[0.88411739579962...|       0.0|
|  1.0|who knew having a...|[0.80308980630361...|       0.0|
|  0.0|                    |[0.88411739579962...|       0.0|
|  0.0|                    |[0.88411739579962...|       0.0|
|  0.0|                    |[0.88411739579962...|       0.0|
|  1.0|win your dream we...|[0.81222170288967...|       0.0|
|  0.0|we may not have h...|[0.80868031849939...|       0.0|
|  0.0|bruce lee ping po..

# Hyper parameter search

# Adding custom features to the model