Skip to content

Commit

Permalink
Matrix exercises
Browse files Browse the repository at this point in the history
  • Loading branch information
chicagoscala committed Sep 23, 2012
1 parent 47a9f89 commit 928322a
Show file tree
Hide file tree
Showing 5 changed files with 136 additions and 7 deletions.
42 changes: 35 additions & 7 deletions Workshop.md
Expand Up @@ -16,6 +16,7 @@ This document will explain many features of the Scalding and Cascading. The scri
* [Scalding Wiki](https://github.com/twitter/scalding/wiki).
* Scalding Scaladocs are not online, but they can be built from the [Scalding Repo](https://github.com/twitter/scalding). For convenience, we have included these files in the workshop as `api.zip`. Unzip the file and open the [index](api/index.html).
* [Movie Recommendations](http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/) is a fantastic blog post with detailed, non-trivial examples using Scalding.
* [Scalding Example Project](https://github.com/snowplow/scalding-example-project) is a full example designed to run on Hadoop, specifically on Amazon's EMR (Elastic MapReduce) platform.

## A Disclaimer...

Expand Down Expand Up @@ -444,22 +445,49 @@ Context ngrams are special case of ngrams, where you just find the most common n
Let's revisit the exercise to join stock and dividend records and generalize it to read in multiple sets of data, for different companies, and process them as one stream. A complication is that the data files don't contain the stock ("instrument") symbol, so we'll see another way to add data to tuples.

run.rb scripts/StocksDividendsRevisited8.scala \
--stocks-root-path data/stocks/ \
--stocks-root-path data/stocks/ \
--dividends-root-path data/dividends/ \
--symbols AAPL,INTC,GE,IBM \
--output output/stocks-dividends-join.txt

# Matrix API

TODO
## Jaccard Similarity and Adjacency Matrices

*Adjacency matrices* are used to record the similarities between two things. For example, the "things" might be users who have rated movies and the *adjacency* might be how many movies they have reviewed in common. Higher adjacency numbers indicate more likely similarity of interests. Note that this simple representation says nothing about whether or not they both rated the movies in a similar way.

Once you have adjacency data, you need a *similarity measure* to determine how similar to things (e.g., people) really are. One is *Jaccard Similarity*:

![](images/JaccardSimilarity.png)

This is set notation; the size of the intersection of two sets over the size of the union. It can be generalized and is similar to the cosine of two vectors normalized by length. Note that the distance would be 1 - similarity.

Run the script this way on a small matrix:

run.rb scripts/MatrixJaccardSimilarity9.scala \
--input data/matrix/graph.tsv \
--output output/jaccardSim.tsv

## Term Frequency-Inverse Document Frequency (TF*IDF)

TF*IDF is a widely used *Natural Language Processing* tool to analyze text. It's useful for indexing documents, e.g., for web search engines. Naively, you might calculate the *frequency* of words in a corpus of documents and assume that if a word appears more frequently in one document, then that document is probably a "definitive" place for that word, such as the way you search for web pages on a particular topic. Similarly, the most frequent words indicate the primary topics for a document.

There's a problem, though. Very common words, e.g., articles like "the", "a", etc. will appear very frequently, undermining results. So we want to remove them so how. Fortunately, they tend to appear frequently in *every* document, so you can reduce the ranking of a particular word if you *divide* its frequency in a given document by its frequency in *all* documents. That's the essence of TF*IDF.

For more information, see the [Wikipedia](http://en.wikipedia.org/wiki/Tf*idf) page.

Run the script this way on a small matrix:

run.rb scripts/TfIdf10.scala \
--input data/matrix/docBOW.tsv \
--output output/featSelectedMatrix.tsv \
--nWords 300

# Type-Safe API

So far, we have been using the more mature *Fields-Based API*, which emphasizes naming fields and uses a relatively dynamic approach to typing. This is consistent with Cascading's model.

There is now a newer, more experimental *Type-Safe API* that attempts to more fully exploit the type safety provided by Scala.

TODO
There is now a newer, more experimental *Type-Safe API* that attempts to more fully exploit the type safety provided by Scala. We won't discuss it here, but refer you to the [Type-Safe API Wiki page](https://github.com/twitter/scalding/wiki/Type-safe-api-reference).

# Using Scalding with Hadoop

Expand Down Expand Up @@ -500,8 +528,8 @@ Pig has very similar capabilities, with notable advantages and disadvantages.

#### Advantages

* *A custom language* - A language customized for a particular purpose can optimize expressiveness for common scenarios.
* *Lazy evaluation* - you define the workflow, then Pig compiles, optimizes, and runs it when output is required. Scalding, following Scala, uses eager evaluation.
* *A custom language* - A purpose-built language for a particular domain can optimize expressiveness for common scenarios.
* *Lazy evaluation* - you define the work flow, then Pig compiles, optimizes, and runs it when output is required. Scalding, following Scala, uses eager evaluation.
* *Describe* - The describe feature is very helpful when learning how each Pig statement defines a new schema.

#### Disadvantages
Expand Down
9 changes: 9 additions & 0 deletions data/matrix/docBOW.tsv
@@ -0,0 +1,9 @@
1 hello 2
1 twitter 1
2 conversation 1
2 celebrities 1
2 twitter 1
3 elections 1
3 debate 1
3 twitter 1
3 political 1
4 changes: 4 additions & 0 deletions data/matrix/graph.tsv
@@ -0,0 +1,4 @@
1 2 1
1 3 1
3 2 1
4 2 2
46 changes: 46 additions & 0 deletions scripts/MatrixJaccardSimilarity9.scala
@@ -0,0 +1,46 @@
import com.twitter.scalding._
import com.twitter.scalding.mathematics.Matrix

/*
* MatrixJaccardSimilarity9.scala
*
* Adapted from "MatrixTutorial5" in the tutorials that come with Scalding.
*
* Loads a directed graph adjacency matrix where a[i,j] = 1 if there is an edge from a[i] to b[j]
* and computes the jaccard similarity between any two pairs of vectors
*
* You invoke the script like this:
* run.rb scripts/MatrixJaccardSimilarity9.scala \
* --input data/matrix/graph.tsv \
* --output output/jaccardSim.tsv
*
*/

class MatrixJaccardSimilarity9(args : Args) extends Job(args) {

import Matrix._

val adjacencyMatrix = Tsv(args("input"), ('user1, 'user2, 'rel))
.read
.toMatrix[Long,Long,Double]('user1, 'user2, 'rel)

val aBinary = adjacencyMatrix.binarizeAs[Double]

// intersectMat holds the size of the intersection of row(a)_i n row (b)_j
val intersectMat = aBinary * aBinary.transpose
val aSumVct = aBinary.sumColVectors
val bSumVct = aBinary.sumRowVectors

//Using zip to repeat the row and column vectors values on the right hand
//for all non-zeroes on the left hand matrix
val xMat = intersectMat.zip(aSumVct).mapValues( pair => pair._2 )
val yMat = intersectMat.zip(bSumVct).mapValues( pair => pair._2 )

val unionMat = xMat + yMat - intersectMat
//We are guaranteed to have Double both in the intersection and in the union matrix
intersectMat.zip(unionMat)
.mapValues(pair => pair._1 / pair._2)
.write(Tsv(args("output")))

}

42 changes: 42 additions & 0 deletions scripts/TfIdf10.scala
@@ -0,0 +1,42 @@
import com.twitter.scalding._
import com.twitter.scalding.mathematics.Matrix

/*
* TfIdf10.scala
*
* Adapted from "MatrixTutorial6" in the tutorials that come with Scalding.
*
* Loads a document to word matrix where a[i,j] = freq of the word j in the document i
* computes the Tf-Idf score of each word w.r.t. to each document and keeps the top nrWords in each document
* (see http://en.wikipedia.org/wiki/Tf*idf for more info)
*
* You invoke the script like this:
* run.rb scripts/TfIdf10.scala \
* --input data/matrix/docBOW.tsv \
* --output output/featSelectedMatrix.tsv \
* --nWords 300
*/
class TfIdf10(args : Args) extends Job(args) {

import Matrix._

val docWordMatrix = Tsv( args("input"), ('doc, 'word, 'count) )
.read
.toMatrix[Long,String,Double]('doc, 'word, 'count)

// compute the overall document frequency of each row
val docFreq = docWordMatrix.sumRowVectors

// compute the inverse document frequency vector
val invDocFreqVct = docFreq.toMatrix(1).rowL1Normalize.mapValues( x => log2(1/x) )

// zip the row vector along the entire document - word matrix
val invDocFreqMat = docWordMatrix.zip(invDocFreqVct.getRow(1)).mapValues( pair => pair._2 )

// multiply the term frequency with the inverse document frequency and keep the top nrWords
docWordMatrix.hProd(invDocFreqMat).topRowElems( args("nWords").toInt ).write(Tsv( args("output") ))

def log2(x : Double) = scala.math.log(x)/scala.math.log(2.0)

}

0 comments on commit 928322a

Please sign in to comment.