Matrix exercises

ThinkBigAnalytics · Sep 23, 2012 · 928322a · 928322a
1 parent 47a9f89
commit 928322a
Show file tree

Hide file tree

Showing 5 changed files with 136 additions and 7 deletions.
diff --git a/Workshop.md b/Workshop.md
@@ -16,6 +16,7 @@ This document will explain many features of the Scalding and Cascading. The scri
 * [Scalding Wiki](https://github.com/twitter/scalding/wiki).
 * Scalding Scaladocs are not online, but they can be built from the [Scalding Repo](https://github.com/twitter/scalding). For convenience, we have included these files in the workshop as `api.zip`. Unzip the file and open the [index](api/index.html).
 * [Movie Recommendations](http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/) is a fantastic blog post with detailed, non-trivial examples using Scalding.
+* [Scalding Example Project](https://github.com/snowplow/scalding-example-project) is a full example designed to run on Hadoop, specifically on Amazon's EMR (Elastic MapReduce) platform.
 
 ## A Disclaimer...
 
@@ -444,22 +445,49 @@ Context ngrams are special case of ngrams, where you just find the most common n
 Let's revisit the exercise to join stock and dividend records and generalize it to read in multiple sets of data, for different companies, and process them as one stream. A complication is that the data files don't contain the stock ("instrument") symbol, so we'll see another way to add data to tuples.
 
 	run.rb scripts/StocksDividendsRevisited8.scala \
-		--stocks-root-path    data/stocks/ \
+	  --stocks-root-path    data/stocks/ \
 	  --dividends-root-path data/dividends/ \
 	  --symbols AAPL,INTC,GE,IBM \
 	  --output output/stocks-dividends-join.txt
 
 # Matrix API
 
-TODO
+## Jaccard Similarity and Adjacency Matrices
+
+*Adjacency matrices* are used to record the similarities between two things. For example, the "things" might be users who have rated movies and the *adjacency* might be how many movies they have reviewed in common. Higher adjacency numbers indicate more likely similarity of interests. Note that this simple representation says nothing about whether or not they both rated the movies in a similar way.
+
+Once you have adjacency data, you need a *similarity measure* to determine how similar to things (e.g., people) really are. One is *Jaccard Similarity*:
+
+![](images/JaccardSimilarity.png)
+
+This is set notation; the size of the intersection of two sets over the size of the union. It can be generalized and is similar to the cosine of two vectors normalized by length. Note that the distance would be 1 - similarity.
+
+Run the script this way on a small matrix:
+
+	run.rb scripts/MatrixJaccardSimilarity9.scala \
+	  --input data/matrix/graph.tsv \
+	  --output output/jaccardSim.tsv
+
+## Term Frequency-Inverse Document Frequency (TF*IDF)
+
+TF*IDF is a widely used *Natural Language Processing* tool to analyze text. It's useful for indexing documents, e.g., for web search engines. Naively, you might calculate the *frequency* of words in a corpus of documents and assume that if a word appears more frequently in one document, then that document is probably a "definitive" place for that word, such as the way you search for web pages on a particular topic. Similarly, the most frequent words indicate the primary topics for a document.
+
+There's a problem, though. Very common words, e.g., articles like "the", "a", etc. will appear very frequently, undermining results. So we want to remove them so how. Fortunately, they tend to appear frequently in *every* document, so you can reduce the ranking of a particular word if you *divide* its frequency in a given document by its frequency in *all* documents. That's the essence of TF*IDF.
+
+For more information, see the [Wikipedia](http://en.wikipedia.org/wiki/Tf*idf) page.
+
+Run the script this way on a small matrix:
+
+	run.rb scripts/TfIdf10.scala \
+ 	  --input data/matrix/docBOW.tsv \
+	  --output output/featSelectedMatrix.tsv \
+	  --nWords 300
 
 # Type-Safe API
 
 So far, we have been using the more mature *Fields-Based API*, which emphasizes naming fields and uses a relatively dynamic approach to typing. This is consistent with Cascading's model.
 
-There is now a newer, more experimental *Type-Safe API* that attempts to more fully exploit the type safety provided by Scala.
-
-TODO
+There is now a newer, more experimental *Type-Safe API* that attempts to more fully exploit the type safety provided by Scala. We won't discuss it here, but refer you to the [Type-Safe API Wiki page](https://github.com/twitter/scalding/wiki/Type-safe-api-reference).
 
 # Using Scalding with Hadoop
 
@@ -500,8 +528,8 @@ Pig has very similar capabilities, with notable advantages and disadvantages.
 
 #### Advantages
 
-* *A custom language* - A language customized for a particular purpose can optimize expressiveness for common scenarios.
-* *Lazy evaluation* - you define the workflow, then Pig compiles, optimizes, and runs it when output is required. Scalding, following Scala, uses eager evaluation.
+* *A custom language* - A purpose-built language for a particular domain can optimize expressiveness for common scenarios.
+* *Lazy evaluation* - you define the work flow, then Pig compiles, optimizes, and runs it when output is required. Scalding, following Scala, uses eager evaluation.
 * *Describe* - The describe feature is very helpful when learning how each Pig statement defines a new schema.
 
 #### Disadvantages

diff --git a/data/matrix/docBOW.tsv b/data/matrix/docBOW.tsv
@@ -0,0 +1,9 @@
+1	hello	2
+1	twitter	1
+2	conversation	1
+2	celebrities	1
+2	twitter	1
+3	elections	1
+3	debate	1
+3	twitter	1
+3	political	1
diff --git a/data/matrix/graph.tsv b/data/matrix/graph.tsv
@@ -0,0 +1,4 @@
+1	2	1
+1	3	1
+3	2	1
+4	2	2
diff --git a/scripts/MatrixJaccardSimilarity9.scala b/scripts/MatrixJaccardSimilarity9.scala
@@ -0,0 +1,46 @@
+import com.twitter.scalding._
+import com.twitter.scalding.mathematics.Matrix
+
+/*
+ * MatrixJaccardSimilarity9.scala
+ *
+ * Adapted from "MatrixTutorial5" in the tutorials that come with Scalding.
+ *
+ * Loads a directed graph adjacency matrix where a[i,j] = 1 if there is an edge from a[i] to b[j]
+ * and computes the jaccard similarity between any two pairs of vectors
+ * 
+ * You invoke the script like this:
+ *   run.rb scripts/MatrixJaccardSimilarity9.scala \
+ *     --input data/matrix/graph.tsv \
+ *     --output output/jaccardSim.tsv
+ *
+ */
+
+class MatrixJaccardSimilarity9(args : Args) extends Job(args) {
+
+  import Matrix._
+
+  val adjacencyMatrix = Tsv(args("input"), ('user1, 'user2, 'rel))
+    .read
+    .toMatrix[Long,Long,Double]('user1, 'user2, 'rel)
+
+  val aBinary = adjacencyMatrix.binarizeAs[Double]
+
+  // intersectMat holds the size of the intersection of row(a)_i n row (b)_j
+  val intersectMat = aBinary * aBinary.transpose
+  val aSumVct = aBinary.sumColVectors
+  val bSumVct = aBinary.sumRowVectors
+
+  //Using zip to repeat the row and column vectors values on the right hand
+  //for all non-zeroes on the left hand matrix
+  val xMat = intersectMat.zip(aSumVct).mapValues( pair => pair._2 )
+  val yMat = intersectMat.zip(bSumVct).mapValues( pair => pair._2 )
+
+  val unionMat = xMat + yMat - intersectMat
+  //We are guaranteed to have Double both in the intersection and in the union matrix
+  intersectMat.zip(unionMat)
+              .mapValues(pair => pair._1 / pair._2)
+              .write(Tsv(args("output")))
+
+}
+
diff --git a/scripts/TfIdf10.scala b/scripts/TfIdf10.scala
@@ -0,0 +1,42 @@
+import com.twitter.scalding._
+import com.twitter.scalding.mathematics.Matrix
+
+/*
+ * TfIdf10.scala
+ *
+ * Adapted from "MatrixTutorial6" in the tutorials that come with Scalding.
+ *
+ * Loads a document to word matrix where a[i,j] = freq of the word j in the document i 
+ * computes the Tf-Idf score of each word w.r.t. to each document and keeps the top nrWords in each document
+ * (see http://en.wikipedia.org/wiki/Tf*idf for more info)
+ * 
+ * You invoke the script like this:
+ *   run.rb scripts/TfIdf10.scala \
+ *     --input data/matrix/docBOW.tsv \
+ *     --output output/featSelectedMatrix.tsv \
+ *     --nWords 300
+ */
+class TfIdf10(args : Args) extends Job(args) {
+
+  import Matrix._
+
+  val docWordMatrix = Tsv( args("input"), ('doc, 'word, 'count) )
+    .read
+    .toMatrix[Long,String,Double]('doc, 'word, 'count)
+
+  // compute the overall document frequency of each row
+  val docFreq = docWordMatrix.sumRowVectors
+
+  // compute the inverse document frequency vector
+  val invDocFreqVct = docFreq.toMatrix(1).rowL1Normalize.mapValues( x => log2(1/x) )
+
+  // zip the row vector along the entire document - word matrix
+  val invDocFreqMat = docWordMatrix.zip(invDocFreqVct.getRow(1)).mapValues( pair => pair._2 )
+
+  // multiply the term frequency with the inverse document frequency and keep the top nrWords
+  docWordMatrix.hProd(invDocFreqMat).topRowElems( args("nWords").toInt ).write(Tsv( args("output") ))
+
+  def log2(x : Double) = scala.math.log(x)/scala.math.log(2.0)
+
+}
+