# **Entity Resolution**

#### Entity Resolution is the term to describe the process of linking records from one data source with another that describe the same entity. 

#### We will use data files from the [metric-learning](https://code.google.com/p/metric-learning/) project:

* **Google.txt**, the Google Products dataset
* **Amazon.txt**, the Amazon dataset
* **Google_small.txt**, 200 records sampled from the Google data
* **Amazon_small.txt**, 200 records sampled from the Amazon data
* **stopwords.txt**, a list of common English words

#### **EXERCISE:** Load the datasets directly into RDDs using sc.textFile. Use 4 partitions of the data.

#### **_The answer should be_:**
<pre><code>
Loaded googleSmall dataset with size 200
Loaded amazonSmall dataset with size 200
Loaded google dataset with size 3226
Loaded amazon dataset with size 1363
</code></pre>

In [1]:
googleSmallRDD = sc.textFile('data/googleSmall.txt', 4)          
print "Loaded googleSmall dataset with size %d" % googleSmallRDD.count()

amazonSmallRDD = sc.textFile('data/amazonSmall.txt', 4)          
print "Loaded amazonSmall dataset with size %d" % amazonSmallRDD.count()

googleRDD = sc.textFile('data/google.txt', 4)          
print "Loaded google dataset with size %d" % googleRDD.count()

amazonRDD = sc.textFile('data/amazon.txt', 4)          
print "Loaded amazon dataset with size %d" % amazonRDD.count()


Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/vagrant/synced_folder/Notebooks/Topic Modeling/notebook/data/googleSmall.txt
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)


#### Let's examine some of the lines in the RDDs.

In [6]:
print "This is the aspect of lines from the google dataset:"
print " "

for line in googleSmallRDD.take(3):
    fields = line.split(';')
    print "ID: %s" % fields[0]
    print "CONTENT: %s" % fields[1]
    print " "

print "This is the aspect of lines from the amazon dataset:"
print " "

for line in amazonSmallRDD.take(3):
    fields = line.split(';')
    print "ID: %s" % fields[0]
    print "CONTENT: %s" % fields[1]
    print " "


This is the aspect of lines from the google dataset:
 
ID: http://www.google.com/base/feeds/snippets/11448761432933644608
CONTENT: spanish vocabulary builder "expand your vocabulary! contains fun lessons that both teach and entertain you'll quickly find yourself mastering new terms. includes games and more!" 
 
ID: http://www.google.com/base/feeds/snippets/8175198959985911471
CONTENT: topics presents: museums of world "5 cd-rom set. step behind the velvet rope to examine some of the most treasured collections of antiquities art and inventions. includes the following the louvre - virtual visit 25 rooms in full screen interactive video detailed map of the louvre ..." 
 
ID: http://www.google.com/base/feeds/snippets/18445827127704822533
CONTENT: sierrahome hse hallmark card studio special edition win 98 me 2000 xp "hallmark card studio special edition (win 98 me 2000 xp)" "sierrahome"
 
This is the aspect of lines from the amazon dataset:
 
ID: b000jz4hqo
CONTENT: clickart 950 000 - premi

#### In what follows we will use the ID as a key, and the content will be the text needed for matching among products. The similarity measure will be a cosine distance operating on [Bag-of-words][bag-of-words]. In the next sections we will learn to implement such a similarity.

#### First of all, we need to build a function to transform a string into a list of terms (tokens). 

#### **EXERCISE:** Complete the definition of the 'tokeniza' function, and do not forget: 
* to lower case the text
* to remove punctuation signs
* to eliminate empty tokens

#### You may want to use regular expressions, at  [regex101](https://regex101.com/)  you can explore regular expressions on strings.

[bag-of-words]: https://en.wikipedia.org/wiki/Bag-of-words_model


#### **_The answer should be_:**
<pre><code>
This is the tokenized text:

['the', 'bag', 'of', 'words', 'model', 'is', 'a', 'simplifying', 'representation', 'used', 'in', 'natural', 'language', 'processing', 'and', 'information', 'etrieval', 'ir', 'in', 'this', 'model', 'a', 'text', 'such', 'as', 'a', 'sentence', 'or', 'a', 'document', 'is', 'represented', 'as', 'the', 'bag', 'multiset', 'of', 'its', 'words', 'disregarding', 'grammar', 'and', 'even', 'word', 'order', 'but', 'keeping', 'multiplicity']
</code></pre>

In [7]:
import re
texto = "The bag-of-words model is a simplifying representation used in natural language processing and information \\
retrieval (IR).  In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of\\
its words, disregarding grammar and even word order but keeping multiplicity"

def tokeniza(string):
    RE = r'\W+' # match any non-word character [^a-zA-Z0-9_], and removes punctuation signs.
    tokens = re.split(RE, string.lower())
    tokens = [t for t in tokens if len(t) > 0]
    return tokens

print "This is the tokenized text:\n"
print tokeniza(texto)

This is the tokenized text:

['the', 'bag', 'of', 'words', 'model', 'is', 'a', 'simplifying', 'representation', 'used', 'in', 'natural', 'language', 'processing', 'and', 'information', 'etrieval', 'ir', 'in', 'this', 'model', 'a', 'text', 'such', 'as', 'a', 'sentence', 'or', 'a', 'document', 'is', 'represented', 'as', 'the', 'bag', 'multiset', 'of', 'its', 'words', 'disregarding', 'grammar', 'and', 'even', 'word', 'order', 'but', 'keeping', 'multiplicity']


#### It is important to eliminate the [Stopwords][stopwords], that are common words that do not contribute much to the meaning of a document (e.g., "the", "a", "is", "to", etc.). 

#### **EXERCISE:** Load the file "stopwords.txt" and use them to improve the tokeniza function.
[stopwords]: https://en.wikipedia.org/wiki/Stop_words


#### **_The answer should be_:**
<pre><code>
This is the tokenized text without stopwords:

['bag', 'words', 'model', 'simplifying', 'representation', 'used', 'natural', 'language', 'processing', 'information', 'etrieval', 'ir', 'model', 'text', 'sentence', 'document', 'represented', 'bag', 'multiset', 'words', 'disregarding', 'grammar', 'even', 'word', 'order', 'keeping', 'multiplicity']
</code></pre>


In [8]:
stopfile = 'data/stopwords.txt'
stopwords = list(set(sc.textFile(stopfile).collect()))
print 'These are the english stopwords:\n'
for sw in stopwords:
    print sw, 
print " \n\n"

def tokeniza(string, stopwords):
    RE = r'\W+' # match any non-word character [^a-zA-Z0-9_]
    tokens = re.split(RE, string.lower())
    tokens = [t for t in tokens if len(t) > 0 and t not in stopwords]
    return tokens

print "This is the tokenized text without stopwords:\n"
print tokeniza(texto, stopwords)

These are the english stopwords:

all just being over both through yourselves its before with had should to only under ours has do them his very they not during now him nor did these t each where because doing theirs some are our ourselves out what for below does above between she be we after here hers by on about of against s or own into yourself down your from her whom there been few too themselves was until more himself that but off herself than those he me myself this up will while can were my and then is in am it an as itself at have further their if again no when same any how other which you who most such why a don i having so the yours once  


This is the tokenized text without stopwords:

['bag', 'words', 'model', 'simplifying', 'representation', 'used', 'natural', 'language', 'processing', 'information', 'etrieval', 'ir', 'model', 'text', 'sentence', 'document', 'represented', 'bag', 'multiset', 'words', 'disregarding', 'grammar', 'even', 'word', 'order', 'keeping', 'multipli

### Tokenizing the RDDS

#### We want to modify the structure of the RDDs such that every item has now the format (ID, [list of tokens])

#### **EXERCISE:**  Use the 'tokeniza' function to tokenize the amazonSmall RDD and count the number of unique tokens in that RDD. Repeat for the other datasets. It is better you define a 'tokenizeRDD' function that takes a RDD as argument (plus the stopwords) and returns a modified RDD.

#### **_The answer should be_:**
<pre><code>
These are the first 5 elements of the processed amazonSmallRDD:

('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])
 
('b0006zf55o', ['ca', 'international', 'arcserve', 'lap', 'desktop', 'oem', '30pk', 'oem', 'arcserve', 'backup', 'v11', '1', 'win', '30u', 'laptops', 'desktops', 'computer', 'associates'])
 
('b00004tkvy', ['noah', 'ark', 'activity', 'center', 'jewel', 'case', 'ages', '3', '8', 'victory', 'multimedia'])
 
('b000g80lqo', ['peachtree', 'sage', 'premium', 'accounting', 'nonprofits', '2007', 'peachtree', 'premium', 'accounting', 'nonprofits', '2007', 'affordable', 'easy', 'use', 'accounting', 'solution', 'provides', 'donor', 'grantor', 'management', 're', 'like', 'nonprofit', 'organizations', 're', 'constantly', 'striving', 'maximize', 'every', 'dollar', 'annual', 'operating', 'budget', 'financial', 'reporting', 'programs', 'funds', 'advanced', 'operational', 'reporting', 'rock', 'solid', 'core', 'accounting', 'features', 'made', 'peachtree', 'choice', 'hundreds', 'thousands', 'small', 'businesses', 'result', 'accounting', 'solution', 'tailor', 'made', 'challenges', 'operating', 'nonprofit', 'organization', 'keep', 'audit', 'trail', 'record', 'report', 'changes', 'made', 'transactions', 'improve', 'data', 'integrity', 'prior', 'period', 'locking', 'archive', 'organization', 'data', 'snap', 'shots', 'data', 'closed', 'year', 'set', 'individual', 'user', 'profiles', 'password', 'protection', 'peachtree', 'restore', 'wizard', 'restores', 'backed', 'data', 'files', 'plus', 'web', 'transactions', 'customized', 'forms', 'includes', 'standard', 'accounting', 'features', 'general', 'ledger', 'accounts', 'receivable', 'accounts', 'payable'])
 
('b0006se5bq', ['singing', 'coach', 'unlimited', 'singing', 'coach', 'unlimited', 'electronic', 'learning', 'products', 'win', 'nt', '2000', 'xp', 'carry', 'tune', 'technologies'])
</code></pre>


In [9]:
amazonSmallRDD.take(1)

def tokenizeRDD(RDD, stopwords):
    tokenizedRDD = (RDD
                         .map(lambda x: x.split(';'))
                         .map(lambda x: (x[0], tokeniza(x[1], stopwords)))
                         .cache()
                     )
    return tokenizedRDD

amazonSmallRDDtokenized = tokenizeRDD(amazonSmallRDD, stopwords)
print "Finished with amazonSmallRDD."
amazonRDDtokenized = tokenizeRDD(amazonRDD, stopwords)
print "Finished with amazonRDD."
googleSmallRDDtokenized = tokenizeRDD(googleSmallRDD, stopwords)
print "Finished with googleSmallRDD."
googleRDDtokenized = tokenizeRDD(googleRDD, stopwords)
print "Finished with googleRDD.\n"

print "These are the first 5 elements of the processed amazonSmallRDD:\n"
elements = amazonSmallRDDtokenized.take(5)
for l in elements:
    print l
    print " "

Finished with amazonSmallRDD.
Finished with amazonRDD.
Finished with googleSmallRDD.
Finished with googleRDD.

These are the first 5 elements of the processed amazonSmallRDD:

(u'b000jz4hqo', [u'clickart', u'950', u'000', u'premier', u'image', u'pack', u'dvd', u'rom', u'broderbund'])
 
(u'b0006zf55o', [u'ca', u'international', u'arcserve', u'lap', u'desktop', u'oem', u'30pk', u'oem', u'arcserve', u'backup', u'v11', u'1', u'win', u'30u', u'laptops', u'desktops', u'computer', u'associates'])
 
(u'b00004tkvy', [u'noah', u'ark', u'activity', u'center', u'jewel', u'case', u'ages', u'3', u'8', u'victory', u'multimedia'])
 
(u'b000g80lqo', [u'peachtree', u'sage', u'premium', u'accounting', u'nonprofits', u'2007', u'peachtree', u'premium', u'accounting', u'nonprofits', u'2007', u'affordable', u'easy', u'use', u'accounting', u'solution', u'provides', u'donor', u'grantor', u'management', u're', u'like', u'nonprofit', u'organizations', u're', u'constantly', u'striving', u'maximize', u'every', u'd

#### Counting tokens: we want to know the vocabulary size in every one of the datasets. 

#### **EXERCISE:**  Count the number of tokens in every dataset. Also count the number of UNIQUE tokens in every dataset.

#### **_The answer should be_:**

<pre><code>
amazonSmall has 14052 tokens.
amazon has 133267 tokens.
googleSmall has 5710 tokens.
google has 98588 tokens.
 
amazonSmall has 3700 unique tokens.
amazon has 11505 unique tokens.
googleSmall has 2305 unique tokens.
google has 11176 unique tokens.
</code></pre>

In [10]:
NtokensamazonSmall = amazonSmallRDDtokenized.map(lambda x: len(x[1])).sum()
print "amazonSmall has %d tokens." % NtokensamazonSmall

Ntokensamazon = amazonRDDtokenized.map(lambda x: len(x[1])).sum()
print "amazon has %d tokens." % Ntokensamazon

NtokensgoogleSmall = googleSmallRDDtokenized.map(lambda x: len(x[1])).sum()
print "googleSmall has %d tokens." % NtokensgoogleSmall

Ntokensgoogle = googleRDDtokenized.map(lambda x: len(x[1])).sum()
print "google has %d tokens." % Ntokensgoogle

print " "

NtokensuniqueamazonSmall = len(amazonSmallRDDtokenized.map(lambda x: x[1]).reduce(lambda x, y: list(set(x + y))))
print "amazonSmall has %d unique tokens." % NtokensuniqueamazonSmall

Ntokensuniqueamazon = len(amazonRDDtokenized.map(lambda x: x[1]).reduce(lambda x, y: list(set(x + y))))
print "amazon has %d unique tokens." % Ntokensuniqueamazon

NtokensuniquegoogleSmall = len(googleSmallRDDtokenized.map(lambda x: x[1]).reduce(lambda x, y: list(set(x + y))))
print "googleSmall has %d unique tokens." % NtokensuniquegoogleSmall

Ntokensuniquegoogle = len(googleRDDtokenized.map(lambda x: x[1]).reduce(lambda x, y: list(set(x + y))))
print "google has %d unique tokens." % Ntokensuniquegoogle


amazonSmall has 14052 tokens.
amazon has 133267 tokens.
googleSmall has 5710 tokens.
google has 98588 tokens.
 
amazonSmall has 3700 unique tokens.
amazon has 11505 unique tokens.
googleSmall has 2305 unique tokens.
google has 11176 unique tokens.


#### Largest and smallest item in amazon dataset. 

#### **EXERCISE:**  Build an 'amazonCountRDD' that stores (ID, [Tokens], Number of tokens). Print the largest content (the record with the largest number of tokens) and the smallest content (the one with the smaller number of tokens). 

#### **_The answer should be_:**

<pre><code>
amazon smallest item, with length 2, is:
[('b000hlt5g2', ['monopoly', 'encore'], 2)]
 
amazon largest item, with length 1520, is:
[('b0007lw22q', ['apple', 'ilife', '06', 'family', 'pack', 'mac', 'dvd', 'older', ....
</code></pre>

In [11]:
amazonCountRDD = amazonRDDtokenized.map(lambda x: (x[0], x[1], len(x[1]))).cache()

smallestItem = amazonCountRDD.takeOrdered(1, lambda x: x[2])
largestItem = amazonCountRDD.takeOrdered(1, lambda x: -x[2])

print "amazon smallest item, with length %d, is:" % smallestItem[0][2]
print smallestItem
print " "

print "amazon largest item, with length %d, is:" % largestItem[0][2]
print largestItem
print " "



amazon smallest item, with length 2, is:
[(u'b000hlt5g2', [u'monopoly', u'encore'], 2)]
 
amazon largest item, with length 1520, is:
[(u'b0007lw22q', [u'apple', u'ilife', u'06', u'family', u'pack', u'mac', u'dvd', u'older', u'version', u'ilife', u'06', u'easiest', u'way', u'make', u'every', u'bit', u'digital', u'life', u'use', u'mac', u'collect', u'organize', u'edit', u'various', u'elements', u'transform', u'mouth', u'watering', u'masterpieces', u'apple', u'designed', u'templates', u'share', u'magic', u'moments', u'beautiful', u'books', u'colorful', u'calendars', u'dazzling', u'dvds', u'perfect', u'podcasts', u'attractive', u'online', u'journals', u'starring', u'family', u'pack', u'lets', u'install', u'ilife', u'06', u'five', u'apple', u'computers', u'household', u'easier', u'ever', u'edit', u'photos', u'perfection', u'photos', u'one', u'place', u'iphoto', u'6', u'rebuilt', u'blazing', u'performance', u'iphoto', u'makes', u'sharing', u'photos', u'faster', u'simpler', u'cooler', u'ever'

#### Transforming the list of tokens into a sparse structure (dictionary). This will allow us to efficiently compute scalar products between sparse vectors. 

#### **EXERCISE:**  Complete the definition of the "compute_tf" function that takes as input a list of tokens and produces a dictionary {key: value}, one key for every unique element in the list of tokens. The value for every token is the number of times that tokens appears in the list, divided by the total number of tokens. This measurement is know as **TF**, term frequency. 

#### **_The answer should be_:**

<pre><code>
{'sentence': 0.037037037037037035, 'text': 0.037037037037037035, 'ir': 0.037037037037037035, 'multiset': 0.037037037037037035, 'even': 0.037037037037037035, 'information': 0.037037037037037035, 'document': 0.037037037037037035, 'used': 0.037037037037037035, 'processing': 0.037037037037037035, 'grammar': 0.037037037037037035, 'words': 0.07407407407407407, 'represented': 0.037037037037037035, 'word': 0.037037037037037035, 'etrieval': 0.037037037037037035, 'keeping': 0.037037037037037035, 'natural': 0.037037037037037035, 'language': 0.037037037037037035, 'multiplicity': 0.037037037037037035, 'disregarding': 0.037037037037037035, 'bag': 0.07407407407407407, 'simplifying': 0.037037037037037035, 'representation': 0.037037037037037035, 'model': 0.07407407407407407, 'order': 0.037037037037037035}
</code></pre>

In [12]:
# TODO: Replace <FILL IN> with appropriate code
def compute_tf(tokens):
    tf_dict = {}
    for tf in tokens:
        try:
            tf_dict[tf] += 1.0
        except:
            tf_dict[tf] = 1.0
            
    Total = len(tokens)
    for key in tf_dict.keys():
        tf_dict[key] /= Total 
    return tf_dict

print compute_tf(tokeniza(texto, stopwords)) 


{'sentence': 0.037037037037037035, 'text': 0.037037037037037035, 'ir': 0.037037037037037035, 'multiset': 0.037037037037037035, 'even': 0.037037037037037035, 'information': 0.037037037037037035, 'document': 0.037037037037037035, 'used': 0.037037037037037035, 'processing': 0.037037037037037035, 'grammar': 0.037037037037037035, 'words': 0.07407407407407407, 'represented': 0.037037037037037035, 'word': 0.037037037037037035, 'etrieval': 0.037037037037037035, 'keeping': 0.037037037037037035, 'natural': 0.037037037037037035, 'language': 0.037037037037037035, 'multiplicity': 0.037037037037037035, 'disregarding': 0.037037037037037035, 'bag': 0.07407407407407407, 'simplifying': 0.037037037037037035, 'representation': 0.037037037037037035, 'model': 0.07407407407407407, 'order': 0.037037037037037035}


#### Let us implement now the cosine similarity function between two items using the TF measurement.  It is defined as:

#### $$ cosim(a,b) = \frac{a \cdot b}{\|a\| \|b\|} = \frac{\sum a_i b_i}{\sqrt{\sum a_i^2} \sqrt{\sum b_i^2}} $$

#### **EXERCISE:** We observe in the formula that we need a dot product function (between two sparse dictionaries), and a norm function.  Complete the definition of the "dotproduct" and "norm" functions below. 


#### **_The answer should be_:**

<pre><code>
The dot product between tf_dict1 and tf_dict2 is 0.055556

The norm of tf_dict1 is 0.212762

The norm of tf_dict2 is 0.408248

The cosim of a vector with itself must be one: 1.000000

The cosim of a vector with itself must be one: 1.000000

The cosim between two vectors does not depend of the order: 0.639602 = 0.639602
</code></pre>

In [13]:
import math

def dotprod(tf_dict1, tf_dict2):
    common_tokens = list(set(tf_dict1.keys()).intersection(tf_dict2.keys()))       
    dotProd = 0
    for token in common_tokens:
        dotProd += tf_dict1[token]*tf_dict2[token]
    return dotProd

def norm(tf_dict1):
    dp = dotprod(tf_dict1, tf_dict1)
    norma = math.sqrt(dp)
    return norma

def cosim(tf_dict1, tf_dict2):
    dp = dotprod(tf_dict1, tf_dict2)
    norm1 = norm(tf_dict1)
    norm2 = norm(tf_dict2)
    cs = dp / norm1 / norm2
    return cs


tf_dict1 = compute_tf(tokeniza(texto, stopwords))
tf_dict2 = compute_tf(tokeniza(texto[0:60], stopwords))

print "The dot product between tf_dict1 and tf_dict2 is %f\n" % dotprod(tf_dict1, tf_dict2) 
print "The norm of tf_dict1 is %f\n" % norm(tf_dict1) 
print "The norm of tf_dict2 is %f\n" % norm(tf_dict2) 
print "The cosim of a vector with itself must be one: %f\n" % cosim(tf_dict1, tf_dict1) 
print "The cosim of a vector with itself must be one: %f\n" % cosim(tf_dict2, tf_dict2) 
print "The cosim between two vectors does not depend of the order: %f = %f\n" % (cosim(tf_dict1, tf_dict2) , cosim(tf_dict2, tf_dict1))

The dot product between tf_dict1 and tf_dict2 is 0.055556

The norm of tf_dict1 is 0.212762

The norm of tf_dict2 is 0.408248

The cosim of a vector with itself must be one: 1.000000

The cosim of a vector with itself must be one: 1.000000

The cosim between two vectors does not depend of the order: 0.639602 = 0.639602



#### We will use these functions to process the datasets and find the two most similar records between amazon and google. 

#### **EXERCISE:** Generate a new "allPairsRDD" that contains all possible combinations between elements of googleSmallRDDtokenized and amazonSmallRDDtokenized. You may want to use the "cartesian" transformation. Transform "allPairsRDD" to obtain a new RDD with the format (googleID, amazonID, cosim). Finally, print the contents with the largest cosine similarity.

#### **_The answer should be_:**

<pre><code>
The allPairsRDD has 40000 elements.

This is one of the elements in allPairsRDD:

[(('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary', 'builder', 'expand', 'vocabulary', 'contains', 'fun', 'lessons', 'teach', 'entertain', 'll', 'quickly', 'find', 'mastering', 'new', 'terms', 'includes', 'games']), ('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund']))]

</code></pre>

In [14]:
allPairsRDD = (googleSmallRDDtokenized
              .cartesian(amazonSmallRDDtokenized)
              .cache())

print "The allPairsRDD has %d elements.\n" % allPairsRDD.count()

print "This is one of the elements in allPairsRDD:\n"
print allPairsRDD.take(1)

The allPairsRDD has 40000 elements.

This is one of the elements in allPairsRDD:

[((u'http://www.google.com/base/feeds/snippets/11448761432933644608', [u'spanish', u'vocabulary', u'builder', u'expand', u'vocabulary', u'contains', u'fun', u'lessons', u'teach', u'entertain', u'll', u'quickly', u'find', u'mastering', u'new', u'terms', u'includes', u'games']), (u'b000jz4hqo', [u'clickart', u'950', u'000', u'premier', u'image', u'pack', u'dvd', u'rom', u'broderbund']))]


#### **EXERCISE:** Transform "allPairsRDD" to obtain a new "cosimRDD" with the format (googleID, amazonID, cosim). 

#### **_The answer should be_:**

<pre><code>
This is the first element in cosimRDD:

[('http://www.google.com/base/feeds/snippets/11448761432933644608', 'b000jz4hqo', 0.0)]
</code></pre>

In [15]:
#print allPairsRDD.take(1)[0][0][0]
#print allPairsRDD.take(1)[0][0][1]
#print "-----------"
#print allPairsRDD.take(1)[0][1][0]
#print allPairsRDD.take(1)[0][1][1]

cosimRDD = (allPairsRDD
              .map(lambda x: (x[0][0], x[1][0],  cosim(compute_tf(x[0][1]) ,compute_tf(x[1][1]) )   ))
              .cache())

print "This is the first element in cosimRDD:\n"
print cosimRDD.take(1)

This is the first element in cosimRDD:

[(u'http://www.google.com/base/feeds/snippets/11448761432933644608', u'b000jz4hqo', 0.0)]


#### **EXERCISE:**  Now, find the element with the largest similarity.

#### **_The answer should be_:**

<pre><code>
This is the element in cosimRDD with the largest similarity:

[('http://www.google.com/base/feeds/snippets/18411875162562199123', 'b000j4k804', 0.9712858623572642)]
</code></pre>

In [16]:
mostSimilarPair = (cosimRDD
              .takeOrdered(1, lambda x: -x[2])
              )

print "This is the element in cosimRDD with the largest similarity:\n"
print mostSimilarPair

This is the element in cosimRDD with the largest similarity:

[(u'http://www.google.com/base/feeds/snippets/18411875162562199123', u'b000j4k804', 0.9712858623572642)]


#### **EXERCISE:**  As a final step, we will print the contents in amazon and google corresponding to that element, to check if they are similar or not.

#### **_The answer should be_:**

<pre><code>

The google ID is: http://www.google.com/base/feeds/snippets/18411875162562199123

The amazon ID is: b000j4k804

The google content is: topics entertainment 40248 instant immersion spanish audio book audio book "instant immersion spanish (audio book) (audio book)" "topics entertainment"

The amazon content is: instant immersion spanish (audio book) "instant immersion spanish (audio book) (audio book)" "topics entertainment"

</code></pre>

In [17]:
googleID = mostSimilarPair[0][0]
print "The google ID is: %s\n" % googleID
amazonID = mostSimilarPair[0][1]
print "The amazon ID is: %s\n" % amazonID

googleContent = (googleSmallRDD
                         .map(lambda x: x.split(';'))
                         .filter(lambda x: x[0] == googleID)
                         .map(lambda x: x[1])
                         .first()
                )

amazonContent = (amazonSmallRDD
                         .map(lambda x: x.split(';'))
                         .filter(lambda x: x[0] == amazonID)
                         .map(lambda x: x[1])
                         .first()
                )

print "The google content is:\n"
print googleContent
print "\nThe amazon content is:\n"
print amazonContent


The google ID is: http://www.google.com/base/feeds/snippets/18411875162562199123

The amazon ID is: b000j4k804

The google content is:

topics entertainment 40248 instant immersion spanish audio book audio book "instant immersion spanish (audio book) (audio book)" "topics entertainment"

The amazon content is:

instant immersion spanish (audio book) "instant immersion spanish (audio book) (audio book)" "topics entertainment"


#### **EXERCISE:**  Repeat the computations with the full datasets. Open in a new browser the page localhost:4040 to see the evolution of the tasks.

#### **_The answer should be_:**

<pre><code>
The allPairsRDD has 4397038 elements.

This is one of the elements in allPairsRDD:

[(('http://www.google.com/base/feeds/snippets/11125907881740407428', ['learning', 'quickbooks', '2007', 'learning', 'quickbooks', '2007', 'intuit']), ('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund']))]

This is the first element in cosimRDD:

[('http://www.google.com/base/feeds/snippets/11125907881740407428', 'b000jz4hqo', 0.0)]

This is the element in cosimRDD with the largest similarity:

[('http://www.google.com/base/feeds/snippets/17521446718236049500', 'b000v9yxj4', 1.0)]

The google ID is: http://www.google.com/base/feeds/snippets/17521446718236049500

The amazon ID is: b000v9yxj4

The google content is: nero inc nero 8 ultra edition  

The amazon content is: nero 8 ultra edition  "nero inc."
</code></pre>

In [18]:
allPairsRDD = (googleRDDtokenized
              .cartesian(amazonRDDtokenized)
              .cache())

print "The allPairsRDD has %d elements.\n" % allPairsRDD.count()

print "This is one of the elements in allPairsRDD:\n"
print allPairsRDD.take(1)

cosimRDD = (allPairsRDD
              .map(lambda x: (x[0][0], x[1][0],  cosim(compute_tf(x[0][1]) ,compute_tf(x[1][1]) )   ))
              .cache())

print "\nThis is the first element in cosimRDD:\n"
print cosimRDD.take(1)

mostSimilarPair = (cosimRDD
              .takeOrdered(1, lambda x: -x[2])
              )

print "\nThis is the element in cosimRDD with the largest similarity:\n"
print mostSimilarPair

googleID = mostSimilarPair[0][0]
print "The google ID is: %s\n" % googleID
amazonID = mostSimilarPair[0][1]
print "The amazon ID is: %s\n" % amazonID

googleContent = (googleRDD
                         .map(lambda x: x.split(';'))
                         .filter(lambda x: x[0] == googleID)
                         .map(lambda x: x[1])
                         .first()
                )

amazonContent = (amazonRDD
                         .map(lambda x: x.split(';'))
                         .filter(lambda x: x[0] == amazonID)
                         .map(lambda x: x[1])
                         .first()
                )

print "The google content is: %s\n" % googleContent
print "The amazon content is: %s\n" % amazonContent

The allPairsRDD has 4397038 elements.

This is one of the elements in allPairsRDD:

[((u'http://www.google.com/base/feeds/snippets/11125907881740407428', [u'learning', u'quickbooks', u'2007', u'learning', u'quickbooks', u'2007', u'intuit']), (u'b000jz4hqo', [u'clickart', u'950', u'000', u'premier', u'image', u'pack', u'dvd', u'rom', u'broderbund']))]

This is the first element in cosimRDD:

[(u'http://www.google.com/base/feeds/snippets/11125907881740407428', u'b000jz4hqo', 0.0)]

This is the element in cosimRDD with the largest similarity:

[(u'http://www.google.com/base/feeds/snippets/17521446718236049500', u'b000v9yxj4', 1.0)]
The google ID is: http://www.google.com/base/feeds/snippets/17521446718236049500

The amazon ID is: b000v9yxj4

The google content is: nero inc nero 8 ultra edition  

The amazon content is: nero 8 ultra edition  "nero inc."

