# Text Analysis

#### The idea of this exercise to perform simple text analysis, a popular concept used in many cutting-edge applications. Also, known as Text Mining - the idea is to retrieve high-quality information from the text. Some of the text mining tasks are: text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization etc

#### Based on a custom query, we will try to find the similar documents from our pool of documents

In [1]:
from pyspark import SparkContext
sc = SparkContext()

In [2]:
# Load the text file in zipped format, yes that's possible!
t = sc.textFile('data/test.ft.txt.bz2')

In [3]:
t.count()

400000

In [4]:
# Take a look how the data looks like
t.take(10)

['__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"',
 "__label__2 One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too ma

##### Stopwords: The list of most frequenty used words in a specific language. Stopwords do not offer any useful information about a chunk of text, so we generally remove them from the text before progressing further

In [5]:
# Execute this cell to download the list of English stopwords
import urllib.request as urllib
urllib.urlretrieve ("https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt", "stopwords.txt")

('stopwords.txt', <http.client.HTTPMessage at 0x7f20b9d28b38>)

In [6]:
stopwords = sc.textFile("stopwords.txt").collect()

#### Split the total dataset into two parts, if needed

In [7]:
train,test = t.randomSplit(weights=[0.9, 0.1], seed=1)

In [8]:
# Check the number of partitions
train.getNumPartitions()

2

In [9]:
# Increase the number of partitions
train = train.repartition(10)

In [10]:
train.getNumPartitions() # Check again

10

In [11]:
train.persist() # Store the RDD in memory for quicker operations

MapPartitionsRDD[10] at coalesce at NativeMethodAccessorImpl.java:0

In [12]:
# Split the text into 'tokens' (individual words) by whitespace
traw = train.map(lambda x: x.split(' '))

In [45]:
# Discard the first token(word) and take rest
tdata = traw.map(lambda x: x[1:])

In [53]:
xxx = tdata.take(1)

In [56]:
for i in xxx:
    print(xxx)

[['very', 'disappointing:', 'The', 'movie', 'is', 'vulgar', 'and', 'not', 'meant', 'for', 'children.', 'It', 'is', 'a', 'typical', 'Adam', 'Sandler', 'movie,', 'with', 'foul', 'language', 'and', 'raunchy', 'humor.', 'Not', 'enjoyable', 'at', 'all.']]


In [79]:
# Create a function which tries to eliminate all the special characters in tokens(words)
# Also, only take words which have length more than 2!
# Hint: Use regex, the module in python is re
# Input: x -> list of words/tokens
# Outout: list of words/tokens with length more than 2 and without any special characters
import re
def replace_special_chars(x):
    return [re.sub('[^a-zA-Z0-9]|\.', '', nelem) for nelem in x if len(nelem)>2] 

In [119]:
t_semi_clean = tdata.map(replace_special_chars)
t_semi_clean.take(10)

[['very',
  'disappointing',
  'The',
  'movie',
  'vulgar',
  'and',
  'not',
  'meant',
  'for',
  'children',
  'typical',
  'Adam',
  'Sandler',
  'movie',
  'with',
  'foul',
  'language',
  'and',
  'raunchy',
  'humor',
  'Not',
  'enjoyable',
  'all'],
 ['Sandler',
  'Strikes',
  'Out',
  'Crazy',
  'Nights',
  'might',
  'have',
  'been',
  'sweet',
  'film',
  'with',
  'good',
  'message',
  'for',
  'kids',
  'but',
  'the',
  'scatological',
  'humor',
  'offensive',
  'language',
  'and',
  'explicit',
  'sexual',
  'references',
  'made',
  'unsuitable',
  'for',
  '10year',
  'old',
  'The',
  'plot',
  'the',
  'other',
  'hand',
  'while',
  'fine',
  'for',
  '10year',
  'olds',
  'was',
  'too',
  'obvious',
  'and',
  'simplistic',
  'for',
  'most',
  'the',
  'adults',
  'the',
  'audience',
  'result',
  'while',
  'its',
  'probably',
  'not',
  'the',
  'worst',
  'film',
  'the',
  'year',
  'certainly',
  'the',
  'running'],
 ['Not',
  'the',
  'worst',
  '

In [113]:
# Create a function that would make the tokens(words) lowercase and then check if it's a stopword or not.
# If stopword, then discard it
# Input: x -> list of words/tokens
# Outout: list of words/tokens without stopwords
def remove_sw(x):
    # Write your code here
    return [elm.lower() for elm in x if elm.lower() not in stopwords]

In [112]:
t_clean = t_semi_clean.map(remove_sw)
t_clean.take(1)

[['disappointing',
  'movie',
  'vulgar',
  'meant',
  'children',
  'typical',
  'adam',
  'sandler',
  'movie',
  'foul',
  'language',
  'raunchy',
  'humor',
  'enjoyable']]

## Term Frequency (TF): The number of times a specific word occurs in a record

#### TF of term 't' in a document 'd' = Number of times term 't' occurs in a document or record 'd'

In [114]:
# Write a function which takes the rdd item (record) and 
# then tries to count the occurances of a specific word in the whole record
# Input: record -> list of words/tokens
# Output: list of (word, frequency of occurance)
def tf(record):
    counts = {}
    # Write your code here
    for word in record:  # Looping, Why?
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    return list(counts.items()) 

In [115]:
tokens_with_tfs = t_clean.map(tf)
tokens_with_tfs.take(10)

[[('disappointing', 1),
  ('movie', 2),
  ('vulgar', 1),
  ('meant', 1),
  ('children', 1),
  ('typical', 1),
  ('adam', 1),
  ('sandler', 1),
  ('foul', 1),
  ('language', 1),
  ('raunchy', 1),
  ('humor', 1),
  ('enjoyable', 1)],
 [('sandler', 1),
  ('strikes', 1),
  ('crazy', 1),
  ('nights', 1),
  ('might', 1),
  ('sweet', 1),
  ('film', 2),
  ('good', 1),
  ('message', 1),
  ('kids', 1),
  ('scatological', 1),
  ('humor', 1),
  ('offensive', 1),
  ('language', 1),
  ('explicit', 1),
  ('sexual', 1),
  ('references', 1),
  ('made', 1),
  ('unsuitable', 1),
  ('10year', 2),
  ('old', 1),
  ('plot', 1),
  ('hand', 1),
  ('fine', 1),
  ('olds', 1),
  ('obvious', 1),
  ('simplistic', 1),
  ('adults', 1),
  ('audience', 1),
  ('result', 1),
  ('probably', 1),
  ('worst', 1),
  ('year', 1),
  ('certainly', 1),
  ('running', 1)],
 [('worst', 2),
  ('far', 1),
  ('good', 1),
  ('film', 2),
  ('dubious', 1),
  ('honor', 1),
  ('included', 1),
  ('book', 1),
  ('films', 1),
  ('time', 1),
  

## Inverse Document Frequency (IDF): How important is a specific word in the whole corpus

#### Calculation of IDF is not as straightforward as TF. 
#### IDF score of term 't' = log(total number of documents / number of documents containing 't')

In [136]:
#Take out the unique words per record from 't_clean' 
# Hint: Use python 'set' function

unique_words_per_record = t_clean.map(lambda x: list(set(x)))

In [137]:
# Write a helper function to attach '1' to every word
# Input: record -> list of words
# Output: list of tuples where each tuple is (word, 1)
def attach_1_to_words(record):
    # Your code here
    return [(elm,1) for elm in record]

In [142]:
# You need to attach '1' to each and every word across all records of RDD 'unique_words_per_record'
# And Return as a single list. 
# Which transformation should we use?
unique_words_per_record_with_1 = t_clean.flatMap(attach_1_to_words)# unique_words_per_record. YOUR CODE HERE

In [144]:
unique_words_per_record_with_1.take(5)

[('disappointing', 1),
 ('movie', 1),
 ('vulgar', 1),
 ('meant', 1),
 ('children', 1)]

In [145]:
# We need to add up the '1's together for same words 
# which is basically counting the number of documents where a specific word occurs!
# Which transformation?
tokens_with_docs_count = unique_words_per_record_with_1.reduceByKey(lambda a,b:a+b)
tokens_with_docs_count.take(5)

[('language', 3853),
 ('strikes', 319),
 ('crazy', 2225),
 ('sexual', 1091),
 ('references', 1198)]

In [148]:
# Now, count the total number of documents
docs = t_clean.count()

In [149]:
docs

359963

In [150]:
# You have the counts for the words in the whole document set, now try to calculate IDF
# Hint: use python module "math" and then math.log for logarithm
# Return: RDD of (token, idf_score)
import math
tokens_with_idfs = tokens_with_docs_count.map(lambda x: (x[0], math.log(docs/x[1])))

In [151]:
# Sort the result on the basis of idf scores and take just 10. Which 'action' do we use?
tokens_with_idfs.takeOrdered(10, lambda s: s[1])

[('book', 0.6277301356605689),
 ('one', 1.0030553967683506),
 ('great', 1.2283240049232733),
 ('like', 1.2675693884191037),
 ('good', 1.2718713210308168),
 ('just', 1.3527353106939637),
 ('will', 1.6148503100624352),
 ('get', 1.6623843504641405),
 ('read', 1.6656115793176205),
 ('time', 1.7240915000058556)]

In [152]:
# Calculate the idfs for each of the tokens (words) as a python dict (because we need to use it over and over again)
tokens_with_idfs_dict = tokens_with_idfs.collectAsMap()

In [154]:
len(tokens_with_idfs_dict)

456313

In [156]:
tokens_with_tfs.take(1)

[[('disappointing', 1),
  ('movie', 2),
  ('vulgar', 1),
  ('meant', 1),
  ('children', 1),
  ('typical', 1),
  ('adam', 1),
  ('sandler', 1),
  ('foul', 1),
  ('language', 1),
  ('raunchy', 1),
  ('humor', 1),
  ('enjoyable', 1)]]

In [157]:
tokens_with_idfs_dict['disappointing']

4.017898212574991

### TFIDF score of a term in a specific document = TF of the term in a specific doc x IDF of the term 

In [171]:
# Write the function tfidf which would take the rdd which has the token counts per document
# and then muliply with the IDF score of that term
# Input: record -> list of (word, term frequency)
# Output: list of (word, tfidf score)
def tfidf(record):
    return [(elm[0],tokens_with_idfs_dict[elm[0]]*elm[1]) for elm in record if elm[0] in tokens_with_idfs_dict]  

In [172]:
tfidf_docs = tokens_with_tfs.map(tfidf)
tfidf_docs.take(5)

[[('disappointing', 4.017898212574991),
  ('movie', 3.622732635577382),
  ('vulgar', 7.505489496677982),
  ('meant', 5.358318507557967),
  ('children', 3.87136493007666),
  ('typical', 5.138839479524196),
  ('adam', 6.858862331752929),
  ('sandler', 7.97347496176748),
  ('foul', 7.244680442477297),
  ('language', 4.53714918274636),
  ('raunchy', 8.13031743326045),
  ('humor', 4.67108850402611),
  ('enjoyable', 4.621309709029738)],
 [('sandler', 7.97347496176748),
  ('strikes', 7.028565424587673),
  ('crazy', 5.086244332772177),
  ('nights', 6.228491557337156),
  ('might', 3.510444542940512),
  ('sweet', 4.885737082740047),
  ('film', 5.8561493791445765),
  ('good', 1.2718713210308168),
  ('message', 4.803857152429578),
  ('kids', 3.6310320093379977),
  ('scatological', 11.001997058144463),
  ('humor', 4.67108850402611),
  ('offensive', 6.476591840625233),
  ('language', 4.53714918274636),
  ('explicit', 7.1661354136818805),
  ('sexual', 5.798906541539447),
  ('references', 5.7053477486

### Calculate cosine similarity :  measure of similarity of two documents i.e. the document vectors and the query vector. The document vectors are the vector representation of our documents which we have already calculated and the query vector will be calcultated based on a custom query

#### https://en.wikipedia.org/wiki/Cosine_similarity

In [164]:
# The cosine similarity function
# Input: doc_record: data rdd record, query: query rdd record
# Output: tuple of (doc_record, cosine similarity score)
def cosine_similarity(doc_record, query):
    dot_prod = 0.0
    norm_record = []
    norm_query = []
    for query_term in query:
        norm_query.append(query[query_term])
    for word_tfidf in doc_record:
        word = word_tfidf[0]
        tfidf = word_tfidf[1]
        norm_record.append(tfidf**2)
        
        if word in query:
            dot_prod += query[word] * tfidf
        res = dot_prod / math.sqrt(sum(norm_record)) / math.sqrt(sum(norm_query))
        return (doc_record, res)

In [165]:
def tuples_to_dict(record):
    output = {}
    for word_tfidf in record:
        word = word_tfidf[0]
        tfidf = word_tfidf[1]
        output[word] = tfidf
    return output

# condense all previous steps

In [178]:
def querybuilder(querystr=""):
    query_rdd_raw = sc.parallelize([tuple(querystr.split(' '))])
    query_rs = query_rdd_raw.map(replace_special_chars)
    query_sw = query_rs.map(remove_sw)
    query_rdd_tf = query_sw.map(tf)
    query_rdd_tfidf = query_rdd_tf.map(tfidf)
    query_dict = query_rdd_tfidf.map(tuples_to_dict).collect()[0]
    return query_dict

In [179]:
test.take(1)

["__label__2 Simple, Durable, Fun game for all ages: This is an AWESOME game! Almost everyone know tic-tac-toe so it is EASY to learn and quick to play. You can't play just once! The twist is that your pieces are slightly different sizes - just big enough to gobble up your opponent. The first person to make tic-tac-toe wins, but it's not as easy as it looks when you're stuck in the mindset of just making three in a row and forget about the gobbling possibilities! My 4 and 5 year olds will beat me even when I'm trying to win! Excellent beginning critical thinking game. Grandparents loved playing it with the kids too."]

#### Now we will build the 'query' which would be used to find similar documents

In [180]:
# query = querybuilder("") # You can build the query by passing a string OR
query = querybuilder(test.take(1)[0])  # Build the query from the test RDD using any of the documents
query

{'ages': 5.357728711020669,
 'almost': 3.351431799416643,
 'awesome': 3.819518307874936,
 'beat': 4.77414373297225,
 'beginning': 4.355823016941912,
 'big': 3.1509197082371396,
 'critical': 6.025263315723888,
 'different': 3.090428689760224,
 'durable': 5.121929729493736,
 'easy': 5.728033207384647,
 'enough': 3.01651270141064,
 'even': 1.9861925346418126,
 'everyone': 3.834830588677574,
 'excellent': 2.9850742821103013,
 'first': 1.9780866710067773,
 'forget': 4.850683810094584,
 'fun': 3.102039939855632,
 'game': 8.119220815571568,
 'gobble': 9.960543183316302,
 'gobbling': 11.407462166252627,
 'grandparents': 7.573400702294192,
 'just': 4.058205932081891,
 'kids': 3.6310320093379977,
 'know': 2.475546845465548,
 'learn': 3.967462296131199,
 'looks': 3.638400463729094,
 'loved': 3.3540519922690684,
 'make': 2.4545980246711587,
 'making': 3.774455602810012,
 'mindset': 7.796544253608403,
 'olds': 6.517113038030873,
 'opponent': 8.387037280108265,
 'person': 3.926470003383102,
 'pieces

In [181]:
r = tfidf_docs.map(lambda x: cosine_similarity(x, query)) # Calculate the cosine similarity

In [182]:
r.takeOrdered(10, key=lambda s: -s[1])

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 185.0 failed 1 times, most recent failure: Lost task 0.0 in stage 185.0 (TID 183, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1290, in <lambda>
    return self.mapPartitions(lambda it: [heapq.nsmallest(num, it, key)]).reduce(merge)
  File "/opt/conda/lib/python3.6/heapq.py", line 516, in nsmallest
    k = key(elem)
  File "<ipython-input-182-476a741fe417>", line 1, in <lambda>
TypeError: 'NoneType' object is not subscriptable

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/spark/python/pyspark/rdd.py", line 346, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1290, in <lambda>
    return self.mapPartitions(lambda it: [heapq.nsmallest(num, it, key)]).reduce(merge)
  File "/opt/conda/lib/python3.6/heapq.py", line 516, in nsmallest
    k = key(elem)
  File "<ipython-input-182-476a741fe417>", line 1, in <lambda>
TypeError: 'NoneType' object is not subscriptable

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


#### The rest of the section is optional and could be used if needed

In [183]:
r.filter(lambda x: x is None).count()

1

In [185]:
r = r.filter(lambda x: x is not None)

In [186]:
# Attach the document id and then sort
r.zipWithIndex().takeOrdered(5, key=lambda s: -s[0][1])

[(([('game', 2.706406938523856),
    ('sweet', 4.885737082740047),
    ('want', 7.631857485077642),
    ('codes', 14.038409963656218),
    ('got', 2.4623902718913246),
    ('one', 1.0030553967683506),
    ('sweetyou', 12.793756527372517),
    ('get', 1.6623843504641405),
    ('rocket', 7.342718073806817),
    ('launcher', 8.786423342140047),
    ('typing', 7.210260218590818),
    ('code', 10.489347633120463),
    ('hunting', 6.700186757327382),
    ('just', 1.3527353106939637),
    ('kill', 5.376776905991363),
    ('blow', 5.92059269316),
    ('away', 3.2705781630634356),
    ('bhbbq', 12.793756527372517)],
   0.4934579857224185),
  429),
 (([('game', 8.119220815571568),
    ('aged', 7.038014313785605),
    ('well', 5.863291650359596),
    ('went', 3.649235496433628),
    ('back', 2.42032784754386),
    ('original', 3.3589135280444937),
    ('360', 13.00808191292704),
    ('games', 8.549929229185166),
    ('see', 5.099182615465329),
    ('ones', 3.908315615664932),
    ('worth', 2.7958

In [187]:
def get_original_record_ids(result_rdd, number):
    ids = []
    r_rdd = result_rdd.zipWithIndex()
    r_rdd_sorted = r_rdd.takeOrdered(number, key=lambda s: -s[0][1])
    i = 0
    for rec in r_rdd_sorted:
        ids.append((rec[1], i))
        i = i+1
    return ids

def filter_records_on_ids(training_record, oids):
    position = training_record[1]
    for oid in oids:
        if position == oid[0]:
            return True
    return False

def map_final_records(training_record, oids):
    position = training_record[1]
    for oid in oids:
        if position == oid[0]:
            return (training_record, oid[1])
    return None

In [188]:
oids = get_original_record_ids(r, 10)
oids

[(429, 0),
 (1503, 1),
 (3310, 2),
 (4787, 3),
 (5364, 4),
 (5396, 5),
 (5423, 6),
 (7429, 7),
 (10421, 8),
 (10662, 9)]

In [189]:
#Get the full content of the matched documents
train.zipWithIndex().filter(lambda x: filter_records_on_ids(x, oids)).map(lambda x: map_final_records(x, oids)).takeOrdered(10, lambda s: s[1])

[(("__label__2 This game is Sweet.: If you want codes I got codes only one but its Sweet.You can get a rocket launcher by typing in this code when your hunting and you don't just want to kill it you want to blow it away that code is BHB-BQ",
   429),
  0),
 (("__label__1 This Game Hasn't Aged Well: I went back through the original 360 games to see which ones were worth still playing.This game isn't all that bad but lacks in quite a few areas. The story tutorial doesn't guide you through the moves very well. You will be facing challenges needed to level up without any guidance on how to complete them. The characters are very hokey and the dialogue is pretty stiff. The animations are worse as when they talk you can see clear down their throats.If you love Tony Hawk games then this one will probably suit you well. If on the other hand you are a casual gamer and not specifically fan of the genre then dont pick up this game and try one of the 6 others for the 360.",
   1503),
  1),
 (("__la