# Analyzing Hamlet and Transforming Into a Data Set

## Objective

Using transformations and actions functions while cleaning up the text of _Hamlet_.

Ssing Spark to transform the text of _Hamlet_ into a usable data set.

## Data Set

The file named hamlet.txt contains the entire text of [Shakespeare's play Hamlet](https://en.wikipedia.org/wiki/Hamlet). Shakespeare is well-known for his unique writing style and arguably one of the most influential writers in history. Hamlet is one of his most popular plays.

The file is in pure text format and not ready for analysis.

## Reading In The Data

In [1]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext()

raw_hamlet = sc.textFile("hamlet.txt")
raw_hamlet.take(5)

['\tHAMLET', '', '', '\tDRAMATIS PERSONAE', '']

In [2]:
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))
split_hamlet.take(5)

[['', 'HAMLET'], [''], [''], ['', 'DRAMATIS PERSONAE'], ['']]

#### Creating RDD object that contains tuples of the unique line IDs and the text "hamlet speaketh!," but only for the elements in the RDD that have "HAMLET" in one of the values.

In [3]:
def hamlet_speaks(line):
    id = line[0]
    speaketh = False
    
    if "HAMLET" in line:
        speaketh = True
    
    if speaketh:
        yield id,"hamlet speaketh!"

hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))
hamlet_spoken.take(10)

[('', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!'),
 ('', 'hamlet speaketh!'),
 ('', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!'),
 ('HAMLET', 'hamlet speaketh!')]

## Filter Using A Named Function

Extract the original lines where Hamlet spoke. 

In [4]:
def filter_hamlet_speaks(line):
    if "HAMLET" in line:
        return True
    else:
        return False
    
hamlet_spoken_lines = split_hamlet.filter(lambda line: filter_hamlet_speaks(line))
hamlet_spoken_lines.take(5)

[['', 'HAMLET'],
 ['HAMLET', 'son to the late, and nephew to the present king.'],
 ['', 'HAMLET'],
 ['', 'HAMLET'],
 ['HAMLET', '[Aside]  A little more than kin, and less than kind.']]

## Compute the number of elements in hamlet_spoken_lines

In [5]:
spoken_count = 0
spoken_101 = list()
spoken_count = hamlet_spoken_lines.count()
spoken_count

286

In [6]:
spoken_collect = hamlet_spoken_lines.collect()
spoken_101 = spoken_collect[100]

In [7]:
spoken_101

['HAMLET', 'A goodly one; in which there are many confines,']