
# Create a base RDD and transform it

- The volume of unstructured data (log lines, images, binary files) in existence is growing dramatically, and PySpark is an excellent framework for analyzing this type of data through RDDs. In this 3 part exercise, you will write code that calculates the most common words from [Complete Works of William Shakespeare](http://www.gutenberg.org/ebooks/100).

- Here are the brief steps for writing the word counting program:

> - Create a base RDD from `Complete_Shakespeare.txt` file.
> - Use RDD transformation to create a long list of words from each element of the base RDD.
> - Remove stop words from your data.
> - Create pair RDD where each element is a pair tuple of `('w', 1)`
> - Group the elements of the pair RDD by key (word) and add up their values.
> - Swap the keys (word) and values (counts) so that keys is count and value is the word.
> - Finally, sort the RDD by descending order and print the 10 most frequent words and their frequencies.

- In this first exercise, you'll create a base RDD from `Complete_Shakespeare.txt` file and transform it to create a long list of words.

- Remember, you already have a `SparkContext` `sc` already available in your workspace. A `file_path` variable (which is the path to the `Complete_Shakespeare.txt` file) is also loaded for you.


In [None]:
## Instructions
- Create an RDD called `baseRDD` that reads lines from `file_path`.
- Transform the `baseRDD` into a long list of words and create a new `splitRDD`.
- Count the total words in `splitRDD`.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [8]:
file_path = "file:///home/talentum/test-jupyter/P2/M2/SM4/Dataset/Complete_Shakespeare.txt"

# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)

# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda x: x.split())

# Count the total number of words
print("Total number of words in splitRDD:", splitRDD.count())

Total number of words in splitRDD: 128576


In [10]:
baseRDD.map(lambda x: x.split()).collect()

[['The',
  'Project',
  'Gutenberg',
  'EBook',
  'of',
  'The',
  'Complete',
  'Works',
  'of',
  'William',
  'Shakespeare,',
  'by'],
 ['William', 'Shakespeare'],
 [],
 ['This',
  'eBook',
  'is',
  'for',
  'the',
  'use',
  'of',
  'anyone',
  'anywhere',
  'at',
  'no',
  'cost',
  'and',
  'with'],
 ['almost',
  'no',
  'restrictions',
  'whatsoever.',
  'You',
  'may',
  'copy',
  'it,',
  'give',
  'it',
  'away',
  'or'],
 ['re-use',
  'it',
  'under',
  'the',
  'terms',
  'of',
  'the',
  'Project',
  'Gutenberg',
  'License',
  'included'],
 ['with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org'],
 [],
 ['**',
  'This',
  'is',
  'a',
  'COPYRIGHTED',
  'Project',
  'Gutenberg',
  'eBook,',
  'Details',
  'Below',
  '**'],
 ['**',
  'Please',
  'follow',
  'the',
  'copyright',
  'guidelines',
  'in',
  'this',
  'file.',
  '**'],
 [],
 ['Title:', 'The', 'Complete', 'Works', 'of', 'William', 'Shakespeare'],
 [],
 ['Author:', 'William', 'Shakespeare'],
 [],
 [

In [12]:
baseRDD.flatMap(lambda x: x.split()).collect()

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Complete',
 'Works',
 'of',
 'William',
 'Shakespeare,',
 'by',
 'William',
 'Shakespeare',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever.',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.org',
 '**',
 'This',
 'is',
 'a',
 'COPYRIGHTED',
 'Project',
 'Gutenberg',
 'eBook,',
 'Details',
 'Below',
 '**',
 '**',
 'Please',
 'follow',
 'the',
 'copyright',
 'guidelines',
 'in',
 'this',
 'file.',
 '**',
 'Title:',
 'The',
 'Complete',
 'Works',
 'of',
 'William',
 'Shakespeare',
 'Author:',
 'William',
 'Shakespeare',
 'Posting',
 'Date:',
 'September',
 '1,',
 '2011',
 '[EBook',
 '#100]',
 'Release',
 'Date:',
 