# Remove stop words and reduce the dataset

- After splitting the lines in the file into a long list of words using `flatMap()` transformation, in the next step, you'll remove stop words from your data. Stop words are common words that are often uninteresting. For example "I", "the", "a" etc., are stop words. You can remove many obvious stop words with a list of your own. But for this exercise, you will just remove the stop words from a curated list `stop_words` provided to you in your environment.

- After removing stop words, you'll next create a pair RDD where each element is a pair tuple (k, v) where k is the key and v is the value. In this example, pair RDD is composed of `(w, 1)` where `w` is for each word in the RDD and `1` is a number. Finally, you'll combine the values with the same key from the pair RDD using `reduceByKey()` operation

 - Remember you already have a `SparkContext` `sc` and `splitRDD` available in your workspace.


## Instructions
- Convert the words in `splitRDD` in lower case and then remove stop words from `stop_words`.
- Create a pair RDD tuple containing the word and the number 1 from each word element in `splitRDD`.
- Get the count of the number of occurrences of each word (word frequency) in the pair RDD using `reduceByKey()`

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [3]:
stop_words = ['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 'can',
 'will',
 'just',
 'don',
 'should',
 'now']


In [13]:
file_path = "file:///home/talentum/test-jupyter/P2/M2/SM4/Dataset/Complete_Shakespeare.txt"

# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)

# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda x: x.split(' '))

# Convert the words in lower case and remove stop words from stop_words
splitRDD_no_stop = splitRDD.filter(lambda x: x.lower() not in stop_words)

# Create a tuple of the word and 1 
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w, 1))

# Count of the number of occurences of each word
resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x, y: x + y)


In [17]:
resultRDD.sortByKey().collect()

[('', 65498),
 ('"', 77),
 ('"AS-IS".', 1),
 ('"Defect"', 1),
 ('"Pro-', 1),
 ('"Right', 1),
 ('"Small', 2),
 ('"Tis', 2),
 ('"never."', 1),
 ('"not"', 1),
 ('"small', 1),
 ('"then"', 1),
 ('#100]', 1),
 ('&c.', 5),
 ("''Tis", 1),
 ("'A", 4),
 ("'After", 1),
 ("'Agrippa,", 1),
 ("'Ah", 1),
 ("'Among", 1),
 ("'Antony!", 1),
 ("'Antony'", 1),
 ("'Art", 1),
 ("'Ay,'", 1),
 ("'Be", 1),
 ("'Before", 2),
 ("'Beware", 1),
 ("'But", 1),
 ("'Caesar'-", 1),
 ("'Call", 1),
 ("'Came", 1),
 ("'Come", 1),
 ("'Death!'", 1),
 ("'Demand", 1),
 ("'Dian,", 1),
 ("'Do", 1),
 ("'Fine!'-", 1),
 ("'First", 1),
 ("'Five", 1),
 ("'Fly", 1),
 ("'Fool", 1),
 ("'Fore", 2),
 ("'From", 1),
 ("'Gainst", 1),
 ("'God", 4),
 ("'Good", 3),
 ("'Had", 1),
 ("'Hang", 1),
 ("'Have", 1),
 ("'He", 1),
 ("'Ho!'", 1),
 ("'Holla'", 1),
 ("'I", 14),
 ("'If", 2),
 ("'It", 2),
 ("'Let", 1),
 ("'Look,", 1),
 ("'Make", 1),
 ("'Many", 1),
 ("'Marcius,", 1),
 ("'My", 5),
 ("'No", 1),
 ("'No'", 1),
 ("'No,", 1),
 ("'Now", 1),
 ("'O", 3)