# WordMap Counting Problem All Together

We have done a lot to make our understanding better with MapReduce framework in PySpark.

Now, we are going to compose all of them together ..

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext


ss = SparkSession.builder.master("local[4]").appName("FlatMap-ReduceByKey").getOrCreate();
sc = ss.sparkContext



lines = [
    "word count from Wikipedia the free encyclopedia",
    "the word count is the number of words in a document or passage of text Word counting may be needed when a text",
    "is required to stay within certain numbers of words This may particularly be the case in academia legal",
    "proceedings journalism and advertising Word count is commonly used by translators to determine the price for"
]

# Create Rdd using parallelize
rdd = sc.parallelize(lines)
rddFlatMap = rdd.flatMap(lambda line:line.split(" "))
rddMap = rddFlatMap.map(lambda word: (word, 1))
rddReduced = rddMap.reduceByKey(lambda x, y : x + y)
rddReduced.collect()





## Exercise 1

Take the input from data.txt file and count the number of words.

In [1]:
import pyspark
from pyspark.sql import SparkSession

# Create a new Spark session
ss = SparkSession.builder.master("local[4]").appName("WordCount").getOrCreate()
sc = ss.sparkContext

# Read the content of data.txt file
file_path = "textData.txt" 
rdd = sc.textFile(file_path)

# Use flatMap to split the lines into words
rddFlatMap = rdd.flatMap(lambda line: line.split())

# Map each word to a tuple (word, 1)
rddMap = rddFlatMap.map(lambda word: (word, 1))

# Use reduceByKey to count the occurrences of each word
rddReduced = rddMap.reduceByKey(lambda x, y: x + y)

# Collect the result and print it
result = rddReduced.collect()

for word, count in result:
    print(f"Word: {word}, Count: {count}")


Word: Alice’s, Count: 18
Word: by, Count: 18
Word: Lewis, Count: 18
Word: Carroll, Count: 18
Word: eBook, Count: 27
Word: for, Count: 27
Word: use, Count: 27
Word: of, Count: 27
Word: anyone, Count: 27
Word: at, Count: 27
Word: no, Count: 27
Word: cost, Count: 27
Word: and, Count: 27
Word: with, Count: 27
Word: Project, Count: 9
Word: Gutenberg’s, Count: 9
Word: Adventures, Count: 18
Word: in, Count: 18
Word: Wonderland, Count: 18
Word: This, Count: 27
Word: is, Count: 27
Word: the, Count: 27
Word: anywhere, Count: 27


## Exercise 2

Load the testData.txt in the rdd and add all the values together.

In [3]:
import pyspark
from pyspark.sql import SparkSession

# Create a new Spark session
ss = SparkSession.builder.master("local[4]").appName("SumValues").getOrCreate()
sc = ss.sparkContext

# Read the content of testData.txt file
file_path = "testDataM.txt"  # You can specify the path to the file here
rdd = sc.textFile(file_path)

# Convert the string values into integers
rdd_int = rdd.flatMap(lambda line: line.split()).map(lambda x: int(x))

# Sum all the values
total_sum = rdd_int.reduce(lambda x, y: x + y)

print(f"Total Sum of Values: {total_sum}")

Total Sum of Values: 680
