**AIM**: The aim of the provided Spark code is to perform word count on a given text file. This is a fundamental example of distributed data processing, demonstrating how to use Apache Spark to count the frequency of each word in a dataset.

**STEPS:**

In [None]:
from pyspark.sql import SparkSession

1. Initialize SparkSession

Purpose: Creates a SparkSession, which is the entry point for interacting with
Spark functionality.

appName: Specifies the name of the application ("WordCount").

In [None]:
# Initialize SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()

2. Load the Text File


*   Purpose: Reads the input text file into an RDD (Resilient Distributed Dataset).
*   RDD: A distributed collection of elements that can be processed in parallel across the cluster.
*   textFile: Splits the file into lines and stores them in the RDD.

In [None]:
# Load the text file into an RDD
input_file = "/content/hi.txt"  # Replace with your file path
text_rdd = spark.sparkContext.textFile(input_file)

3. Split Lines into Words


*   Purpose: Breaks each line of text into individual words.
*   flatMap: Transforms each line into multiple words. It flattens the results into a single RDD of words.

4. Map Words to Key-Value Pairs


*   Purpose: Maps each word to a tuple (word, 1).
*   Key-Value Pair: Enables grouping by word to perform aggregation later.

5. Count Word Frequencies


*   Purpose: Aggregates the values (counts) for each unique key (word).
*   reduceByKey: Combines the counts for each word by summing the 1s.







In [None]:
# Transformations to perform word count
word_counts = (
    text_rdd.flatMap(lambda line: line.split())  # Split lines into words
    .map(lambda word: (word, 1))  # Map each word to (word, 1)
    .reduceByKey(lambda a, b: a + b)  # Reduce by key to count occurrences
)

6. Collect Results

*   Purpose: Gathers the results from the distributed RDD to the driver node.
*   collect: Converts the distributed data into a local Python list.
*   Print Loop: Iterates through the list and prints each word and its count.



In [None]:
# Collect and print results
for word, count in word_counts.collect():
    print(f"{word}: {count}")

And: 2
we: 4
want,: 1
is: 2
feel: 1
like: 3
are: 2
beggars.: 1
when: 2
i: 1
use: 1
serve: 1
curd: 1
sundays),: 1
only: 2
curd,: 1
give: 3
accordingly.: 1
stand: 1
cauliflower: 1
anything: 1
wish.: 1
Do: 1
have: 1
question: 1
too?: 1
Then: 1
queue: 1
there,: 1
waiting: 1
behind,: 1
attitude.: 1
attitude: 1
for: 2
selecting: 1
the: 4
chicken,: 1
its: 2
our: 1
choice: 1
which: 1
piece: 2
asking: 1
a: 1
better: 1
made: 1
to: 4
some: 3
Personally: 1
(on: 1
people: 3
prefer: 1
need: 1
onion: 1
and: 3
so: 1
on.: 1
I: 1
it: 2
fried: 1
rice,: 1
pick: 1
according: 1
their: 1
also: 1
how: 1
you: 1
guys: 1
statements: 1
with: 2
So: 1
want: 2
keep: 1
your: 1
yourself.: 1


7. Stop the SparkSession

In [None]:
# Stop SparkSession
spark.stop()

**RESULT**: This code efficiently processes large datasets by distributing computation across the cluster, showcasing the power of Spark's parallelism.