<a href="https://colab.research.google.com/github/Manya123-max/Big-Data-Framework/blob/main/BDF5_WORD_COUNT_USING_SPARK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Aim:**
The primary aim of this code is to perform a word count analysis on a text file using Apache Spark.

**Step 1:** Import Libraries

The necessary PySpark libraries are imported to create a Spark session and perform DataFrame transformations. Specifically:



In [None]:
# Import Libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col

**Step 2**: SparkSession

It is used to create the Spark application.
Functions like explode, split, and col are utilized for text preprocessing and transformations.

In [None]:
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Word Count Example") \
    .master("local[*]") \
    .getOrCreate()

**Step 3**: Load Text Data

The text file is loaded into a Spark DataFrame where each line is represented as a single record under the column "value."

This is a distributed read operation, which makes it scalable for larger files.

In [None]:
# Load Text Data
# Replace 'path/to/textfile.txt' with the actual path to your text file
file_path = "/content/NLP_adventure_of_sherlock_holmes.txt"
text_df = spark.read.text(file_path)

**Step 4**: Display the Original Text

Displays the first five rows of the text file for preview. truncate=False ensures that long lines are not truncated in the output.

In [None]:
# Show the Loaded Text
print("Original Text Data:")
text_df.show(5, truncate=False)

Original Text Data:
+-----------------------+
|value                  |
+-----------------------+
|                       |
|I. A SCANDAL IN BOHEMIA|
|                       |
|                       |
|I.                     |
+-----------------------+
only showing top 5 rows



**Step 5**: Split Text into Words

The split function splits each line into words using the regular expression \\s+, which matches one or more whitespace characters.

The explode function flattens the resulting list of words into individual rows.
Each word is given the alias "word" for readability.

In [None]:
# Split Text into Words
words_df = text_df.select(explode(split(col("value"), "\\s+")).alias("word"))

**Step 6:** Count Words

The groupBy function groups the data by each unique word.

The count function computes the frequency of each word.

The resulting DataFrame is ordered in descending order of word count using orderBy.


In [None]:
# Count Words
word_count_df = words_df.groupBy("word").count().orderBy(col("count").desc())

Step 7: Display Word Counts

Displays the top 10 words along with their counts for analysis.

In [None]:
# Show Word Counts
print("Word Counts:")
word_count_df.show(20)

Word Counts:
+----+-----+
|word|count|
+----+-----+
| the| 2144|
|    | 1154|
| and| 1135|
|   a| 1131|
|  of| 1114|
|  to| 1064|
|   I| 1044|
|  in|  669|
|that|  578|
| was|  535|
| his|  436|
|  is|  422|
|  my|  402|
|  it|  399|
| you|  398|
|  he|  378|
|have|  357|
| had|  328|
|with|  320|
|  as|  317|
+----+-----+
only showing top 20 rows



Step 8: Save Word Counts to Disk

The word counts are saved in CSV format to the specified directory.

mode("overwrite") ensures that any existing data in the output directory is replaced.

header=True adds column names to the CSV file.

In [None]:
# Save Word Counts to Disk
output_path = "path/to/output/directory"
word_count_df.write.mode("overwrite").csv(output_path, header=True) # Added mode="overwrite"


Step 9: Stop the SparkSession

Stops the Spark application to free up resources.

In [None]:
# Stop the SparkSession
spark.stop()

**RESULT:** This code efficiently processes large datasets by distributing computation across the
cluster, showcasing the power of Spark's parallelism