# Map Reduce with Spark - Individual Assignment.

### The code below uses map reduce in spark in order to provide a word count. Your task is to copy and modify the code in order to generate a similar analysis on a book of your choice.

 * Start by picking a book and finding its txt file https://www.gutenberg.org/
 * Ensure the starter code works. This starter code finds and lists the _most frequent single words_ that appear in the book.
 * Create three different map reduce operations that:
 * (1) One task that finds and lists the _longest single words_ used in the book. For example, the word "three" would have a length of "5".
 * (2) Another task that finds and lists the _most frequent bigrams_. A bigram is a pair of words that appear next to each other. For example, the phrase "one two three four" will have the bigrams "one two" "two three" and "three four".
 * (3) Another task that finds and lists a _customized statistic_. Pick any other kind of text counting statistic that you want to use map reduce for. Explain it and then implement it in this notebook.

In [None]:

import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
import re

# Initialize SparkSession
spark = SparkSession.builder.master("local").appName("MapReduceOperations").getOrCreate()
sc = spark.sparkContext

# Load your book text from Gutenberg
text_file = sc.textFile("https://www.gutenberg.org/cache/epub/228/pg228.txt")

# Preprocessing: clean the text by removing non-alphanumeric characters
def clean_text(line):
    return re.sub(r'[^a-zA-Z0-9\s]', '', line).lower().split()

cleaned_text = text_file.flatMap(clean_text)

# Task 1: Find the longest single words
def find_longest_words(cleaned_text):
    # Map each word to its length
    word_lengths = cleaned_text.map(lambda word: (word, len(word)))
    # Find the maximum length
    max_length = word_lengths.map(lambda x: x[1]).reduce(lambda a, b: max(a, b))
    # Filter words that have the maximum length
    longest_words = word_lengths.filter(lambda x: x[1] == max_length).map(lambda x: x[0]).distinct().collect()
    return longest_words

# Task 2: Find the most frequent bigrams
def find_most_frequent_bigrams(cleaned_text, top_n=10):
    # Create bigrams by zipping the RDD with its shifted version
    bigrams = cleaned_text.zipWithIndex().map(lambda x: (x[1], x[0])).join(
        cleaned_text.zipWithIndex().map(lambda x: (x[1] - 1, x[0]))).values()
    # Count the frequency of each bigram
    bigram_counts = bigrams.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b)
    # Get the top N bigrams by frequency
    return bigram_counts.takeOrdered(top_n, key=lambda x: -x[1])

# Task 3: Custom Statistic - Find the average word length
def average_word_length(cleaned_text):
    # Map each word to its length
    word_lengths = cleaned_text.map(lambda word: len(word))
    # Calculate the total length and the number of words
    total_length = word_lengths.reduce(lambda a, b: a + b)
    word_count = cleaned_text.count()
    # Calculate the average word length
    average_length = total_length / word_count
    return average_length

# Execution
longest_words = find_longest_words(cleaned_text)
print("Longest Words:", longest_words)

most_frequent_bigrams = find_most_frequent_bigrams(cleaned_text)
print("Most Frequent Bigrams:")
for bigram, count in most_frequent_bigrams:
    print(f"{bigram[0]} {bigram[1]}: {count}")

average_length = average_word_length(cleaned_text)
print(f"Average Word Length: {average_length}")

# Stop SparkSession
spark.stop()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Create a bar plot using seaborn
plt.figure(figsize=(10, 20))  # Adjust width and height as needed
sns.barplot(x='count', y='word', data=top_n_df, orient='h')
plt.title(f'Top {top_n} Most Frequent Words')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.show()

## Task 1: Longest Single Words 

## Task 2: Most Frequent Bigrams

## Task 3: Customized Statistic