## Initializing Spark Context

This code cell initializes a Spark Context for running Spark jobs.

- **Setting Python Executables**: `os.environ['PYSPARK_PYTHON']` and `os.environ['PYSPARK_DRIVER_PYTHON']` set the Python executable paths for Spark.
- **Creating Spark Context**: `ps.SparkContext('local[*]')` creates a Spark Context using all available CPU cores on the local machine.
- **Spark Context Creation Message**: `"Just created a SparkContext"` is printed to indicate the successful creation of the Spark Context.

Initializing the Spark Context is the first step in setting up a Spark environment for distributed data processing and analysis.


In [None]:
#import findspark
#findspark.init()
import os
import sys
import pyspark as ps
import warnings
from pyspark.sql import SQLContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
try:
  # create SparkContext on all CPUs available: in my case I have 4 CPUs on my laptop
  sc = ps.SparkContext('local[*]')
  #sqlContext = SQLContext(sc)
  print("Just created a SparkContext")
except ValueError:
  warnings.warn("SparkContext already exists in this scope")

## Running Unit Tests with PySpark

This code cell defines and runs unit tests using the `unittest` framework with PySpark.

- **Defining Test Case**: `TestRdd` class is defined inheriting from `unittest.TestCase`, containing a test method `test_take`.
- **Test Method**: `test_take` tests the `take` method on an RDD created from a parallelized collection.
- **Running Tests**: `run_tests()` function loads the test suite from the `TestRdd` class and runs it using `unittest.TextTestRunner`.

Unit testing with PySpark allows for validating the behavior of Spark operations and transformations, ensuring correctness and reliability of Spark code.


In [None]:
import unittest
import sys

class TestRdd(unittest.TestCase):
  def test_take(self):
    input = sc.parallelize([1,2,3,4])
    self.assertEqual([1,2,3,4], input.take(4))


def run_tests():
  suite = unittest.TestLoader().loadTestsFromTestCase( TestRdd )
  unittest.TextTestRunner(verbosity=1,stream=sys.stderr).run( suite )

run_tests()


The help() function in Python provides documentation and information about the specified object or module. However, the help() function may not display the documentation directly in this interface.

If you're looking for help on the sc object, which typically refers to the Spark Context in PySpark, you can access its documentation directly within your Python environment by executing help(sc) in a Python shell or Jupyter Notebook where PySpark is imported and initialized. This will display the documentation and available methods for the Spark Context object.

In [None]:
help(sc)

## Data Validation and Processing with PySpark

This code cell defines a function `validate()` to check the presence of specific fields in JSON lines and processes JSON data using PySpark.

- **Fields Definition**: `fields`, `fields2`, `fields3`, and `fields4` list the expected fields in the JSON data.
- **Validation Function**: `validate()` checks if all required fields are present in a JSON line. It returns `True` if all fields from `fields2` are present, otherwise `False`.
- **Reading JSON Data**: JSON data is read from the file 'data/movies.json' using `sc.textFile()` and stored in `reviews_raw`.
- **Data Processing**: The JSON data is parsed using `json.loads()` and filtered using the `validate()` function to ensure that only valid records are retained.
- **Caching**: The resulting RDD `reviews` is cached for faster access in subsequent operations.

This code demonstrates data validation and processing using PySpark, ensuring that only records with the required fields are retained for further analysis.


In [None]:
import json

fields = ['product_id','user_id','score','time']

fields2 = ['product_id','user_id','review','profile_name','helpfulness','score','time']
fields3 = ['product_id','user_id','time']
fields4 = ['user_id','score','time']

def validate(line):
  for field in fields2:
    if field not in line: return False
  return True

reviews_raw = sc.textFile('data/movies.json')

reviews = reviews_raw.map(lambda line: json.loads(line)).filter(validate)

reviews.cache()

In [None]:
reviews.take(1)

## Analyzing Review Data

This code cell calculates various statistics on the review data using PySpark operations.

- **Number of Reviews**: `num_entries = reviews.count()` counts the total number of reviews in the dataset.
- **Number of Unique Movies**: `num_movies = reviews.groupBy(lambda entry: entry['product_id']).count()` groups the reviews by product ID and counts the number of unique movies.
- **Number of Unique Users**: `num_users = reviews.groupBy(lambda entry: entry['user_id']).count()` groups the reviews by user ID and counts the number of unique users.
- **Printing Statistics**: The statistics are printed, showing the total number of reviews, unique movies, and unique users.

This code provides basic insights into the size and diversity of the review dataset, including the number of reviews, unique movies, and unique users.


In [None]:
num_movies = reviews.groupBy(lambda entry: entry['product_id']).count()
num_users = reviews.groupBy(lambda entry: entry['user_id']).count()
num_entries = reviews.count()

print (str(num_entries) + " reviews of " + str(num_movies) + " movies by " + str(num_users) + " different people.")

## Generating Movie Suggestions

This code cell generates movie suggestions based on user reviews using PySpark operations.

- **Map Phase**: `r1 = reviews.map(lambda r: ((r['product_id'],), 1))` maps each review to a tuple containing the product ID as a key and 1 as the value.
- **Reduce Phase**: `avg3 = r1.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))` reduces the tuples by key, summing up the values.
- **Filtering**: `avg3 = avg3.filter(lambda x: x[1][1] > 20 )` filters out movies with less than 20 reviews.
- **Sorting**: `avg3 = avg3.map(lambda x: ((x[1][0]+x[1][1],), x[0])).sortByKey(ascending=False)` calculates the average rating for each movie and sorts them in descending order of average rating.

This code aims to provide movie suggestions based on user reviews, considering movies with a significant number of reviews and sorting them by average rating.


In [None]:
#Suggestion_users = reviews.filter(lambda entry: entry['user_id'])

#for review in Suggestion_users.collect():
r1 = reviews.map(lambda r: ((r['product_id'],), 1))
avg3 = r1.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

avg3 = avg3.filter(lambda x: x[1][1] > 20 )
avg3 = avg3.map(lambda x: ((x[1][0]+x[1][1],), x[0])).sortByKey(ascending=False)

## Top 10 Movie Suggestions

This code iterates through the top 10 movie suggestions generated based on user reviews and prints the Amazon URL for each movie along with the number of people who watched it.

- **Iteration**: `for movie in avg3.take(10):` iterates through the top 10 movies in the `avg3` RDD.
- **Printing**: `print ("http://www.amazon.com/dp/" + movie[1][0] + " WATCHED BY : " + str(movie[0][0]) + " PEOPLE")` prints the Amazon URL for the movie (`movie[1][0]`) along with the number of people who watched it (`movie[0][0]`).

This code provides direct links to the top 10 suggested movies on Amazon along with the number of people who have watched them.


In [None]:
for movie in avg3.take(10):
  print ("http://www.amazon.com/dp/" + movie[1][0] + " WATCHED BY : " + str(movie[0][0]) + " PEOPLE")

## Generating User Suggestions

This code cell generates user suggestions based on their review activity using PySpark operations.

- **Map Phase**: `r2 = reviews.map(lambda ru: ((ru['user_id'],), 1))` maps each review to a tuple containing the user ID as a key and 1 as the value.
- **Reduce Phase**: `avg2 = r2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))` reduces the tuples by key, summing up the values.
- **Filtering**: `avg2 = avg2.filter(lambda x: x[1][1] > 20 )` filters out users with less than 20 reviews.
- **Sorting**: `avg2 = avg2.map(lambda x: ((x[1][0]+x[1][1],), x[0])).sortByKey(ascending=False )` calculates the total number of reviews for each user and sorts them in descending order.

This code aims to provide user suggestions based on their review activity, considering users with a significant number of reviews and sorting them by total review count.


In [None]:
r2 = reviews.map(lambda ru: ((ru['user_id'],), 1))
avg2 = r2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

avg2 = avg2.filter(lambda x: x[1][1] > 20 )
avg2 = avg2.map(lambda x: ((x[1][0]+x[1][1],), x[0])).sortByKey(ascending=False )

## Top 10 User Suggestions

This code iterates through the top 10 user suggestions generated based on their review activity and prints the Amazon URL for each user along with the number of movies they have watched.

- **Iteration**: `for user in avg2.take(10):` iterates through the top 10 users in the `avg2` RDD.
- **Printing**: `print ("http://www.amazon.com/dp/" + user[1][0] + " WATCHED : " + str(user[0][0]) + " MOVIES")` prints the Amazon URL for the user (`user[1][0]`) along with the number of movies they have watched (`user[0][0]`).

This code provides direct links to the top 10 suggested users on Amazon along with the number of movies they have watched.


In [None]:
for movie in avg2.take(10):
  print ("http://www.amazon.com/dp/" + movie[1][0] + " WATCHED : " + str(movie[0][0]) + " MOVIES")

## Review Search for "George"

This code cell searches for reviews written by users with "George" in their profile name and displays relevant information.

- **Filtering**: `filtered = reviews.filter(lambda entry: "George" in entry['profile_name'])` filters reviews based on whether "George" is present in the profile name.
- **Counting Entries**: `print ("Found " + str(filtered.count()) + " entries.\n")` prints the total number of entries found after filtering.
- **Iterating and Printing**: The code iterates through the filtered reviews and prints the rating, helpfulness, Amazon URL, summary, and full review text for each matching review.

This code allows for searching and retrieving reviews written by users with "George" in their profile name, providing details such as rating, helpfulness, and review text.


In [None]:
# Has someone written a review?
filtered = reviews.filter(lambda entry: "George" in entry['profile_name'])
print ("Found " + str(filtered.count()) + " entries.\n")

for review in filtered.collect():
  print ("Rating: " + str(review['score']) + " and helpfulness: " + review['helpfulness'])
  print ("http://www.amazon.com/dp/" + review['product_id'])
  print (review['summary'])
  print (review['review'])
  print ("\n")

## Best and Worst Rated Movies

This code cell calculates the average rating for each movie and identifies the best and worst rated movies.

- **Map Phase**: `reviews_by_movie = reviews.map(lambda r: ((r['product_id'],), r['score']))` maps each review to a tuple containing the product ID as a key and the score as the value.
- **Reduce Phase**: `avg = reviews_by_movie.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))` reduces the tuples by key, summing up the scores and counting the number of reviews.
- **Filtering**: `avg = avg.filter(lambda x: x[1][1] > 20 )` filters out movies with less than 20 reviews.
- **Sorting**: `avg = avg.map(lambda x: ((x[1][0]/x[1][1],), x[0])).sortByKey(ascending=True)` calculates the average rating for each movie and sorts them in ascending order of average rating.

This code provides insights into the best and worst rated movies based on average user ratings, considering movies with a significant number of reviews.


In [None]:
# Get best and worst rated movies
reviews_by_movie = reviews.map(lambda r: ((r['product_id'],), r['score']))
avg = reviews_by_movie.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

avg = avg.filter(lambda x: x[1][1] > 20 )
avg = avg.map(lambda x: ((x[1][0]/x[1][1],), x[0])).sortByKey(ascending=True)

## Top 10 Best and Worst Rated Movies

This code iterates through the top 10 best and worst rated movies and prints their Amazon URL along with their average rating.

- **Iteration**: `for movie in avg.take(10):` iterates through the top 10 movies in the `avg` RDD.
- **Printing**: `print ("http://www.amazon.com/dp/" + movie[1][0] + " Rating: " + str(movie[0][0]))` prints the Amazon URL for the movie (`movie[1][0]`) along with its average rating (`movie[0][0]`).

This code provides direct links to the top 10 best and worst rated movies on Amazon along with their average ratings.


In [None]:
for movie in avg.take(10):
  print ("http://www.amazon.com/dp/" + movie[1][0] + " Rating: " + str(movie[0][0]))

0.1 Spark and Pandas

## Creating Time Series RDD

This code cell creates a time series RDD from the reviews RDD by mapping each entry to a dictionary containing the score and time attributes.

- **Mapping**: `timeseries_rdd = reviews.map(lambda entry: {'score': entry['score'],'time': datetime.fromtimestamp(entry['time'])})` maps each entry in the reviews RDD to a dictionary with 'score' and 'time' attributes. The 'time' attribute is converted from a Unix timestamp to a datetime object using `datetime.fromtimestamp()`.

This code aims to create a time series RDD for further analysis and visualization of review scores over time.


In [None]:
from datetime import datetime

timeseries_rdd = reviews.map(lambda entry: {'score': entry['score'],'time': datetime.fromtimestamp(entry['time'])})

## Time Series Analysis and Visualization

This code cell performs time series analysis and visualization using Pandas and Matplotlib on the time series RDD created earlier.

- **Pandas DataFrame**: `timeseries = pd.DataFrame(sample.collect(),columns=['score', 'time'])` converts the sampled RDD into a Pandas DataFrame with 'score' and 'time' columns.
- **Data Types**: `timeseries.score.astype('float64')` ensures that the 'score' column is of type float64.
- **Indexing**: `timeseries.set_index('time', inplace=True)` sets the 'time' column as the index of the DataFrame.
- **Resampling**: `Rsample = timeseries.score.resample('Y').count()` resamples the data annually ('Y'), counting the number of scores for each year.
- **Plotting**: `Rsample.plot()` plots the annual resampled data using Matplotlib.
- **Further Resampling and Plotting**: Similar resampling and plotting are performed for monthly ('M') and quarterly ('Q') intervals.

This code provides insights into the distribution of review scores over time through visualizations at different temporal resolutions.


In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

def parser(x):
  return datetime.strptime('190'+x, '%Y-%m')

sample = timeseries_rdd.sample(withReplacement=False, fraction=20000.0/num_entries, seed=1134)
timeseries = pd.DataFrame(sample.collect(),columns=['score', 'time'])
print(timeseries.head(3))
timeseries.score.astype('float64')
#timeseries.time.astype('datetime64')

timeseries.set_index('time', inplace=True)
Rsample = timeseries.score.resample('Y').count()
Rsample.plot()
Rsample2 = timeseries.score.resample('M').count()
Rsample2.plot()
Rsample3 = timeseries.score.resample('Q').count()
Rsample3.plot()

0.2 Matrix factorization

## Histogram of Average Ratings for Top Movies

This code cell generates histograms of the average ratings for the top movies in the RDD.

- **Iteration**: `for movie in avg.take(4):` iterates through the top 4 movies in the `avg` RDD.
- **Plotting**: `plt.bar(movie[1][0],movie[0][0])` creates a bar chart where the x-axis represents the movie and the y-axis represents the average rating.
- **Title, Labels**: `plt.title('Histogram of \'AVERAGE RATING OF MOVIE\'')`, `plt.xlabel('MOVIE')`, and `plt.ylabel('AVGRATING')` set the title and labels for the histogram.

This code visualizes the average ratings of the top movies using histograms, allowing for easy comparison of their ratings.


In [None]:
for movie in avg.take(4):
  plt.bar(movie[1][0],movie[0][0])
  plt.title('Histogram of \'AVERAGE RATING OF MOVIE\'')
  plt.xlabel('MOVIE')
  plt.ylabel('AVGRATING')

## Histogram of Number of Movies Reviewed by Top Users

This code cell generates histograms of the number of movies reviewed by the top users in the RDD.

- **Iteration**: `for movie in avg2.take(3):` iterates through the top 3 users in the `avg2` RDD.
- **Plotting**: `plt.bar(movie[1][0],movie[0][0])` creates a bar chart where the x-axis represents the user and the y-axis represents the number of movies reviewed.
- **Title, Labels**: `plt.title('Histogram of \'NUMBER OF MOVIES REVIEWED BY USER\'')`, `plt.xlabel('USER')`, and `plt.ylabel('MOVIE COUNT')` set the title and labels for the histogram.

This code visualizes the number of movies reviewed by the top users using histograms, allowing for easy comparison of their reviewing activity.


In [None]:
for movie in avg2.take(3):
  plt.bar(movie[1][0],movie[0][0])
  plt.title('Histogram of \'NUMBER OF MOVIES REVIEWED BY USER\'')
  plt.xlabel('USER')
  plt.ylabel('MOVIE COUNT')

## Histogram of Movies Reviewed by Number of Users

This code cell generates histograms of the number of users who reviewed each movie in the RDD.

- **Iteration**: `for movie in avg3.take(4):` iterates through the top 4 movies in the `avg3` RDD.
- **Plotting**: `plt.bar(movie[1][0],movie[0][0])` creates a bar chart where the x-axis represents the movie and the y-axis represents the number of users who reviewed it.
- **Title, Labels**: `plt.title('Histogram of \'MOVIES REVIEWED BY NUMBER OF USERS\'')`, `plt.xlabel('MOVIE')`, and `plt.ylabel('USER COUNT')` set the title and labels for the histogram.

This code visualizes the number of users who reviewed each movie using histograms, providing insights into the popularity of movies among users.


In [None]:
for movie in avg3.take(4):
  plt.bar(movie[1][0],movie[0][0])
  plt.title('Histogram of \'MOVIES REVIEWED BY NUMBER OF USERS\'')
  plt.xlabel('MOVIE')
  plt.ylabel('USER COUNT')

## Collaborative Filtering with Alternating Least Squares (ALS)

This code cell implements collaborative filtering using Alternating Least Squares (ALS) for recommendation.

- **Import**: `from pyspark.mllib.recommendation import ALS` imports the ALS module.
- **Hashing**: `get_hash` function hashes the user and product IDs to integers for modeling.
- **Ratings**: `ratings = reviews.map(lambda entry: tuple([ get_hash(entry['user_id'].encode('utf-8')),get_hash(entry['product_id'].encode('utf-8')),int(entry['score']) ]))` transforms each entry in the reviews RDD into a tuple of (user ID, product ID, rating).
- **Train-Test Split**: `train_data` and `test_data` are generated by filtering ratings based on a hash-based train-test split.
- **Caching**: `train_data.cache()` caches the train data for efficient processing.
- **Print Stats**: The number of train and test samples are printed for inspection.

This code sets up the data for collaborative filtering with ALS, allowing for the training and evaluation of recommendation models.


In [None]:
from pyspark.mllib.recommendation import ALS
from numpy import array
import hashlib
import math

def get_hash(s):
  return int(hashlib.sha1(s).hexdigest(), 16) % (10 ** 8)


#Input format: [user, product, rating]
ratings = reviews.map(lambda entry: tuple([ get_hash(entry['user_id'].encode('utf-8')),get_hash(entry['product_id'].encode('utf-8')),int(entry['score']) ]))

train_data = ratings.filter(lambda entry: ((entry[0]+entry[1]) % 10) >=2 )
test_data = ratings.filter(lambda entry: ((entry[0]+entry[1]) % 10) < 2 )
train_data.cache()
#train_data.union(train_data)

print ("Number of train samples: " + str(train_data.count()))
print ("Number of test samples: " + str(test_data.count()))

## Evaluation of Collaborative Filtering Model

This code cell builds a recommendation model using Alternating Least Squares (ALS) and evaluates it on test data.

- **Model Building**: `model = ALS.train(train_data, rank, numIterations)` trains the ALS model using the train data with the specified rank and number of iterations.
- **Prediction**: `predictions = model.predictAll(unknown)` generates predictions for unknown user-product pairs.
- **Evaluation**: The Mean Squared Error (MSE) is calculated to assess the model's performance on the test data.

This code effectively builds and evaluates a recommendation model using collaborative filtering with ALS, providing insights into its predictive accuracy.

## Building and Evaluating the Recommendation Model

This code cell builds a recommendation model using Alternating Least Squares (ALS) and evaluates its performance.

- **Model Parameters**: `rank = 20` and `numIterations = 20` define the rank of the latent factors and the number of iterations for ALS training.
- **Training**: `model = ALS.train(train_data, rank, numIterations)` trains the ALS model on the train data.
- **Evaluation**:
  - **Prediction**: `unknown = test_data.map(lambda entry: (int(entry[0]), int(entry[1])))` prepares the test data for prediction.
  - **Predictions**: `predictions = model.predictAll(unknown).map(lambda r: ((int(r[0]), int(r[1])), r[2]))` generates predictions for the test data.
  - **True and Predicted Values**: `true_and_predictions = test_data.map(lambda r: ((int(r[0]), int(r[1])), r[2])).join(predictions)` joins true ratings with predictions.
  - **Mean Squared Error (MSE)**: `MSE = true_and_predictions.map(lambda r: (int(r[1][0]) - int(r[1][1])**2).reduce(lambda x, y: x + y)/true_and_predictions.count())` calculates the MSE between true and predicted ratings.

This code trains the recommendation model using ALS and evaluates its performance using MSE.



In [None]:
# Build the recommendation model using Alternating Least Squares
from math import sqrt
rank = 20
numIterations = 20
model = ALS.train(train_data, rank, numIterations)

def convertToFloat(lines):
  returnedLine = []
  for x in lines:
    returnedLine.append(float(x))
  return returnedLine

# Evaluate the model on test data
unknown = test_data.map(lambda entry: (int(entry[0]), int(entry[1])))
predictions = model.predictAll(unknown).map(lambda r: ((int(r[0]), int(r[1])), r[2]))
true_and_predictions = test_data.map(lambda r: ((int(r[0]), int(r[1])), r[2])).join(predictions)
MSE = true_and_predictions.map(lambda r: (int(r[1][0]) - int(r[1][1])**2).reduce(lambda x, y: x + y)/true_and_predictions.count())

This output shows the first 10 elements of the true_and_predictions RDD, where each element consists of a tuple containing the user ID, product ID, true rating, and predicted rating

In [None]:
true_and_predictions.take(10)

0.3 No demo without a word count example!

## Filtering Reviews for Word Analysis

This code cell filters reviews based on their ratings to analyze words used in positive (5.0) and negative (1.0) reviews.

- **Parameters**:
  - `min_occurrences = 10`: Minimum occurrences required for a word to be considered.

- **Positive Reviews**:
  - `good_reviews`: Filters reviews with a score of 5.0.
  - `good_words`: Splits each review into words and counts their occurrences. Words occurring less than `min_occurrences` times are filtered out.

- **Negative Reviews**:
  - `bad_reviews`: Filters reviews with a score of 1.0.
  - `bad_words`: Splits each review into words and counts their occurrences. Words occurring less than `min_occurrences` times are filtered out.


In [None]:
min_occurrences = 10

good_reviews = reviews.filter(lambda line: line['score']==5.0)
bad_reviews = reviews.filter(lambda line: line['score']==1.0)

good_words = good_reviews.flatMap(lambda line: line['review'].split(' '))
num_good_words = good_words.count()

good_words = good_words.map(lambda word: (word.strip(), 1)).reduceByKey(lambda a, b: a+b).filter(lambda word_count: word_count[1] > min_occurrences)

bad_words = bad_reviews.flatMap(lambda line: line['review'].split(' '))
num_bad_words = bad_words.count()

bad_words = bad_words.map(lambda word: (word.strip(), 1)).reduceByKey(lambda a, b: a+b).filter(lambda word_count: word_count[1] > min_occurrences)

## Calculating Word Frequencies

This code cell calculates the frequencies of words found in positive and negative reviews.

- **Positive Reviews**:
  - `frequency_good`: Calculates the frequency of each word in positive reviews by dividing its count by the total number of words in positive reviews.

- **Negative Reviews**:
  - `frequency_bad`: Calculates the frequency of each word in negative reviews by dividing its count by the total number of words in negative reviews.


In [None]:
# Calculate the word frequencies
frequency_good = good_words.map(lambda word: ((word[0],), float(word[1])/num_good_words))
frequency_bad = bad_words.map(lambda word: ((word[0],), float(word[1])/num_bad_words))

## Joining Word Frequencies

This code cell joins the word frequencies calculated for positive and negative reviews.

- **Input**:
  - `frequency_good`: Frequencies of words in positive reviews.
  - `frequency_bad`: Frequencies of words in negative reviews.

- **Output**:
  - `joined_frequencies`: Joined frequencies of words in both positive and negative reviews.


In [None]:
# Join the word frequencies of the good and bad reviews
joined_frequencies = frequency_good.join(frequency_bad)

## Calculating Relative Difference of Word Frequencies

This code cell calculates the relative difference of word frequencies between positive and negative reviews. It sorts the dataset to identify the most significant expressions for characterizing either positively or negatively rated movies.


In [None]:
# Calculate the relative difference of each word frequency in the good and bad reviews.
# Sort the dataset to get the most significant expressions for the characterization of either a positively
# or negatively rated movie.

import math

def relative_difference(a, b):
  return math.fabs(a-b)/a

result = joined_frequencies.map(lambda f: ((relative_difference(f[1][0], f[1][1]),), f[0][0]) ).sortByKey(ascending=False)

In [None]:
result.take(50)

## Histogram of Sentiment Analysis

This code cell creates a histogram of sentiment analysis based on the relative difference of word frequencies between positive and negative reviews. It displays the top 7 significant expressions for characterizing movie sentiment.

- **X-axis**: Word
- **Y-axis**: Number of occurrences


In [None]:
for movie in result.take(7):
  plt.bar(movie[1],movie[0][0])
  plt.title('Histogram of \'SENTIMENT ANALYSIS\'')
  plt.xlabel('WORD')
  plt.ylabel('NUMBER OF OCCURANCES')