<h2>Movie Recommender</h2>

<ul>
    <li>This code snippet sets up a PySpark environment in a Python script.</li><li> It first imports the necessary modules like pyspark, os, and sys.</li><li> Then, it sets the Python executable for PySpark to the same one being used by the script.</li><li> Finally, it imports the SparkContext class for creating RDDs and the SparkSession class for programming Spark with the DataFrame API.</li>
    <li>Then, it attempts to create a Spark session (a SparkContext in older versions of Spark) with the configuration to use all available CPU cores on the local machine.</li><li> If the session is successfully created, it prints a message. If a SparkContext already exists, it raises a warning.</li>
    </ul>

In [None]:
import os
import sys
import pyspark as ps
import warnings
from pyspark.sql import SQLContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
try:
    sc = ps.SparkContext('local[*]')
    print("Just created a SparkContext")
except ValueError:
    warnings.warn("SparkContext already exists in this scope")

<ul><li>
This code defines a unit test for a PySpark RDD operation using the unittest framework. It includes a test case TestRdd with a method test_take that checks if the take(4) method on an RDD created from [1, 2, 3, 4] returns the same list [1, 2, 3, 4].</li><li> The run_tests function loads this test case into a test suite and runs it, displaying the results with a verbosity level of 1.</li></ul>

In [None]:
import unittest
import sys
class TestRdd(unittest.TestCase):
    def test_take(self):
        input = sc.parallelize([1,2,3,4])
        self.assertEqual([1,2,3,4], input.take(4))
def run_tests():
    suite = unittest.TestLoader().loadTestsFromTestCase( TestRdd )
    unittest.TextTestRunner(verbosity=1,stream=sys.stderr).run( suite )
run_tests()

<ul><li>This code snippet reads a JSON file containing movie reviews into a PySpark RDD. It defines different sets of fields that each review should have.</li><li> It then filters the RDD to keep only the reviews that contain all the fields specified in one of the field sets (fields2). The filtered RDD is then cached for faster access.</li></ul>

In [None]:
import json
fields = ['product_id',
 'user_id',
 'score',
 'time']
fields2 = ['product_id',
 'user_id',
 'review',
'profile_name',
 'helpfulness',
 'score',
 'time']
fields3 = ['product_id',
 'user_id',
 'time']
fields4 = ['user_id',
 'score',
 'time']
 
def validate(line):
    for field in fields2:
        if field not in line: return False
    return True
reviews_raw = sc.textFile('data/movies.json')
reviews = reviews_raw.map(lambda line: json.loads(line)).filter(validate)
reviews.cache()

<b>This code calculates the number of unique movies, unique users, and total entries in the reviews RDD.</b><ul><li> It first groups the reviews by 'product_id' (movie ID) and 'user_id' (user ID) using groupBy and then counts the number of unique IDs for each group using count().</li><li> Finally, it prints the total number of reviews, unique movies, and unique users in the dataset.</li></ul>

In [None]:
num_movies = reviews.groupBy(lambda entry: entry['product_id']).count()
num_users = reviews.groupBy(lambda entry: entry['user_id']).count()
num_entries = reviews.count()
 
print (str(num_entries) + " reviews of " + str(num_movies) + " movies by " +str(num_users) + " different people.")

<b>This code snippet calculates the average score for each product in the reviews RDD:</b>
<ul><li>
It first maps each review to a tuple ((product_id,), 1), where product_id is the key and 1 is the value.</li><li>
It then uses mapValues to convert the values to (value, 1) tuples, where the first element is the score and the second element is 1.</li><li>
    It uses reduceByKey to sum up the scores and counts for each product.</li><li>
    It filters out products that have less than 20 reviews.</li><li>
It maps the average scores to (score_sum, review_count) tuples and reorders them based on the sum of scores and review counts in descending order.</li></ul>

In [None]:
r1 = reviews.map(lambda r: ((r['product_id'],), 1))
avg3 = r1.mapValues(lambda x: (x, 1)) \
 .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) 
avg3 = avg3.filter(lambda x: x[1][1] > 20 )
avg3 = avg3.map(lambda x: ((x[1][0]+x[1][1],), x[0])) \
 .sortByKey(ascending=False)

<ul><li>This code snippet iterates over the first 10 elements of the avg3 RDD (assuming avg3 is an RDD containing tuples of the form ((score_sum, review_count), product_id)). It then prints a formatted string for each movie, including a URL to Amazon's product page for that movie, the movie's ID, and the number of reviews for the movie.</li></ul>

In [None]:
for movie in avg3.take(10):
    print ("http://www.amazon.com/dp/" + movie[1][0] + "str(movie[0][0]) + " PEOPLE")

<b>This code calculates the average score given by each user in a set of movie reviews.</b><ul><li> It first maps each review to a tuple containing the user ID and a value of 1.</li><li> Then, it calculates the total score and the count of reviews for each user using reduceByKey.</li><li> It filters out users with less than 20 reviews and calculates the average score for each user.</li><li> Finally, it sorts the results based on the average score in descending order.</li></ul>

In [None]:
r2 = reviews.map(lambda ru: ((ru['user_id'],), 1))
avg2= r2.mapValues(lambda x: (x, 1)) \
 .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
avg2 = avg2.filter(lambda x: x[1][1] > 20 )
avg2 = avg2.map(lambda x: ((x[1][0]+x[1][1],), x[0])) \
 .sortByKey(ascending=False )

<ul><li>This code snippet iterates over the first 10 elements of the avg2 RDD (assuming avg2 is an RDD containing tuples of the form ((score_sum, review_count), user_id)). It then prints a formatted string for each user, including a URL to Amazon's product page for that user's reviews, the user's ID, and the number of reviews by the user.</li></ul>

In [None]:
for movie in avg2.take(10):
    print ("http://www.amazon.com/dp/" + movie[1][0] + "str(movie[0][0]) + " MOVIES")

<b>This code filters the reviews RDD to keep only the entries where the 'profile_name' contains the string "George".</b><ul><li> It then counts the number of such entries and prints the count.</li><li> Next, it collects and iterates over all the filtered reviews, printing various details for each review, such as rating, helpfulness, product ID (with an Amazon URL), summary, and full review text.</li></ul>


In [None]:
filtered = reviews.filter(lambda entry: "George" in entry['profile_name'])
print ("Found " + str(filtered.count()) + " entries.\n")

for review in filtered.collect():
    print ("Rating: " + str(review['score']) + " and helpfulness: " + review['helpfulness'])
    print ("http://www.amazon.com/dp/" + review['product_id'])
    print (review['summary'])
    print (review['review'])
    print ("\n"

<b>This code calculates the average score for each movie in a set of reviews.</b><ul><li> It first maps each review to a tuple containing the movie ID and the score.</li><li> Then, it calculates the total score and the count of reviews for each movie using reduceByKey.</li><li> It filters out movies with less than 20 reviews and calculates the average score for each movie.</li><li> Finally, it sorts the results based on the average score in ascending order.</li></ul>

In [None]:
reviews_by_movie = reviews.map(lambda r: ((r['product_id'],), r['score']))
avg = reviews_by_movie.mapValues(lambda x: (x, 1)) \
 .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
avg = avg.filter(lambda x: x[1][1] > 20 )
avg = avg.map(lambda x: ((x[1][0]/x[1][1],), x[0])) \
 .sortByKey(ascending=True)

<ul><li>This code snippet iterates over the first 10 elements of the avg RDD (assuming avg is an RDD containing tuples of the form ((average_score,), movie_id)). It then prints a formatted string for each movie, including a URL to Amazon's product page for that movie, the movie's ID, and the average rating.</li></ul>

In [None]:
for movie in avg.take(10):
    print ("http://www.amazon.com/dp/" + movie[1][0] + "RatingL "+str(movie[0][0]))

<h3>Spark and Pandas</h3>

<ul><li>This code creates a new RDD called timeseries_rdd by mapping each entry in the reviews RDD to a dictionary containing the score and the timestamp converted to a datetime object.

In [None]:
from datetime import datetime
timeseries_rdd = reviews.map(lambda entry: {'score': entry['score'],'time': datetime.fromtimestamp(entry['time'])})

<b>This code snippet uses the pandas library to create a time series analysis of the sampled timeseries_rdd RDD:<b>
<ul><li>
    It defines a parser function to convert the timestamp to a datetime object.</li><li>
It samples the timeseries_rdd RDD to reduce the number of entries for plotting purposes.</li><li>
It creates a Pandas DataFrame from the sampled RDD, specifying the columns as 'score' and 'time'.</li><li>
It prints the first 3 rows of the DataFrame.</li><li></li><li>
It sets the index of the DataFrame to the 'time' column.</li><li>
It resamples the 'score' column of the DataFrame to get the count of scores for each year, month, and quarter.</li><li>
    It plots the resampled data for each frequency (yearly, monthly, and quarterly).</li></ul>

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
def parser(x):
    return datetime.strptime('190'+x, '%Y-%m')

sample = timeseries_rdd.sample(withReplacement=False, fraction=20000.0/num_entries, seed=1134)
timeseries = pd.DataFrame(sample.collect(),
columns=['score', 'time'])
print(timeseries.head(3))
timeseries.score.astype('float64')

timeseries.set_index('time', inplace=True)
Rsample = timeseries.score.resample('Y').count()
Rsample.plot()
Rsample2 = timeseries.score.resample('M').count()
Rsample2.plot()
Rsample3 = timeseries.score.resample('Q').count()
Rsample3.plot()

<h3>Matrix Factorization</h3>

<b>This code snippet creates a histogram showing the average rating for each of the top 4 movies.</b><ul><li> It iterates over the first 4 elements of the avg RDD (assuming avg is an RDD containing tuples of the form ((average_score,), movie_id)).</li><li> For each movie, it uses plt.bar to create a bar chart where the x-axis represents the movie ID and the y-axis represents the average rating.</li><li> The title of the plot is set to 'Histogram of 'AVERAGE RATING OF MOVIE'', the x-axis label is 'MOVIE', and the y-axis label is 'AVGRATING'.</li></ul>

In [None]:
for movie in avg.take(4):
    plt.bar(movie[1][0],movie[0][0])
    plt.title('Histogram of \'AVERAGE RATING OF MOVIE\'')
    plt.xlabel('MOVIE')
    plt.ylabel('AVGRATING')

<b>This code generates a histogram displaying the number of movies reviewed by the top 3 users. </b><ul><li>It loops through the first 3 elements of the avg2 RDD, where each element represents a user and their respective movie count.</li><li> For each user, a bar is plotted with the user ID on the x-axis and the movie count on the y-axis. </li><li>The plot is titled 'Histogram of 'NUMBER OF MOVIES REVIEWED BY USER'', with 'USER' and 'MOVIE COUNT' labels for the x-axis and y-axis, respectively.</li></ul>

In [None]:
for movie in avg2.take(3):
    plt.bar(movie[1][0],movie[0][0])
    plt.title('Histogram of \'NUMBER OF MOVIES REVIEWED BY USER\'')
    plt.xlabel('USER')
    plt.ylabel('MOVIE COUNT')

<b>This code generates a histogram showing the number of users who have reviewed each of the top 4 movies.</b><ul><li> It loops through the first 4 elements of the avg3 RDD, where each element represents a movie and the number of users who have reviewed it.</li><li> For each movie, a bar is plotted with the movie ID on the x-axis and the user count on the y-axis. </li><li>The plot is titled 'Histogram of 'MOVIES REVIEWED BY NUMBER OF USERS'', with 'MOVIE' and 'USER COUNT' labels for the x-axis and y-axis, respectively.</li></ul>

In [None]:
for movie in avg3.take(4):
    plt.bar(movie[1][0],movie[0][0])
    plt.title('Histogram of \'MOVIES REVIEWED BY NUMBER OF USERS\'')
    plt.xlabel('MOVIE')
    plt.ylabel('USER COUNT')

<b>This code snippet prepares data for collaborative filtering using the Alternating Least Squares (ALS) algorithm from PySpark's pyspark.mllib.recommendation module:</b>
<ul><li>
    It defines a function get_hash to hash user and product IDs into integers for anonymity.</li><li>
It maps each review in reviews RDD to a tuple of hashed user ID, hashed product ID, and the rating.</li><li>
    It filters the data into training and testing sets based on a hashing condition.</li><li>
    It caches the training data for faster access.</li><li>
    It prints the number of samples in the training and testing datasets.</li></ul>

In [None]:
from pyspark.mllib.recommendation import ALS
from numpy import array
import hashlib
import math
def get_hash(s):
    return int(hashlib.sha1(s).hexdigest(), 16) % (10 ** 8)

ratings = reviews.map(lambda entry: tuple([ get_hash(entry['user_id'].encode('utf-8')),get_hash(entry['product_id'].encode('utf-8')),int(entry['score']) ]))

train_data = ratings.filter(lambda entry: ((entry[0]+entry[1]) % 10) >=2 )
test_data = ratings.filter(lambda entry: ((entry[0]+entry[1]) % 10) < 2 )
train_data.cache()

print ("Number of train samples: " + str(train_data.count()))
print ("Number of test samples: " + str(test_data.count()))

<b>This code evaluates a recommendation model by predicting ratings for test data and calculating the Mean Squared Error (MSE) between predicted and actual ratings. </b><ul><li>It first trains an ALS model using train_data, specifying the rank and number of iterations.</li><li> Then, it maps the test data to user-product pairs and uses the model to predict ratings.</li><li> These predictions are joined with the actual ratings, and MSE is calculated by comparing predicted and actual ratings for each pair.</li><li> The code returns the first 10 pairs of actual and predicted ratings for examination.</li></ul>

In [None]:
from math import sqrt
rank = 20
numIterations = 20
model = ALS.train(train_data, rank, numIterations)
def convertToFloat(lines):
    returnedLine = []
    for x in lines:
        returnedLine.append(float(x))
    return returnedLine
 # Evaluate the model on test data
unknown = test_data.map(lambda entry: (int(entry[0]), int(entry[1])))
predictions= model.predictAll(unknown).map(lambda r: ((int(r[0]), int(r[1])), r[2]))
true_and_predictions = test_data.map(lambda r: ((int(r[0]), int(r[1])), r[2])).join(predictions)
MSE = true_and_predictions.map(lambda r: (int(r[1][0])- int(r[1][1])**2).reduce(lambda x, y: x + y)/true_and_predictions.count())
true_and_predictions.take(10)

<h3>No demo without word count example

<b>This code snippet analyzes reviews to identify frequently occurring words in both positive (score 5.0) and negative (score 1.0) reviews:</b>
<ul><li>
    It filters the reviews to separate positive and negative reviews based on their scores.</li><li>
It extracts individual words from the reviews and counts the occurrences of each word.</li><li>
It filters out words that occur less than min_occurrences times in either positive or negative reviews.</li><li>
The result is a list of frequently occurring words in positive and negative reviews, along with their respective counts.</li></ul>

In [None]:
min_occurrences = 10
good_reviews = reviews.filter(lambda line: line['score']==5.0)
bad_reviews = reviews.filter(lambda line: line['score']==1.0)

good_words = good_reviews.flatMap(lambda line: line['review'].split(' '))
num_good_words = good_words.count()
good_words = good_words.map(lambda word: (word.strip(), 1)) \
 .reduceByKey(lambda a, b: a+b) \
 .filter(lambda word_count: word_count[1] > min_occurrences)
bad_words = bad_reviews.flatMap(lambda line: line['review'].split(' '))
num_bad_words = bad_words.count()
bad_words = bad_words.map(lambda word: (word.strip(), 1)) \
 .reduceByKey(lambda a, b: a+b) \
 .filter(lambda word_count: word_count[1] > min_occurrences)

<b>This code snippet calculates the normalized frequency of words in positive (good) and negative (bad) reviews.</b><ul><li> It iterates over the words in each set of reviews and divides the count of each word by the total number of words in that set.</li><li> This normalization allows comparison of word frequencies between the two sets.</li></ul>

In [None]:
frequency_good = good_words.map(lambda word: ((word[0],), float(word[1])/num_good_words))
frequency_bad = bad_words.map(lambda word: ((word[0],), float(word[1])/num_bad_words))

<ul><li>This code joins the normalized word frequencies for good and bad reviews based on the words.</li><li> It creates a new RDD joined_frequencies by combining the frequencies of words that appear in both good and bad reviews.</li></ul>

In [None]:
 joined_frequencies = frequency_good.join(frequency_bad)

<b>This code calculates the relative difference in word frequencies between positive (good) and negative (bad) reviews and sorts them in descending order.</b><ul><li> It first defines a function to compute the relative difference between two values. </li><li>Then, it maps each word's frequency to a tuple containing the relative difference and the word. </li><li>Finally, it sorts the results based on the relative difference, showing which words are more characteristic of one type of review compared to the other.</li></ul>

In [None]:
import math
def relative_difference(a, b):
    return math.fabs(a-b)/a
result = joined_frequencies.map(lambda f: ((relative_difference(f[1][0], f[1][1]),), f[0][0]) ) \
 .sortByKey(ascending=False)

<b>This code creates a histogram displaying the relative differences in word frequencies between positive (good) and negative (bad) reviews for the top 7 words.</b><ul><li> It loops through the first 7 elements of the result RDD, where each element represents a word and its relative difference in frequency.</li><li> For each word, a bar is plotted with the word on the x-axis and the relative difference on the y-axis.</li><li> The plot is titled 'Histogram of 'SENTIMENT ANALYSIS'', with 'WORD' and 'NUMBER OF OCCURRENCES' labels for the x-axis and y-axis, respectively.</li></ul>


In [None]:
for movie in result.take(7):
    plt.bar(movie[1],movie[0][0])
    plt.title('Histogram of \'SENTIMENT ANALYSIS\'')
    plt.xlabel('WORD')
    plt.ylabel('NUMBER OF OCCURANCES')