PKA-MOVIE-RECOMMENDER

- Importing necessary libraries and setting up PySpark environment.
- Creating a SparkContext object.

In [None]:
# import findspark
# findspark.init()
import os
import sys
import pyspark as ps
import warnings
from pyspark.sql import SQLContext

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

try:
    # create SparkContext on all CPUs available: in my case I have 4 CPUs on my laptop
    sc = ps.SparkContext('local[*]')
    # sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    warnings.warn("SparkContext already exists in this scope")



- Importing necessary libraries for unit testing.
- Defining a test case class `TestRdd`.
- Defining a test method `test_take` to test the `take` function of an RDD.
- Running the tests using `unittest.TextTestRunner`.


In [None]:
import unittest
import sys

class TestRdd(unittest.TestCase):
    def test_take(self):
        input = sc.parallelize([1,2,3,4])
        self.assertEqual([1,2,3,4], input.take(4))

def run_tests():
    suite = unittest.TestLoader().loadTestsFromTestCase(TestRdd)
    unittest.TextTestRunner(verbosity=1, stream=sys.stderr).run(suite)

run_tests()


In [None]:
help(sc)


- Importing the `json` module.
- Defining lists `fields`, `fields2`, `fields3`, and `fields4` containing field names.
- Defining a function `validate` to check if all fields in `fields2` are present in a line.
- Loading a JSON file `movies.json` as an RDD.
- Mapping each line of the RDD to a JSON object and filtering lines based on the `validate` function.
- Caching the filtered RDD for optimization.


In [None]:
import json

fields = ['product_id', 'user_id', 'score', 'time']
fields2 = ['product_id', 'user_id', 'review', 'profile_name', 'helpfulness', 'score', 'time']
fields3 = ['product_id', 'user_id', 'time']
fields4 = ['user_id', 'score', 'time']

def validate(line):
    for field in fields2:
        if field not in line:
            return False
    return True

reviews_raw = sc.textFile('data/movies.json')
reviews = reviews_raw.map(lambda line: json.loads(line)).filter(validate)
reviews.cache()



In [None]:
reviews.take(1)



- Grouping the reviews by `product_id` and counting the number of unique movies.
- Grouping the reviews by `user_id` and counting the number of unique users.
- Counting the total number of reviews.
- Printing the total number of reviews, the number of unique movies, and the number of unique users in a formatted string.


In [None]:
num_movies = reviews.groupBy(lambda entry: entry['product_id']).count()
num_users = reviews.groupBy(lambda entry: entry['user_id']).count()
num_entries = reviews.count()
print(str(num_entries) + " reviews of " + str(num_movies) + " movies by " +
str(num_users) + " different people.")



- Mapping each review to a tuple where the key is the product_id and the value is 1.
- Mapping each tuple value to a tuple with the count and 1.
- Reducing by key to sum up the counts and the ones.
- Filtering out the results where the count is greater than 20.
- Mapping each tuple to a new tuple where the key is the sum of counts and ones, and the value is the product_id.
- Sorting the results by the key in descending order.


In [None]:
#Suggestion_users = reviews.filter(lambda entry: entry['user_id'])
#for review in Suggestion_users.collect():
r1 = reviews.map(lambda r: ((r['product_id'],), 1))
avg3 = r1.mapValues(lambda x: (x, 1)) \
          .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
avg3 = avg3.filter(lambda x: x[1][1] > 20)
avg3 = avg3.map(lambda x: ((x[1][0] + x[1][1],), x[0])) \
           .sortByKey(ascending=False)



- Iterating over the first 10 elements of `avg3`.
- Printing a formatted string containing the URL of the movie on Amazon and the number of people who watched it.


In [None]:
for movie in avg3.take(10):
    print("http://www.amazon.com/dp/" + movie[1][0] + " WATCHED BY : " +
          str(movie[0][0]) + " PEOPLE")



- Mapping each review to a tuple where the key is the user_id and the value is 1.
- Mapping each tuple value to a tuple with the count and 1.
- Reducing by key to sum up the counts and the ones.
- Filtering out the results where the count is greater than 20.
- Mapping each tuple to a new tuple where the key is the sum of counts and ones, and the value is the user_id.
- Sorting the results by the key in descending order.
- Iterating over the first 10 elements of `avg2`.
- Printing a formatted string containing the URL of the movie on Amazon and the number of movies watched by the user.


In [None]:
r2 = reviews.map(lambda ru: ((ru['user_id'],), 1))
avg2 = r2.mapValues(lambda x: (x, 1)) \
          .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
avg2 = avg2.filter(lambda x: x[1][1] > 20)
avg2 = avg2.map(lambda x: ((x[1][0] + x[1][1],), x[0])) \
           .sortByKey(ascending=False)
for movie in avg2.take(10):
    print("http://www.amazon.com/dp/" + movie[1][0] + " WATCHED : " +
          str(movie[0][0]) + " MOVIES")



- Filtering reviews to find entries where "George" is in the profile name.
- Printing the count of filtered entries.
- Iterating over the filtered reviews and printing information about each review, including rating, helpfulness, Amazon product link, summary, and review text.


In [None]:
# Has someone written a review?
filtered = reviews.filter(lambda entry: "George" in entry['profile_name'])
print("Found " + str(filtered.count()) + " entries.\n")
for review in filtered.collect():
    print("Rating: " + str(review['score']) + " and helpfulness: " +
          review['helpfulness'])
    print("http://www.amazon.com/dp/" + review['product_id'])
    print(review['summary'])
    print(review['review'])
    print("\n")



- Mapping each review to a dictionary containing the score and the time converted to a datetime object.
- Importing necessary libraries for data visualization.
- Defining a function `parser` to parse datetime values.
- Sampling the `timeseries_rdd` to reduce data size for plotting.
- Converting the sampled RDD to a DataFrame.
- Printing the first 3 rows of the DataFrame.
- Converting the 'score' column to float64 type.
- Setting the 'time' column as the index of the DataFrame.
- Resampling the 'score' column by year ('Y'), month ('M'), and quarter ('Q') and plotting the results.


In [None]:
from datetime import datetime

timeseries_rdd = reviews.map(lambda entry: {'score': entry['score'],
                                             'time': datetime.fromtimestamp(entry['time'])})

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

def parser(x):
    return datetime.strptime('190'+x, '%Y-%m')

sample = timeseries_rdd.sample(withReplacement=False, fraction=20000.0/num_entries, seed=1134)

timeseries = pd.DataFrame(sample.collect(), columns=['score', 'time'])
print(timeseries.head(3))

timeseries.score = timeseries.score.astype('float64')
#timeseries.time = timeseries.time.astype('datetime64')
timeseries.set_index('time', inplace=True)

Rsample = timeseries.score.resample('Y').count()
Rsample.plot()

Rsample2 = timeseries.score.resample('M').count()
Rsample2.plot()

Rsample3 = timeseries.score.resample('Q').count()
Rsample3.plot()



- Iterating over the first 4 elements of the `avg` RDD.
- Creating a bar plot for each movie with its average rating.
- Adding title, xlabel, and ylabel to the plot.
- Displaying each plot individually.


In [None]:
for movie in avg.take(4):
    plt.bar(movie[1][0], movie[0][0])
    plt.title('Histogram of \'AVERAGE RATING OF MOVIE\'')
    plt.xlabel('MOVIE')
    plt.ylabel('AVGRATING')
    plt.show()


In [None]:
for movie in avg2.take(3):
    plt.bar(movie[1][0], movie[0][0])
    plt.title('Histogram of \'NUMBER OF MOVIES REVIEWED BY USER\'')
    plt.xlabel('USER')
    plt.ylabel('MOVIE COUNT')
    plt.show()


In [None]:
for movie in avg3.take(4):
    plt.bar(movie[1][0], movie[0][0])
    plt.title('Histogram of \'MOVIES REVIEWED BY NUMBER OF USERS\'')
    plt.xlabel('MOVIE')
    plt.ylabel('USER COUNT')
    plt.show()



- Importing necessary modules and libraries for collaborative filtering with ALS.
- Defining a function `get_hash` to hash strings.
- Mapping each review to a tuple containing hashed user_id, hashed product_id, and rating.
- Splitting the data into training and testing sets based on a hash-based criteria.
- Caching the training data for optimization.
- Printing the number of samples in the training and testing sets.


In [None]:
from pyspark.mllib.recommendation import ALS
from numpy import array
import hashlib
import math

def get_hash(s):
    return int(hashlib.sha1(s).hexdigest(), 16) % (10 ** 8)

# Input format: [user, product, rating]
ratings = reviews.map(lambda entry: tuple([get_hash(entry['user_id'].encode('utf-8')),
                                           get_hash(entry['product_id'].encode('utf-8')),
                                           int(entry['score'])]))

train_data = ratings.filter(lambda entry: ((entry[0] + entry[1]) % 10) >= 2)
test_data = ratings.filter(lambda entry: ((entry[0] + entry[1]) % 10) < 2)
train_data.cache()

print("Number of train samples: " + str(train_data.count()))
print("Number of test samples: " + str(test_data.count()))



- Building a recommendation model using Alternating Least Squares (ALS) with specified parameters: `rank` and `numIterations`.
- Defining a function `convertToFloat` to convert lines to float values.
- Mapping test data to contain only user and product IDs.
- Making predictions using the trained ALS model on the test data.
- Joining true ratings with predicted ratings.
- Calculating Mean Squared Error (MSE) for evaluation.


In [None]:
# Build the recommendation model using Alternating Least Squares
from math import sqrt

rank = 20
numIterations = 20
model = ALS.train(train_data, rank, numIterations)

def convertToFloat(lines):
    returnedLine = []
    for x in lines:
        returnedLine.append(float(x))
    return returnedLine

# Evaluate the model on test data
unknown = test_data.map(lambda entry: (int(entry[0]), int(entry[1])))
predictions = model.predictAll(unknown).map(lambda r: ((int(r[0]), int(r[1])), r[2]))

true_and_predictions = test_data.map(lambda r: ((int(r[0]), int(r[1])), r[2])).join(predictions)

MSE = true_and_predictions.map(lambda r: (int(r[1][0]) - int(r[1][1]))**2).reduce(lambda x, y: x + y) / true_and_predictions.count()


In [None]:
true_and_predictions.take(10)



- Filtering reviews to obtain good reviews with a score of 5.0 and bad reviews with a score of 1.0.
- Splitting the review texts into individual words and mapping each word to a tuple with a count of 1.
- Reducing by key to count the occurrences of each word.
- Filtering out words with occurrences less than `min_occurrences`.
- Calculating the total number of good and bad words.
- Calculating the frequency of each word in good and bad reviews by dividing the count of occurrences by the total number of words.


In [None]:
min_occurrences = 10

good_reviews = reviews.filter(lambda line: line['score'] == 5.0)
bad_reviews = reviews.filter(lambda line: line['score'] == 1.0)

good_words = good_reviews.flatMap(lambda line: line['review'].split(' '))
num_good_words = good_words.count()
good_words = good_words.map(lambda word: (word.strip(), 1)) \
                       .reduceByKey(lambda a, b: a + b) \
                       .filter(lambda word_count: word_count[1] > min_occurrences)

bad_words = bad_reviews.flatMap(lambda line: line['review'].split(' '))
num_bad_words = bad_words.count()
bad_words = bad_words.map(lambda word: (word.strip(), 1)) \
                     .reduceByKey(lambda a, b: a + b) \
                     .filter(lambda word_count: word_count[1] > min_occurrences)

# Calculate the word frequencies
frequency_good = good_words.map(lambda word: ((word[0],), float(word[1]) / num_good_words))
frequency_bad = bad_words.map(lambda word: ((word[0],), float(word[1]) / num_bad_words))



- Joining the word frequencies of good and bad reviews.
- Calculating the relative difference of each word frequency in the good and bad reviews.
- Sorting the dataset to get the most significant expressions for characterizing either a positively or negatively rated movie.
- Defining a function `relative_difference` to calculate the relative difference between two values.
- Mapping each word with its relative difference and sorting the result in descending order.
- Taking the top 50 significant expressions for characterizing movie reviews.


In [None]:
# Join the word frequencies of the good and bad reviews
joined_frequencies = frequency_good.join(frequency_bad)

# Calculate the relative difference of each word frequency in the good and bad reviews
import math

def relative_difference(a, b):
    return math.fabs(a - b) / a

result = joined_frequencies.map(lambda f: ((relative_difference(f[1][0], f[1][1]),), f[0][0])) \
                           .sortByKey(ascending=False)

result.take(50)


In [None]:
for movie in result.take(7):
    plt.bar(movie[1], movie[0][0])

    plt.title('Histogram of \'SENTIMENT ANALYSIS\'')
    plt.xlabel('WORD')
    plt.ylabel('NUMBER OF OCCURRENCES')
    plt.show()
