Similarity analysis
In this stage, you are asked to perform similarity analysis on the review sentences. The analysis involves segmenting review body into multiple sentences; encoding each sentence as vector so that the distance between pair of sentences can be computed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.2-bin-hadoop2.7"
import findspark
findspark.init()

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
    .builder \
    .appName("Python Spark Stage three") \
    .getOrCreate()
music_data = 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Music_v1_00.tsv.gz'
musics= spark.read.csv(music_data,header=True,sep='\t')

In [None]:
import nltk
nltk.download('punkt')
from pyspark.sql.functions import *
from nltk.tokenize import sent_tokenize
def sent_token(s):
    sent_list = sent_tokenize(str(s))
    return sent_list
seg = udf(lambda s: sent_token(s), ArrayType(StringType()))
musics = musics.withColumn('sentences',seg(musics.review_body).alias('seg'))
musics.show()
musics_review = musics.select(musics.review_id, musics.star_rating, explode(musics.sentences).alias("sentence"))

For a given product, consider all reviews with star rating 4 and above as positive reviews; and all reviews with star rating 2 and below as negative reviews. You are asked to pick a product from the top 10 products you find in stage One. The positive class is constructed by

• extracting all reviews with rate 4 and above
• for each review, extracting the review body part and segment it into multiple sentences.

The negative class is constructed in similar manner except that we extract all reviews with rate 2 and below.

In [None]:
musics_review = musics_review.filter(length(musics_review.sentence)>1)
musics_review.show()

In [None]:
musicsP=musics_review.filter(musics.star_rating>=4)
musicsN=musics_review.filter(musics.star_rating<=2)

In [None]:
musicsP= musicsP.withColumn("id", monotonically_increasing_id())
musicsN = musicsN.withColumn("id", monotonically_increasing_id())

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
def review_embed(rev_text_partition):
    module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]
    #2 for cpu
    embed = hub.Module(module_url)
    # mapPartition would supply element inside a partition using generator stype
    # this does not fit tensorflow stype
    rev_text_list = []
    for text in  rev_text_partition:
        for i in text:
            rev_text_list.append(i)
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        message_embeddings = session.run(embed(rev_text_list))
    return message_embeddings

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
rev_text_P = musicsP.select('sentence')
rev_clean_text_rdd_P = rev_text_P.rdd.filter(lambda data: data is not None).cache()
rev_clean_text_rdd_P.collect()

In [None]:
rev_text_N = musicsN.select('sentence')
rev_clean_text_rdd_N = rev_text_N.rdd.filter(lambda data: data is not None).cache()
rev_clean_text_rdd_N.count()

In [None]:
review_embedding_P = rev_clean_text_rdd_P.mapPartitions(review_embed).cache()
review_embedding_P.count()

In [None]:
review_embedding_N = rev_clean_text_rdd_N.mapPartitions(review_embed).cache()
review_embedding_N.count()

In [None]:
list_P = review_embedding_P.collect()
list_N = review_embedding_N.collect()

We want to find out if sentences in the same category are closely related with each other. The closeness is measured by average distance between points in the class. In our case, point refers to the sentence encoding and pair-wise distance is measured by Cosine distance. Cosine distance is computed as “1 − CosineSimilarity”. It has a value between 0 and 2.


In [None]:
def calculate_similarity(rew1,rew2):
    mul=np.dot(rew1,rew2)
    norm=np.linalg.norm(rew1)*np.linalg.norm(rew2)
    return mul/norm

In [None]:
def min_sim(List):
    for i in range(0,len(List)):
        sumOfsim = 0
        for j in range(0,len(List)):
            sumOfsim = sumOfsim + (1 - calculate_similarity(List[i],List[j]))        
        if (i == 0 ):
            min_sum = sumOfsim
            index = 0
        else:
            if (sumOfsim < min_sum):
                min_sum = sumOfsim
                index = i
    return min_sum , index

In [None]:
min_P,index_P = min_sim(list_P)
min_P,index_N = min_sim(list_N)

In [None]:
def TenNeighbors(List , index):
    list = []
    for i in range(len(List)):
        sim = (1-calculate_similarity(List[i],List[index]))
        list.append(sim)
    indexes = np.argsort(list)
    return indexes[:11]

In [None]:
indexes_P = TenNeighbors(list_P , index_P)
indexes_N = TenNeighbors(list_N , index_N)
print(indexes_P,indexes_N)

In [None]:
musicsP.select('review_id','sentence').filter(musicsP.id.isin(indexes_P)).collect()

In [None]:
musicsN.select('review_id','sentence').filter(musicsN.id.isin(indexes_P)).collect()

Find out the class center and its 10 closest neighbours for positive and negative class respectively. We define class center as the point that has the smallest average distance to other points in the class. Again in this case point refers to the sentence encoding and pair-wise distance are measured by Cosine distance.

The result should show the text of the center sentence, the review id it belongs to and its 10 closest neighbouring sentences text and their respective review id.