# Online Review Analysis

In [1]:
#import related packages and library
from pyspark.sql import SparkSession
from pyspark.sql import types
from pyspark.sql.types import StructType, StructField, StringType,IntegerType, FloatType
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.ml.feature import Word2Vec

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
import re
from sklearn.metrics.pairwise import cosine_similarity

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1558511762013_0001,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
spark = SparkSession \
    .builder \
    .appName("Spark Text Encoder example") \
    .config("spark.sql.shuffle.partitions", "100")\
    .getOrCreate()

VBox()

## Load data from S3

We use review data published by Amazon and releaed in S3 bucket. The data is published as tsv file with many columns. 

In [3]:
# import raw data
rev_data = "s3://amazon-reviews-pds/tsv/amazon_reviews_us_Music_v1_00.tsv.gz"

revs = spark.read.csv(rev_data,header=True,sep='\t')\
            .select('customer_id','product_id','star_rating', 'review_id','review_body')

VBox()

### Overview of the raw data

We show the first five rows of the data to inspect the format.

In [4]:
revs.show(5)

VBox()

+-----------+----------+-----------+--------------+--------------------+
|customer_id|product_id|star_rating|     review_id|         review_body|
+-----------+----------+-----------+--------------+--------------------+
|   10140119|B00TXH4OLC|          5|R3LI5TRP3YIDQL|Love this CD alon...|
|   27664622|B00B6QXN6U|          5|R3LGC3EKEG84PX|This is the album...|
|   45946560|B001GCZXW6|          5| R9PYL3OYH55QY|  Excellent / thanks|
|   15146326|B000003EK6|          3|R3PWBAWUS4NT0Q|Nice variety of c...|
|   16794688|B00N1F0BKK|          5|R15LYP3O51UU9E|Purchased as a gi...|
+-----------+----------+-----------+--------------+--------------------+
only showing top 5 rows

## Reviews by User, by Product

### 1.1 The total number of reviews in the file.

In [5]:
# show the count of lines
revs.count()

VBox()

4751577

### 1.2 Unique Users

The following cell demonstrates usage of the groupBy operation. We try to find the number of reviews of each publisher. The result is showed.

In [6]:
users_revs = revs.groupBy('customer_id').count().withColumnRenamed("count", "c_reviews").cache()
users_revs.count()

VBox()

1940732

#### 1.2.1 The total number of unique users.

In [8]:
users_revs.select('customer_id').distinct().count()

VBox()

1940732

#### 1.2.2 The top 10 users ranked by the number of reviews they publish.

We constract the number of reviews published by users with descending sort and show the top ten users.

In [7]:
users_revs.sort("c_reviews",ascending=False).show(10)

VBox()

+-----------+---------+
|customer_id|c_reviews|
+-----------+---------+
|   50736950|     7168|
|   38214553|     5412|
|   51184997|     5369|
|   18116317|     4222|
|   23267387|     4023|
|   50345651|     3793|
|   14539589|     2896|
|   15725862|     2842|
|   19380211|     2592|
|   20018062|     2568|
+-----------+---------+
only showing top 10 rows

#### 1.2.3 The largest number and the median number of reviews published by a user

We use summary function to extract the largest number of reviews which is the max number of reviews and the median number of reviews in which is 50 percent of total numbers.

In [9]:
users_revs.select('c_reviews').summary().show()

VBox()

+-------+------------------+
|summary|         c_reviews|
+-------+------------------+
|  count|           1940732|
|   mean|2.4483426871922553|
| stddev|15.898599280984767|
|    min|                 1|
|    25%|                 1|
|    50%|                 1|
|    75%|                 2|
|    max|              7168|
+-------+------------------+

### 1.3 Unique Products

The following cell demonstrates usage of the groupBy operation. We try to find the number of reviews of each product. The result is showed.

In [10]:
product_revs = revs.groupBy('product_id').count().withColumnRenamed("count", "p_reviews").cache()
product_revs.count()

VBox()

782326

#### 1.3.1 The number of unique products

In [11]:
product_revs.select('product_id').distinct().count()

VBox()

782326

#### 1.3.2 The top 10 products ranked by the number of reviews they have

We constract the number of reviews belonging to each product with descending sort and show the top ten products.

In [12]:
product_revs_sort = product_revs.sort("p_reviews",ascending=False)
product_revs_sort.show(10)

VBox()

+----------+---------+
|product_id|p_reviews|
+----------+---------+
|B00008OWZG|     3936|
|B0000AGWEC|     3326|
|B00MIA0KGY|     2699|
|B00NEJ7MMI|     2420|
|B000089RVX|     2376|
|B004EBT5CU|     2106|
|B0026P3G12|     2080|
|B00009PRZF|     2026|
|B00004XONN|     1901|
|B00006J6VG|     1793|
+----------+---------+
only showing top 10 rows

#### 1.3.3 The largest number and the median number of reviews a product has

We use summary function to extract the largest number of reviews which is the max number of reviews and the median number of reviews in which is 50 percent of total numbers.

In [13]:
product_revs.select('p_reviews').summary().show()

VBox()

+-------+------------------+
|summary|         p_reviews|
+-------+------------------+
|  count|            782326|
|   mean| 6.073653438592096|
| stddev|25.781486507093998|
|    min|                 1|
|    25%|                 1|
|    50%|                 2|
|    75%|                 4|
|    max|              3936|
+-------+------------------+

## Number of review sentences

### 2.1 Reviews published by users with less than median number of reviews published

In [14]:
users_r = users_revs.filter("c_reviews > 1")
users_r.show(5)

VBox()

+-----------+---------+
|customer_id|c_reviews|
+-----------+---------+
|   16794688|        3|
|   49997672|        5|
|   15963400|        8|
|   35087168|        2|
|   47840769|        4|
+-----------+---------+
only showing top 5 rows

### 2.2 Reviews from products with less than median number of reviews received

In [15]:
product_r = product_revs.filter("p_reviews > 2")
product_r.show(5)

VBox()

+----------+---------+
|product_id|p_reviews|
+----------+---------+
|B001MYIPWS|       43|
|B00MRHANNI|     1513|
|B000ROAL8U|       71|
|B00000G1JD|       13|
|B00SMMJBO4|       57|
+----------+---------+
only showing top 5 rows

### 2.3 Reviews with less than two sentences in the review body

We use the sent_tokenize function to calculate the number of sentences of each review_body.

In [16]:
join = revs.drop('star_rating','review_id')\
            .join(users_r, 'customer_id','inner')\
            .drop('c_reviews')\
            .join(product_r, 'product_id','inner')\
            .drop('p_reviews')

VBox()

In [17]:
def processReview(review_body):
    if review_body is not None:
        review_body = re.sub(r'[^A-Za-z0-9,.?! ]','',re.sub(r'<br />','',review_body))
    return str(review_body)

VBox()

In [18]:
data = join.rdd.map(lambda x: types.Row(
            customer_id = x.customer_id, 
            product_id = x.product_id,
            n_sentence = len(sent_tokenize(processReview(x.review_body)))))\
            .filter(lambda x: x.n_sentence > 1)
data.take(5)

VBox()

[Row(customer_id='13165145', n_sentence=8, product_id='1424338476'), Row(customer_id='40933981', n_sentence=5, product_id='1424338476'), Row(customer_id='22840380', n_sentence=4, product_id='1424338476'), Row(customer_id='52495199', n_sentence=6, product_id='B0000001NY'), Row(customer_id='39694876', n_sentence=4, product_id='B0000001NY')]

### 2.4 The result after removing the above

In [19]:
new_revs = data.toDF().cache()
new_revs.count()
new_revs.show(5)

VBox()

+-----------+----------+----------+
|customer_id|n_sentence|product_id|
+-----------+----------+----------+
|   40933981|         5|1424338476|
|   22840380|         4|1424338476|
|   13165145|         8|1424338476|
|   45351250|         3|B0000001NY|
|   52495199|         6|B0000001NY|
+-----------+----------+----------+
only showing top 5 rows

We combine three dataframe and join them into a dataframe. The result is showed after filtering out the above reviews.

#### 2.4.1 Define functions to calculate the median number of sentences in the reviews

In [20]:
def pairKToV(data):
    return (data[0], [data[1]])

def mergeValue(x,y):
    x.extend(y)
    return x

def median(line):
    key, value = line
    value = int(np.median(value))
    return (key, value)

def sortVaule(line):
    key,value = line
    return value

VBox()

#### 2.4.2 Top 10 users ranked by median number of sentences in the reviews they have published;

In [21]:
user_rdd = new_revs.select('customer_id','n_sentence').rdd\
                    .map(pairKToV)\
                    .reduceByKey(mergeValue)\
                    .map(median)\
                    .sortBy(sortVaule,False)

VBox()

In [22]:
df_user = user_rdd.toDF(['customer_id','median_n_sentence'])
df_user.show(10)

VBox()

+-----------+-----------------+
|customer_id|median_n_sentence|
+-----------+-----------------+
|   25628286|              248|
|   37118941|              227|
|   51865782|              226|
|   29580246|              200|
|   50595705|              196|
|   17821650|              183|
|   43879820|              181|
|   15585529|              177|
|   23717536|              158|
|   46097534|              154|
+-----------+-----------------+
only showing top 10 rows

#### 2.4.3 Top 10 products ranked by median number of sentences in the reviews they have received;

In [23]:
product_rdd = new_revs.select('product_id','n_sentence').rdd\
                    .map(pairKToV)\
                    .reduceByKey(mergeValue)\
                    .map(median)\
                    .sortBy(sortVaule,False)

VBox()

In [24]:
df_product = product_rdd.toDF(['product_id','median_n_sentence'])
df_product.show(10)

VBox()

+----------+-----------------+
|product_id|median_n_sentence|
+----------+-----------------+
|B00LTQ5EVY|              984|
|B000003G29|              319|
|B00T7TYTCK|              252|
|B000BCH5PK|              209|
|B0000C0FEW|              205|
|B0002IJNGC|              171|
|B000RY431G|              166|
|B0000020FQ|              162|
|B009SF2GZU|              161|
|B00AP5M4WM|              160|
+----------+-----------------+
only showing top 10 rows

## Similarity analysis with Sentence Embedding

### 3.1 Positive reviews

#### 3.1.1 Select positive review data

In [25]:
# Filter the product_id = B00006J6VG from stage one and extract all reviews with rate four and above as positive 
# Select the review_id and review_body to create a new dataframe 
product_id = list(product_revs_sort.select("product_id").take(10))
positive = revs.filter("product_id = '%s' and star_rating >= 4" % product_id[9][0])\
                .select("review_id","review_body")
positive.take(1)

VBox()

[Row(review_id='R3R7MRNK5HPULY', review_body="Good Charlotte's Hiatus in late 2009 made this album controversial in the sense that most of the band members were taking credit on the production of this record their best effort so far.")]

In [26]:
# Define a sentence segmentation function that will be used with map function below
# segment sentences with nltk sent_tokenize function
def sentence_segmentation(line):
    review_id, review_body = line
    review_body = sent_tokenize(review_body)
    return types.Row(review_id = review_id,
                     review_body = review_body)

# convert dataframe to rdd format,\
# segment review texts into multiple sentences and map each sentence with review id using flatMap 
# clean the sentences (remove punctuations and special charactors) and filter empty string  
# convert rdd into dataframe format
def get_data_df(data):
    data_df = data.rdd.map(sentence_segmentation)\
                .flatMap(lambda x: [(x.review_id, re.sub(r'[^A-Za-z,.?! ]','',re.sub(r'<br />','',t))) for t in x.review_body])\
                .filter(lambda x: x[1] is not '')\
                .toDF(["review_id","sentence"])
    return data_df

VBox()

In [27]:
# Extract positive review sentences with review id
positive_data_df = get_data_df(positive).repartition(1000)
positive_data_df.show(5)

VBox()

+--------------+--------------------+
|     review_id|            sentence|
+--------------+--------------------+
|R2VF8VV93YU5WX|A new begging is ...|
|R2SBFGR3NQ98EX|I just wanna scre...|
|R3OPD0QICE1VUJ|Ok well Good Char...|
|R20W9X7H6YOEKC|I really like thi...|
|R39NLSHHX99MNO|i give my props t...|
+--------------+--------------------+
only showing top 5 rows

#### 3.1.2 Convert text to vectors

In [28]:
# Define a function to convert text to vectors using tensorflow
def review_embed(rev_text_partition):
    module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" 
    embed = hub.Module(module_url)
    rev_text_list = [text for text in rev_text_partition]
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        embeddings = session.run(embed(rev_text_list))
    return embeddings

# convert texts to embeddings using mapPartitions together with tensorflow
# only select review sentence column and convert to rdd format
def get_sentence_embeddings(data_df):
    embeddings = data_df.select("sentence").rdd.map(lambda row: str(row)).mapPartitions(review_embed).cache()
    return embeddings 

VBox()

In [10]:
# Convert sentences into vectors
positive_embeddings = get_sentence_embeddings(positive_data_df)
positive_embeddings.take(1)

VBox()

[array([-2.16722675e-03,  1.35801230e-02, -6.41393960e-02, -6.07020222e-02,
       -4.73380387e-02, -3.45437117e-02,  1.12012094e-02,  2.75574643e-02,
        4.56465967e-02,  5.55792376e-02,  7.59716406e-02, -4.37334888e-02,
       -2.09210347e-02,  8.61330926e-02,  3.03803161e-02,  3.16098944e-04,
       -7.79331177e-02,  2.45641414e-02,  6.12513535e-02, -2.20675785e-02,
        4.34800871e-02, -8.60454962e-02,  3.92793268e-02,  2.74228621e-02,
        1.74887441e-02,  5.07017672e-02, -4.83388826e-02, -3.73001844e-02,
        1.24751106e-02, -2.57413406e-02,  3.22404914e-02,  7.06426576e-02,
        1.97638236e-02, -1.39964307e-02,  6.49445057e-02, -2.41210684e-02,
       -4.08982188e-02,  2.41026152e-02,  1.40444199e-02, -5.25969490e-02,
        1.96626168e-02,  1.81780048e-02, -3.65597382e-02, -8.88614857e-04,
       -7.92986527e-02,  3.66979949e-02,  1.32006826e-02, -5.46549112e-02,
       -1.47038847e-02, -1.33299595e-02,  2.61292867e-02,  4.63009924e-02,
       -3.33528519e-02, 

#### 3.1.3 Create all posible sentence vector pairs

In [31]:
# Create all possible vector pairs, return: index1_vector1 - index2_vector2 pairs
def Create_vector_pairs(embeddings):  
    n = embeddings.count()
    vec_idx = embeddings.map(lambda v: v.tolist()).zipWithIndex() # add index to embedding rdd
    vec_idx_1 = vec_idx.toDF(['vector_a','index1']).coalesce(100) # convert to dataframe
    # map all possible index pairs between vectors
    idx_pair = vec_idx.flatMap(lambda x: [(t,[x[1], x[0]]) for t in range(n-1) if t > x[1]])
    vec_idx_pair = idx_pair.map(lambda x: (x[0], x[1][0], x[1][1])) # map key value pairs
    vec_idx_2 = vec_idx_pair.toDF(['index1','index0', 'vector']) # convert to dataframe
    vector_pairs = vec_idx_2.join(vec_idx_1, 'index1', 'inner') # join vectors using paired indices
    return vector_pairs

VBox()

In [76]:
p_vector_pairs = Create_vector_pairs(positive_embeddings).repartition(1000)
p_vector_pairs.show(5)

VBox()

+------+------+--------------------+--------------------+
|index1|index0|              vector|            vector_a|
+------+------+--------------------+--------------------+
|   270|   174|[0.04034562036395...|[0.06719374656677...|
|   270|   235|[0.05574147775769...|[0.06719374656677...|
|   293|   149|[0.02960725314915...|[0.03029567562043...|
|   222|    32|[0.00909040216356...|[0.02643903344869...|
|   293|   268|[0.07378774136304...|[0.03029567562043...|
+------+------+--------------------+--------------------+
only showing top 5 rows

#### 3.1.4 Calculate cosine distances of each sentence vector pairs

In [13]:
# Define a function to calculate cosine distance, return index pairs with calculated distance
def calculate_cosine_distance(line):
    index1,index0,vector0,vector1 = line 
    # convert vector to numpy array and reshape the vector
    vector0 = np.array(vector0).reshape(1,np.array(vector0).shape[0])
    vector1 = np.array(vector1).reshape(1,np.array(vector1).shape[0])
    # calculate cosine distance between two vectors
    cosine_distance = 1-cosine_similarity(vector0,vector1)
    return (index0, index1, float(cosine_distance))

VBox()

In [14]:
# Convert vector pairs into rdd format and calculate cosine distance using map function with user defined function above
p_distance_pairs = p_vector_pairs.rdd.map(calculate_cosine_distance).repartition(1000).cache()
p_distance_pairs.count()
p_distance_pairs.take(5)

VBox()

[(41, 1636, 0.5753198721058271), (1530, 1636, 0.8080835573178853), (1236, 1636, 0.9001891774222969), (725, 1636, 0.8196900275843757), (762, 1636, 0.6249230001518802)]

#### 3.1.5 Calculate average distance of each sentence vector

In [15]:
# calculate average inside the map function
def average(line): 
    key,value = line
    value = np.mean(value)
    return (key,value)

# sum the values inside the reduceByKey function
def mergeValue(x,y): 
    x.extend(y)
    return x
 
# Main function to calculate average distance
# 1. flatMap index and distance value; 
# 2.reduce by index to calculate sum of the distance 
# 3. map to calculate average of the distance 
def Average_distance(vector_pairs):   
    all_distances = vector_pairs.flatMap(lambda x: [(key, [x[2]]) for key in [x[0], x[1]]])
    avg_distance = all_distances.reduceByKey(mergeValue).map(average)
    return avg_distance

VBox()

In [16]:
p_avg_distance = Average_distance(p_distance_pairs).repartition(1000)
p_avg_distance.take(5)

VBox()

[(7811, 0.6929154163259487), (4355, 0.5457921000166329), (2627, 0.7314162652081974), (8675, 0.6394836722994217), (4643, 0.8198567378104787)]

#### 3.1.6 Extract indexes of center point and top10 similar sentences based on average distances

In [17]:
# 1.Sort average distances select the index of minimum value as center point; 
# 2.Filter out all distances with the center point;
# 3.Sort by values and select 10 indices with smallest values;
# Return center point index and top 10 closest indices. 
def Center_Top10_Index(distance_pairs, avg_distance):   

    center_idx = avg_distance.sortBy(lambda x: x[1]).first()[0]
    distances_from_center = distance_pairs.filter(lambda x: x[0]==center_idx or x[1]==center_idx)\
                                            .sortBy(lambda x: x[2])
    # select top 10 indices
    def top10_index(line):
        if line[0] != center_idx:
            idx = line[0]
        if line[1] != center_idx:
            idx = line[1]
        return idx 

    top10_idx = distances_from_center.zipWithIndex().filter(lambda x:x[1]<10).keys().map(top10_index).collect()    

    return center_idx, top10_idx

VBox()

In [18]:
p_center_idx, p_top10_idx = Center_Top10_Index(p_distance_pairs, p_avg_distance)
print('Index of center sentences: ', p_center_idx)
print('Indexes of top10 similar sentences', p_top10_idx)

VBox()

Index of center sentences:  7846
Indexes of top10 similar sentences [3824, 4116, 7444, 8730, 1174, 3212, 44, 6164, 4411, 9958]

#### 3.1.7 Extract review sentences and corresponding review id based on index

In [19]:
# Convert positive data into rdd format and add indices
p_rdd = positive_data_df.rdd.zipWithIndex().cache() 

# Select review id and sentences from all positive data based on indices.
positive_center = p_rdd.filter(lambda x:x[1] == p_center_idx).keys().collect()
positive_similar_top10 = p_rdd.filter(lambda x:x[1] in p_top10_idx).keys().toDF(["review_id", "review_sentence"])

VBox()

#### 3.1.8 The results of positive reviews -Stage 3

In [21]:
# print out the center sentence of positive review
print("POSITIVE - center sentence: \n --review_id--  ----review_sentence----")
for center in positive_center:
    print(center[0], center[1])

# print out top10 similar centences     
positive_similar_top10.show(truncate=False)

VBox()

POSITIVE - center sentence: 
 --review_id--  ----review_sentence----
R1QUF2LEOUW3NY The songs are so totally rad.
+--------------+----------------------------------------------+
|review_id     |review_sentence                               |
+--------------+----------------------------------------------+
|R32U2AEZWXU5X0|Almost all the songs on here are great.       |
|R2USLMK3AI02O7|The songs are so fun and catchy.              |
|R3FYTSF6HUXQAD|All their songs are awsome!!                  |
|R27HL4EDA5S9D3|The songs are great.                          |
|R3M3TEOCSKPLZW|The songs are great.                          |
|R3OEA639KL4BNJ|Their music rocks and their songs are awesome.|
|R2ZMZHGGKCH19V|and their other songs r so awesome!           |
|R1ZAV0KB4C8FY5|The songs are all great!                      |
|R2MOCYGXTDHVEE|its songs are amazing!                        |
|R3OMI3V71OYD0Q|But theres some great songs so far.           |
+--------------+--------------------------------------

### 3.2 Negative reviews

#### 3.2.1 Select negative review data

In [24]:
# Filter the product_id = B00006J6VG from stage one and extract all reviews with rate two and below as negative
# Select the review_id and review_body to create a new dataframe 
product_id = list(product_revs_sort.select("product_id").take(10))
negative = revs.filter("product_id = '%s' and star_rating <= 2" % product_id[9][0])\
                .select("review_id","review_body")
negative.take(1)

VBox()

[Row(review_id='R2F6WAB05QY47M', review_body='I would rather go get a coffee enema than sit through this formulaic mainstream bs. This is the garbage that Rolling Stone magazine and the record industry is peddling to the pimple faced emo kid who goes to buy the album from the band he just saw on MTV(sic).')]

In [25]:
# Extract negative review sentences with review id
negative_data_df = get_data_df(negative)
negative_data_df.show(5)

VBox()

+--------------+--------------------+
|     review_id|            sentence|
+--------------+--------------------+
|R2F6WAB05QY47M|I would rather go...|
|R2F6WAB05QY47M|This is the garba...|
|R2LKFD8AW9N76U|Im listening to t...|
|R2LKFD8AW9N76U|If this was five ...|
|R2LKFD8AW9N76U|That being said t...|
+--------------+--------------------+
only showing top 5 rows

#### 3.2.2 Convert text to vectors

In [26]:
# Convert sentences into vectors
negative_embeddings = get_sentence_embeddings(negative_data_df)
negative_embeddings.take(1)

VBox()

[array([ 0.03918048,  0.03105982,  0.03235339,  0.01579139, -0.02399617,
        0.03708548, -0.03718183,  0.0382824 ,  0.00989581,  0.04228171,
        0.03556698,  0.01481014, -0.01122175,  0.03181321,  0.03372408,
        0.0145869 ,  0.0759725 ,  0.02310706,  0.01451915, -0.03981126,
       -0.02837968, -0.01391553,  0.05004405,  0.0392571 ,  0.05040007,
       -0.03248855, -0.03459056, -0.01369295,  0.07842439, -0.02500102,
        0.0817729 , -0.0012256 ,  0.08245012, -0.07296173,  0.0755223 ,
        0.00716521, -0.05865304, -0.01330853, -0.0327848 , -0.02093267,
        0.04406163, -0.0459072 , -0.04588456,  0.0143362 , -0.04337143,
        0.02115093,  0.0480163 , -0.05982846,  0.05947711, -0.03726853,
       -0.04188664, -0.03226871, -0.03640187, -0.06396554,  0.05971521,
        0.0502161 ,  0.03946881, -0.01615177,  0.06396203,  0.02251009,
       -0.00783303, -0.0532188 , -0.07300755,  0.02608615, -0.03282412,
        0.02758986,  0.01002579,  0.01320507,  0.01328125,  0.0

#### 3.2.3 Create all posible sentence vector pairs

In [27]:
n_vector_pairs = Create_vector_pairs(negative_embeddings).repartition(1000)
n_vector_pairs.show(5)

VBox()

+------+------+--------------------+--------------------+
|index1|index0|              vector|            vector_a|
+------+------+--------------------+--------------------+
|  1289|  1110|[-0.0205773524940...|[0.03598988428711...|
|  1289|   216|[0.03776592761278...|[0.03598988428711...|
|  2515|   479|[-0.0095890052616...|[-0.0074121477082...|
|   228|   101|[-0.0052495957352...|[-0.0397398769855...|
|   274|    22|[0.04166766256093...|[-0.0104109682142...|
+------+------+--------------------+--------------------+
only showing top 5 rows

#### 3.2.4 Calculate cosine distances of each sentence vector pairs

In [50]:
# Convert vector pairs into rdd format and calculate cosine distance using map function with user defined function above
n_distance_pairs = n_vector_pairs.rdd.map(calculate_cosine_distance).repartition(1000)
n_distance_pairs.take(5)

VBox()

[(214, 487, 0.9181180023957322), (1596, 2706, 0.7323821704009261), (1519, 2323, 0.5459362687151805), (783, 823, 0.9301455548585617), (1601, 2828, 0.8547286859674169)]

#### 3.2.5 Calculate average distance of each sentence vector

In [29]:
n_avg_distance = Average_distance(n_distance_pairs).repartition(1000)
n_avg_distance.take(5)

VBox()

[(2003, 0.7302155212502809), (2867, 0.6318418534381124), (563, 0.5851419175200856), (2625, 0.7975127724099806), (897, 0.6211570917270516)]

#### 3.2.6 Extract indexes of center point and top10 similar sentences based on average distances

In [30]:
n_center_idx, n_top10_idx = Center_Top10_Index(n_distance_pairs, n_avg_distance)
print('Index of center sentences: ', n_center_idx)
print('Indices of top10 similar sentences', n_top10_idx)

VBox()

Index of center sentences:  2331
Indices of top10 similar sentences [1760, 2886, 1320, 980, 1919, 34, 1097, 1823, 2887, 441]

#### 3.2.7 Extract review sentences and corresponding review id based on index

In [31]:
# Convert positive data into rdd format and add indices
n_rdd = negative_data_df.rdd.zipWithIndex().cache() 

# Select review id and sentences from all positive data based on indices.
negative_center = n_rdd.filter(lambda x:x[1] == n_center_idx).keys().collect()
negative_similar_top10 = n_rdd.filter(lambda x:x[1] in n_top10_idx).keys().toDF(["review_id", "review_sentence"])

VBox()

#### 3.2.8 The results of negative reviews -Stage 3

In [32]:
# print out the center sentence of negative review
print("NEGATIVE - center sentence: \n --review_id--  ----review_sentence----")
for center in negative_center:
    print(center[0], center[1])

# print out top10 similar centences     
negative_similar_top10.show(truncate=False)

VBox()

NEGATIVE - center sentence: 
 --review_id--  ----review_sentence----
R3MGK1ZXH61TIS its not even poppunk.
+--------------+-----------------------------------------------------------------------------------------------------------+
|review_id     |review_sentence                                                                                            |
+--------------+-----------------------------------------------------------------------------------------------------------+
|R2L59UF4845DWY|But there are also a lot of bands in this genre, that are almost as bad as Good Charlotte.                 |
|R3UKZ7D8HCZFM |It is pop punk and i despise pop punk.                                                                     |
|R10EFJ8V3F35IC|They are the ones to be repsect, not the immitators like this band is!I hate Good Charlotte with a passion.|
|R38HLY2JJ8UX9Q|They are one of the worst bands i have ever heard along with simple plan and new found glory.              |
|R34EYTYWCR0SBL|I d

## Similarity analysis with Spark Word2Vec

### Spark Word2Vec
#### Combine positive and negative review sentences as training data of word2vec model.

In [28]:
# Tokenize positive sentences
positive_tokens = positive_data_df.select("sentence") \
                    .withColumn("tokens", F.array_remove(F.split(positive_data_df.sentence, " "), ''))\
                    .select("tokens")
# Tokenize negative sentences
negative_tokens = negative_data_df.select("sentence") \
                    .withColumn("tokens", F.array_remove(F.split(negative_data_df.sentence, " "), ''))\
                    .select("tokens")
# Combine positive and negative tokens 
all_tokens = positive_tokens.union(negative_tokens)
all_tokens.show(5)

VBox()

+--------------------+
|              tokens|
+--------------------+
|[Gotta, love, moc...|
|[If, you, like, g...|
|[Thats, what, the...|
|[Id, suggest, get...|
|[there, music, is...|
+--------------------+
only showing top 5 rows

#### Training Word2Vec Model

In [29]:
# Learn a mapping from words to vectors using Spark built in Word2Vec package
word2vec = Word2Vec(vectorSize=300, inputCol="tokens", outputCol="vectors")
w2v_model = word2vec.fit(all_tokens)

VBox()

### 4.1 Positive Reviews

#### 4.1.1 Use the positive data tokenized in the previous section.

#### 4.1.2 Convert review sentences into vectors using Word2Vec

In [30]:
# Convert sentences into embeddings with trained model
p_w2v_embedding = w2v_model.transform(positive_tokens).select("vectors").rdd\
                    .map(lambda vector: np.reshape(vector, -1))
p_w2v_embedding.take(1)

VBox()

[array([ 3.63519285e-02, -5.77875241e-03,  1.65805502e-02,  5.03333164e-03,
        1.15673000e-02,  3.08192484e-03, -8.01540338e-03, -3.85059640e-02,
        6.31333515e-03,  1.47057926e-02, -1.59396507e-02,  4.74832701e-03,
       -2.84482211e-02, -6.25622962e-03,  1.01936368e-02,  1.01769037e-02,
       -1.81037342e-02, -2.43037464e-03,  1.94003641e-02,  2.12085394e-02,
       -5.09521663e-03, -1.01973064e-02,  1.26099032e-02,  2.02603471e-02,
       -1.11511819e-02,  1.90538198e-02,  7.44140348e-03, -7.64141195e-03,
       -2.21740372e-02, -9.33796149e-03,  2.82737277e-03, -1.12647215e-02,
       -5.03928196e-03,  2.88877971e-03,  2.75453828e-02, -5.03110037e-02,
       -8.88692737e-03,  2.08724383e-02, -2.14774642e-02, -1.96734097e-02,
       -4.91627641e-03, -6.89913840e-03,  2.23920759e-02,  4.79810406e-04,
       -2.99670090e-02, -9.77530722e-03,  1.42662827e-02,  2.47128502e-02,
        2.06708552e-02, -2.90772542e-03,  7.86442086e-03, -2.74854298e-03,
        5.82515299e-03, 

#### 4.1.3 Create all posible sentence vector pairs

In [31]:
p_vector_pairs_stage4 = Create_vector_pairs(p_w2v_embedding).repartition(1000)
p_vector_pairs_stage4.show(5)

VBox()

+------+------+--------------------+--------------------+
|index1|index0|              vector|            vector_a|
+------+------+--------------------+--------------------+
|  3835|  3612|[0.02099321743783...|[0.00902031800326...|
|  6259|  2430|[0.01515886603322...|[0.03315434470557...|
|  9351|  8204|[0.04008870945544...|[0.01479616293606...|
|  6259|  3077|[0.03038003092321...|[0.03315434470557...|
|  6722|  4118|[0.04488368514770...|[0.01928596779841...|
+------+------+--------------------+--------------------+
only showing top 5 rows

#### 4.1.4 Calculate cosine distances of each sentence vector pairs

In [32]:
# Convert vector pairs into rdd format and calculate cosine distance using map function with user defined function 
p_distance_pairs_stage4 = p_vector_pairs_stage4.rdd.map(calculate_cosine_distance).repartition(1000)
p_distance_pairs_stage4.take(5)

VBox()

[(4978, 6645, 1.1788256284138603), (3288, 7720, 1.1126417999652365), (3560, 8070, 0.4815487471868033), (5535, 7838, 0.6604533377275099), (2888, 5018, 0.6154258497284311)]

#### 4.1.5 Calculate average distance of each sentence vector

In [33]:
p_avg_distance_stage4 = Average_distance(p_distance_pairs_stage4)
p_avg_distance_stage4.take(5)

VBox()

[(7776, 0.5529389196620226), (576, 0.7279303023526731), (8064, 0.7246439862766181), (2016, 0.7171658470750248), (3168, 0.5158682697053306)]

#### 4.1.6 Extract indexes of center point and top10 similar sentences based on average distances

In [34]:
p_center_idx_stage4, p_top10_idx_stage4 = Center_Top10_Index(p_distance_pairs_stage4, p_avg_distance_stage4)
print('Index of center sentences: ', p_center_idx_stage4)
print('Indexes of top10 similar sentences', p_top10_idx_stage4)

VBox()

Index of center sentences:  5350
Indexes of top10 similar sentences [2394, 4915, 6130, 3486, 915, 8694, 5770, 9243, 8246, 6344]

#### 4.1.7 Extract review sentences and corresponding review id based on index

In [35]:
# Convert positive data into rdd format and add indices
p_rdd = positive_data_df.rdd.zipWithIndex().cache() 
p_rdd.count()
# Select review id and sentences from all positive data based on indices.
positive_center_stage4 = p_rdd.filter(lambda x:x[1] == p_center_idx_stage4).keys().collect()
positive_similar_top10_stage4 = p_rdd.filter(lambda x:x[1] in p_top10_idx_stage4).keys()\
                                    .toDF(["review_id", "review_sentence"])

VBox()

#### 4.1.8 The results of positive reviews -Stage 4

In [38]:
# print out the center sentence of positive review
print("POSITIVE - center sentence: \n --review_id--  ----review_sentence----")
for center in positive_center_stage4:
    print(center[0], center[1])

# print out top10 similar centences     
positive_similar_top10_stage4.show(truncate=False)

VBox()

POSITIVE - center sentence: 
 --review_id--  ----review_sentence----
R1OKU7XPP54KCZ These are some truly talented artists of the punk rock genre.
+--------------+--------------------+
|     review_id|     review_sentence|
+--------------+--------------------+
|R1NC7QCQBST3UK|The prechorus was...|
|R1TPOP45R0I93J|Good Charlotte is...|
|R1OGG8CH4AXDIJ|it is really good...|
|R2D9GJKVK42XDV|When I was feelin...|
|R2Z83HAYSJ4UP5|Forget the image ...|
|R37U2GX41AYFFX|Good Charlotte Ro...|
|R1JB8P9ZKKKIPB|Id have to say i ...|
|R1UPGK03N6GNUY|They have come so...|
|R1RE7EIJ0T5I9W|As a very big fan...|
| RW4HAE4HGJ69L|and even if you a...|
+--------------+--------------------+

### 4.2 Negative Reviews

#### 4.2.1 Use the negative data tokenized in the previous section.

#### 4.2.2 Convert review sentences into vectors using Word2Vec

In [39]:
# Convert sentences into embeddings with trained model
n_w2v_embedding = w2v_model.transform(negative_tokens).select("vectors").rdd\
                    .map(lambda vector: np.reshape(vector, -1)).repartition(1000)
n_w2v_embedding.take(1)

VBox()

[array([-1.69599495e-02, -5.12491265e-02, -6.25911766e-04,  5.48528104e-02,
       -1.54560875e-02,  8.14741037e-03,  8.03272026e-03, -1.66637256e-02,
        6.99344116e-03, -9.05940033e-03, -1.15462499e-02, -6.10355093e-03,
       -1.91547970e-02, -9.03689185e-03,  1.70491321e-02,  1.18451234e-02,
       -1.05184469e-02,  1.66268921e-03,  2.80658697e-02,  2.02178253e-02,
       -1.66754667e-02,  6.18930471e-03,  7.72561555e-03,  7.92322494e-03,
       -1.51904402e-02,  1.89348883e-02, -2.87616234e-03,  7.76005142e-03,
        9.91778072e-03,  3.19145278e-03, -5.06206801e-02,  3.76133726e-02,
       -1.06819226e-02,  1.74676806e-02, -3.07303072e-02, -6.61912893e-03,
       -2.54300962e-02,  2.47634992e-02,  8.48270437e-03,  1.11153115e-02,
       -9.04286653e-03, -1.87829216e-02,  3.26201069e-02,  2.59971206e-02,
       -2.69255582e-03,  8.84344340e-03,  1.02517667e-02,  1.11472048e-02,
        3.12291499e-02, -2.00567253e-02,  2.61095933e-03, -1.92558724e-02,
       -3.39948643e-03, 

#### 4.2.3 Create all posible sentence vector pairs

In [40]:
n_vector_pairs_stage4 = Create_vector_pairs(n_w2v_embedding).repartition(1000)
n_vector_pairs_stage4.show(5)

VBox()

+------+------+--------------------+--------------------+
|index1|index0|              vector|            vector_a|
+------+------+--------------------+--------------------+
|  1444|   629|[0.03841874026693...|[0.03723155241459...|
|  1444|  1078|[-0.0279344741332...|[0.03723155241459...|
|   884|   489|[-0.0330888023599...|[-0.0301497410317...|
|   520|   136|[0.01623912721333...|[0.02871226106071...|
|  2984|   877|[-0.0122688690475...|[0.02366009080672...|
+------+------+--------------------+--------------------+
only showing top 5 rows

#### 4.2.4 Calculate cosine distances of each sentence vector pairs

In [41]:
# Convert vector pairs into rdd format and calculate cosine distance using map function with user defined function 
n_distance_pairs_stage4 = n_vector_pairs_stage4.rdd.map(calculate_cosine_distance).repartition(1000)
n_distance_pairs_stage4.take(5)

VBox()

[(1469, 2756, 0.5633809421939431), (1682, 2652, 0.3888757253101396), (837, 2594, 0.3687945195999507), (1467, 1706, 0.5388735899180961), (182, 2435, 0.810293379594003)]

#### 4.2.5 Calculate average distance of each sentence vector

In [42]:
n_avg_distance_stage4 = Average_distance(n_distance_pairs_stage4).repartition(1000)
n_avg_distance_stage4.take(5)

VBox()

[(1091, 0.4415327604587037), (2819, 0.7159338492188829), (515, 0.7690280451760207), (227, 0.472166127970975), (2531, 0.5836842320410964)]

#### 4.2.6 Extract indexes of center point and top10 similar sentences based on average distances

In [43]:
n_center_idx_stage4, n_top10_idx_stage4 = Center_Top10_Index(n_distance_pairs_stage4, n_avg_distance_stage4)
print('Index of center sentences: ', n_center_idx_stage4)
print('Indexes of top10 similar sentences', n_top10_idx_stage4)

VBox()

Index of center sentences:  1140
Indexes of top10 similar sentences [212, 1888, 1818, 726, 959, 2835, 1819, 2548, 1323, 3003]

#### 4.2.7 Extract review sentences and corresponding review id based on index

In [44]:
# Convert positive data into rdd format and add indices
n_rdd = negative_data_df.rdd.zipWithIndex().cache() 

# Select review id and sentences from all positive data based on indices.
negative_center_stage4 = n_rdd.filter(lambda x:x[1] == n_center_idx_stage4).keys().collect()
negative_similar_top10_stage4 = n_rdd.filter(lambda x:x[1] in n_top10_idx_stage4).keys()\
                                    .toDF(["review_id", "review_sentence"])

VBox()

#### 4.2.8 The results of negative reviews -Stage 4

In [46]:
# print out the center sentence of negative review
print("NEGATIVE - center sentence: \n --review_id--  ----review_sentence----")
for center in negative_center_stage4:
    print(center[0], center[1])

# print out top10 similar centences     
negative_similar_top10_stage4.show()

VBox()

NEGATIVE - center sentence: 
 --review_id--  ----review_sentence----
R25XBVBWHQ5Y2Y I dont care how punk they supposedly look I dont define myself by the way I dress or the music I listen to although others obsess over it, I dont care if they SAID theyre punk or not, thats irrelevant, its the fact that they get raving reviews for lyrics shallower than a hot tub and music that has no emotional substance to it and in which the guitars are twodimensional if they play them at all.BRWhat this band needs to do to be any goodBR Lose the corny name bands named after someones mothers name isnt the in thingBR Take instrument playing BR Learn that theres more than one chord or riff possible to play on the guitarBR Lose the egotistical attitude and stop making their fans so narrowminded musicallyBR Write some lyrics with depthUntil then, all this deserves is one star, because thats the lowest rating I can give.Try these rock albums for some good musicNumetalgeneral rockBRFlaw  Endangered SpeciesBR