# Using Text to Group Amazon Reviews

The goal of this demo session is to cluster a set of reviews scraped from the Amazon.com e-commerce platform. We will use the textual content of each review to extract features by implementing the TF-IDF technique. These features are used from the k-means clustering algorithm to group reviews with similar features.

#### Downloading dataset github

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import os

In [None]:
if os.path.exists('out.json'):
    df = pd.read_json('out.json')
else:
    df = pd.read_json('https://raw.githubusercontent.com/InfoTUNI/joda2022/master/koodiesimerkit/out.json')

df

### Data Preprocessing

First, we preprocess the textual content of each rating. There are many techniques we could use to clean the data but given the relatively short length of the average rating's content, we only consider converting the text field of each rating to lowercase.

In [None]:
df['text'] = df['text'].apply(str.lower)

___
Next, let's make sure that we convert the rating attribute to a floating point value.

In [None]:
df['rating'].unique()

In [None]:
conv_rating = lambda rating: float(rating[:2])

df['rating'] = df['rating'].apply(conv_rating)

df.info()

____
There are various methods to extract features out of textual data. Such techniques are known as word representation techniques. Here we list two popular and relatively simple approaches to the word representation problem:

- Among the simplest ones is the Bag-of-Words(BoW) technique. Through this approach we can represent a text (such as a sentence or a document) is represented as the bag of its words, disregarding grammar and the word order. <br/> Ex: "the quick brown fox jumps over the lazy dog" ==> <br/>{'the':2, 'quick':1, 'brown':1, 'fox':1, 'jumps':1, 'over':1, 'lazy':1, 'dog':1}
- TF-IDF stands for "Term Frequency - Inverse Document Frequency". This approach is build on top of the BoW technique. The count(frequency) of each word in a document is normalized by the number of times that each specific word appears over the set of all documents. In this way, popular words appearing among multiple documents will get lower tf-idf score.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# suppress errors when dividing by 0
np.seterr(divide='ignore', invalid='ignore')


In [None]:
vectorizer = TfidfVectorizer()
reprs = vectorizer.fit_transform(df['text'])
vectorizer.get_feature_names_out()

In [None]:
# Convert sparse arrays to dense ones and normalize the values of each vector in repr to be of unit length.
# Our clustering algorithm uses euclidean-distance to group similar vectors together. By performing the following
# normalization we get the equivalent effect as if we group them based on the cosine distance.
reprs = reprs.toarray()
length = np.sqrt((reprs**2).sum(axis=1))[:,None]
reprs = reprs / length

# replace Inf and Nan values
reprs = np.nan_to_num(reprs, nan=0.0, posinf=0.0, neginf=0.0)

print('The shape of the matrix containing the word representations is:', reprs.shape)

In [None]:
# print example features
reprs[0][:50]

### Clustering the ratings 

K-means is an unsupervised clustering algorithm. Given a set of **n** vectors, k-means will group them into **k** groups/clusters while trying to keep the variance(distance) between vectors of the same group small. For our use-case, we will use the tf-idf generated representations as the input vectors. 

In [None]:
from sklearn.cluster import KMeans
import numpy as np

k = 6
kmeans = KMeans(n_clusters=k, random_state=11)
clusters = kmeans.fit_predict(reprs)

print(clusters)

In [None]:
df['cluster_id'] = clusters

df['cluster_id'].value_counts()

### Visualizing the results

In [None]:
average_ratings = df.groupby('cluster_id').mean()

fig = average_ratings.plot.bar(figsize=(10,5))

____
Now we will plot each rating on a 2d scatter-plot. Currently, the length/dimensions of the representation vectors is more than 3000. We will use the PCA dimensionality reduction tool to reduce the number of dimensions to 2. In the [tutorials folder] (https://github.com/InfoTUNI/joda2022/tree/master/tutorials) you may find a more detailed explanation of the PCA technique. 



In [None]:
from sklearn.decomposition import PCA

# Initialize PCA with 2 components, for 2d visualising.
pca = PCA(n_components=2)

In [None]:
# Reduce representation vectors
reducedX = pca.fit_transform(reprs)

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))

# generate k colors
color_map = lambda cid: plt.cm.get_cmap('hsv', k)(cid)

for dp, cid in zip(reducedX, clusters):
    ax.scatter(dp[0], dp[1], s=50, color=color_map(cid))

    
plt.legend(handles=[
                        mpatches.Patch(color=color_map(cid), label='Cluster '+str(cid)) for cid in range(k)
        ])
plt.show()

In [None]:
df[df['cluster_id']==5]

In [None]:
print('Thank you')