<a href="https://colab.research.google.com/github/EmanueleGiavardi/AMD_project/blob/main/reviews_link_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q kaggle pyspark

In [2]:
import pyspark
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
import os
import time
from google.colab import files
from collections import Counter

In [3]:
# please upload your kaggle.json file here
files.upload()
!ls -lha kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
-rw-r--r-- 1 root root 72 Jun 10 07:34 kaggle.json


In [4]:
!kaggle datasets download -d "mohamedbakhet/amazon-books-reviews"
!unzip amazon-books-reviews.zip
!rm -r amazon-books-reviews.zip

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
Downloading amazon-books-reviews.zip to /content
 99% 1.05G/1.06G [00:06<00:00, 159MB/s]
100% 1.06G/1.06G [00:06<00:00, 184MB/s]
Archive:  amazon-books-reviews.zip
  inflating: Books_rating.csv        
  inflating: books_data.csv          


In [5]:
start_time = time.time()

In [6]:
spark = pyspark.sql.SparkSession.builder.master("local[*]").appName("reviews_link_analysis").getOrCreate()
sc = spark.sparkContext

In [7]:
books_rating_df = spark.read.csv("Books_rating.csv", header=True, inferSchema=True)
books_data_df = spark.read.csv("books_data.csv", header=True, inferSchema=True)

# **Link Analysis: finding influential/authoritative users**

## **Dataset subsampling**

In the original dataset, reviews are ordered by the book being reviewed. This means we have $K$ rows related to book $B$, followed by $K′$ rows for book $B′$, and so on.
There are two main strategies for subsampling the dataset:

1. **Selecting the first N rows without any shuffling**: \\
This approach favors the inclusion of reviews for the same book, resulting in a smaller number of distinct books but a higher number of user-user connections.

2. **Random sampling, where each row is included in the sample with probability p (using a random seed for reproducibility)**: \\
This method leads to a broader variety of books in the sample, but it becomes less likely that two users have reviewed the same book, thus reducing the number of connected users.

Although the second approach may be more statistically appropriate in terms of representative sampling, the first strategy is chosen for this project in order to ensure a denser graph with a more substantial number of user connections, since the goal of this project is to explore links between users who have reviewed the same book

In [8]:
ratings_count = books_rating_df.count()

# keeps the first (sampling_frac*100)% of the lines
sampling_frac = 0.01
books_rating_df_sub = books_rating_df.limit(int(sampling_frac * ratings_count))
print(f"sample has {books_rating_df_sub.count()} lines")

sample has 30000 lines


## **Graph creation**

- **Nodes** represent users.
- **Edges** represent links between users who have reviewed the same book.

The graph is **directed**: a directed edge from $u2$ to $u1$ exists if both $u1$ and $u2$ reviewed the same book, and the **helpfulness score** of $u1$'s review is higher than that of $u2$'s review for the same book.

We focus only on a subset of all available books. In the *books_data* CSV file, the ```categories``` column contains:

- In most cases, a string representing a list of categories, e.g., ```"['Religion', 'Politics', ...]"```
- In some cases, invalid or irrelevant data, such as ```None``` values or links to the Google Books Store

To ensure the use of meaningful data, we only consider books for which the ```categories``` value matches the regular expression \[.*\], allowing us to parse it as an actual list of strings.

This decision is motivated by the intent to apply **Topic-Sensitive PageRank** to the graph. In this context, the "topic" associated with each node (i.e., user) corresponds to their *preferred literary genre*, defined as the most frequently reviewed genre by that user.

```
# graph creation pseudocode:

for each book b (with well-formatted categories):
    for each (u1, u2) such that both u1 and u2 reviewed b:
        if (helpfulness(u1, b)) > (helpfulness(u2, b)):
            add edge from u2 to u1
```


---------------

Given $R$, the review table in which the Title, User_id and helpfulness attributes have been selected, we construct the following new table:
$$
J = \sigma_{\text{helpfulness}_1 > \text{helpfulness}_2}(R' \Join_{\text{Title}} R')
$$

This table has the schema
```
root
 |-- User_id_1
 |-- Title
 |-- User_id_2
 |-- Helpfulness_1
 |-- Helpfulness_2
```
It is constructed such that both ```User_id_1``` and ```User_id_2``` reviewd the book named with ```Title``` and ```Helpfulness_1``` $>$ ```Helpfulness_2```.

From this table, we build the directed graph based on the criteria described above.


**NOTE**:
> The helpfulness scores of each review are not on a common scale as they appear in formats like "0/0", "4/5", "8/10", "78/82", and so on.
For the purpose of this project, we simply convert each "X/Y" string into a float by evaluating the fraction.
However, a finer interpretation would treat "X/Y" as "number of people who found the review helpful / total number of voters".
Under this assumption, it's important to consider not only the fraction itself but also how many people voted, since a high ratio based on few votes may be less reliable than a slightly lower ratio based on many votes.

In [9]:
from pyspark.sql.functions import split, col, when

def get_helpfulness_score(col_name: pyspark.sql.Column):
    """
    Returns the numerical helpfulness score associated with the ```"X/Y"``` string of the input column.

    Parameters
    ----------
    col_name: pyspark.sql.Column
        the name of the column

    Returns
    ----------
    k: float
        the numerical ratio associated with ```"X/Y"``` if Y is not "0", ```0.0``` otherwise
    """

    num = split(col(col_name), "/").getItem(0).cast("float")
    den = split(col(col_name), "/").getItem(1).cast("float")
    return when(den != 0, num / den).otherwise(0.0)



R = books_rating_df_sub.select(["Title", "User_id", "review/helpfulness"]).filter(col('User_id').isNotNull())
R1 = R.alias("R1")
R2 = R.alias("R2")

# J schema: | Title | User_id_1 | User_id_2 | helpfulness_1 | helpfulness_2 |
J = R1.join(R2, on="Title") \
      .filter(col("R1.User_id") != col("R2.User_id")) \
      .select(
          col("R1.Title").alias("Title"),
          col("R1.User_id").alias("User_id_1"),
          col("R2.User_id").alias("User_id_2"),
          get_helpfulness_score("R1.review/helpfulness").alias("helpfulness_1"),
          get_helpfulness_score("R2.review/helpfulness").alias("helpfulness_2")
      ).filter(col("helpfulness_1") > col("helpfulness_2"))

In [10]:
# finding books with well-formatted "categories" value

from pyspark.sql.functions import from_json
from pyspark.sql.types import ArrayType, StringType

genres_schema = ArrayType(StringType())

# output example: Row(Title='Wonderful Worship in Smaller Churches', Genres=['Religion', 'Politics']),
genres = books_data_df.filter((col("categories").isNotNull()) & (col("categories").rlike(r"^\[.*\]$"))) \
        .withColumn("Genres", from_json("categories", genres_schema)).select("Title", "Genres")

# filtering J keeping only books with well-formatted categories
# J_filtered schema: | Title | User_id_1 | User_id_2 | helpfulness_1 | helpfulness_2 | Genres
J_filtered = J.join(genres, on="Title")

The idea now is to associate an **increasing integer value from $0$ to $N-1$** to each one of the $N$ user ids. In this way:
- An edge is simply going to be represented as a couple of integers $(i, j)$, where $i$ is the integer value related to the user having the outgoing connection and $j$ us the integer value related to the user having the incoming connection
- PageRank values will be stored in a simple array $V$ of $N$ elements, such that $V[i]$ = pageRank value for the user associated to integer value $i$



In [11]:
# NOTE: (User_id_1 U User_id_2) excludes all the users that reviewd a certain book by themselves!

unique_users = J_filtered.select(col("User_id_1").alias("User_id")) \
    .union(J_filtered.select(col("User_id_2").alias("User_id"))) \
    .distinct()
N = unique_users.count()

# [('user1_id', integer1), ... ('userN_id', integerN)]
user_ids_rdd = unique_users.rdd.map(lambda row: row["User_id"]).zipWithIndex()

Now there could be two ways of creating the $(i, j)$ couples:
1. from ```user_ids_rdd``` a Dataframe with schema ```[User_id, Integer_value]``` could be created, being able to associate each ```Integer_value``` both to ```User_id_1``` and ```User_id_2``` using join operations.
2. convert ```user_ids_rdd``` in a dictionary which is broadcasted to every computing node of the cluster, so that becomes easy to retrieve the ```Integer_value``` extracting the value for the specific ```User_id``` key

The number $N$ of distinct users is not expected to be _that high_, so we can assume that a dictionary containing $N$ entries can be held in main memory

In [12]:
user_ids_dict = user_ids_rdd.collectAsMap()
user_ids_bdcast = sc.broadcast(user_ids_dict)

In [13]:
# list of couples (node_src, node_dst)
edges = J_filtered.rdd.map(lambda row : (user_ids_bdcast.value[row[2]], user_ids_bdcast.value[row[1]]))

In [14]:
# list of couples (node_src, [iterable of dst nodes])
adjacency_list = edges.groupByKey()

In [15]:
#print(f"Transition graph has {N} nodes and {edges.count()} edges")

## **Graph Analysis**

Here we look for Spider Traps (in the simplest form, i.e. 2 nodes) and Dead Ends in the graph

In [16]:
# dead ends
dead_ends = adjacency_list.filter(lambda x: len(x[1]) == 0).count()
print(f"{dead_ends} dead ends have been found in this graph")

0 dead ends have been found in this graph


In [17]:
from pyspark.sql.functions import least, greatest

# spider traps
edges_df = edges.toDF(["src", "dst"])

# tuples (u1, u2) such that u1 <==> u2
mutual_edges = edges_df.alias("e1") \
    .join(edges_df.alias("e2"),
          (col("e1.src") == col("e2.dst")) &
          (col("e1.dst") == col("e2.src"))) \
    .select(
        least(col("e1.src"), col("e1.dst")).alias("u1"),
        greatest(col("e1.src"), col("e1.dst")).alias("u2")
    ).distinct()

In [18]:
out_degrees = adjacency_list.map(lambda x: (x[0], len(x[1]))).toDF(["node", "out_degree"])

# u1 | u2 | out_degree_1 | out_degree_2
joined = mutual_edges \
    .join(out_degrees.withColumnRenamed("node", "u1").withColumnRenamed("out_degree", "out_degree_1"), on="u1") \
    .join(out_degrees.withColumnRenamed("node", "u2").withColumnRenamed("out_degree", "out_degree_2"), on="u2")

# spider traps couples are those u1 <==> u2 such that
# - u1 has only ONE outgoing edge (the one connecting it to u2)
# - u2 has only ONE outgoing edge (the one connecting it to u1)
# the number of spider traps couples in this context is expected to be small, so we can collect them in main memory
spider_traps = joined.filter((col("out_degree_1") == 1) & (col("out_degree_2") == 1)).select("u1", "u2").collect()

In [19]:
print(f"{len(spider_traps)} spider traps have been found in this graph: ")
for t in spider_traps: print(f"{t[0]} <==> {t[1]}")

1 spider traps have been found in this graph: 
1605 <==> 3883


## **Topic vector creation**

Topic-Sensitive PageRank requires a topic assigned to each node of the graph. In this case each user is associated to the literary genre that he/she reviewed the most.
The data structure that contains this information is an array such that the $i$-th element refers to the "preferred" literary genre for the user mapped to integer $i$

In [20]:
T1 = books_rating_df_sub.join(unique_users, "User_id").select(col("User_id"), col("Title")).alias("T1")
T2 = genres.alias("T2")

genres_per_review = T1.join(T2, on="Title").select("User_id", "T1.Title", "Genres")
genres_per_review.take(5)

[Row(User_id='A31WHFXF6T06DR', Title='Isaac Asimov: Master of Science Fiction (People to Know)', Genres=['Juvenile Nonfiction']),
 Row(User_id='A14B2NR2XELLQ0', Title='Isaac Asimov: Master of Science Fiction (People to Know)', Genres=['Juvenile Nonfiction']),
 Row(User_id='A7FAM0VNL7F4B', Title='Iridescent Soul', Genres=['Hummingbirds']),
 Row(User_id='A2YZWB84LMU5CC', Title='Iridescent Soul', Genres=['Hummingbirds']),
 Row(User_id='AUNJJ9J53PMGH', Title='Iridescent Soul', Genres=['Hummingbirds'])]

In [21]:
# when setting the user as key and grouping by key we have something like:
# ('A3U1XS6XK6YUU4' -> [['Fiction', 'Drama'], ['Religion'], ['Religion', 'Politics']]).
# So the value is a nested list: Each internal list represents the genres of each book reviewd by the user

# [genre for book_genres in external_list for genre in book_genres] just flattens the list:
# same as:
# for book_genres in external_list:
#     for genre in book_genres:
#         append genre to resulting list
#
# [['Fiction', 'Drama'], ['Religion'], ['Religion', 'Politics']] => ['Fiction', 'Drama', 'Religion', 'Religion', 'Politics']

# the Counter simply returns the most common genre (in the example: 'Religion')
# if the counting is the same for two distinct genres, the first one according to the alphabetical ordering is returned

genres_per_user = genres_per_review.rdd.map(lambda row: (row['User_id'], row['Genres'])).groupByKey() \
    .mapValues(lambda external_list: [genre for book_genres in external_list for genre in book_genres]).mapValues(
    lambda genres: Counter(genres).most_common(1)[0][0] if genres else None
).map(lambda x: (user_ids_dict[x[0]], x[1]))

genres_per_user_array = [None] * N

# genres_per_user_array[i] = preferred genre for user mapped to integer i
for index, genre in genres_per_user.collect(): genres_per_user_array[index] = genre

again, we're under the assumption that there's no problem in keeping a $N$ element array in main memory

## **Topic-Sensitive PageRank**

First of all we need to initialize the transition matrix $M$ so that $M_{ij}$ = $\frac{1}{α}$, where $\alpha$ is the number of outgoing edges from node $j$ (if there's a link from $j$ to $i$).


Now ```adjacency_list``` is a rdd in which each element is expressed ```(node, [neighbours])```. Since the transition matrix $M$ is heavily sparse, we are going to represent it using triplets $(i, j, M_{ij})$ only if $M_{ij} \neq 0$

**NOTE**: in this setting, we could potentially have many arcs from a certain node $A$ to another node $B$, because there could be many books for which user $B$ wrote a better review than user $A$.

The idea in this case is to collapse all the possibile arcs from $A$ to $B$ in a single arc, weighting the associated pageRank initial value according to the actual number of books for which $B$ obtained a better score with respect to $A$.

So triplets are actually stored in the form $((i, j), M_{ij})$, so that it becomes easy to group triples with the same key $(i, j)$ and summing up all the contributes $M_{ij}$ associated with the same src-dest nodes.

_Example:_
- ```edges = [(A, B), (A, B), (A, C)]```
- ```adjacency_list = [A, [B, B, C]]```
- ```
triplets (before grouping) = [
    ((B, A), 1/3)
    ((B, A), 1/3)
    ((C, A), 1/3)
]
```
- ```
triplets (after grouping) = [
    ((B, A), 2/3)
    ((C, A), 1/3)
]
```



In [22]:
# if el is an element of the adjacency_list rdd,
# el[0] => node
# el[1] => list of neighbours of that node

# NOTE: now the semantics of the triplets is (i, j, Mij) => (dst_node, src_node, value)
# it's a bit counterintuitive, but COLUMNS REPRESENT SOURCE NODES, while ROWS REPRESENT DESTINATION NODES

triplets = adjacency_list.flatMap(
    lambda el: [((neighbour, el[0]), 1.0/len(el[1])) for neighbour in el[1] if len(el[1]) > 0]
).reduceByKey(lambda x, y: x + y)

In [23]:
# mapping back to canonical (i, j, Mij) form, for semplicity (and for coherence with lecture notes)
M = triplets.map(lambda triplet: (triplet[0][0], triplet[0][1], triplet[1])).cache()

In [24]:
# CHECK: M should be column-wise stochastic, so column values should sum up to 1
# m[1] -> column index
# m[2] -> initial pageRank score for i, j nodes
check = M.map(lambda m: (m[1], m[2])).reduceByKey(lambda x, y: x+y)
# now we have key-value pairs such that key => column index j and value = SUM(M[i,j]) for i = 0, ..., # rows - 1.
# Check if some values are far from 1 (with a tolerance of epsilon)
epsilon = 1e-6
far_from_one = check.filter(lambda pair: abs(pair[1] - 1.0) > epsilon).count()
if far_from_one == 0: print(f"✅ M is column-wise stochastic")
else: print(f"⚠️ M IS NOT column-wise stochastic: there are {far_from_one} columns that don't sum up to 1")

✅ M is column-wise stochastic


**Topic-Sensitive PageRank (with dumping factor β):**

$$
\begin{equation}
    \begin{cases}
        v(0) = \frac{1}{N}\underline{1} \\
        v(t+1) = \beta Mv(t) + (1-\beta)\frac{e_S}{|S|}  
    \end{cases}\,
\end{equation}
$$

where $S$ is a given literary genre (e.g. "Fiction"), |S| is the cardinality of the set of users interested to $S$ and $e_S$ is a vector such that
$$
\begin{equation}
    \begin{cases}
        e_S[i] == 1 & \text{if user associated with integer value $i$ has $S$ as preferred genre} \\
        e_S[i] == 0 & \text{otherwise}
    \end{cases}\,
\end{equation}
$$

In [35]:
def create_topic_vector(N: int, genres_per_user_array: np.ndarray, topic: str):
    """
    Creates the vector used for taxation in Topic-Sensitive Page Rank.

    Parameters
    ----------
    N: int
        the number of distinct users
    genres_per_user_array: np.ndarray
        a vector containing the "preferred" literary genre for each user
    topic: str
        the chosen topic for PageRank computation

    Returns
    ----------
    a vector containing ```1/N``` for each user if ```N``` is ```None```,
    ```e_s / |S|``` otherwise
    """
    if topic is None: return np.ones(N) / N
    e_S = np.zeros(N)
    for index, genre in enumerate(genres_per_user_array):
        if genre == topic: e_S[index] = 1

    # sanity check: the number of occurrences of topic in genres_per_user_dict.values() (which is |S|) should be equal to the number of ones in e_S
    S_card = Counter(genres_per_user_array)[topic]
    if not S_card == int(sum(e_S)): raise(ValueError("ERROR: something went wrong during topic vector creation"))
    return e_S / S_card

In [26]:
def PageRank(M: pyspark.rdd.PipelinedRDD, v: np.ndarray, topic_vector: np.ndarray, max_iterations=100, tolerance=10e-5, beta=0.8):
    """
    Computed the PageRank score for the graph described by ```M```

    Parameters
    ----------
    M: pyspark.rdd.PipelinedRDD
        an rdd containing triplets (i, j, M_ij) describing the transition matrix
    v: np.ndarray
        a vector containing PageRank scores for each user
    topic_vector: np.ndarray:
        a vector containing the taxation probability for each user (uniform for non-topic sensitive PageRank)
    max_iterations: int
        the maximum number of iterations allowed
    tolerance: float
        the threshold under which the difference between the previews PageRank vector and the current PageRank vector is irrelevant
    beta: float
        the taxation parameter

    Returns
    ----------
    v: np.ndarray
        a vector containing PageRank scores for each user
    """
    iteration = 0
    while iteration < max_iterations:
        prev_v = v.copy()

        # broadcast v to each node of the cluster
        v_bdcast = sc.broadcast(v)

        # matrix - vector multiplication (distributed)
        pr_scores = M.map(lambda m: (m[0], m[2]*v_bdcast.value[m[1]])).reduceByKey(lambda x, y: x + y).collect()
        # update vector v (local)
        for (user, pr_score) in pr_scores: v[user] = beta * pr_score + (1 - beta) * topic_vector[user]

        dist = np.linalg.norm(v - prev_v)

        if dist < tolerance:
            print(f"Convergence reached after {iteration} iterations with distance {dist}")
            break

        print(f"iteration {iteration}: distance = {dist}")
        iteration += 1
    return v

In [27]:
# just to show some topics:

Counter(genres_per_user_array).most_common(30)

[('Fiction', 5648),
 ('Book burning', 1065),
 ('Business & Economics', 575),
 ('Education', 564),
 ('Biography & Autobiography', 487),
 ('Religion', 471),
 ('American fiction', 454),
 ('History', 367),
 ('True Crime', 350),
 ('Body, Mind & Spirit', 320),
 ('Health & Fitness', 310),
 ('Juvenile Fiction', 291),
 ('Cooking', 207),
 ('Copyright', 199),
 ('Sports & Recreation', 189),
 ('Study Aids', 187),
 ('Family & Relationships', 177),
 ('Poetry', 176),
 ('Computers', 157),
 ('Philosophy', 125),
 ('Crafts & Hobbies', 119),
 ('Language Arts & Disciplines', 117),
 ('Psychology', 113),
 ('Travel', 90),
 ('Music', 89),
 ('Juvenile Nonfiction', 89),
 ('Science', 87),
 ('England', 85),
 ('Games', 77),
 ('Self-Help', 73)]

In [39]:
# for non-topic sensitive pageRank computation, just set topic to None
topic = None

if topic != None and topic not in genres_per_user_array: raise(AttributeError("Error: please select a valid literary genre"))

v = np.ones(N) / N
topic_vector = create_topic_vector(N, genres_per_user_array, topic)

max_iterations = 100
tolerance = 10e-5
beta = 0.8

In [40]:
%%time
pg_scores = PageRank(M, v, topic_vector, max_iterations=max_iterations, tolerance=tolerance, beta=beta)

iteration 0: distance = 0.004508294288083763
iteration 1: distance = 0.001392548552744607
iteration 2: distance = 0.0005269557574718849
iteration 3: distance = 0.00022173774433604593
iteration 4: distance = 0.00011146929390574273
Convergence reached after 5 iterations with distance 5.9924181026157786e-05
CPU times: user 650 ms, sys: 24.4 ms, total: 675 ms
Wall time: 28.8 s


In [41]:
# finding profileNames for the most k authoritative users

k = 5
pagerank_top_k_users = np.argsort(pg_scores)[-k:][::-1]

# reversing user_ids_dict in an array
user_ids_array = [None] * N
for id, i in user_ids_dict.items(): user_ids_array[i] = id

# here we're collecting in main memory a dict with only k elements (User_id -> profileName)
profileNames = books_rating_df_sub.filter(books_rating_df_sub.User_id.isin([user_ids_array[user] for user in pagerank_top_k_users]))  \
    .select('User_id', 'profileName').distinct().rdd.collectAsMap()

print(f"PageRank top {k} users [topic: {topic}]")
for user in pagerank_top_k_users: print(f"username: {profileNames[user_ids_array[user]]} -> PageRank score: {pg_scores[user]}")

PageRank top 5 users [topic: None]
username: Midwest Book Review -> PageRank score: 0.0013815026485379056
username: "E. A Solinas ""ea_solinas""" -> PageRank score: 0.0008588802397190114
username: "Terri J. Rice ""ricepaper""" -> PageRank score: 0.0008068922060201434
username: Harriet Klausner -> PageRank score: 0.0006907966501004307
username: "booksforabuck ""BooksForABuck""" -> PageRank score: 0.0005986556961684647


With ```Topic = None``` the most influential/authoritative user appears to be "[Midwest Book Review](https://www.midwestbookreview.com/)", which is a fairly well-known organization focused on book reviews. So, in this example, the PageRank result seems quite reasonable.



In [31]:
end_time = time.time()
elapsed_time = (end_time - start_time) / 60
print(f"Global runtime: {elapsed_time:.2f} minutes")

Global runtime: 10.44 minutes
