<a href="https://colab.research.google.com/github/EmanueleGiavardi/AMD_project/blob/main/src/amd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q kaggle pyspark

In [2]:
import pyspark
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
import os
from google.colab import files
from collections import Counter

In [3]:
# handling kaggle.json file

files.upload()
!ls -lha kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
-rw-r--r-- 1 root root 72 May 23 08:02 kaggle.json


In [4]:
!kaggle datasets download -d "mohamedbakhet/amazon-books-reviews"
!unzip amazon-books-reviews.zip
!rm -r amazon-books-reviews.zip

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
Downloading amazon-books-reviews.zip to /content
 98% 1.04G/1.06G [00:06<00:00, 204MB/s]
100% 1.06G/1.06G [00:06<00:00, 178MB/s]
Archive:  amazon-books-reviews.zip
  inflating: Books_rating.csv        
  inflating: books_data.csv          


In [5]:
spark = pyspark.sql.SparkSession.builder.master("local[*]").appName("AMD_project").getOrCreate()
sc = spark.sparkContext

In [6]:
books_rating_df = spark.read.csv("Books_rating.csv", header=True, inferSchema=True)
books_rating_df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- profileName: string (nullable = true)
 |-- review/helpfulness: string (nullable = true)
 |-- review/score: string (nullable = true)
 |-- review/time: string (nullable = true)
 |-- review/summary: string (nullable = true)
 |-- review/text: string (nullable = true)



In [7]:
# subsampling

random_state = 42
count = books_rating_df.count()
sampling_frac = 0.01

# probabilistic approach: keeps each line with prob = fraction
#books_rating_df_sub = books_rating_df.sample(fraction=sampling_frac, seed=random_state)

# keeps exactly (sampling_frac * count) lines, assuming books already in casual order
books_rating_df_sub = books_rating_df.limit(int(sampling_frac * count))
print(f"sample has {int(sampling_frac * count)} lines")

sample has 30000 lines


### **Exercise**
word occurrences in reviews summary

In [8]:
import re

def normalize_string(s):
    #s = re.sub('[^A-Za-z0-9]+', '', s)
    return s.lower().split(" ")

In [9]:
# map takes a function which take as input the ATOMIC INFORMATION of the rdd
# in this case: books_rating_df_sub.select("review/summary").rdd creates a new Spark Dataframe containing just the "review/summary" column and converts it into a rdd
# the atomic information of the rdd in this case is a Row, so our map function just extracts the "review/summary" field of the row, which is a string, and then tokenizes the string
# actually this is a flatMap, which means that if the input is a collection of other collections (in this case we have one list per row), the output is flatten: a unique list containing
# all the words of all the lists
s = books_rating_df_sub.select("review/summary").rdd.flatMap(lambda row: normalize_string(row["review/summary"]))

In [10]:
# here the atomic information of s is symply the word, thanks to the flatMap operation. So here we count the words, then we swap the (word, counter) pair so that we can
# orderByKey, and then we re-swap so that we can visualize the first 10 elements
s.map(lambda word:(word, 1)).reduceByKey(lambda x, y: x + y).map(lambda couple: (couple[1], couple[0])).sortByKey(ascending=False).map(lambda couple: (couple[1], couple[0])).take(10)

[('the', 6198),
 ('a', 6044),
 ('of', 4423),
 ('book', 3447),
 ('and', 2695),
 ('to', 2067),
 ('great', 2061),
 ('for', 1967),
 ('this', 1662),
 ('read', 1507)]

In [11]:
books_rating_df_sub.show()

+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|        Id|               Title|Price|       User_id|         profileName|review/helpfulness|review/score|review/time|      review/summary|         review/text|
+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|1882931173|Its Only Art If I...| NULL| AVCGYZL8FQQTD|"Jim of Oz ""jim-...|               7/7|         4.0|  940636800|Nice collection o...|This is only for ...|
|0826414346|Dr. Seuss: Americ...| NULL|A30TK6U7DNS82R|       Kevin Killian|             10/10|         5.0| 1095724800|   Really Enjoyed It|I don't care much...|
|0826414346|Dr. Seuss: Americ...| NULL|A3UH4UZ4RSVO82|        John Granger|             10/11|         5.0| 1078790400|Essential for eve...|"If people become...|
|0826414346|Dr. Seuss: Ameri

In [12]:
words = ["house", "dog", "cat", "dog", "garden", "cat", "dog",]

# create rdd

rdd = sc.parallelize(words)

rdd.map(lambda x:(x, 1)).reduceByKey(lambda x,y: x+y).collect()


[('dog', 3), ('house', 1), ('cat', 2), ('garden', 1)]

# **Link Analysis: finding influential/authoritative users**

**Graph**:
- nodes → users
- edges → links between users if two users reviewed the same book

the graph is **oriented**, so a link from ```u1``` to ```u2``` exists if ```u1``` and ```u2``` reviewd the same book, but the score (helpfulness) of the ```u1```'s review for that book is higher than the score that ```u2``` obtained for his/her review of that specific book.



## **Graph creation**

given $R$ the review table and given $R' = Π_{Title, User\_id, helpfulness}(R)$, we create the table

$$J = \sigma_{helpfulness_1 > helpfulness_2}(R' ⨝_{Title} R') $$

This table has the schema
```
root
 |-- User_id_1
 |-- Title
 |-- User_id_2
 |-- Helpfulness_1
 |-- Helpfulness_2
```
and it's build such that both ```User_id_1``` and ```User_id_2``` reviewd the book named with ```Title``` and ```Helpfulness_1``` $>$ ```Helpfulness_2```

Starting from this table we create the graph according to the criterium explained above

**NOTE**:

the helpfulness score of each review does not share a common scaling (we have things like 0/0, 4/5, 8/10, 78/82 ...)
Just for now, the score is simply obtained turning the string "X/Y" into a float number and evaluating it.

> TODO NEXT: find a cleverer way to deal with helpfulness. The "X/Y" could be interpreted as "people who found the review useful/total people who voted", even though this is not clear from the dataset specifications (or from the Amazon website). With this assumption, however, it becomes important to take into account the number of people who voted, instead of just considering the fraction of appreciation.



In [13]:
from pyspark.sql.functions import split, col, when

# TODO: replace with normalized version
def get_helpfulness_score(col_name):
    num = split(col(col_name), "/").getItem(0).cast("float")
    den = split(col(col_name), "/").getItem(1).cast("float")
    return when(den != 0, num / den).otherwise(0.0)

R_first = books_rating_df_sub.select(["Title", "User_id", "review/helpfulness"])
R1 = R_first.alias("R1")
R2 = R_first.alias("R2")

J = R1.join(R2, col("R1.Title") == col("R2.Title")) \
      .filter(col("R1.User_id") != col("R2.User_id")) \
      .select(
          col("R1.Title").alias("Title"),
          col("R1.User_id").alias("User_id_1"),
          col("R2.User_id").alias("User_id_2"),
          get_helpfulness_score("R1.review/helpfulness").alias("helpfulness_1"),
          get_helpfulness_score("R2.review/helpfulness").alias("helpfulness_2")
      )

J_filtered = J.filter(col("helpfulness_1") > col("helpfulness_2"))

In [14]:
J_filtered.take(10)

[Row(Title='Dr. Seuss: American Icon', User_id_1='A30TK6U7DNS82R', User_id_2='A3VA4XFS5WNJO3', helpfulness_1=1.0, helpfulness_2=0.6),
 Row(Title='Dr. Seuss: American Icon', User_id_1='A30TK6U7DNS82R', User_id_2='A25MD5I2GUIW6W', helpfulness_1=1.0, helpfulness_2=0.0),
 Row(Title='Dr. Seuss: American Icon', User_id_1='A30TK6U7DNS82R', User_id_2='A2RSSXTDZDUSH4', helpfulness_1=1.0, helpfulness_2=0.0),
 Row(Title='Dr. Seuss: American Icon', User_id_1='A30TK6U7DNS82R', User_id_2='A14OJS0VWMOSWO', helpfulness_1=1.0, helpfulness_2=0.75),
 Row(Title='Dr. Seuss: American Icon', User_id_1='A30TK6U7DNS82R', User_id_2='A3UH4UZ4RSVO82', helpfulness_1=1.0, helpfulness_2=0.9090909090909091),
 Row(Title='Dr. Seuss: American Icon', User_id_1='A3UH4UZ4RSVO82', User_id_2='A3VA4XFS5WNJO3', helpfulness_1=0.9090909090909091, helpfulness_2=0.6),
 Row(Title='Dr. Seuss: American Icon', User_id_1='A3UH4UZ4RSVO82', User_id_2='A25MD5I2GUIW6W', helpfulness_1=0.9090909090909091, helpfulness_2=0.0),
 Row(Title='Dr. 

The idea now is to associate an **increasing integer value from $0$ to $N-1$** to each one of the $N$ user ids. In this way:
- An edge is simply going to be represented as a couple of integers $(i, j)$, where $i$ is the integer value related to the user having the outgoing connection and $j$ us the integer value related to the user having the incoming connection
- PageRank values will be stored in a simple array $V$ of $N$ elements, such that $V[i]$ = pageRank value for the user associated to integer value $i$



In [15]:
unique_users = J_filtered.select(col("User_id_1").alias("User_id")) \
    .union(J_filtered.select(col("User_id_2").alias("User_id"))) \
    .distinct()

user_ids_rdd = unique_users.rdd.map(lambda row: row["User_id"]).zipWithIndex()
N = user_ids_rdd.count()
print(f"There are {N} unique users")

There are 19805 unique users


In [16]:
user_ids_rdd.take(5)

[('AT3C9SZ3MB4U9', 0),
 ('A2CIIL55BUQWBG', 1),
 ('A140XH16IKR4B0', 2),
 ('A3EN6NDS6S7N9N', 3),
 ('AKDP4PZ94N2E1', 4)]

Now there could be two ways of creating the $(i, j)$ couples:
1. from ```user_ids_rdd``` a Dataframe with schema ```[User_id, Integer_value]``` could be created, being able to associate each ```Integer_value``` both to ```User_id_1``` and ```User_id_2``` using join operations.
2. convert ```user_ids_rdd``` in a dictionary which is broadcasted to every computing node of the cluster, so that becomes easy to retrieve the ```Integer_value``` extracting the value for the specific ```User_id``` key

Since the number of unique users is not expected to be _that high_, the second option is choosen

In [17]:
user_ids_dict = user_ids_rdd.collectAsMap()
bdcast = sc.broadcast(user_ids_dict)

In [18]:
edges = J_filtered.rdd.map(lambda row : (bdcast.value[row[2]], bdcast.value[row[1]]))
edges.take(10)

[(3418, 12229),
 (18836, 12229),
 (19123, 12229),
 (355, 12229),
 (5800, 12229),
 (3418, 5800),
 (18836, 5800),
 (19123, 5800),
 (355, 5800),
 (3418, 8946)]

## **PageRank**

first of all we need to compute the adjacency list, so that we can initialize the transition matrix $M$ so that $M_{ij}$ = $\frac{1}{α}$, where $\alpha$ is the number of outgoing edges from node $i$ (if there's a link between $i$ and $j$)

In [26]:
# TODO: find a more efficient way
adjacency_list = edges.groupByKey()
adjacency_list.take(5)

[(3418, <pyspark.resultiterable.ResultIterable at 0x7d7879fd9fd0>),
 (18836, <pyspark.resultiterable.ResultIterable at 0x7d7879e57f90>),
 (19123, <pyspark.resultiterable.ResultIterable at 0x7d787a128790>),
 (355, <pyspark.resultiterable.ResultIterable at 0x7d7879e52ad0>),
 (5800, <pyspark.resultiterable.ResultIterable at 0x7d7879f36510>)]

now ```adjacency_list``` is a rdd in which each element is expressed ```(node, [neighbours])```. Since the transition matrix $M$ is heavily sparse, we are going to represent it using triplets $(i, j, M_{ij})$ only if $M_{ij} \neq 0$

**NOTE**: in this setting, we could potentially have many arcs from a certain node $A$ to another node $B$, because there could be many books for which user $B$ wrote a better review than user $A$.

The idea in this case is to collapse all the possibile arcs from $A$ to $B$ in a single arc, but weighting the associated pageRank initial value according to the actual number of books for which $B$ obtained a better score with respect to $A$.

So triplets are actually stored in the form $((i, j), M_{ij})$, so that it becomes easy to group triples with the same key $(i, j)$ and summing up all the contributes $M_{ij}$ associated with the same src-dest nodes.

_Example:_
- ```edges = [(A, B), (A, B), (A, C)]```
- ```adjacency_list = [A, [B, B, C]]```
- ```
triplets (before grouping) = [
    ((B, A), 1/3)
    ((B, A), 1/3)
    ((C, A), 1/3)
]
```
- ```
triplets (after grouping) = [
    ((B, A), 2/3)
    ((C, A), 1/3)
]
```



In [37]:
# if el is an element of the adjacency_list rdd,
# el[0] => node
# el[1] => list of neighbours of that node

triplets = adjacency_list.flatMap(
    lambda el: [((neighbour, el[0]), 1.0/len(el[1])) for neighbour in el[1] if len(el[1]) > 0]
).reduceByKey(lambda x, y: x + y)

In [38]:
print(f"transition graph has {N} nodes and {triplets.count()} edges")

transition graph has 19805 nodes and 2363186 edges
