# Amazon reviews K-NN

1.   Scaricare i dati disponibili a questo url: [amazon reviews](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download) o questo [amazon review dropbox](https://www.dropbox.com/scl/fi/ucfoh391qalha3lz0bzjx/amazon_review_polarity_csv.tgz.zip?rlkey=m3a0bbp2ep4sh2qisaz0xwo1w&dl=0)


2.   Il dataset è composto da due file: train and test in csv. Ogni file contiene le seguenti informazioni
  *   polarity - 1 for negative and 2 for positive
  *   title - review heading
  *   text - review body

3.  Generare i vettori sparsi applicando il q-shingle ai dati di training con q=3.
4. Sui vettori sparsi Applicare il MinHashing LSH sul dataset di training.
5. USare il file di testing e applicare una k-nearest neighbor con i dati di testing su cui è stato applicato l'hashing. Usare k=3 e classificare l'elemento con del test set con la polarità maggiormente presente.


6. *Identificare i cluster di recensioni. Ogni cluster di recensione contiene le coppie di recensioni che hanno una similarità  > di 0.6. Da svolgere dopo l'introduzione alle network







In [1]:
import pyspark
from pyspark.mllib import *
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.ml.feature import CountVectorizer
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from collections import Counter
from pyspark.sql.functions import lit

In [None]:
!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("kritanjalijain/amazon-reviews")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/kritanjalijain/amazon-reviews?dataset_version_number=2...


100%|██████████| 1.29G/1.29G [00:43<00:00, 31.6MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2


In [5]:
!ls /root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2

amazon_review_polarity_csv.tgz	test.csv  train.csv


In [6]:
!mkdir datasets
!mv "/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2" datasets

In [7]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# Getting the datasets

In [8]:
base_path = 'datasets/'
# base_path = ''

schema = StructType([
    StructField('polarity', IntegerType(), True),
    StructField('title', StringType(), True),
    StructField('description', StringType(), True),
])
df_train = spark.read.format("csv") \
            .option("header", "false") \
            .option("escape", '"') \
            .option("quote", '"') \
            .schema(schema) \
            .load("datasets/train.csv")

# Mostra lo schema del DataFrame
df_train.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [9]:
df_train.count()

3600000

In [10]:
df_test = spark.read.format("csv") \
            .option("header", "false") \
            .option("escape", '"') \
            .option("quote", '"') \
            .schema(schema) \
            .load("datasets/test.csv")

# Mostra lo schema del DataFrame
df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [11]:
df_test.count()

400000

# Q-shingle
$q = 3$

In [12]:
def shingle(text: str, q: int):
    shingle_set = []
    if(text is None):
        return list()

    for i in range(len(text) - q + 1):
        shingle_set.append(text[i:i+q])
    return list(set(shingle_set))

shingle_udf = F.udf(shingle, ArrayType(StringType()))

q = 3
df_train = df_train.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))

# Mostra il DataFrame con i shingles
df_train.select('shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [13]:
# Creazione del CountVectorizer con binary=True per ottenere un vettore binario (multi-hot encoding)
cv = CountVectorizer(inputCol="shingles", outputCol="one_hot_shingles", binary=True)

# Fit del modello sul dataset e trasformazione
cv_model = cv.fit(df_train)
df_train = cv_model.transform(df_train)

# Visualizzare il risultato
df_train.select("one_hot_shingles").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Min-hash

In [14]:
mh = MinHashLSH(inputCol="one_hot_shingles", outputCol="hashes", numHashTables=3)
model = mh.fit(df_train)

df_train = model.transform(df_train)
df_train.show()

+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|polarity|               title|         description|            shingles|    one_hot_shingles|              hashes|
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       2|Stuning even for ...|This sound track ...|[ tr, ist, wel,  ...|(262144,[0,1,2,3,...|[[5.2217124E7], [...|
|       2|The best soundtra...|I'm reading a lot...|[uy , g I, ney, o...|(262144,[0,1,2,3,...|[[1.1451894E7], [...|
|       2|            Amazing!|This soundtrack i...|[mor,  tr, ins, r...|(262144,[0,1,2,3,...|[[1804792.0], [11...|
|       2|Excellent Soundtrack|I truly like this...|[ tr, Gal, pea, c...|(262144,[0,1,2,3,...|[[581166.0], [117...|
|       2|Remember, Pull Yo...|If you've played ...|[da , t a, rot, i...|(262144,[0,1,2,3,...|[[8734212.0], [11...|
|       2|an absolute maste...|I am quite sure a...|[ tr, mor, tly, o...

# K-NN

In [15]:
df_test = df_test.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))
df_test.select('shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
df_test = cv_model.transform(df_test)
df_test.select('one_hot_shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)
 |-- shingles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- one_hot_shingles: vector (nullable = true)



In [22]:
rows = df_test.limit(100).collect()

In [None]:
k = 3

predictions = []
for row in rows:
	print("Looking for", row["title"], "...")
	neighbors = model.approxNearestNeighbors(df_train, row["one_hot_shingles"], k)

	result_row = neighbors.select(round(avg(col("polarity"))).alias('pred_polarity')).first()

	if result_row and result_row[0] is not None:
			pred_polarity = int(result_row[0])
	else:
			pred_polarity = None

	predictions.append((row["title"], row["description"], row["polarity"], pred_polarity))


Looking for Great CD ...
Looking for One of the best game music soundtracks - for a game I didn't really play ...
Looking for Batteries died within a year ... ...


In [None]:
df_predictions = spark.createDataFrame(predictions, ["title", "description", "polarity", "pred_polarity"])

In [None]:
errors = df_predictions.filter(col("polarity") != col("pred_polarity")).count()
total = df_predictions.count()
error_rate = errors / total

print(f"Error Rate: {error_rate:.4f}")

In [None]:
df_predictions.show()