# Amazon reviews K-NN

1.   Scaricare i dati disponibili a questo url: [amazon reviews](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download) o questo [amazon review dropbox](https://www.dropbox.com/scl/fi/ucfoh391qalha3lz0bzjx/amazon_review_polarity_csv.tgz.zip?rlkey=m3a0bbp2ep4sh2qisaz0xwo1w&dl=0)


2.   Il dataset è composto da due file: train and test in csv. Ogni file contiene le seguenti informazioni
  *   polarity - 1 for negative and 2 for positive
  *   title - review heading
  *   text - review body

3.  Generare i vettori sparsi applicando il q-shingle ai dati di training con q=3.
4. Sui vettori sparsi Applicare il MinHashing LSH sul dataset di training.
5. USare il file di testing e applicare una k-nearest neighbor con i dati di testing su cui è stato applicato l'hashing. Usare k=3 e classificare l'elemento con del test set con la polarità maggiormente presente.


6. *Identificare i cluster di recensioni. Ogni cluster di recensione contiene le coppie di recensioni che hanno una similarità  > di 0.6. Da svolgere dopo l'introduzione alle network







In [1]:
import pyspark
from pyspark.mllib import *
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.ml.feature import CountVectorizer
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from collections import Counter
from pyspark.sql.functions import lit

In [2]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# Getting the datasets

In [9]:
schema = StructType([
    StructField('polarity', IntegerType(), True),
    StructField('title', StringType(), True),
    StructField('description', StringType(), True),
])
df_train = spark.read.format("csv") \
               .option("header", "false") \
               .schema(schema) \
               .load("train.csv")

# Mostra lo schema del DataFrame
df_train.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [10]:
df_train.count()

25755

In [5]:
df_test = spark.read.format("csv") \
               .option("header", "false") \
               .schema(schema) \
               .load("test.csv")

# Mostra lo schema del DataFrame
df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [6]:
df_test.count()

160249

# Q-shingle
$q = 3$

In [None]:
def shingle(text: str, q: int):
    shingle_set = []
    if(text is None):
        return list()

    for i in range(len(text) - q + 1):
        shingle_set.append(text[i:i+q])
    return list(set(shingle_set))

shingle_udf = F.udf(shingle, ArrayType(StringType()))

q = 3
df_train = df_train.limit(10).withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))

# Mostra il DataFrame con i shingles
df_train.select('shingles').show(truncate=False)

In [None]:
shingles_df = df_train.select(F.explode(F.col('shingles')).alias('shingle')).distinct().orderBy("shingle")
shingles_df.show()

In [None]:
shingles_df.count()

In [None]:
# Creazione del CountVectorizer con binary=True per ottenere un vettore binario (multi-hot encoding)
cv = CountVectorizer(inputCol="shingles", outputCol="one_hot_shingles", binary=True)

# Fit del modello sul dataset e trasformazione
cv_model = cv.fit(df_train)
df_train = cv_model.transform(df_train)

# Visualizzare il risultato
df_train.select("one_hot_shingles").show(truncate=False)

# Min-hash

In [None]:
mh = MinHashLSH(inputCol="one_hot_shingles", outputCol="hashes", numHashTables=3)
model = mh.fit(df_train)

df_train = model.transform(df_train)
df_train.show()

# K-NN

In [None]:
df_test = df_test.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))
df_test.select('shingles').show(truncate=False)

In [None]:
# Mostra il DataFrame con i shingles
df_test = cv_model.transform(df_test)
df_test.select('one_hot_shingles').show(truncate=False)

In [None]:
df_test.printSchema()

In [None]:
rows = df_train.limit(1000).collect()

In [None]:
k = 3

predictions = []
for row in rows:
	neighbors = model.approxNearestNeighbors(df_train, row["one_hot_shingles"], k)

	result_row = neighbors.select(round(avg(col("polarity"))).alias('pred_polarity')).first()

	if result_row and result_row[0] is not None:
			pred_polarity = int(result_row[0])
	else:
			if not result_row:
						print(f"Warning: No neighbors found for row. Using default polarity (None).")
			else:
						print(f"Warning: Average polarity calculation resulted in None for row. Using default polarity (None).")
			pred_polarity = None

	predictions.append((row["title"], row["description"], row["polarity"], pred_polarity))

In [None]:
df_predictions = spark.createDataFrame(predictions, ["title", "description", "polarity", "pred_polarity"])

In [None]:
df_predictions.show()

# Evaluation

In [None]:
errors = df_predictions.filter(col("polarity") != col("pred_polarity")).count()
total = df_predictions.count()
error_rate = errors / total

print(f"Error Rate: {error_rate:.4f}")