# Amazon reviews K-NN

Notebook steps:
1. **Download the dataset** from [Amazon Reviews on Kaggle](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download)  
2. **Dataset Description:**  
   The dataset consists of two CSV files: `train` and `test`, each containing the following fields:  
   - **polarity**: 1 for negative, 2 for positive  
   - **title**: review heading  
   - **text**: review body  

3. **Sparse Vectors Generation:**  
   - Apply **q-shingle** with `q=3` on the training data.

4. **MinHashing & LSH:**  
   - Apply **MinHashing Locality-Sensitive Hashing (LSH)** on the sparse vectors of the training dataset.

5. **K-Nearest Neighbors Classification:**  
   - Use the **test set** and apply **k-nearest neighbors (k=3)** on the hashed training data.  
   - Classify each test instance based on the majority polarity of its `k` nearest neighbors.

6. **Cluster Identification:**  
   - Identify **clusters of reviews** where each pair of reviews has a **similarity greater than 0.6**.  
   - This step should be performed after the introduction to **network analysis**.

In [2]:
import pyspark
from pyspark.mllib import *
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.ml.feature import CountVectorizer
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from collections import Counter
from pyspark.sql.functions import lit
import datetime;
import kagglehub

In [3]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

In [4]:
!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list

Saving kaggle.json to kaggle.json
ref                                                         title                                                   size  lastUpdated                 downloadCount  voteCount  usabilityRating  
----------------------------------------------------------  ------------------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
atharvasoundankar/chocolate-sales                           Chocolate Sales Data 📊🍫                                14473  2025-03-19 03:51:40.270000          18265        308  1.0              
adilshamim8/student-depression-dataset                      Student Depression Dataset                            467020  2025-03-13 03:12:30.423000           9329        135  1.0              
khushikyad001/finance-and-economics-dataset-2000-present    Finance & Economics Dataset (2000 - Present)          204142  2025-03-29 18:51:33.840000           1018         23  1.0           

# Getting the datasets

In [5]:
# Download latest version
path = kagglehub.dataset_download("kritanjalijain/amazon-reviews")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/kritanjalijain/amazon-reviews?dataset_version_number=2...


100%|██████████| 1.29G/1.29G [00:14<00:00, 97.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2


In [6]:
!mkdir datasets
!mv "/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2" datasets

In [36]:
schema = StructType([
    StructField('polarity', IntegerType(), True),
    StructField('title', StringType(), True),
    StructField('description', StringType(), True),
])
df_train = spark.read.format("csv") \
               .option("header", "false") \
               .schema(schema) \
               .load("datasets/2/train.csv")

# Mostra lo schema del DataFrame
df_train.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [37]:
df_test = spark.read.format("csv") \
               .option("header", "false") \
               .schema(schema) \
               .load("datasets/2/test.csv")

# Mostra lo schema del DataFrame
df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



# Q-shingle
$q = 3$

In [38]:
def shingle(text: str, q: int):
    shingle_set = []
    if(text is None):
        return list()

    for i in range(len(text) - q + 1):
        shingle_set.append(text[i:i+q])
    return list(set(shingle_set))

shingle_udf = F.udf(shingle, ArrayType(StringType()))

q = 3
df_train = df_train.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))

# Mostra il DataFrame con i shingles
df_train.select('shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [39]:
shingles_df = df_train.select(F.explode(F.col('shingles')).alias('shingle')).distinct().orderBy("shingle")
shingles_df.show()

+-------+
|shingle|
+-------+
|    al|
|     J|
|     R|
|     l|
|     s|
|     t|
|    --|
|    3k|
|    He|
|    WO|
|    it|
|    of|
|    th|
|    üY|
|     a|
|    , |
|    qP|
|   \a,t|
|   \b h|
|   \bHe|
+-------+
only showing top 20 rows



In [40]:
# Creazione del CountVectorizer con binary=True per ottenere un vettore binario (multi-hot encoding)
cv = CountVectorizer(inputCol="shingles", outputCol="one_hot_shingles", binary=True)

# Fit del modello sul dataset e trasformazione
cv_model = cv.fit(df_train)
df_train = cv_model.transform(df_train)

# Visualizzare il risultato
df_train.select("one_hot_shingles").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Min-hash

In [41]:
mh = MinHashLSH(inputCol="one_hot_shingles", outputCol="hashes", numHashTables=3)
model = mh.fit(df_train)

df_train = model.transform(df_train)
df_train.show()

+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|polarity|               title|         description|            shingles|    one_hot_shingles|              hashes|
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       2|Stuning even for ...|This sound track ...|[ tr, ist, wel,  ...|(253024,[0,1,2,3,...|[[3745976.0], [50...|
|       2|The best soundtra...|I'm reading a lot...|[uy , g I, ney, o...|(253024,[0,1,2,3,...|[[1767667.0], [47...|
|       2|            Amazing!|"This soundtrack ...|[mor, nte, u'v, i...|(253024,[0,1,2,3,...|[[3745976.0], [50...|
|       2|Excellent Soundtrack|I truly like this...|[ tr, Gal, pea, c...|(253024,[0,1,2,3,...|[[2132677.0], [50...|
|       2|Remember, Pull Yo...|If you've played ...|[da , t a, rot, i...|(253024,[0,1,2,3,...|[[2132677.0], [58...|
|       2|an absolute maste...|I am quite sure a...|[ tr, mor, tly, o...

# K-NN

In [42]:
df_test = df_test.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))
df_test.select('shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [43]:
# Mostra il DataFrame con i shingles
df_test = cv_model.transform(df_test)
df_test.select('one_hot_shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [61]:
df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)
 |-- shingles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- one_hot_shingles: vector (nullable = true)



In [69]:
rows = df_test.limit(10).collect()

In [70]:
k = 3

predictions = []
for row in rows:
	print(f"{row['title']}")

	neighbors = model.approxNearestNeighbors(df_train, row["one_hot_shingles"], k)

	neighbors.createOrReplaceTempView("neighbors")

	query = """
	SELECT AVG(polarity) as pred_polarity
	FROM neighbors
	"""

	result = spark.sql(query).first()[0]

	if result:
			pred_polarity = int(result)
	else:
			pred_polarity = None

	predictions.append((row["title"], row["description"], row["polarity"], pred_polarity))
	print(f"\tPredicted polarity: {pred_polarity} - {datetime.datetime.now().strftime('%d/%m/%Y %H:%M')}")

Great CD
One of the best game music soundtracks - for a game I didn't really play
Batteries died within a year ...
works fine, but Maha Energy is better
Great for the non-audiophile
DVD Player crapped out after one year
Incorrect Disc
DVD menu select problems
Unique Weird Orientalia from the 1930's
"Not an ""ultimate guide"""


In [90]:
df_predictions = spark.createDataFrame(predictions, ["title", "description", "polarity", "pred_polarity"])

In [92]:
df_predictions.show()

+--------------------+--------------------+--------+-------------+
|               title|         description|polarity|pred_polarity|
+--------------------+--------------------+--------+-------------+
|            Great CD|"My lovely Pat ha...|       2|            2|
|One of the best g...|Despite the fact ...|       2|            1|
|Batteries died wi...|I bought this cha...|       1|            1|
|works fine, but M...|Check out Maha En...|       2|            2|
|Great for the non...|Reviewed quite a ...|       2|            3|
|DVD Player crappe...|I also began havi...|       1|            1|
|      Incorrect Disc|I love the style ...|       1|            2|
|DVD menu select p...|I cannot scroll t...|       1|            1|
|Unique Weird Orie...|"Exotic tales of ...|       2|            2|
|"Not an ""ultimat...|Firstly,I enjoyed...|       1|            2|
+--------------------+--------------------+--------+-------------+



# Evaluation

In [93]:
errors = df_predictions.filter(col("polarity") != col("pred_polarity")).count()
total = df_predictions.count()
error_rate = errors / total

print(f"Error Rate: {error_rate:.4f}")

Error Rate: 0.4000
