# Amazon reviews K-NN

Notebook steps:
1. **Download the dataset** from [Amazon Reviews on Kaggle](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews?resource=download)  
2. **Dataset Description:**  
   The dataset consists of two CSV files: `train` and `test`, each containing the following fields:  
   - **polarity**: 1 for negative, 2 for positive  
   - **title**: review heading  
   - **text**: review body  

3. **Sparse Vectors Generation:**  
   - Apply **q-shingle** with `q=3` on the training data.

4. **MinHashing & LSH:**  
   - Apply **MinHashing Locality-Sensitive Hashing (LSH)** on the sparse vectors of the training dataset.

5. **K-Nearest Neighbors Classification:**  
   - Use the **test set** and apply **k-nearest neighbors (k=3)** on the hashed training data.  
   - Classify each test instance based on the majority polarity of its `k` nearest neighbors.

6. **Cluster Identification:**  
   - Identify **clusters of reviews** where each pair of reviews has a **similarity greater than 0.6**.  
   - This step should be performed after the introduction to **network analysis**.

In [1]:
import pyspark
from pyspark.mllib import *
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.ml.feature import CountVectorizer
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from collections import Counter
from pyspark.sql.functions import lit
import datetime;
import kagglehub
import re
import spacy
nlp = spacy.load("en_core_web_sm")

In [2]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

In [3]:
!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list

Saving kaggle.json to kaggle (2).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
ref                                                         title                                               size  lastUpdated                 downloadCount  voteCount  usabilityRating  
----------------------------------------------------------  --------------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
atharvasoundankar/chocolate-sales                           Chocolate Sales Data 📊🍫                            14473  2025-03-19 03:51:40.270000          18919        316  1.0              
adilshamim8/student-depression-dataset                      Student Depression Dataset                        467020  2025-03-13 03:12:30.423000           9853        141  1.0              
atharvasoundankar/impact-of-ai-on-digital-media-2020-2025   🌍 Impact of AI on Digital Media (2020-2025)         5812  2025-04-03 09:12:25.0700

# Datasets

In [4]:
# Download latest version
path = kagglehub.dataset_download("kritanjalijain/amazon-reviews")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2


In [5]:
!mkdir datasets
!mv "/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2" datasets

mkdir: cannot create directory ‘datasets’: File exists
mv: cannot move '/root/.cache/kagglehub/datasets/kritanjalijain/amazon-reviews/versions/2' to 'datasets/2': Directory not empty


In [6]:
schema = StructType([
    StructField('polarity', IntegerType(), True),
    StructField('title', StringType(), True),
    StructField('description', StringType(), True),
])
df_train = spark.read.format("csv") \
               .option("header", "false") \
               .schema(schema) \
               .load("datasets/2/train.csv") \
               .limit(10_000)

df_train.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [18]:
df_test = spark.read.format("csv") \
               .option("header", "false") \
               .schema(schema) \
               .load("datasets/2/test.csv") \
               .limit(50)

df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)



In [19]:
print(f"Train set size: {df_train.count()}")
print(f"Test set size: {df_test.count()}")

Train set size: 10000
Test set size: 50


# Q-shingle
$q = 3$

In [9]:
def shingle(text: str, q: int):
    shingle_set = []
    if(text is None):
        return list()

    text = re.sub(r'[^a-zA-Z\s]', '', text).lower()

    for i in range(len(text) - q + 1):
        shingle_set.append(text[i:i+q])
    return list(set(shingle_set))

shingle_udf = F.udf(shingle, ArrayType(StringType()))

q = 3
df_train = df_train.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))

df_train.select('shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [10]:
shingles_df = df_train.select(F.explode(F.col('shingles')).alias('shingle')).distinct().orderBy("shingle")
shingles_df.count()

8824

one-hot encoding of shingles

In [11]:
cv = CountVectorizer(inputCol="shingles", outputCol="one_hot_shingles", binary=True)

cv_model = cv.fit(df_train)
df_train = cv_model.transform(df_train)

df_train.select("one_hot_shingles").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Min-hash

In [12]:
mh = MinHashLSH(inputCol="one_hot_shingles", outputCol="hashes", numHashTables=3)
model = mh.fit(df_train)

df_train = model.transform(df_train)
df_train.show()

+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|polarity|               title|         description|            shingles|    one_hot_shingles|              hashes|
+--------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       2|Stuning even for ...|This sound track ...|[ tr, ist, wel,  ...|(8824,[0,1,2,3,4,...|[[7260164.0], [1....|
|       2|The best soundtra...|I'm reading a lot...|[uy , ney, ot , t...|(8824,[0,1,2,3,4,...|[[7260164.0], [33...|
|       2|            Amazing!|"This soundtrack ...|[mor, nte, ist,  ...|(8824,[0,1,2,3,4,...|[[8150141.0], [51...|
|       2|Excellent Soundtrack|I truly like this...|[ tr, orp, e d, p...|(8824,[0,1,2,3,4,...|[[2752531.0], [59...|
|       2|Remember, Pull Yo...|If you've played ...|[ tr, pap, wel, d...|(8824,[0,1,2,3,4,...|[[7260164.0], [51...|
|       2|an absolute maste...|I am quite sure a...|[ tr, mor, rst, t...

# K-NN

In [20]:
df_test = df_test.withColumn("shingles", shingle_udf(F.col("description"), F.lit(q)))
df_test.select('shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [21]:
df_test = cv_model.transform(df_test)
df_test.select('one_hot_shingles').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
df_test.printSchema()

root
 |-- polarity: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- description: string (nullable = true)
 |-- shingles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- one_hot_shingles: vector (nullable = true)



In [23]:
rows = df_test.collect()

In [24]:
k = 3

predictions = []
for row in rows:
	print(f"{row['title']}")

	neighbors = model.approxNearestNeighbors(df_train, row["one_hot_shingles"], k)

	neighbors.createOrReplaceTempView("neighbors")

	query = """
	SELECT AVG(polarity) as pred_polarity
	FROM neighbors
	"""

	result = spark.sql(query).first()[0]

	if result:
			pred_polarity = int(result)
	else:
			pred_polarity = None

	predictions.append((row["title"], row["description"], row["polarity"], pred_polarity))

Great CD
One of the best game music soundtracks - for a game I didn't really play
Batteries died within a year ...
works fine, but Maha Energy is better
Great for the non-audiophile
DVD Player crapped out after one year
Incorrect Disc
DVD menu select problems
Unique Weird Orientalia from the 1930's
"Not an ""ultimate guide"""
Great book for travelling Europe
Not!
A complete Bust
TRULY MADE A DIFFERENCE!
didn't run off of USB bus power
Don't buy!
Simple, Durable, Fun game for all ages
Review of Kelly Club for Toddlers
SOY UN APASIONADO DEL BOX
Some of the best fiddle playing I have heard in a long time
Long and boring
Dont like it
one of the last in the series to collect !
Sony Hi8 Camcorder with 2.5 LCD
Don't Take the Chance - Get the SE Branded Cable
Waste of money!
works great
Has No Range
wish i had gotten this sooner!
Three Days of Use and It Broke
This is the all time best book!
Mary Ash
Tha BOMB of a book!!
This book was a great book that i have read many times!
They'd watch it n

In [25]:
df_predictions = spark.createDataFrame(predictions, ["title", "description", "polarity", "pred_polarity"])

# Evaluation

In [26]:
errors = df_predictions.filter(col("polarity") != col("pred_polarity")).count()
total = df_predictions.count()
error_rate = errors / total

print(f"Error Rate: {error_rate:.4f}")

Error Rate: 0.3200
