<a href="https://colab.research.google.com/github/GHMelany/AMD_project/blob/main/AMD%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PROJECT: Finding similar items. Implement a detector of pairs of similar book reviews


LIBRARIES

In [1]:
import pandas as pd
import zipfile
from itertools   import combinations
from collections import Counter
import re
import html
import os

In [2]:
os.environ['KAGGLE_USERNAME'] = "melanygomez"
os.environ['KAGGLE_KEY'] = "38db1cce93622035560027022e9cafc"

!pip install -q kaggle

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
amazon-books-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


DOWLOAD OF THE DATASET FROM KAGGLE

UNZIP OF THE FILE

In [3]:
zip_path = "amazon-books-reviews.zip"
extract_dir = "amazon_books_reviews"
os.makedirs(extract_dir, exist_ok=True)

with zipfile.ZipFile(zip_path, "r") as z:
    z.extractall(extract_dir)

for root, dirs, files in os.walk(extract_dir):
    for file in files:
        print(os.path.join(root, file))

amazon_books_reviews/books_data.csv
amazon_books_reviews/Books_rating.csv


FIRST LOOK ON TO THE DATASET: columns, first 5 reviews

In [4]:
folder = "amazon_books_reviews"
csv_path = os.path.join(folder, "Books_rating.csv")

df = pd.read_csv(csv_path, nrows=30000)

print("\nColumns in Books_rating.csv:")
print(df.columns.tolist())


Columns in Books_rating.csv:
['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text']


In [5]:
reviews = df['review/text']
print("\nFirst 5 reviews:")
print(reviews.head())


First 5 reviews:
0    This is only for Julie Strain fans. It's a col...
1    I don't care much for Dr. Seuss but after read...
2    If people become the books they read and if "t...
3    Theodore Seuss Geisel (1904-1991), aka &quot;D...
4    Philip Nel - Dr. Seuss: American IconThis is b...
Name: review/text, dtype: object


FIRST STEP FOR CLASSIFYING THE REVIEWS.
DIVIDE THE REVIEWS BASED ON THE 'REVIEW/SCORE' VARIABLE.

In [6]:
def label_sentiment(df: pd.DataFrame, score_col: str) -> pd.DataFrame:
    sentiments = []
    for score in df[score_col]:
        if score <= 3:
            sentiments.append('negative')
        else:
            sentiments.append('positive')
    df['sentiment'] = sentiments
    return df

In [7]:
df = label_sentiment(df,'review/score')
print("\'sentiment' added to columns :", df.columns.tolist())



'sentiment' added to columns : ['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text', 'sentiment']


CREATING SUBSETS FOR POSITVE AND NEGATIVE REVIEWS

In [8]:
positive_reviews = df.loc[df['sentiment'] == 'positive', 'review/text']
negative_reviews = df.loc[df['sentiment'] == 'negative', 'review/text']

positive_reviews = positive_reviews.dropna()
negative_reviews = negative_reviews.dropna()

subset_reviews = positive_reviews.head(5).reset_index(drop=True)
subset_reviews_n = negative_reviews.head(5).reset_index(drop=True)



STRIPPING THE REVIEWS OF THE STOPWORDS

In [9]:
STOPWORDS = set([
    "the", "and", "is", "in", "it", "this", "that", "was", "for", "to", "of",
    "with", "a", "an", "on", "my", "but", "at", "as", "by", "be", "are", "from", "not"
])


COMPUTE JACCARD SIMILARITY: FIRST TRY

In [10]:
def jaccard_similarity(text1, text2):
    #striping the words from basic punctiation and covertion to lower case
    words1 = set(word.strip(".,!?").lower() for word in text1.split())
    words2 = set(word.strip(".,!?").lower() for word in text2.split())
    words1 = words1 - STOPWORDS
    words2 = words2 - STOPWORDS
    #intersection = words that appear in both texts
    intersection = words1.intersection(words2)
    #union = unique words in both texts
    union = words1.union(words2)
    if not union:
        similarity = 0.0
    else:
        similarity = len(intersection) / len(union)
    return similarity, intersection


LOOP THROUGH ALL REVIEW PAIRS OF THE SUBSETS TO CHECK THE JACCARD SIMILARITY

POSITIVE REVIEWS

In [11]:
positive_pairs = list(combinations(range(len(subset_reviews)), 2))
positive_results = []

for i, j in positive_pairs:
    sim, common_words = jaccard_similarity(
        subset_reviews.loc[i],
        subset_reviews.loc[j]
    )
    positive_results.append({
        'Review 1 index': i,
        'Review 2 index': j,
        'Jaccard similarity': sim,
        'Common words': ', '.join(sorted(common_words)),
    })


In [12]:
positive_similarity_df = pd.DataFrame(positive_results)

print("\nPairwise Jaccard similarities between positive reviews:\n")
print(positive_similarity_df)


Pairwise Jaccard similarities between positive reviews:

   Review 1 index  Review 2 index  Jaccard similarity  \
0               0               1            0.021390   
1               0               2            0.024155   
2               0               3            0.031700   
3               0               4            0.036269   
4               1               2            0.113971   
5               1               3            0.069048   
6               1               4            0.098113   
7               2               3            0.105882   
8               2               4            0.138686   
9               3               4            0.087886   

                                        Common words  
0                                 book, i, like, one  
1                             book, find, i, if, you  
2  better, book, go, however, i, literary, one, o...  
3                about, book, fans, i, if, like, you  
4  all, book, both, care, children, dr,

NEGATIVE REVIEWS

In [13]:
negative_pairs = list(combinations(range(len(subset_reviews_n)), 2))
negative_results = []

for i, j in negative_pairs:
    sim, common_words = jaccard_similarity(
        subset_reviews_n.loc[i],
        subset_reviews_n.loc[j]
    )
    negative_results.append({
        'Review 1 index': i,
        'Review 2 index': j,
        'Jaccard similarity': sim,
        'Common words': ', '.join(sorted(common_words)),
    })


In [14]:
negative_similarity_df = pd.DataFrame(negative_results)

print("\nPairwise Jaccard similarities between negative reviews:\n")
print(negative_similarity_df)


Pairwise Jaccard similarities between negative reviews:

   Review 1 index  Review 2 index  Jaccard similarity  \
0               0               1            0.038889   
1               0               2            0.023810   
2               0               3            0.032258   
3               0               4            0.048193   
4               1               2            0.019608   
5               1               3            0.046512   
6               1               4            0.071942   
7               2               3            0.069444   
8               2               4            0.070588   
9               3               4            0.089286   

                                        Common words  
0           book, can't, i, know, never, quite, some  
1                              book, disappointed, i  
2                             all, book, i, page, so  
3             all, book, good, her, i, read, she, so  
4                                      

FIND THE PAIR WITH THE HIGHEST JACCARD SIMILARITY

In [15]:
max_sim_row = positive_similarity_df.loc[positive_similarity_df['Jaccard similarity'].idxmax()]

# Get the index and the text of the pair
idx1 = int(max_sim_row['Review 1 index'])
idx2 = int(max_sim_row['Review 2 index'])

review1 = subset_reviews[idx1]
review2 = subset_reviews[idx2]

print(f"\nReview {idx1}:\n{review1}")
print(f"\nReview {idx2}:\n{review2}")


Review 2:
If people become the books they read and if "the child is father to the man," then Dr. Seuss (Theodor Seuss Geisel) is the most influential author, poet, and artist of modern times. For me, a daddy to a large family who learned to read with Dr. Seuss and who has memorized too many of the books via repeated readings to young children, Prof. Nel's brilliant 'American Icon' is a long awaited treat. At last a serious treatment of this remarkable genius that is both an engaging read and filled with remarkable insights! I especially enjoyed (and learned more than I care to admit from) Prof. Nel's discussions of the Disneyfication of Seuss - which Nel links to failings in American copyright law, "the other sides of Dr. Seuss" - all of which sides were new to me, and the political genesis of his secular morality in the WWII cartoon work he did at PM magazine. The chapters on Geisel's poetry and artwork and the link Nel makes between Seuss and the historical avant guarde alone make t

In [16]:
# and also the common words
common_words = max_sim_row['Common words']

print(f"\nCommon words: {common_words if common_words else 'None'}")


Common words: -, american, book, children, children's, come, did, dr, good, has, he, him, his, i, if, knowledge, literature, magazine, make, makes, many, me, more, most, nel, new, other, pm, poetry, read, reader, seuss, style, were, will, writing, you, your


In [17]:
#same for the negative reviews
max_sim_row_n = negative_similarity_df.loc[negative_similarity_df['Jaccard similarity'].idxmax()]

idx1_n = int(max_sim_row_n['Review 1 index'])
idx2_n = int(max_sim_row_n['Review 2 index'])

review1_n = subset_reviews_n[idx1_n]
review2_n = subset_reviews_n[idx2_n]

print(f"\nReview {idx1_n}:\n{review1_n}")
print(f"\nReview {idx2_n}:\n{review2_n}")



Review 3:
I guess you have to be a romance novel lover for this one, and not a very discerning one. All others beware! It is absolute drivel. I figured I was in trouble when a typo is prominently featured on the back cover, but the first page of the book removed all doubt. Wait - maybe I'm missing the point. A quick re-read of the beginning now makes it clear. This has to be an intentional churning of over-heated prose for satiric purposes. Phew, so glad I didn't waste $10.95 after all.

Review 4:
I feel I have to write to keep others from wasting their money. This book seems to have been written by a 7th grader with poor grammatical skills for her age! As another reviewer points out, there is a misspelling on the cover, and I believe there is at least one per chapter. For example, it was mentioned twice that she had a "lean" on her house. I was so distracted by the poor writing and weak plot, that I decided to read with a pencil in hand to mark all of the horrible grammar and spellin

In [18]:
common_words_n = max_sim_row_n['Common words']

print(f"\nCommon words: {common_words_n if common_words_n else 'None'}")


Common words: all, book, cover, have, i, now, one, others, so, waste


THE RESULTS OF JACCARD SIMILARITY ARE NOT BAD CONSIDERING THAT ONLY 5 REVIEWS ARE BEING CONSIDERED. SINCE WE HAVE TO ANALIZE A GREATER NUMBER OF REVIEWS RESULTS MAY CHANGE, SINCE MORE WORDS WILL BE CONSIDERED, LET'S STRIP THE REVIEWS OF THOSE WORDS THAT CREATE NOISE AND ARE NOT SIGNIFICANT FOR THE ANALYSIS.

In [19]:
def get_word_counts(texts):
    all_words = []
    for text in texts:
        words = [word.strip(".,!?").lower() for word in text.split()]
        words = [w for w in words if w not in STOPWORDS]
        all_words.extend(words)
    return Counter(all_words)

positive_counts = get_word_counts(positive_reviews)
negative_counts = get_word_counts(negative_reviews)

# show 10 most common words
print("\n Most common words in positive reviews:")
for word, count in positive_counts.most_common(20):
    print(f"{word}: {count}")

print("\n Most common words in negative reviews:")
for word, count in negative_counts.most_common(20):
    print(f"{word}: {count}")


 Most common words in positive reviews:
i: 51443
book: 43229
you: 22544
read: 18735
his: 18523
he: 17415
have: 16910
one: 14886
all: 13063
about: 11962
who: 11376
has: 11220
so: 10372
or: 10285
they: 10244
what: 9777
her: 9584
story: 9362
more: 9268
will: 9116

 Most common words in negative reviews:
i: 16469
book: 11982
have: 5046
you: 4921
read: 4114
he: 3960
his: 3931
about: 3395
one: 3386
or: 3219
if: 3139
all: 3003
so: 2991
would: 2845
like: 2687
her: 2661
more: 2660
what: 2568
who: 2507
they: 2498


ADDITIONAL STOPWORDS

In [20]:
custom_stopwords = set([
    "the", "and", "is", "in", "it", "this", "that", "was", "for", "to", "of",
    "with", "a", "an", "on", "my", "but", "at", "as", "by", "be", "are", "from",
    "not", "did", "has", "have", "you", "your", "he", "his", "her", "him",
    "they", "them", "we", "us", "our", "i", "me", "book", "read", "review",
    "author", "writing", "will", "one", "also", "many", "more", "all", "so",
    "what", "who", "or", "if", "like", "would", "about", "story"
])

ADDITIONALLY, LET'S IMPLEMENT THIS BASIC TOKENIZATION FUNCTION WITH ADDED BIGRAMS.
THIS HELPS CAPTURE WORD PAIRS THAT MAY CARRY STRONGER OR MORE SPECIFIC MEANING THAN SINGLE WORDS.

In [21]:
def clean_text(text):
    text = html.unescape(text)
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip().lower()


In [22]:
def get_tokens(text):
    words = [word.strip(".,!?").lower() for word in str(text).split() if word.strip()]
    words = [w for w in words if w not in custom_stopwords]
    tokens = words.copy()
    # bigrams
    tokens += [f"{words[i]} {words[i+1]}" for i in range(len(words)-1)]
    return tokens


THIS FUNCTION COMPUTES THE MOST FREQUENT TOKENS(UNI + BIGRAMS) ACCROSS THE REVIEWS.

In [23]:
#Limit 1000
positive_reviews = positive_reviews.iloc[:1000]
negative_reviews = negative_reviews.iloc[:1000]

def get_counts(texts):
    counter = Counter()
    for text in texts:
        tokens = get_tokens(text)
        counter.update(tokens)
    return counter

positive_counts = get_counts(positive_reviews)
negative_counts = get_counts(negative_reviews)

print(f"\nPositive token total: {sum(positive_counts.values())}")
print(f"Negative token total: {sum(negative_counts.values())}")

# 20 most frequent token in each class
print("\n Top 20 positive reviews tokens:")
for word, count in positive_counts.most_common(20):
    print(f"{word}: {count}")

print("\n Top 20 negative reviews tokens:")
for word, count in negative_counts.most_common(20):
    print(f"{word}: {count}")



Positive token total: 133292
Negative token total: 171066

 Top 20 positive reviews tokens:
books: 359
she: 359
great: 311
very: 310
can: 276
just: 273
love: 273
when: 253
some: 252
there: 252
out: 241
only: 239
up: 239
other: 233
first: 233
their: 227
had: 220
good: 217
well: 213
how: 211

 Top 20 negative reviews tokens:
she: 457
very: 439
there: 423
some: 374
just: 374
out: 344
had: 341
no: 335
good: 326
up: 326
when: 314
much: 305
can: 303
reading: 300
time: 300
only: 297
which: 288
other: 287
been: 273
even: 273


CREATE SUBSET OF 200 TO TEST JACCARD SIMILARITY

In [24]:
# positive reviews subset
positive_reviews = df.loc[df["review/score"] > 3, "review/text"].dropna().reset_index(drop=True)
subset_size = 200
subset_reviews = positive_reviews.iloc[:subset_size]

#tokens
tokens_list = [set(get_tokens(text)) for text in subset_reviews]

# Jaccard_similarity
def jaccard_similarity(tokens1, tokens2):
    intersection = tokens1 & tokens2
    union = tokens1 | tokens2
    similarity = len(intersection) / len(union)
    if not union:
        return 0.0, intersection
    return similarity, intersection

# combinations
pairs = list(combinations(range(len(tokens_list)), 2))
results = []
threshold = 0.2

for i, j in pairs:
    sim, common = jaccard_similarity(tokens_list[i], tokens_list[j])
    if sim >= threshold:
        common_bigrams = [tok for tok in common if ' ' in tok]
        results.append({
            'Review 1 index': i,
            'Review 2 index': j,
            'Jaccard similarity': sim,
            'Review 1 text': subset_reviews.iloc[i][:200],
            'Review 2 text': subset_reviews.iloc[j][:200]
        })

# dataframe with similar reviews
similar_df = pd.DataFrame(results)

if similar_df.empty:
    print("\n no similar pairs of reviews found.")
else:
    pd.set_option('display.max_colwidth', None)
    print("\n similar pairs of reviews found:")
    print(similar_df.sort_values(by='Jaccard similarity', ascending=False))



 similar pairs of reviews found:
   Review 1 index  Review 2 index  Jaccard similarity  \
0             132             134            0.987421   
2             174             175            0.761905   
1             160             163            0.554688   

                                                                                                                                                                                              Review 1 text  \
0  Kurt Seligmann, Surrealist artist par excellence, admitted &amp; unashamed bibliophile, has ravaged his occult library in a miraculous marriage giving birth to this classic historical account of Magic   
2  Wonderful! Karen Cummings writes a book that tells you all the basics in cat care and tailors it to the Birman breed. The excellent photographs enable the reader to see what a Birmans really looks lik   
1  Dr Baker explains clearly and engagingly how one can improve one's life by changing your subconscious pattern thr

In [25]:
negative_reviews = df[df["review/score"] <= 3]["review/text"].dropna().reset_index(drop=True)
subset_size = 200
subset_reviews_neg = negative_reviews.iloc[:subset_size]

tokens_list_neg = [set(get_tokens(text)) for text in subset_reviews_neg]

pairs_neg = list(combinations(range(len(tokens_list_neg)), 2))
results_neg = []
threshold = 0.2

for i, j in pairs_neg:
    sim, common = jaccard_similarity(tokens_list_neg[i], tokens_list_neg[j])
    if sim >= threshold:
        common_bigrams = [tok for tok in common if ' ' in tok]
        results_neg.append({
            'Review 1 index': i,
            'Review 2 index': j,
            'Jaccard similarity': sim,
            'Review 1 text': subset_reviews_neg.iloc[i][:200],
            'Review 2 text': subset_reviews_neg.iloc[j][:200]
        })


similar_df_neg = pd.DataFrame(results_neg)

if not similar_df_neg.empty:
    print("\n Similar pairs of negative reviews:")
    print(similar_df_neg.sort_values(by='Jaccard similarity', ascending=False))
else:
    print(" No similar pairs found in negative reviews.")


 Similar pairs of negative reviews:
   Review 1 index  Review 2 index  Jaccard similarity  \
0              76              77                 1.0   
1              91              92                 1.0   
2             122             124                 1.0   

                                                                                                                                                                                              Review 1 text  \
0  Unless you are under obligation to read this for some sort of class, I would not recomend wasting yout time trying to wade through the quagmire of redundantly long, boring text. If I could pay attenti   
1  Generally speaking a great book if you are not familiar with management accounting and turning heaps of information into valuable reports for your top management group. No doubt about it: This book wi   
2  This book was a bit different for me perhaps because of its historical setting. It started out well, but i soo

SINCE I SPOTTED SOME DUPLICATES, I INCLUDE A SIMILARITY THRESHOLD TO IDENTIFY DUPLICATES

In [26]:
duplicate_threshold = 0.95
similarity_threshold = 0.2
results = []
duplicates = set()


for i, j in pairs:
    sim, common = jaccard_similarity(tokens_list[i], tokens_list[j])
    if sim >= threshold:
        common_bigrams = [tok for tok in common if " " in tok]
        results.append({
            'Review 1 index': i,
            'Review 2 index': j,
            'Jaccard similarity': sim,
            'Review 1 text': subset_reviews.iloc[i][:200],
            'Review 2 text': subset_reviews.iloc[j][:200]
        })
    if sim >= 0.95:
        duplicates.add(j)

#similar pairs
similar_df = pd.DataFrame(results)

if similar_df.empty:
    print("\n No similar pair found.")
else:
    pd.set_option('display.max_colwidth', None)
    print("\n Pair of similar reviews:")
    print(similar_df.sort_values(by='Jaccard similarity', ascending=False))

# duplicates
if duplicates:
    print(f"\n indexes of duplicates {sorted(duplicates)}")

    print("\n Duplicates:")
    for j in sorted(duplicates):
        print(f"\nReview {j}:\n{subset_reviews.iloc[j][:200]}")
else:
    print("\n No duplicates >= 0.95.")

# delete duplicates
subset_reviews_cleaned = subset_reviews.drop(index=duplicates).reset_index(drop=True)
print(f"\n number of reviews after duplicate removal: {len(subset_reviews_cleaned)}")


 Pair of similar reviews:
   Review 1 index  Review 2 index  Jaccard similarity  \
0             132             134            0.987421   
2             174             175            0.761905   
1             160             163            0.554688   

                                                                                                                                                                                              Review 1 text  \
0  Kurt Seligmann, Surrealist artist par excellence, admitted &amp; unashamed bibliophile, has ravaged his occult library in a miraculous marriage giving birth to this classic historical account of Magic   
2  Wonderful! Karen Cummings writes a book that tells you all the basics in cat care and tailors it to the Birman breed. The excellent photographs enable the reader to see what a Birmans really looks lik   
1  Dr Baker explains clearly and engagingly how one can improve one's life by changing your subconscious pattern through th

SINCE THERE'S STILL SOME DUPLICATES I LOWER THE TRESHOLD.
IT SEEMS LIKE I FOUND AN INTERVAL IN WHICH THE SIMILARITY IN ENOUGH TO CAPITURE THE SIMILAR PAIRS BUT TO EXCLUDE THE DUPLICATES

In [27]:
high_sim_df = similar_df[
    (similar_df['Jaccard similarity'] >= 0.2) &
    (similar_df['Jaccard similarity'] <= 0.7)
]

if not high_sim_df.empty:
    print("\n Reviews with Jaccard similarity between 0.2 and 0.7:")
    for _, row in high_sim_df.iterrows():
        i = row['Review 1 index']
        j = row['Review 2 index']
        sim = row['Jaccard similarity']
        print(f"\n--- Similarity: {sim:.3f} ---")
        print(f"Review {i}:\n{subset_reviews.iloc[i][:200]}\n")
        print(f"Review {j}:\n{subset_reviews.iloc[j][:200]}\n")
else:
    print(" No review pairs with similarity in the 0.2–0.7 range.")



 Reviews with Jaccard similarity between 0.2 and 0.7:

--- Similarity: 0.555 ---
Review 160:
Dr Baker explains clearly and engagingly how one can improve one's life by changing your subconscious pattern through the spiritual technique called treatment. The essence of treatment is this: When t

Review 163:
Dr Baker was one of those great 20th century metaphysicians like Emmet Fox, Ernest Holmes &Thomas Troward, who understood the working of the mind long before psychotherapy became popular. This approac



In [28]:
high_sim_neg_df = similar_df_neg[
    (similar_df_neg['Jaccard similarity'] >= 0.2) &
    (similar_df_neg['Jaccard similarity'] <= 0.7)
]

if not high_sim_neg_df.empty:
    print("\n Reviews with Jaccard similarity between 0.2 and 0.7:")
    for _, row in high_sim_neg_df.iterrows():
        i = row['Review 1 index']
        j = row['Review 2 index']
        sim = row['Jaccard similarity']
        print(f"\n--- Similarity: {sim:.3f} ---")
        print(f"Review {i}:\n{subset_reviews.iloc[i][:200]}\n")
        print(f"Review {j}:\n{subset_reviews.iloc[j][:200]}\n")
else:
    print(" No review pairs with similarity in the 0.2–0.7 range.")


 No review pairs with similarity in the 0.2–0.7 range.


MAIN PART OF THE CODE:

1. Extracting a subset of positive reviews (with scores > 3).
2. Tokenizing each review by removing stopwords and punctuation, then generating unigrams and bigrams.
3. Computing Jaccard similarity between all unique review pairs.
4. Identifying:
   - Pairs of reviews with high similarity (similarity ≥ 0.2).
   - Near-duplicate reviews (similarity ≥ 0.95).
5. Removing duplicates from the dataset.
6. Displaying reviews with moderate similarity (0.2 ≤ similarity ≤ 0.7) for manual inspection.

In [29]:
def get_tokens(text):
    words = [word.strip(".,!?").lower() for word in str(text).split() if word.strip()]
    words = [w for w in words if w not in custom_stopwords]
    tokens = words.copy()
    tokens += [f"{words[i]} {words[i+1]}" for i in range(len(words)-1)]
    return tokens


# Positive reviews subset
positive_reviews = df.loc[df["review/score"] > 3, "review/text"].dropna().reset_index(drop=True)
subset_size = 1000
subset_reviews = positive_reviews.iloc[:subset_size]

tokens_list = [set(get_tokens(text)) for text in subset_reviews]

# Jaccard similarity function
def jaccard_similarity(tokens1, tokens2):
    intersection = tokens1 & tokens2
    union = tokens1 | tokens2
    if not union:
        return 0.0, intersection
    similarity = len(intersection) / len(union)
    return similarity, intersection

# Combinations
pairs = list(combinations(range(len(tokens_list)), 2))

similarity_threshold = 0.2
duplicate_threshold = 0.7

results = []
duplicates = set()

# Process all pairs
for i, j in pairs:
    sim, common = jaccard_similarity(tokens_list[i], tokens_list[j])

    if sim >= similarity_threshold:
        results.append({
            'Review 1 index': i,
            'Review 2 index': j,
            'Jaccard similarity': sim,
          })

    if sim >= duplicate_threshold:
        duplicates.add(j)

similar_df = pd.DataFrame(results)

#  similar pairs
if similar_df.empty:
    print("\nNo similar pairs of reviews found.")
else:
    pd.set_option('display.max_colwidth', None)
    print(similar_df.sort_values(by='Jaccard similarity', ascending=False))

if duplicates:
    print(f"\nIndexes of duplicates: {sorted(duplicates)}")
else:
    print("\nNo duplicates with similarity ≥ 0.7.")

# Remove duplicates
subset_reviews_cleaned = subset_reviews.drop(index=duplicates).reset_index(drop=True)
print(f"\nNumber of reviews after duplicate removal: {len(subset_reviews_cleaned)}")

# Jaccard similarity between 0.2 and 0.7
high_sim_df = similar_df[
    (similar_df['Jaccard similarity'] >= 0.2) &
    (similar_df['Jaccard similarity'] <= 0.7)
]

if not high_sim_df.empty:
    print("\nReviews with Jaccard similarity between 0.2 and 0.7:")
    print(high_sim_df.sort_values(by='Jaccard similarity', ascending=False))
else:
    print("No review pairs with similarity in the 0.2–0.7 range.")


    Review 1 index  Review 2 index  Jaccard similarity
4              215             220            1.000000
13             737             741            1.000000
3              207             210            1.000000
6              357             358            1.000000
5              351             352            1.000000
8              615             664            1.000000
15             794             795            1.000000
16             866             867            1.000000
9              617             618            1.000000
0              132             134            0.987421
7              567             568            0.784314
2              174             175            0.761905
1              160             163            0.554688
11             711             786            0.433962
14             749             857            0.338028
10             646             678            0.250000
12             722             935            0.200000

Indexes o

SAME FUNCTIONS IMPLEMENTED FOR THE NEGATIVE AND WIDER SUBSET.

In [30]:
def get_tokens(text):
    words = [word.strip(".,!?").lower() for word in str(text).split() if word.strip()]
    words = [w for w in words if w not in custom_stopwords]
    tokens = words.copy()
    tokens += [f"{words[i]} {words[i+1]}" for i in range(len(words)-1)]
    return tokens

# Negative reviews subset
negative_reviews = df.loc[df["review/score"] < 3, "review/text"].dropna().reset_index(drop=True)
subset_size = 10000
subset_reviews = negative_reviews.iloc[:subset_size]

# Tokenization
tokens_list = [set(get_tokens(text)) for text in subset_reviews]

# Jaccard similarity function
def jaccard_similarity(tokens1, tokens2):
    intersection = tokens1 & tokens2
    union = tokens1 | tokens2
    if not union:
        return 0.0, intersection
    similarity = len(intersection) / len(union)
    return similarity, intersection


pairs = list(combinations(range(len(tokens_list)), 2))

similarity_threshold = 0.2
duplicate_threshold = 0.7


results = []
duplicates = set()

# Compare all pairs
for i, j in pairs:
    sim, common = jaccard_similarity(tokens_list[i], tokens_list[j])

    if sim >= similarity_threshold:
        results.append({
            'Review 1 index': i,
            'Review 2 index': j,
            'Jaccard similarity': sim,
        })

    if sim >= duplicate_threshold:
        duplicates.add(j)


similar_df = pd.DataFrame(results)

# results
if similar_df.empty:
    print("\nNo similar pairs of reviews found.")
else:
    print("\nSimilar pairs of reviews found:")
    print(similar_df.sort_values(by='Jaccard similarity', ascending=False))

#duplicates
if duplicates:
    print(f"\nIndexes of duplicates: {sorted(duplicates)}")
else:
    print("\nNo duplicates with similarity ≥ 0.95.")

# Remove duplicates
subset_reviews_cleaned = subset_reviews.drop(index=duplicates).reset_index(drop=True)
print(f"\nNumber of reviews after duplicate removal: {len(subset_reviews_cleaned)}")

# Jaccard similarity between 0.2 and 0.7
high_sim_df = similar_df[
    (similar_df['Jaccard similarity'] >= 0.2) &
    (similar_df['Jaccard similarity'] <= 0.7)
]

if not high_sim_df.empty:
    print("\nReview index pairs with Jaccard similarity between 0.2 and 0.7:")
    print(high_sim_df.sort_values(by='Jaccard similarity', ascending=False))
else:
    print("No review pairs with similarity in 0.2–0.7 range.")


Similar pairs of reviews found:
     Review 1 index  Review 2 index  Jaccard similarity
0                43            1451            1.000000
1                44              45            1.000000
2                54              55            1.000000
3               170            1468            1.000000
4               171            1469            1.000000
..              ...             ...                 ...
106            2312            2415            0.464968
87             1236            1247            0.414729
8               310            2792            0.200000
10              371            2792            0.200000
93             1456            2792            0.200000

[125 rows x 3 columns]

Indexes of duplicates: [45, 55, 371, 442, 443, 444, 536, 849, 851, 881, 1078, 1255, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312, 1313, 1314, 1315, 1316, 1317, 1318, 1319, 1320, 1321, 1322, 1323, 1324, 1325, 1326, 1327, 

TO HELP WITH THE VISUALIZATION OF THE RESULTS

In [31]:

def print_filtered_similarities(df, reviews):
    filtered_df = df[
        (df['Jaccard similarity'] >= 0.2) &
        (df['Jaccard similarity'] <= 0.7)
    ]

    if filtered_df.empty:
        print("\nNo review pairs with similarity in the 0.2–0.7 range.")
        return

    filtered_df = filtered_df.sort_values(by='Jaccard similarity', ascending=False)
    print("\nSimilar review pairs (with text):")

    for _, row in filtered_df.iterrows():
        i = int(row['Review 1 index'])
        j = int(row['Review 2 index'])
        sim = row['Jaccard similarity']

        review_i = str(reviews.iloc[i])[:300].replace('\n', ' ')
        review_j = str(reviews.iloc[j])[:300].replace('\n', ' ')

        print(f"\nSimilarity: {sim:.3f} | Reviews {i} & {j}")
        print(f"Review {i}: {review_i}...")
        print(f"Review {j}: {review_j}...")


print_filtered_similarities(similar_df, subset_reviews)



Similar review pairs (with text):

Similarity: 0.601 | Reviews 876 & 877
Review 876: If I hadnt actually lived in Japan i could see how i could mistake this thing for authoritive, but it amazes me that anyone who has lived out here more than a year could see this as much more than the bag of wind it is. With its pretentious title and lofty quotations of translated haikus, Feiler pro...
Review 877: I can understand how people who haven't lived in Japan could mistake this book as authoritive, but it amazes me that anyone who has lived out here more than a year could see this as much more than the bag of wind it is. With its pretentious title and lofty quotations of translated haikus, Feiler pro...

Similarity: 0.521 | Reviews 2821 & 2822
Review 2821: I bought this book with the honest wish to encounter a real criticism of Nietzsche's thought. I believe it is in the interest of us all, and especially of us Americans, for Socialism to take on this Herculean critic of itself and of the 'he

In [33]:
print("Full Review 2312:\n")
print(subset_reviews.iloc[2312])

print("\nFull Review 2415:\n")
print(subset_reviews.iloc[2415])

Full Review 2312:

Marquez' book takes the reader through the lives and times of what I hope is an atypical Columbian (?) family, covering from roughly the 1830s to the 1930s. It was fun in parts, but oddly repetitive, with a style that seemed half magic realism, half Freud. And, of course, the standard enervating anti-American diatribes were thrown in. I guess a Latin American author even in magic realism mode can't overcome his predilictions.

Full Review 2415:

Marquez' book takes the reader through the lives and times of what *may* be an atypical Colombian (?) family, covering from roughly the 1830s to the 1930s. It was fun in parts, but oddly repetitive, with a style that seemed half magic realism, half Freud. And, of course, the standard enervating anti-American diatribes were thrown in. I guess a Latin American author even in magic realism mode can't overcome his predilections.And speaking of predilections, Marquez, along with some other of intellectualdom's best and brightest, 