# Duplicate Reviews

Detects and removes duplicate user/review pairs.

Although it seems reasonable to assume reviews have a primary key that automatically enforces non-duplication, this is not feasible because our review datasets do not include unique identifiers for reviews. They also include anonymous reviews that are missing a user id, so we cannot user the combination of user id and product id as unique identifiers.

It is unclear how exactly duplicate user reviews got into the database in the first place. A very likely possibility is that some Amazon products are classified in multiple categories and therefore are present in multiple review datasets for our McAuley Amazon Review data.

In [43]:
import os

import pandas as pd
import sqlite3 as sql

conn = sql.connect('../products.sql')

In [27]:
# To verify data before + after deletion, it is very useful to check total review counts before commiting changes:
def total_review_count() -> int:
    return pd.read_sql("SELECT COUNT(*) FROM review", conn).iloc[0,0]

total_reviews = total_review_count()
total_reviews

7501225

In [25]:
# This method is made for speed, showing dupes in the minimum time possible
def find_duplicate_reviews() -> pd.DataFrame:
    return pd.read_sql("""
    SELECT user_id, product_id, COUNT(*) AS count FROM review WHERE user_id IS NOT NULL GROUP BY user_id, product_id HAVING count > 1 
    """, conn)

dupes = find_duplicate_reviews()
dupe_review_count = dupes['count'].sum()
print(dupe_review_count, 'duplicate reviews')
dupes

161382 duplicate reviews


Unnamed: 0,user_id,product_id,count
0,A02660181QI9HHAVFK06O,B000NDSX6C,2
1,A03816223LL3Q1P48HRU,B000NDSX6C,2
2,A07532193EC6QULM9ZSPJ,B000GQG5MA,2
3,A08604952BQMNOFP9OIUX,B000GQG5MA,2
4,A08854851CPI1JRVJJQVQ,B000GQG5MA,2
...,...,...,...
72196,AZY5BVFFFZ9P2,B000HDF5GE,2
72197,AZYH1MWUQC6RQ,0595165990,2
72198,AZYJI5WR8KWEE,0613916301,2
72199,AZYYHHWSLTDVS,0613103572,2


In [18]:
# Let's inspect an example duplicate review
pd.read_sql("SELECT * FROM review WHERE user_id = ? AND product_id = ?", conn, params = ['A02660181QI9HHAVFK06O', 'B000NDSX6C'])

Unnamed: 0,user_id,product_id,title,review,rating,upvotes,downvotes,timestamp
0,A02660181QI9HHAVFK06O,B000NDSX6C,Excellent read!,I thoroughly enjoyed reading The Hobbit. Mr. B...,5,0,0,1360281600
1,A02660181QI9HHAVFK06O,B000NDSX6C,Excellent read!,I thoroughly enjoyed reading The Hobbit. Mr. B...,5,0,0,1360281600


In [None]:
# This variation of duplicate data retrieves full information for inspection, but is somewhat slower:
# Note that the length matches the dupe_review_count computed earlier
dupe_reviews = pd.read_sql("""
SELECT * FROM review WHERE CONCAT(user_id, product_id) IN (
  SELECT DISTINCT(CONCAT(user_id, product_id)) FROM review WHERE user_id IS NOT NULL GROUP BY user_id, product_id HAVING COUNT(*) > 1 
)
""", conn)
dupe_reviews


Unnamed: 0,user_id,product_id,title,review,rating,upvotes,downvotes,timestamp
0,AMKC1EJBUXDS2,0517150328,a bibliophiliac wedding of the SURREALIST &amp...,"Kurt Seligmann, Surrealist artist par excellen...",5,12,1,989107200
1,AMKC1EJBUXDS2,0517150328,a bibliophiliac wedding of the SURREALIST &amp...,"Kurt Seligmann, Surrealist artist par excellen...",5,2,0,989107200
2,A1RJD10TTI568L,B0007DVHU2,CLEAR AND INSPIRING EXPLANATION OF SPIRITUAL T...,Dr Baker explains clearly and engagingly how o...,5,36,1,973987200
3,A1RJD10TTI568L,B0007DVHU2,Harnessing thought-power,Dr Baker was one of those great 20th century m...,5,4,0,1239321600
4,A332U346E9T5PU,B0000630MU,Not as good as the 2nd Edition,This is a great book but I didn't think it was...,4,4,0,991267200
...,...,...,...,...,...,...,...,...
161377,AHPPLXCO2B6T67NCMCN4ZKL5DLFA,B07F9VF1Z8,Broken and no photocard,Got it broken and no photocard :(,1,0,0,1624931440608
161378,AGSR2QHPWTDGLFQUNPYMIGBYYBMA,B00E3M477S,Case was broken!!!,The case was broken!!,3,0,0,1641697511695
161379,AGSR2QHPWTDGLFQUNPYMIGBYYBMA,B00E3M477S,Case was broken!!!,The case was broken!!,3,0,0,1641697511695
161380,AEVPTNKJV4NGZPAIDBT3XZMXTRHQ,B091PV3SHZ,Llego sin ninguna proteccion,El album llego bien pero aun asi no traia ning...,3,1,0,1622579146841


In [26]:
dupe_reviews.drop_duplicates(subset = ['user_id', 'product_id'], inplace = True)
dupe_reviews

Unnamed: 0,user_id,product_id,title,review,rating,upvotes,downvotes,timestamp
0,AMKC1EJBUXDS2,0517150328,a bibliophiliac wedding of the SURREALIST &amp...,"Kurt Seligmann, Surrealist artist par excellen...",5,12,1,989107200
2,A1RJD10TTI568L,B0007DVHU2,CLEAR AND INSPIRING EXPLANATION OF SPIRITUAL T...,Dr Baker explains clearly and engagingly how o...,5,36,1,973987200
4,A332U346E9T5PU,B0000630MU,Not as good as the 2nd Edition,This is a great book but I didn't think it was...,4,4,0,991267200
6,A2A9BTNYLA9EA0,050552421X,Wonderful!!!,Wonderful book! Very few fans of Feehan's Dark...,5,0,0,984528000
7,A1PURG5ASALH79,050552421X,a good start for a new series,I first started reading mrs. feehan thru her w...,4,0,0,984182400
...,...,...,...,...,...,...,...,...
161372,AFGM33DK3PDVJ5FF62GMCQ47YAVA,B00190J6HI,Don't go to heaven without watching this!!!,Absolutely fantastic! The awesome greatness of...,5,0,0,1222413214000
161373,AFGM33DK3PDVJ5FF62GMCQ47YAVA,B0018O3P54,Indescribable indeed,The message is challenging and it put many Scr...,5,0,0,1220495127000
161376,AHPPLXCO2B6T67NCMCN4ZKL5DLFA,B07F9VF1Z8,Broken and no photocard,Got it broken and no photocard :(,1,0,0,1624931440608
161378,AGSR2QHPWTDGLFQUNPYMIGBYYBMA,B00E3M477S,Case was broken!!!,The case was broken!!,3,0,0,1641697511695


In [None]:
# Delete the dupes!

# but first, some precautions:

# In case anything goes wrong, let's save the duplciate data to a temporary file so that we can restore it if needed.
temp_file = 'duplicate_reviews_temp.csv'
dupe_reviews.to_csv(temp_file, index = False)

# Let's use a transaction to verify results before committing changes:
# Performance: 161K records @ 1min 13s = 2.2K records / sec
conn.execute("BEGIN TRANSACTION;")
print('Removing duplicates...')
conn.execute("""
DELETE FROM review WHERE CONCAT(user_id, product_id) IN (
  SELECT DISTINCT(CONCAT(user_id, product_id)) FROM review WHERE user_id IS NOT NULL GROUP BY user_id, product_id HAVING COUNT(*) > 1 
)
""")
print('Finished removing duplicates')


Removing duplicates...
Finished removing duplicates


In [30]:
# verify dupes were removed:
new_dupes = find_duplicate_reviews()
assert(new_dupes.empty)
new_dupes

Unnamed: 0,user_id,product_id,count


In [None]:
# verify nothing else was removed:
new_total = total_review_count()
deletions = total_reviews - new_total
assert(deletions == dupe_review_count)

161382


In [None]:
# If we passed all existing assertions, we are good to go for commiting deletions. Let's add yet another fail-safe though:
if deletions == dupe_review_count:
    conn.commit()
else:
    conn.rollback()


In [40]:
# Don't forget to restore the originals without duplication.
# The data size in this case is small enough to use Pandas.to_sql instead of our ingest utilities designed for very large batches.
dupe_reviews.to_sql('review', conn, if_exists='append', index = False)
conn.commit()
if os.path.exists(temp_file):
    os.remove(temp_file)

In [None]:
# A final summary for clarity and peace of mind:
final_total = total_review_count()
deletions = total_reviews - final_total
print(f"""
Finished removing {deletions} duplicates. 
Total review count: {total_reviews} -> {final_total}     
""")


Finished removing 89181 duplicates. 
Total review count: 7501225 -> 7412044     



### Cleanup

In [46]:
conn.close()