# Data Sanitization

This notebook includes utilities for sanitizing unwanted data.

Though it is best to avoid unwanted data by filtering low-quality data out of data ingest workflows, being left with unwanted data is an unavoidable reality. This is especially true when using sqlite, which does not cascade deletions. For example, removing products will result in orphaned product reviews despite the foreign key constraint on review.product_id.

In [2]:
import sqlite3 as sql
import pandas as pd

def find_orphaned_reviews(conn: sql.Connection, ids_only = False) -> pd.DataFrame:
    selection = "DISTINCT(product_id)" if ids_only else "*"
    return pd.read_sql_query(f"""
    WITH product_ids AS (SELECT id FROM product)
    SELECT {selection} FROM review WHERE product_id NOT IN product_ids
    """, conn)

def remove_orphaned_reviews(conn: sql.Connection):
    # Note: tweaking the below query to creating a list of product ids NOT in the product table does not improve performance
    conn.execute("""
    WITH product_ids AS (SELECT id FROM product)
    DELETE FROM review WHERE product_id NOT IN product_ids
    """)

In [3]:
conn = sql.connect('../products.sql')

In [4]:
# retrieve orphaned reviews so we can verify and conduct EDA before deleting
orphaned_reviews = find_orphaned_reviews(conn)

In [5]:
orphaned_reviews

Unnamed: 0,user_id,product_id,title,review,rating,upvotes,downvotes,timestamp
0,A223508619VLG1,B0007F8Q9K,40 Years of Life; Lincoln Reconsidered fires t...,Lincoln Reconsidered is a collection of provac...,5,25,3,913852800
1,A2K3XIR3EA9L5O,B0007F8Q9K,Lincoln for our times,Lincoln Reconsidered is a series of essays wri...,4,0,0,1357776000
2,A1E8EIKF5T05BO,B0007F8Q9K,Not your ordinary Lincoln biography,"According to the author's Preface, the publish...",5,0,0,1356739200
3,A2TXR85WQLE32N,B0007F8Q9K,Reconsidering Lincoln...,Famous Lincoln scholar David Herbert Donald fi...,4,0,0,1338940800
4,A19Q6JEEEBNOAU,B0007F8Q9K,Lincoln and his time,For the mpost part any book by David Donald is...,4,1,1,1187827200
...,...,...,...,...,...,...,...,...
458448,AGLQY3IGI43MK7TQJ4SQNCPGFLOQ,B0019VSNNA,MLMers listen to Jim.,I strongly believe that every MLM'er should li...,5,2,0,1312210738000
458449,AGXQUTQBTALZ7ZBCMEL4A6MIPFJA,B07C88R67M,Awesome Album,Amazing Album🖤 If you are a BTS fan you will a...,5,0,0,1534056358574
458450,AHCUX5TVEU2OH6CVZGWPCBV4DFPA,1938774272,Four Stars,"I use this cd every day,it meets all of my exp...",4,0,0,1429715098000
458451,AEOA6Z2MGYKYXE2ZDSXZBGYNO7ZQ,B0009PUCUE,Enjoyed listening to this program.,It fit my needs on hearing how you can be succ...,4,0,0,1389091102000


In [6]:
remove_orphaned_reviews(conn)
conn.commit()

In [7]:
conn.close()