# Missing Data

In [2]:
import pandas as pd
import sqlite3 as sql 

import shared.query as q

conn = q.connect()

## Find Missing Data

In [3]:
# Let's look up the fields we have to work with. Also: does the database already enforce NOT NULL on any fields? Should it?
pd.read_sql_query("PRAGMA table_info(product)", conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,VARCHAR(36),1,,1
1,1,title,TEXT,0,'',0
2,2,title_search,TEXT,0,'',0
3,3,creator,TEXT,0,'',0
4,4,creator_search,TEXT,0,'',0
5,5,publisher,TEXT,0,'',0
6,6,description,TEXT,0,'',0
7,7,category,TEXT,0,'',0
8,8,subcategory,TEXT,0,'',0
9,9,release_date,DATE,0,,0


In [5]:
def is_missing(x):
    return pd.isnull(x) or pd.isna(x) or x == ''

product_data = pd.read_sql_query("SELECT id, title, creator, category, subcategory FROM product", conn).set_index('id')
missing_values = product_data.map(is_missing)
missing_values.sum()

title              40
creator         29294
category            0
subcategory    768224
dtype: int64

Let's see how our missing data breaks down by dataset. Which is to say, by category:

In [6]:
print("MISSING DATA BY CATEGORY")
missing_values['category'] = product_data['category']
missing_values.groupby('category').sum()

MISSING DATA BY CATEGORY


Unnamed: 0_level_0,title,creator,subcategory
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Books,0,0,0
Music,40,29294,768224


Our book data is perfect! We have a few gaps in our music data aside from the known gap that there is no genre (subcategory) information though:

In [7]:
music = product_data[product_data.category == 'Music']
missing_title_ids = music[music.title.map(is_missing)].index
missing_titles = q.get_product_details(missing_title_ids, conn)
missing_titles

product details 40: 40 results in 0.016 seconds


Unnamed: 0_level_0,title,title_search,creator,creator_search,publisher,description,category,subcategory,release_date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
B000000H0U,,,Meshuggah,meshuggah,,,Music,,"February 10, 2007"
B000005BST,,,Various Artists,variousartists,,BRAND NEW.,Music,,"February 10, 2007"
B00000DTSG,,,Various Artists,variousartists,,,Music,,"February 10, 2007"
B00000IQ5N,,,Adam Guettel,adamguettel,,"Amazon.com, Until this disc, Adam Guettel was ...",Music,,"January 9, 2007"
B00000JN16,,,"Martinu, Bohuslav",bohuslavmartinu,,,Music,,"December 4, 2006"
B00002MRGH,,,Bing Crosby,bingcrosby,,,Music,,"November 8, 2006"
B00005LBSH,,,Various Artists,variousartists,,Today's Hottest Dance Songs Done Mickey Style!...,Music,,"February 11, 2007"
B00005QVZW,,,Elvis Presley,elvispresley,,,Music,,"December 16, 2006"
B00005RRKV,,,Fulanito,fulanito,,,Music,,"January 12, 2007"
B000091F7I,,,Kingdom Come,kingdomcome,,Kingdom Come (80s) Kingdom Come UK vinyl LP,Music,,"December 15, 2006"


Nothing stands out about the above data. Note that titles are missing instead of empty. 40 missing titles of 768K is entirely insignificant overall. Let's move on:

In [8]:
missing_artists = music[music.creator.map(is_missing)].index
missing_artists = q.get_product_details(missing_artists, conn)
missing_artists.sample(20)

product details 29294: 29294 results in 1.876 seconds


Unnamed: 0_level_0,title,title_search,creator,creator_search,publisher,description,category,subcategory,release_date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
B009HK901E,"CHAISSON, TIM - OTHER SIDE : AUSTRALIAN EDITION",timorsideaustralianeditionchaisson,,,,Tracks: 1. Beat this heart2. Blast your way ou...,Music,,"October 3, 2012"
B0002OJPKK,Tt 1510,tt1510,,,,,Music,,"June 30, 2015"
B00E7WLS7Q,MOTOWN 25 HITS 1962-1971,motown25hits19621971,,,,,Music,,"July 29, 2013"
B01AXMAV1C,On The Eastern Front by Little Feat,oneasternfrontbylittlefeat,,,,,Music,,"March 17, 2016"
B00294IQ10,Nihilistic Purity,nihilisticpurity,,,,,Music,,"September 12, 2012"
B004GPDCZU,Start Somewhere,startsomewhere,,,,,Music,,"December 18, 2010"
B003NZZXN8,Ricky Segall And The Segalls,rickysegallandsegalls,,,,,Music,,"August 20, 2014"
B000OE9TSQ,Harveys Bristol Blues Collection,harveysbristolbluescollection,,,,Tracklist . 1 Live It To The Full 2 Let's Get ...,Music,,"March 14, 2007"
B000TQXI8Q,The Sound of Christmas,soundofchristmas,,,,,Music,,"July 16, 2007"
B001T4JMKA,Live,live,,,,,Music,,"August 17, 2012"


These all look like obscure compilations. 

It is safe to remove all of our music data that has missing/null values if needed. As it stands, this data is not interfering with our algorithms or results though.

## To do: Investigate Missing Authors

Our product data notebook showed a high prevalence of missing authors in popular books. The data show no missing authors though. To do: find the root cause of missing authors in the popular books query.