# 2 - Missing Values
In this second step I'll show you how to detect and clean missing values in the <code>purchase</code> column.

You will see: comment of action, python code and results.

In [1]:
import pandas as pd

df = pd.read_json("customer_data.json", convert_dates = False)
df.head()

Unnamed: 0,amount,category,city,customer_id,date,frequently_bought_together,lat_lon,purchase,related_items,state,zip_code
0,24.64,household,Chicago,100191,1-Jan-14,towels,"41.86,-87.619",soap,towels,IL,60605
1,35.0,clothing,Dallas,100199,2-Jan-14,sandals,"32.924,-96.547",shorts,belts,TX,75089
2,89.72,outdoor,Philadelphia,100170,3-Jan-14,lawn bags,"40.002,-75.118",lawn_mower,shovels,PA,19019
3,51.32,electronics,Chicago,100124,4-Jan-14,headphones,"41.88,-87.63",laptop,headphones,IL,60603
4,81.75,outdoor,Philadelphia,100173,5-Jan-14,sponge,"39.953,-75.166",car wash,sponge,PA,19102


Taking a look at unique values, and missing values.

The <code>.isna()</code> method will only find standard missing value types, so we'll need to use the <code>.unique()</code> method for identifying non-standard missing values.

In [2]:
print(df["purchase"].unique())
print(sum(df["purchase"].isna()))

['soap' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'slow cooker' 'camera' 'snow shovel' 'shoes'
 'blender' 'shirts' 'toaster' 'detergent' 'tv' 'paper products' 'tools'
 'pants' 'audio' 'microwave' 'food processor' 'jackets' 'cell phone'
 'flower pot' None '?' '__' 'na' '--' 'cell' 'cell_phone' 'lawnmower'
 '1111']
36


Lets change the standard missing values to a new category called <code>unavailable</code>.

We can double check that it worked by summing for missing values using the <code>.isna()</code> method after we've changed the missing values using the <code>.fillna()</code> method. If it worked, the sum should be 0.

We can also take a look at the unique values again to see if this new <code>unavailable</code> shows up.

In [3]:
# changing standard missing values to "unavailable"
df["purchase"].fillna("unavailable", inplace=True)

# double checking that it worked by summing for missing values, and looking at unique categories
print(sum(df["purchase"].isna()))
print(df["purchase"].unique())

0
['soap' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'slow cooker' 'camera' 'snow shovel' 'shoes'
 'blender' 'shirts' 'toaster' 'detergent' 'tv' 'paper products' 'tools'
 'pants' 'audio' 'microwave' 'food processor' 'jackets' 'cell phone'
 'flower pot' 'unavailable' '?' '__' 'na' '--' 'cell' 'cell_phone'
 'lawnmower' '1111']


We can see several non-standard missing value types.

Lets identify those by using a list of <code>missing_values</code> which contains the following <code>["?", "__", "na", "--"]</code>.

We'll replace them by looping through the <code>purchase</code> column, and replacing the non-standard missing values using the <code>.loc()</code> method.

In [5]:
# list of non-standard missing values
missing_values = ["?", "__", "na", "--"]

# replacing the missing values with the new category, "unavailable"
cnt = 0
for i in df["purchase"]:
    if i in missing_values:
        df.loc[cnt, "purchase"] = "unavailable"
    cnt+=1

print(df["purchase"].unique())

['soap' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'slow cooker' 'camera' 'snow shovel' 'shoes'
 'blender' 'shirts' 'toaster' 'detergent' 'tv' 'paper products' 'tools'
 'pants' 'audio' 'microwave' 'food processor' 'jackets' 'cell phone'
 'flower pot' 'unavailable' 'cell' 'cell_phone' 'lawnmower' '1111']


We could also convert missing values when we read in the data.

The <code>.read_json()</code> method does not have the <code>na_values</code> parameter like <code>.read_csv()</code> does. 

We'll need to read in the <code>json</code> file, convert it to a <code>csv</code> file, and then read it back in using <code>.read_csv()</code>.

In [7]:
# reading in the json file
df = pd.read_json("customer_data.json", convert_dates = False)

# writing the json file to a csv file
df.to_csv("customer_data2.csv")

# reading the csv file back in, replacing missing values
df2 = pd.read_csv("customer_data2.csv", na_values=missing_values)

print(df2["purchase"].unique())

['soap' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'slow cooker' 'camera' 'snow shovel' 'shoes'
 'blender' 'shirts' 'toaster' 'detergent' 'tv' 'paper products' 'tools'
 'pants' 'audio' 'microwave' 'food processor' 'jackets' 'cell phone'
 'flower pot' nan 'cell' 'cell_phone' 'lawnmower' '1111']


The non-standard missing values have all been changed to missing values.

Lets finish up by replacing the missing values with the new category, <code>unavailable</code>.

In [8]:
# changing missing values to "unavailable"
df2["purchase"].fillna("unavailable", inplace=True)

# double checking that it worked by summing for missing values, and looking at unique categories
print(sum(df2["purchase"].isna()))
print(df2["purchase"].unique())

0
['soap' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'slow cooker' 'camera' 'snow shovel' 'shoes'
 'blender' 'shirts' 'toaster' 'detergent' 'tv' 'paper products' 'tools'
 'pants' 'audio' 'microwave' 'food processor' 'jackets' 'cell phone'
 'flower pot' 'unavailable' 'cell' 'cell_phone' 'lawnmower' '1111']
