# Data Cleaning Notebook - Example
This is the more detailed notebook containing the data cleaning for several features. This contains only the "category" and "amount" features. I will leave the cleaning of the rest of the features as an exercise for you.

## Importing the data

In [2]:
import json

with open("customer_data_example.json") as f_in:
    data = json.load(f_in)

# taking a look at the JSON data
print(json.dumps(data, indent=2))

[
  {
    "customer_id": 100191,
    "date": "1-Jan-14",
    "purchase": null,
    "category": "household",
    "amount": "24.64",
    "related_items": "towels",
    "frequently_bought_together": "towels",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100199,
    "date": "2-Jan-14",
    "purchase": "shorts",
    "category": "clothing",
    "amount": "35",
    "related_items": "belts",
    "frequently_bought_together": "sandals",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75089,
    "lat_lon": "32.924,-96.547"
  },
  {
    "customer_id": 100170,
    "date": "3-Jan-14",
    "purchase": "lawn_mower",
    "category": "outdoor",
    "amount": "89.72",
    "related_items": "shovels",
    "frequently_bought_together": "lawn bags",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19019,
    "lat_lon": "40.002,-75.118"
  },
  {
    "customer_id": 100124,
    "date": "4-Jan-14",
    "purchas

In [15]:
import pandas as pd

# reading the JSON data in to a dataframe
df = pd.read_json("customer_data_example.json")

df

Unnamed: 0,amount,category,city,customer_id,date,frequently_bought_together,lat_lon,purchase,related_items,state,zip_code
0,24.64,household,Chicago,100191,2014-01-01,towels,"41.86,-87.619",,towels,IL,60605
1,35.00,clothing,Dallas,100199,2014-01-02,sandals,"32.924,-96.547",shorts,belts,TX,75089
2,89.72,outdoor,Philadelphia,100170,2014-01-03,lawn bags,"40.002,-75.118",lawn_mower,shovels,PA,19019
3,51.32,electronics,Chicago,100124,2014-01-04,headphones,"41.88,-87.63",laptop,headphones,IL,60603
4,81.75,outdoor,Philadelphia,100173,2014-01-05,sponge,"39.953,-75.166",car wash,sponge,PA,19102
5,29.16,outdoor,San Diego,100116,2014-01-06,fertilizer,"33.143,-117.03",lawn mower,rakes,CA,92027
6,50.71,outdoor,Dallas,100105,2014-01-07,bbq sauce,"32.745,-96.46",grill,grill cleaner,TX,75126
7,35.03,household,San Antonio,100148,2014-01-08,spray bottles,"29.502,-98.306",household cleaner,spray bottles,TX,78109
8,30.55,appliances,Philadelphia,100118,2014-01-09,pot holders,"39.953,-75.166",,tupperware,PA,19102
9,92.01,electronics,Dallas,100106,2014-01-10,camera lens,"32.917,-96.973",camera,lens cleaner,TX,75126


In [16]:
# taking a look at the types for each feature
df.dtypes

amount                               float64
category                              object
city                                  object
customer_id                            int64
date                          datetime64[ns]
frequently_bought_together            object
lat_lon                               object
purchase                              object
related_items                         object
state                                 object
zip_code                               int64
dtype: object

In [12]:
# parsing the XML data and putting it in to a dataframe
import xml.etree.ElementTree as ET

tree = ET.parse('location_data.xml')
root = tree.getroot()

# root tag
print(root.tag)

# child tag
print(root[0].tag)

# number of children elements
num_children = len(root.getchildren())
print(num_children)

# number of subchildren elements
num_subchildren = len(root[0].getchildren())
print(num_subchildren)
    
    
# pulling out all of the subchildren tags
tags = []
for subchild in root[0]:
    tags.append(subchild.tag)
print(tags)
    

# creating an empty dictionary to store the data
d = {}
for tag in tags:
    d[tag] = []
print(d)


# pulling out all of the data
for i in range(0, num_children):
    for j in range(0, num_subchildren):
        value = root[i][j].text
        d[tags[j]].append(value)
        
#print(d)


# converting to a dataframe
df = pd.DataFrame(data=d)

print(df)

data-set
record
50
3
['City', 'Zipcode', 'Latitude_Longitude']
{'City': [], 'Zipcode': [], 'Latitude_Longitude': []}
             City Zipcode Latitude_Longitude
0   New York City   10012     40.726,-73.998
1   New York City   10013     40.721,-74.005
2   New York City   10004     40.699,-74.041
3   New York City   10128      40.782,-73.95
4   New York City   10002     40.717,-73.987
5     Los Angeles   90001    33.973,-118.249
6     Los Angeles   90016     34.03,-118.353
7     Los Angeles   90008     34.01,-118.337
8     Los Angeles   90020    34.066,-118.309
9     Los Angeles   90029     34.09,-118.295
10        Chicago   60610     41.899,-87.637
11        Chicago   60611     41.905,-87.625
12        Chicago   60605      41.86,-87.619
13        Chicago   60602     41.883,-87.629
14        Chicago   60603       41.88,-87.63
15        Houston   77001      29.813,-95.31
16        Houston   77005     29.718,-95.428
17        Houston   77009     29.793,-95.367
18        Houston   77004   

## Cleaning the "category" feature

In [17]:
# summing the missing values
print(sum(df["category"].isna()))

0


It doesn't look like there's any standard missing value types. Let's take a look at the unique values.

In [21]:
# unique values
print(df["category"].unique())

['household' 'clothing' 'outdoor' 'electronics' 'appliances' 'house'
 'elect^ronics']


There's a carrot "^" character in the "elec^tronics" feature. I'll need to remove that.

In [23]:
# removing incorrect "^" character from the strings

bad_characters = ["^"]

cnt = 0
for row in df["category"]:
    for character in bad_characters:
        if character in row:
            df.loc[cnt, "category"] = row.replace(character, "")
    cnt+=1

print(df["category"].unique())

['household' 'clothing' 'outdoor' 'electronics' 'appliances' 'house']


It looks like the features aren't all consistent. The "household" and "house" features are the same. I'll change "house" to "household" to make sure that my feature labels are consistent.

In [24]:
consistent_format = ["house"]

cnt = 0

for row in df["category"]:
    if row in consistent_format:
        df.loc[cnt, "category"] = "household"
    cnt+=1
    
print(df["category"].unique())

['household' 'clothing' 'outdoor' 'electronics' 'appliances']


The data in the "category" column has now been cleaned.

## Cleaning the "amount" feature

In [26]:
# looking for missing values
print(sum(df["amount"].isna()))

0


It doesn't look like there's any standard missing values.

Rather than look at unique values, I'll take a look at the type first. 

I would expect the type to be a float. If the type is an object (string), there's probably some additional dirty data that needs to be cleaned, such as dollar signs "$" or unexpected characters.

In [30]:
# looking at unique values
df["amount"].dtype

dtype('float64')

The type is "float", so it looks like we should be all set.

# Exporting the clean data

I'll finish by exporting the clean data to a CSV file called "customer_data_cleaned.csv".

In [31]:
df.to_csv("customer_data_cleaned.csv")