# Data Cleaning

### Dataset
<br>
Source: https://archive.ics.uci.edu/ml/datasets/Online+Retail
<br>
<br>
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. <br>
- If this code starts with   letter 'c', it indicates a cancellation.<br>
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.<br>
- Description: Product (item) name. Nominal.<br>
- Quantity: The quantities of each product (item) per transaction. Numeric.<br>
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.<br>
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.<br>
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.<br>
- Country: Country name. Nominal, the name of the country where each customer resides.<br>
<br>
<br>

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
os.chdir(r'D:\Data\Projects\Business Analytics\E-Commerce Data')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [3]:
from warnings import filterwarnings
filterwarnings('ignore')

In [4]:
df_ = pd.read_csv('e-commerce data.csv', encoding = 'ISO-8859-1')
print(df_.shape)
df_.head()

(541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


### Missing Values

In [5]:
missing = pd.DataFrame(df_.isnull().sum()).rename(columns = {0: 'total'})
missing['percent'] = missing['total'] / len(df_)*100
#missing = missing[missing.total != 0]
missing.sort_values('percent', ascending = False)

Unnamed: 0,total,percent
CustomerID,135080,24.927
Description,1454,0.268
InvoiceNo,0,0.0
StockCode,0,0.0
Quantity,0,0.0
InvoiceDate,0,0.0
UnitPrice,0,0.0
Country,0,0.0


In [6]:
# There is no way to obtain the missing values from CustomerID or Description,
# so the rows will be dropped
df = df_.dropna(how='any')

### Duplicates

In [7]:
df.loc[df.duplicated(), :].shape

(5225, 8)

In [8]:
# There are 5225 duplicates
df = df.drop_duplicates(subset = ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'])

### Datatypes

In [9]:
df.dtypes.sort_values()

Quantity         int64
UnitPrice      float64
CustomerID     float64
InvoiceNo       object
StockCode       object
Description     object
InvoiceDate     object
Country         object
dtype: object

In [10]:
df.InvoiceDate = pd.to_datetime(df.InvoiceDate)

In [11]:
# Transformation via int to get rid of decimals
df.CustomerID = df.CustomerID.astype('int64').astype('str')

In [12]:
# Removal of punctuation
df.Description = df.Description.str.replace(',', ' ')

In [13]:
# Save clean dataset
# df.to_csv('dfclean.csv', sep=',', encoding='utf-8', index=False)