<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Packages-+-Data" data-toc-modified-id="Import-Packages-+-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Packages + Data</a></span></li><li><span><a href="#Explore-+-Clean-Data" data-toc-modified-id="Explore-+-Clean-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Explore + Clean Data</a></span><ul class="toc-item"><li><span><a href="#Check-For-Null-Values" data-toc-modified-id="Check-For-Null-Values-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Check For Null Values</a></span></li><li><span><a href="#Check-For-Duplicates" data-toc-modified-id="Check-For-Duplicates-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Check For Duplicates</a></span></li><li><span><a href="#Add-TotalPrice-Column" data-toc-modified-id="Add-TotalPrice-Column-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Add TotalPrice Column</a></span></li><li><span><a href="#Remove-Outliers" data-toc-modified-id="Remove-Outliers-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Remove Outliers</a></span></li></ul></li></ul></div>

# Import Packages + Data

In [1]:
# Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Import data & convert to df
data = pd.read_excel('Data/Online_Retail.xlsx')
df = pd.DataFrame(data)

# Preview
df.head()

We can see here there are multiple items and quantities purchased on each invoice. I will create another column that shows total spent on each item, so Quantity * UnitPrice. That way we can group by invoice number, customer, etc. and see the total they spent per invoice and item.

We're also going to be adding Recency, Frequency and Monetary columns so we can conduct an RMF analysis and segment customers that way as well. 

Let's take a look at some of the basics before we hop into it. 

# Explore + Clean Data

In [None]:
# Info
df.info()

**InvoiceNo** is currently an object. I'm going to change that to an integer so we'll be able to group by invoice number. 

**StockCode** can stay an object, I'm guessing it's a string. 

It's great that **InvoiceDate** is already in datetime format, because we can peak at some time series in the EDA to see if we can collect any further insights. 

## Check For Null Values

In [None]:
# Check for missing values

df.isnull().sum()

It looks like we have a good amount of Null values for **CustomerID** and **Description**. Let's see how much of the total this accounts for.

In [None]:
# Description
print('Description Percent Null Values:')
print(f"{((df.Description.isnull().sum())/len(df.Description)*100).round(4)} % \n")

# CustomerID
print('CustomerID Percent Null Values:')
print(f"{((df.CustomerID.isnull().sum())/len(df.CustomerID)*100).round(4)} % \n")

print('==============================')

The number of missing values for the **Description** column is small, however for the **CustomerID** column it is large at almost 25%. I'm curious how many customers there were. Let's take a look at the number of unique values.

In [None]:
# Unique CustomerIDs

print(f'No. of unique CustomerIDs: \n{len(df.CustomerID.value_counts())}')

Since we still have data from over 4,300 customers, and we don't have any way of identifying the customers with the Null **CustomerID** field, it only makes sense to remove them. And since the number of Null **Description** fields are low, we will remove those as well. 

In [None]:
# Drop rows w/null fields
df = df.dropna()

In [None]:
len(df)

In [None]:
df.head()

## Check For Duplicates

In [None]:
#df[df.duplicated()]
df[df.InvoiceNo == 536412].duplicated()

In [None]:
df.iloc[617:622]

None of these seem to be duplicates, so we're going to leave these here. 

In [None]:
# Summary statistics

df.describe().round(2)

## Add TotalPrice Column

In [None]:
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

In [None]:
df.describe().round(2)

In [None]:
df.TotalPrice.hist();

## Remove Outliers

It looks like there are some major outliers in our dataset. Let's remove them. 

In [None]:
# Percentiles for Quantity

# Define percentiles
percentiles = [0,2.5,97.5,100]

# Print them out
for i in percentiles:
    q = i/100
    print("{} percentile Quantity: {}".format(q, df.Quantity.quantile(q=q)))
    
# Percentiles for UnitPrice

# Print them out
for i in percentiles:
    q = i/100
    print("{} percentile UnitPrice: {}".format(q, df.UnitPrice.quantile(q=q)))
    
# Percentiles for TotalPrice

# Print them out
for i in percentiles:
    q = i/100
    print("{} percentile TotalPrice: {}".format(q, df.TotalPrice.quantile(q=q)))

I'm going to remove what may be returns or negative **Quantity** values as the lower 1% is -2.0 and the lower 2.5% was 1.0. We also removed all of the negative UnitPrice values when we removed the Null **CustomerID** values. 

We're also going to set the **UnitPrice** lower limit to be any value greater than 0.0 as this means it has any price. The minimum value being 0.001. 

In [None]:
# Remove extreme outliers in the lower and upper 1%

# Get original length to see percent removed
orig_tot = len(df)

# Subset to remove extreme outliers
# Quantity
df = df[(df.Quantity > 0.0) & (df.Quantity <= 120.0)] 
# UnitPrice lot
df = df[(df.UnitPrice > 0.0) & (df.UnitPrice <= 15.0)]

# Calculate percent removed
print('Percent removed:', (orig_tot -len(df))/orig_tot)

We saw how removing the rows with Null **CustomerIDs** also removed the negative **UnitPrices**, I'm wondering if it would be best to remove the rows with negative Quantity value as well. We can see here with the 1% being -2.0, the 2% being -1.0 and the 2.5% being 1.0. 

I will keep it standard for now with percentiles, however it ma

It seems returns are extremely rare, which we can see with the **Quantity** 0.01 percentile being -2.0. I'm wondering if returns should be removed alltogether since they are rare, or if there are certain segments of customers who are more prone to returns.

In [None]:
sns.boxplot(df.Quantity)
plt.show()

sns.boxplot(df.UnitPrice)
plt.show()

sns.boxplot(df.TotalPrice)
plt.show()

In [None]:
sns.distplot(df.Quantity)
plt.show()
sns.distplot(df.UnitPrice)
plt.show()

It seems that removing the Null **CustomerID** data also removed all of the negative **UnitPrice** values. 

We can see visually there are some major outliers. With this data set it's easy to visually see the outliers, so I could remove them that way, however I'm going to remove them by removing the upper and lower percentiles. 

In [None]:
# Pairplot
sns.pairplot(df);