In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
ecom_data = pd.read_csv('../input/ecommerce-data/data.csv', encoding='latin1')

In [None]:

ecom_data.head()


In [None]:
ecom_data.describe()

In [None]:
ecom_data.info()

We have 541909 item purchases, but we lack description for around ~1500 of the items and CustomerIDs for ~14.000 data.

There are also some wrong data values from quantity and price, as we see negative values in both fields. Most of the sales include 3 items (median, 50% of data in table above) but the mean is 9, which implies that we have some very large sales. We will confirm this in further analysis. A huge bunch of sales (25%) are just single purchases. Prices are also quite low, but there seem to be very expensive items. We will see if those are erroneous data once we analyze the distributions

### Distributions and analysis of each data variable

This section will be just an initial evaluation of the distribution of the variables, combinations of data to have a more detailed analysis will come in a future notebook. We will also skip for now the Description, as it would require to split the entries into several parameters (fo rinstance, color and type of product)

### 1. Invoice nº

Although by looking at the header it would seem the data is numeric, we see in the description that the data is actually of object type.
We will check how many individual invoices exist, to get an idea of the number of total purchases.

In [None]:
ecom_data['InvoiceNo'].value_counts()

The amount of data is fairly large to plot. We will use only 25% of the data and use the describe function to evaluate it

In [None]:
purchase_size = ecom_data['InvoiceNo'].value_counts()
purchase_size.describe()

In [None]:
sns.distplot(purchase_size.sample(frac=0.25))

In [None]:
sns.boxplot(purchase_size.sample(frac=0.25))

In [None]:
purchase_size.quantile(0.5)

Up to 50% of purchases have up to 10 different items. Some of the purchases have a large number different items.

### 2. Stock code

In [None]:
ecom_data['StockCode'].value_counts()

In [None]:
ecom_data['StockCode'].value_counts().describe()

In [None]:
sns.distplot(ecom_data['StockCode'].value_counts())

In [None]:
sns.boxplot(ecom_data['StockCode'].value_counts())

We have 4070 unique stock codes. Most items are sold 62 times around the entire data set period. Again, some identifiers appear a large number of times.

### 3. Quantity

In [None]:
sns.distplot(ecom_data['Quantity'])

In [None]:
sns.boxplot(ecom_data['Quantity'])

The huge negative and positive quantities may imply that there was a mistake in the purchase and the purchase was cancelled. We will test this hypothesis by checking the negative quantities and looking for matches of sales. It is likely that we could entirely remove the suspicious data and move on, but I like to test my hypothesis.

In [None]:
ecom_data[ecom_data['Quantity'] < -1000]

We have some interesting results from this data, that we can see from visual inspection. We see that a large number of entries do not have a CustomerID. Most of the time, the Nan in CustomerID comes with a description of the kind "destroyed", "lost", "sold as sets", "damaged" etc. Seems that NaN in CustomerID is used as an identifier for store operations, but we will check whether it has been used exclusively for that.

The second noticeable thing appears to be that our hypothesis may be correct, and that we can likely find cancelled purchase with data in the InvoiceNo. The C in InvoiceNo seems to indicate that the purchase was indeed cancelled. Let's check some instances where "InvoiceNo" has C

In [None]:
print('First instance')
print(ecom_data[(ecom_data['StockCode'] == '84347') & (ecom_data['CustomerID'] == 15838)])
print('Second Instance')
print(ecom_data[(ecom_data['StockCode'] == '23166') & (ecom_data['CustomerID'] == 12346)])
print('Third Instance')
print(ecom_data[(ecom_data['StockCode'] == '47566B') & (ecom_data['CustomerID'] == 15749)])
print(ecom_data[(ecom_data['StockCode'] == '85123A') & (ecom_data['CustomerID'] == 15749)])
print('Fourth Instance')
print(ecom_data[(ecom_data['StockCode'] == '22920') & (ecom_data['CustomerID'] == 16938)])

We have mixed data in here. In some cases it is easy to find the corresponding cancelled order, as they share the number, but in some other cases there doesn't seem to be any cancelled order. (Third instance for example)

In [None]:
ecom_data[ecom_data['StockCode'] == '23166']