# Customer segmentation with E-Commerce Data



## Context
Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

## Content
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

## Acknowledgements
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

## Source

Dr. Daqing Chen, Course Director: MSc Data Science. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.


[Dataset](https://www.kaggle.com/carrie1/ecommerce-data/data#)

# Imports

In [38]:
import plotly.express as px
import pandas as pd
import numpy as np

# Load the data

In [2]:
e_comerce_data = pd.read_csv('data/data.csv', encoding='ISO-8859-1')
e_comerce_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


# Dataset description

| Variable Name | Type  | Description |
|---------------|-------|:------------|
|InvoiceNo      |Nominal|Invoice number. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.|
|StockCode      |Nominal|Product (item) code. A 5-digit integral number uniquely assigned to each distinct product.|
|Description    |Nominal|Product (item) name.|
|Quantity       |Numeric|The quantities of each product (item) per transaction.|
|InvoiceDate    |Numeric|Invice date and time. The day and time when a transaction was generated.|
|UnitPrice      |Numeric|Unit price. Product price per unit in sterling (£).|
|CustomerID     |Nominal|Customer number. A 5-digit integral number uniquely assigned to each customer.|
|Country        |Nominal|Country name. The name of the country where a customer resides.|

# Helper functions

In [3]:
def get_duplicates(dataframe):
    return dataframe[dataframe.duplicated()]

# Date preprocess

In [4]:
print(f'Removed {len(get_duplicates(e_comerce_data))} duplicated entries')
e_comerce_data.drop_duplicates(inplace=True)

Removed 5268 duplicated entries


# Data Structures

In [3]:
class Customer:
    def __init__(self, customer_id, country):
        self.customer_id = customer_id
        self.country = country
        self.transactions = []
    def addTransaction(transaction
                       
        self.transactions.append(transaction)
                       
    def __str__(self):
        return f'{self.customer_id} : {self.country}'
    

                       
class Product:
    def __init__(self, price, code, description):
        self.price = price
        self.code = code
        self.description = description
        
    def __str__(self):
        return f'{self.description} : {self.code} : {self.price}£'
    
    
                       
class Transaction:
    def __init__(self, product, quantity, date, canceled):
        self.product = product
        self.quantity = quantity
        self.date = date
        self.canceled = canceled
                       
    def __str__(self):
        return f'{self.product} x {self.quantity} : {self.date} \
        : Status: {"Canceled" if self.canceled else "Successful"}'
    

                       
class ECommerce:
    def __init__(self):
        self.customers = []
        self.products = []
        self.transactions = []
                       
    def addCustomer(self, customer):
        self.customers.append(customer)
                       
    def addProduct(self, product):
        self.products.append(product)
                       
    def addTransaction(self, transaction):
        self.transactions.append(transaction)
                       
    def customer_exist(self, customer_id):
        return any(customer.customer_id==customer_id for customer in e_commerce.customers)
                       
    def product_exist(self, code):
        return any(product.code==code for product in e_commerce.products)
                       
    def is_canceled(self, invoice_nr):
        return True if invoice_nr[0]=='c' else False
                       
    def getCustomer(self, customer_id):
        for customer in e_commerce.customers:
            if customer.customer_id==customer_id:
                return customer
        raise NameError
                       
    def getProduct(self, code):
        for product in e_commerce.products:
            if product.code==code:
                return product
        raise NameError
                       
    def __str__(self):
        return f'Customers: {len(self.customers)}\nProducts: {len(self.products)}\
        \nTransactions: {len(self.transactions)}'

# Data initialization

In [4]:
e_commerce = ECommerce()

def add_row_to_ecommerce(row):
    invoice_nr, code, description, quantity, date, price, customer_id, country = row.values
    if not e_commerce.customer_exist(customer_id):
        customer = Customer(customer_id, country)
        e_commerce.addCustomer(customer)
    else:
        customer =e_commerce.getCustomer(customer_id)
    if not e_commerce.product_exist(code):
        product = Product(price, code, description)
        e_commerce.addProduct(product)
    else:
        product = e_commerce.getProduct(code)
    is_canceled = e_commerce.is_canceled(invoice_nr)
    transaction = Transaction(product, quantity, date, is_canceled)
    e_commerce.addTransaction(transaction)

for row_idx in range(len(e_comerce_data)//100):
    add_row_to_ecommerce(e_comerce_data.iloc[row_idx])
    
print(e_commerce)

Customers: 21175
Products: 2908        
Transactions: 54190


# Dataset analysis

In [5]:
e_comerce_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 536641 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      536641 non-null object
StockCode      536641 non-null object
Description    535187 non-null object
Quantity       536641 non-null int64
InvoiceDate    536641 non-null object
UnitPrice      536641 non-null float64
CustomerID     401604 non-null float64
Country        536641 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 36.8+ MB


In [6]:
e_comerce_data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,536641.0,536641.0,401604.0
mean,9.620029,4.632656,15281.160818
std,219.130156,97.233118,1714.006089
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13939.0
50%,3.0,2.08,15145.0
75%,10.0,4.13,16784.0
max,80995.0,38970.0,18287.0


In [81]:
e_comerce_data['spent'] = e_comerce_data.Quantity*e_comerce_data.UnitPrice
fig = px.choropleth(pd.DataFrame(np.array([e_comerce_data.groupby('Country')['spent'].mean().index,
                                 e_comerce_data.groupby('Country')['spent'].mean().values]).T,
                                 columns=['country', 'Average Spending']),
                    locations='country',
                    color='Average Spending',
                    locationmode='country names',
                    color_continuous_scale=px.colors.sequential.Plasma)
fig.show()