# Customer Segmentation Analysis

## Introduction

The past several years have seen a sharp rise in e-commerce sales. Retail e-commerce sales reached $$3.53 trillion in 2019, and e-retail revenues are expected to reach $6.54 trillion by 2022, according to Statista. This huge increase suggests that consumers' purchasing habits have changed significantly. When compared to traditional sales, e-commerce has the distinct advantage that all transaction data, such as the goods, pricing, and shopping time, can be precisely recorded and saved. By grouping consumers into meaningful categories based on extensive transaction data, the firm may better understand the habits and preferences of its customers and meet their requirements more quickly.

Enhancing customer retention and corporate profitability requires an enterprise's capacity to recognize customer behavior and choose the right customer. Practical information about the product from the perspective of the client advantages anticipated gleaned from many connection between the business and its clients hasbeen taken prisoner. In order to increase client happiness and reduce operating expenses, combining comparable aspects of the Clients are vital. Furthermore, grouping clients also lessen the intricacy of the product's design. Consequently, comprehension Features of the customer allow the business to control the product organizing and creating goods that are appropriate for consumers. 



## Data and Sources
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The online shop that is the subject of this article is a registered non-store firm with 80 employees that is situated in the UK. The business was founded in 1981 and specializes in offering unusual presents for every occasion. The merchant used to take orders over the phone and mostly rely on direct mailing catalogs for many years. Just two years prior, the business established its own website and made the full transition to the internet. Since then, the business has maintained a consistent and healthy clientele from around Europe and the United Kingdom, and it has amassed a vast quantity of consumer data. Additionally, the business markets and sells its goods via Amazon.co.uk.

Table 1 displays the 8 variables that make up the customer transaction dataset that the merchant owns. The dataset comprises all of the transactions that happened in the years 2010 and 2011. There were a total of 22,190 legitimate transactions throughout that time frame, linked to 4381 valid unique postcodes. The dataset contains 406 830 instances (record rows) that correspond to these transactions, each of which represents a specific item that was a part of a transaction. It should be mentioned that the variable PostCode is crucial to the operation of the company since it gives crucial information that identifies and tracks each individual customer, allowing for some in-depth analysis to be conducted in the current research.

On average, each postcode is associated with five transactions, that is, each customer has purchased a product from the online retailer about once every2 months. 

In addition, only consumers from the United Kingdom are analysed. It is interesting to notice that the average number of distinct products (items) contained in each transaction occurring in 2011 was 18.3 ( = 406 830 / 22 190). This seems to suggest that many of the consumers of the business were organizational customers rather than individual customers. 


In [1]:
# import the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as px
import re
import warnings
warnings.filterwarnings('ignore')


In [2]:
# load the data and print the first few rows
online_retail = pd.read_excel('Online Retail.xlsx')
online_retail.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Initial Exploration

+ Missing 25% of CustomerID: Since market/customer segmentation necessitates putting each individual client into a category, the absence of unique consumer identification might provide an issue.

+ Negative Unit Prices: It is uncommon to have negative UnitPrice, as this would mean a cash outflow to a company. it could be as a result of incorrect discount configuration, a refund, cancellation of orders or a bad-debt/write-off incurred by the business.

+ Potential data reversal: Incorrect data mapping or formatting during data import from another system may cause a reversal of signs and result in negative results. Further investigation is needed to understand the nature and determine the best way to manage such data reversal.

+ CustomerID column should be an object not float




In [3]:
# check summary statistics
online_retail.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [4]:
# check for data types and missing values
online_retail.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [5]:
# make a copy of the data 
online_retail_data = online_retail.copy()

### Necessary Functions That will be Needed

In [9]:
# define a function to filter a dataframe based on specific filter
def filter_records(df, column, criterion, operator = 'equal'):
    '''
    Ths function accepts a dataframe and filter it based on certain criteria or conditions
    
    df: Dataframe in question
    column: column of interest
    criterion: condition to filter on 
    operator: defines an extra condition to determine the required output
    
    return : return a datframe having removed the rows of interest
    '''
    
    if operator == 'equal':
        return df[df[column] == criterion]
    elif operator == 'less':
        return df[df[column] <= criterion]
    elif operator == 'greater':
        return df[df[column] >= criterion]

In [10]:
# define a function to remove records based on certain criteria or conditions
def remove_records(df, column, criterion):
    '''
    Ths function accepts a dataframe and remove records based on certain criteria or conditions
    
    df: Dataframe in question
    column: column of interest
    criterion: condition to filter on 
    
    return : return a datframe having removed the rows of interest
    '''
    return df[df[column] != criterion]

## Data Preprocessing and Feature Engineering

### 1. InvoiceDate: Separate Date and Time information from InvoiceDate

The `InvoiceDate` column contains both date and time of the transaction. These data are separated into individual columns to facilitate future feature engineering and data manipulation. we would also create new other date time variables columns like month, day, year, weekend and weekdays

In [6]:
# split datatime  from InvoiceDate
online_retail_data['Date'] = online_retail_data.InvoiceDate.dt.date
online_retail_data['Time'] = online_retail_data.InvoiceDate.dt.time
online_retail_data['Year'] = online_retail_data.InvoiceDate.dt.year
online_retail_data['Month'] = online_retail_data.InvoiceDate.dt.month
online_retail_data['MonthName'] = online_retail_data.InvoiceDate.dt.month_name()
online_retail_data['WeedDay'] = online_retail_data.InvoiceDate.dt.weekday

# remove invoice data 
online_retail_data.drop(['InvoiceDate'], axis = 1, inplace = True)

# verify
online_retail_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Time,Year,Month,MonthName,WeedDay
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2.55,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2
1,536365,71053,WHITE METAL LANTERN,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2.75,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2


### 2. InvoiceNo: Extract Transaction status from `InvoiceNo`

InvoiceNo contains both transaction status (i.e. having a 'C' denotes cancelled transaction) and transaction identifier (e.g. unique invoice number). This information could be extracted to facilitate further feature engineering.


In [7]:
# separate order status ad invoice number from InvoiceNo
online_retail_data['OrderCategory'] = online_retail_data['InvoiceNo'].apply(lambda x: re.findall(r'[A-Z]', str(x)))\
                                      .apply(lambda x: pd.Series(x))
online_retail_data['InvoiceNum'] = online_retail_data['InvoiceNo'].apply(lambda x: re.findall(r'\d+', str(x)))\
                                      .apply(lambda x: pd.Series(x))

# remove InvoiceNo 
online_retail_data.drop(['InvoiceNo'], axis = 1, inplace = True)

# verify 
online_retail_data.head()

Unnamed: 0,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Time,Year,Month,MonthName,WeedDay,OrderCategory,InvoiceNum
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2.55,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2,,536365
1,71053,WHITE METAL LANTERN,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2,,536365
2,84406B,CREAM CUPID HEARTS COAT HANGER,8,2.75,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2,,536365
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2,,536365
4,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,2010,12,December,2,,536365


#### Drop Rows with Bad Debts
We have an A category and it indicate that the customer had a bad debt adjustments and does not represent actual sales and furthermore, they are not tagged to any specific customer so we will remove them

In [14]:
# check the unique order category
online_retail_data.OrderCategory.unique()

array([nan, 'C', 'A'], dtype=object)

In [8]:
# let filter for rows with the category A and check the description
online_retail_data[online_retail_data['OrderCategory'] == 'A']

Unnamed: 0,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Time,Year,Month,MonthName,WeedDay,OrderCategory,InvoiceNum
299982,B,Adjust bad debt,1,11062.06,,United Kingdom,2011-08-12,14:50:00,2011,8,August,4,A,563185
299983,B,Adjust bad debt,1,-11062.06,,United Kingdom,2011-08-12,14:51:00,2011,8,August,4,A,563186
299984,B,Adjust bad debt,1,-11062.06,,United Kingdom,2011-08-12,14:52:00,2011,8,August,4,A,563187
