# A practical analytical project from Quantum

## Overview

This project analyzes a customer transaction dataset and identifies customer purchasing behavior patterns to generate valuable insights and information.

#### Context

You are part of Quantium’s retail analytics team and have been approached by your client, the Category Manager for chips, who wants to better understand the types of customers who purchase chips and their purchasing behaviour within the region.

The insights from your analysis will feed into the supermarket’s strategic plan for the chip category in the next half year.

## Project Goals

Here are the main ponts of this project:
- examine and clean transaction and customer data.
- identify customer segments based on purchasing behavior.
- creating charts and graphs to present data insights.
- deriving commercial recommendations from data analysis.

## Actions

- Analyze transaction and customer data. 
- Develop metrics and examine sales drivers.
- Segment customers based on purchasing behavior.
- Create visualizations.
- Formulate a clear recommendation for the client's strategy.

## Data

There are two datasets provided for this project:
1. `QVI_transaction_data.xlsx` - This dataset contains customer transaction data.
2. `QVI_purchase_behaviour.csv` - This dataset contains purchase behavior data.

## Analysis

1. Examine transaction data
1. Examine customer data
1. Data analysis and customer segments
1. Define recommendation by customer segments

## Data preparation and customer analytics

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# setting options
pd.set_option('display.max_columns', None)
pd.set_option("display.float_format", "{:.2f}".format)
pd.set_option('max_colwidth', 0)

### Transaction dataset

Let's start by examining the transaction dataset.

In [3]:
# load the datasets

transactions = pd.read_excel('QVI_transaction_data.xlsx', nrows=10000)

In [4]:
# shape of datasets
print("Transactions dataset shape:", transactions.shape)

Transactions dataset shape: (10000, 8)


In [5]:
# inspect the first few rows of the transactions dataset
transactions.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,43390,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,43599,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,43605,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,43329,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,43330,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [6]:
# rename columns for better readability
transactions.rename(columns={'DATE': 'date',
                             'STORE_NBR': 'store_number',
                             'LYLTY_CARD_NBR': 'loyalty_card_number',
                             'TXN_ID': 'transaction_id',
                             'PROD_NBR': 'product_number',
                             'PROD_NAME': 'product_name',
                             'PROD_QTY': 'product_quantity',
                             'TOT_SALES': 'total_sales'}, inplace=True)
# inspect the first few rows after renaming columns
transactions.head()

Unnamed: 0,date,store_number,loyalty_card_number,transaction_id,product_number,product_name,product_quantity,total_sales
0,43390,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,43599,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,43605,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,43329,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,43330,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [7]:
# checking for data types
transactions.dtypes

date                   int64  
store_number           int64  
loyalty_card_number    int64  
transaction_id         int64  
product_number         int64  
product_name           object 
product_quantity       int64  
total_sales            float64
dtype: object

In [8]:
# checking for missing values
transactions.isna().sum()

date                   0
store_number           0
loyalty_card_number    0
transaction_id         0
product_number         0
product_name           0
product_quantity       0
total_sales            0
dtype: int64

In [9]:
# convert 'DATE' column to datetime format
# Excel's date system starts on 1899-12-30
transactions['date'] = pd.to_datetime(transactions['date'], origin='1899-12-30', unit='D')

In [10]:
# checking for data types again
transactions.dtypes

date                   datetime64[ns]
store_number           int64         
loyalty_card_number    int64         
transaction_id         int64         
product_number         int64         
product_name           object        
product_quantity       int64         
total_sales            float64       
dtype: object

In [11]:
# inspect the first few rows of transactions after date conversion
transactions.head()

Unnamed: 0,date,store_number,loyalty_card_number,transaction_id,product_number,product_name,product_quantity,total_sales
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [12]:
# shwow info about the dataset
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 10000 non-null  datetime64[ns]
 1   store_number         10000 non-null  int64         
 2   loyalty_card_number  10000 non-null  int64         
 3   transaction_id       10000 non-null  int64         
 4   product_number       10000 non-null  int64         
 5   product_name         10000 non-null  object        
 6   product_quantity     10000 non-null  int64         
 7   total_sales          10000 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 625.1+ KB


In [13]:
# inspecting unique product names
transactions['product_name'].unique()[:10]

array(['Natural Chip        Compny SeaSalt175g',
       'CCs Nacho Cheese    175g',
       'Smiths Crinkle Cut  Chips Chicken 170g',
       'Smiths Chip Thinly  S/Cream&Onion 175g',
       'Kettle Tortilla ChpsHny&Jlpno Chili 150g',
       'Old El Paso Salsa   Dip Tomato Mild 300g',
       'Smiths Crinkle Chips Salt & Vinegar 330g',
       'Grain Waves         Sweet Chilli 210g',
       'Doritos Corn Chip Mexican Jalapeno 150g',
       'Grain Waves Sour    Cream&Chives 210G'], dtype=object)

In [14]:
# cleaning product names column; removing leading/trailing spaces and extra spaces between words
transactions['product_name'] = transactions['product_name'].str.strip() \
    .str.replace(r'\s+', ' ', regex=True)

transactions['product_name'].head(10)


0    Natural Chip Compny SeaSalt175g         
1    CCs Nacho Cheese 175g                   
2    Smiths Crinkle Cut Chips Chicken 170g   
3    Smiths Chip Thinly S/Cream&Onion 175g   
4    Kettle Tortilla ChpsHny&Jlpno Chili 150g
5    Old El Paso Salsa Dip Tomato Mild 300g  
6    Smiths Crinkle Chips Salt & Vinegar 330g
7    Grain Waves Sweet Chilli 210g           
8    Doritos Corn Chip Mexican Jalapeno 150g 
9    Grain Waves Sour Cream&Chives 210G      
Name: product_name, dtype: object

In [15]:
# extracting weight from product names
transactions['product_weight'] = transactions['product_name'] \
    .str.extract(r'(\d+)[gG]') \
    .astype(float)

transactions.dtypes

date                   datetime64[ns]
store_number           int64         
loyalty_card_number    int64         
transaction_id         int64         
product_number         int64         
product_name           object        
product_quantity       int64         
total_sales            float64       
product_weight         float64       
dtype: object

In [16]:
# removing weight from product names
transactions['product_name'] = transactions['product_name'] \
    .str.replace(r'(\d+)[gG]', '', regex=True) \
    .str.strip()

Product names were cleaned by removing extra spaces and product weights.
Brand extraction was considered but not performed due to inconsistent naming and lack of a reliable rule-based approach.

In [17]:
# checking results
transactions.tail()

Unnamed: 0,date,store_number,loyalty_card_number,transaction_id,product_number,product_name,product_quantity,total_sales,product_weight
9995,2019-01-30,106,106188,107888,11,RRD Pc Sea Salt,2,6.0,165.0
9996,2019-03-31,106,106188,107889,33,Cobs Popd Swt/Chlli &Sr/Cream Chips,2,7.6,110.0
9997,2019-06-02,106,106188,107890,27,WW Supreme Cheese Corn Chips,2,3.8,200.0
9998,2019-06-29,106,106188,107891,71,Twisties Cheese Burger,2,8.6,250.0
9999,2018-09-14,106,106234,108132,75,Cobs Popd Sea Salt Chips,2,7.6,110.0


Let's perform some basic text analysis by summarising the individual words in the `product_name` column.

In [42]:
# create a function to remove punctuation in text
import string

def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text


products = transactions['product_name'].apply(remove_punctuation).unique()
products

array(['Natural Chip Compny SeaSalt', 'CCs Nacho Cheese',
       'Smiths Crinkle Cut Chips Chicken',
       'Smiths Chip Thinly SCreamOnion',
       'Kettle Tortilla ChpsHnyJlpno Chili',
       'Old El Paso Salsa Dip Tomato Mild',
       'Smiths Crinkle Chips Salt  Vinegar', 'Grain Waves Sweet Chilli',
       'Doritos Corn Chip Mexican Jalapeno',
       'Grain Waves Sour CreamChives', 'Kettle Sensations Siracha Lime',
       'Twisties Cheese', 'WW Crinkle Cut Chicken',
       'Thins Chips Light Tangy', 'CCs Original', 'Burger Rings',
       'NCC Sour Cream  Garden Chives',
       'Doritos Corn Chip Southern Chicken', 'Cheezels Cheese Box',
       'Smiths Crinkle Original', 'Infzns Crn Crnchers Tangy Gcamole',
       'Kettle Sea Salt And Vinegar', 'Smiths Chip Thinly Cut Original',
       'Kettle Original', 'Red Rock Deli Thai ChilliLime',
       'Pringles Sthrn FriedChicken', 'Pringles SweetSpcy BBQ',
       'Red Rock Deli SR Salsa  Mzzrlla', 'Thins Chips Originl saltd',
       'Red Ro

In [19]:
# describing the dataset
transactions.describe()

Unnamed: 0,date,store_number,loyalty_card_number,transaction_id,product_number,product_quantity,total_sales,product_weight
count,10000,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,2018-12-29 21:29:05.280000,100.0,100399.93,99218.37,56.6,1.89,7.35,182.01
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.5,70.0
25%,2018-09-26 00:00:00,45.0,45134.0,41172.75,28.0,2.0,5.4,150.0
50%,2018-12-31 00:00:00,81.0,81357.0,81312.5,55.0,2.0,7.4,170.0
75%,2019-04-03 00:00:00,149.0,149247.0,148937.75,86.0,2.0,9.2,175.0
max,2019-06-30 00:00:00,272.0,2330211.0,270128.0,114.0,5.0,29.5,380.0
std,,72.34,75705.11,73720.5,33.06,0.39,2.6,63.99


In [20]:
# checking the duplicates
transactions[transactions.duplicated(keep=False)]

Unnamed: 0,date,store_number,loyalty_card_number,transaction_id,product_number,product_name,product_quantity,total_sales,product_weight


In [21]:
# dropping duplicates
transactions.drop_duplicates(inplace=True)

In [22]:
# checking the duplicates again
transactions.duplicated().sum()

0

In [23]:
transactions.nunique()

date                   364 
store_number           258 
loyalty_card_number    3104
transaction_id         9940
product_number         114 
product_name           114 
product_quantity       5   
total_sales            84  
product_weight         21  
dtype: int64

### Purchase behaviour dataset

Let's explore the purchase behaviour dataset.

In [24]:
# load the purchase behaviour dataset
purchase_behaviour = pd.read_csv('QVI_purchase_behaviour.csv')

In [25]:
# inspect the first few rows of the purchase behaviour dataset
purchase_behaviour.head()

Unnamed: 0,LYLTY_CARD_NBR,LIFESTAGE,PREMIUM_CUSTOMER
0,1000,YOUNG SINGLES/COUPLES,Premium
1,1002,YOUNG SINGLES/COUPLES,Mainstream
2,1003,YOUNG FAMILIES,Budget
3,1004,OLDER SINGLES/COUPLES,Mainstream
4,1005,MIDAGE SINGLES/COUPLES,Mainstream


In [26]:
# shape and columns of dataset
print("Purchase behaviour dataset shape:", purchase_behaviour.shape)
print("Purchase behaviour dataset columns:", purchase_behaviour.columns.tolist())

Purchase behaviour dataset shape: (72637, 3)
Purchase behaviour dataset columns: ['LYLTY_CARD_NBR', 'LIFESTAGE', 'PREMIUM_CUSTOMER']


In [27]:
# info about purchase behaviour dataset
purchase_behaviour.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   LYLTY_CARD_NBR    72637 non-null  int64 
 1   LIFESTAGE         72637 non-null  object
 2   PREMIUM_CUSTOMER  72637 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB


In [28]:
# rename columns for better readability
purchase_behaviour.rename(columns={'LYLTY_CARD_NBR': 'loyalty_card_number',
                                  'LIFESTAGE': 'lifestage',
                                  'PREMIUM_CUSTOMER': 'customer_type'}, inplace=True)
# checking the first few rows after renaming columns
purchase_behaviour.head()

Unnamed: 0,loyalty_card_number,lifestage,customer_type
0,1000,YOUNG SINGLES/COUPLES,Premium
1,1002,YOUNG SINGLES/COUPLES,Mainstream
2,1003,YOUNG FAMILIES,Budget
3,1004,OLDER SINGLES/COUPLES,Mainstream
4,1005,MIDAGE SINGLES/COUPLES,Mainstream


In [29]:
# checking for data types
purchase_behaviour.dtypes

loyalty_card_number    int64 
lifestage              object
customer_type          object
dtype: object

In [30]:
# count of unique values
purchase_behaviour.nunique()

loyalty_card_number    72637
lifestage              7    
customer_type          3    
dtype: int64

In [31]:
#  counting of lifestage
purchase_behaviour['lifestage'].value_counts()

lifestage
RETIREES                  14805
OLDER SINGLES/COUPLES     14609
YOUNG SINGLES/COUPLES     14441
OLDER FAMILIES            9780 
YOUNG FAMILIES            9178 
MIDAGE SINGLES/COUPLES    7275 
NEW FAMILIES              2549 
Name: count, dtype: int64

In [32]:
# Count of each customer type
purchase_behaviour['customer_type'].value_counts()

customer_type
Mainstream    29245
Budget        24470
Premium       18922
Name: count, dtype: int64

In [33]:
# checking for missing values
purchase_behaviour.isna().sum()

loyalty_card_number    0
lifestage              0
customer_type          0
dtype: int64

In [34]:
# checking for duplicates
purchase_behaviour.duplicated().sum()

0

In [35]:
# converting lifestage to lowercase
purchase_behaviour['lifestage'] = purchase_behaviour['lifestage'].str.lower()
purchase_behaviour.head()

Unnamed: 0,loyalty_card_number,lifestage,customer_type
0,1000,young singles/couples,Premium
1,1002,young singles/couples,Mainstream
2,1003,young families,Budget
3,1004,older singles/couples,Mainstream
4,1005,midage singles/couples,Mainstream
