# Quantium Data Analytics #

#### Analysis of Customer Segments and Chip purchasing behaviour


Importing the necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data in the Pandas DataFrame

In [2]:
transaction_data=pd.read_csv('QVI_transaction_data.csv')
purchase_behaviour_data=pd.read_csv('QVI_purchase_behaviour.csv')

Now lets understand the both datasets and see the summaries of them

In [3]:
transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   DATE            264836 non-null  int64  
 1   STORE_NBR       264836 non-null  int64  
 2   LYLTY_CARD_NBR  264836 non-null  int64  
 3   TXN_ID          264836 non-null  int64  
 4   PROD_NBR        264836 non-null  int64  
 5   PROD_NAME       264836 non-null  object 
 6   PROD_QTY        264836 non-null  int64  
 7   TOT_SALES       264836 non-null  float64
dtypes: float64(1), int64(6), object(1)
memory usage: 16.2+ MB


In [4]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,43390,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,43599,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,43605,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,43329,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,43330,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


Observation:
The transaction dataset contains 264836 rows, and it have got 7 featues.
The only problem that seems to be in this dataset is that the format of Date is incorrect. (In should be in date format, but it is in integer format)

Converting Date from Integer to date format

In [5]:
# The value in the date column seems like excel serial date numbers where the dates are stored as the numbers of days since december 30,1899.
transaction_data['DATE']=pd.to_datetime(transaction_data['DATE'], origin='1899-12-30', unit='D')

In [6]:
transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            264836 non-null  datetime64[ns]
 1   STORE_NBR       264836 non-null  int64         
 2   LYLTY_CARD_NBR  264836 non-null  int64         
 3   TXN_ID          264836 non-null  int64         
 4   PROD_NBR        264836 non-null  int64         
 5   PROD_NAME       264836 non-null  object        
 6   PROD_QTY        264836 non-null  int64         
 7   TOT_SALES       264836 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB


In [7]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [8]:
transaction_data.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES
count,264836,264836.0,264836.0,264836.0,264836.0,264836.0,264836.0
mean,2018-12-30 00:52:12.879215616,135.08011,135549.5,135158.3,56.583157,1.907309,7.3042
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.5
25%,2018-09-30 00:00:00,70.0,70021.0,67601.5,28.0,2.0,5.4
50%,2018-12-30 00:00:00,130.0,130357.5,135137.5,56.0,2.0,7.4
75%,2019-03-31 00:00:00,203.0,203094.2,202701.2,85.0,2.0,9.2
max,2019-06-30 00:00:00,272.0,2373711.0,2415841.0,114.0,200.0,650.0
std,,76.78418,80579.98,78133.03,32.826638,0.643654,3.083226


---

In [9]:
purchase_behaviour_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   LYLTY_CARD_NBR    72637 non-null  int64 
 1   LIFESTAGE         72637 non-null  object
 2   PREMIUM_CUSTOMER  72637 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB


In [10]:
purchase_behaviour_data.head()

Unnamed: 0,LYLTY_CARD_NBR,LIFESTAGE,PREMIUM_CUSTOMER
0,1000,YOUNG SINGLES/COUPLES,Premium
1,1002,YOUNG SINGLES/COUPLES,Mainstream
2,1003,YOUNG FAMILIES,Budget
3,1004,OLDER SINGLES/COUPLES,Mainstream
4,1005,MIDAGE SINGLES/COUPLES,Mainstream


Observation:
Purchase behaviour data have got 3 features and just 72637 data rows. Here all the features seems to be in the correct format.

---

Now first lets focus on the Transaction_data

Product Name

In [11]:
transaction_data['PROD_NAME'].describe()

count                                     264836
unique                                       114
top       Kettle Mozzarella   Basil & Pesto 175g
freq                                        3304
Name: PROD_NAME, dtype: object

In [12]:
transaction_data['PROD_NAME'].unique()

array(['Natural Chip        Compny SeaSalt175g',
       'CCs Nacho Cheese    175g',
       'Smiths Crinkle Cut  Chips Chicken 170g',
       'Smiths Chip Thinly  S/Cream&Onion 175g',
       'Kettle Tortilla ChpsHny&Jlpno Chili 150g',
       'Old El Paso Salsa   Dip Tomato Mild 300g',
       'Smiths Crinkle Chips Salt & Vinegar 330g',
       'Grain Waves         Sweet Chilli 210g',
       'Doritos Corn Chip Mexican Jalapeno 150g',
       'Grain Waves Sour    Cream&Chives 210G',
       'Kettle Sensations   Siracha Lime 150g',
       'Twisties Cheese     270g', 'WW Crinkle Cut      Chicken 175g',
       'Thins Chips Light&  Tangy 175g', 'CCs Original 175g',
       'Burger Rings 220g', 'NCC Sour Cream &    Garden Chives 175g',
       'Doritos Corn Chip Southern Chicken 150g',
       'Cheezels Cheese Box 125g', 'Smiths Crinkle      Original 330g',
       'Infzns Crn Crnchers Tangy Gcamole 110g',
       'Kettle Sea Salt     And Vinegar 175g',
       'Smiths Chip Thinly  Cut Original 175g', 'K

Here we can notice that this dataset is entirely not about chips, as we can see some SALSA's around as well. Here as we are only dealing with chips, anything that mentions salsa is removed.

In [13]:
transaction_data['SALSA']=transaction_data['PROD_NAME'].str.lower().str.contains('salsa', na=False)
transaction_data['SALSA'].sum()

np.int64(18094)

In [14]:
ss=len(transaction_data['PROD_NAME'])
ss

264836

Among the 264836 rows of the PROD_NAME 18094 were SALSA, so we are going to remove it.

In [15]:
transaction_data=transaction_data[transaction_data['SALSA']==False].copy()
transaction_data=transaction_data.drop('SALSA',axis=1)

In [16]:
transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246742 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            246742 non-null  datetime64[ns]
 1   STORE_NBR       246742 non-null  int64         
 2   LYLTY_CARD_NBR  246742 non-null  int64         
 3   TXN_ID          246742 non-null  int64         
 4   PROD_NBR        246742 non-null  int64         
 5   PROD_NAME       246742 non-null  object        
 6   PROD_QTY        246742 non-null  int64         
 7   TOT_SALES       246742 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.9+ MB


The new number of rows is 246742.

In [29]:
transaction_data['PROD_NAME'].describe()

count                                 246742
unique                                   105
top       Kettle Mozzarella   Basil  Pesto g
freq                                    3304
Name: PROD_NAME, dtype: object

In [17]:
top_product=transaction_data['PROD_NAME'].value_counts().index[0]

In [18]:
top_product

'Kettle Mozzarella   Basil & Pesto 175g'

Hence, from the summary of the Product Name of the Transaction Data, we found out that among 246742 data, there are only 105 unique product names
and among them the product with the top frequency of 3304 is Kettle Mozzarella Basil & Pesto 175g

Here we noticed that the product name have got certain special signs and numbers.
We don't require those hence we are going to remove them

In [19]:
#Removing digits
transaction_data['PROD_NAME']=transaction_data['PROD_NAME'].str.replace(r'\d+','',regex=True)

# Removing any other special signs
transaction_data['PROD_NAME']=transaction_data['PROD_NAME'].str.replace(r'[^a-zA-Z\s]','', regex=True)

In [20]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSaltg,2,6.0
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese g,3,6.3
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken g,2,2.9
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly SCreamOnion g,5,15.0
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHnyJlpno Chili g,3,13.8


Here I can notice that the PROD_Name is ending with g (which maybe a unit of measurement which was initially present in its name)
And let's first validate wheter its present on the entire column of string or not 

In [21]:
count_ending_g=transaction_data['PROD_NAME'].str.endswith('g').sum()
count_ending_g



np.int64(237421)

In [22]:
total_value=len(transaction_data)
total_value

246742

So here we can see among 246742 data 237421 has that g in its end.
So before performing any other manipulation, we first have to understand what that other data is about.


Before that lets first sort all the product name by their frequency

In [23]:
frequency_map=transaction_data['PROD_NAME'].value_counts()
frequency_map

PROD_NAME
Kettle Mozzarella   Basil  Pesto g       3304
Kettle Tortilla ChpsHnyJlpno Chili g     3296
Cobs Popd SwtChlli SrCream Chips g       3269
Tyrrells Crisps     Ched  Chives g       3268
Cobs Popd Sea Salt  Chips g              3265
                                         ... 
Sunbites Whlegrn    Crisps FrchOnin g    1432
RRD Pc Sea Salt     g                    1431
NCC Sour Cream     Garden Chives g       1419
French Fries Potato Chips g              1418
WW Crinkle Cut      Original g           1410
Name: count, Length: 105, dtype: int64

In [24]:
transaction_data['PROD_freq']=transaction_data['PROD_NAME'].map(frequency_map)
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_freq
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSaltg,2,6.0,1468
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese g,3,6.3,1498
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken g,2,2.9,1484
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly SCreamOnion g,5,15.0,1473
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHnyJlpno Chili g,3,13.8,3296


In [25]:
transaction_data_sorted=transaction_data.sort_values('PROD_freq', ascending=False)
transaction_data_sorted.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_freq
137549,2019-06-06,34,34057,31150,102,Kettle Mozzarella Basil Pesto g,2,10.8,3304
149057,2019-03-04,245,245223,247682,102,Kettle Mozzarella Basil Pesto g,2,10.8,3304
91915,2019-06-25,160,160226,161580,102,Kettle Mozzarella Basil Pesto g,2,10.8,3304
37807,2019-04-10,65,65122,62177,102,Kettle Mozzarella Basil Pesto g,1,5.4,3304
245585,2018-09-22,91,91070,89505,102,Kettle Mozzarella Basil Pesto g,2,10.8,3304


In [26]:
transaction_data.tail()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_freq
264831,2019-03-09,272,272319,270088,89,Kettle Sweet Chilli And Sour Cream g,2,10.8,3200
264832,2018-08-13,272,272358,270154,74,Tostitos Splash Of Lime g,1,4.4,3252
264833,2018-11-06,272,272379,270187,51,Doritos Mexicana g,2,8.8,3115
264834,2018-12-27,272,272379,270188,42,Doritos Corn Chip Mexican Jalapeno g,2,7.8,3204
264835,2018-09-22,272,272380,270189,74,Tostitos Splash Of Lime g,2,8.8,3252


In [27]:
transaction_data_sorted.drop('PROD_freq', axis=1)

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
137549,2019-06-06,34,34057,31150,102,Kettle Mozzarella Basil Pesto g,2,10.8
149057,2019-03-04,245,245223,247682,102,Kettle Mozzarella Basil Pesto g,2,10.8
91915,2019-06-25,160,160226,161580,102,Kettle Mozzarella Basil Pesto g,2,10.8
37807,2019-04-10,65,65122,62177,102,Kettle Mozzarella Basil Pesto g,1,5.4
245585,2018-09-22,91,91070,89505,102,Kettle Mozzarella Basil Pesto g,2,10.8
...,...,...,...,...,...,...,...,...
168217,2019-04-20,225,225142,225521,72,WW Crinkle Cut Original g,2,3.4
76369,2019-03-26,160,160161,161150,72,WW Crinkle Cut Original g,2,3.4
230584,2018-11-14,53,53110,47364,72,WW Crinkle Cut Original g,1,1.7
65016,2019-04-20,178,178228,178998,72,WW Crinkle Cut Original g,2,3.4


transaction_data_sorted is our new data frame with product name and other values sorted in the order of highest frequency to the lowest frequency

---

Now we are going to see what those remaining data not ending with g looks like

In [40]:
not_ending_with_g= ~transaction_data['PROD_NAME'].str.endswith('g')

In [42]:
transaction_data[not_ending_with_g]['PROD_NAME'].unique()

array(['Grain Waves Sour    CreamChives G',
       'Red Rock Deli Sp    Salt  Truffle G',
       'Smiths Thinly       Swt ChliSCreamG', 'Kettle g Swt Pot Sea Salt'],
      dtype=object)

So basically from the prod Name that is not ending with G, there is just one Unique Value, 'Kettle g Swt Pot Sea Salt', where there is not g on the end.
In other values instead of 'g' there was 'G', hence we have to slightly manipulate our function.

In [44]:
Name_ending_with_g=transaction_data_sorted['PROD_NAME'].str.lower().str.endswith('g').sum()
Name_ending_with_g

np.int64(243485)

So basically among the 246742 data there are 243485 data ending with either 'G' or small 'g'. Let's validate if the remaing data is of 'Kettle g Swt Pot Sea Salt' pot or not.

In [47]:
(transaction_data_sorted['PROD_NAME']=='Kettle g Swt Pot Sea Salt').sum()

np.int64(3257)

In [48]:
243485+3257


246742

Now lets remove the g's from the Product name

In [51]:
transaction_data_sorted['PROD_NAME']=transaction_data_sorted['PROD_NAME'].str.replace(r'\s*[gG]$','',regex=True)

In [55]:
transaction_data_sorted.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_freq
137549,2019-06-06,34,34057,31150,102,Kettle Mozzarella Basil Pesto,2,10.8,3304
149057,2019-03-04,245,245223,247682,102,Kettle Mozzarella Basil Pesto,2,10.8,3304
91915,2019-06-25,160,160226,161580,102,Kettle Mozzarella Basil Pesto,2,10.8,3304
37807,2019-04-10,65,65122,62177,102,Kettle Mozzarella Basil Pesto,1,5.4,3304
245585,2018-09-22,91,91070,89505,102,Kettle Mozzarella Basil Pesto,2,10.8,3304


---

### SORTED DATA SUMMARY, STATISTICS

In [56]:
transaction_data_sorted.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246742 entries, 137549 to 68668
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            246742 non-null  datetime64[ns]
 1   STORE_NBR       246742 non-null  int64         
 2   LYLTY_CARD_NBR  246742 non-null  int64         
 3   TXN_ID          246742 non-null  int64         
 4   PROD_NBR        246742 non-null  int64         
 5   PROD_NAME       246742 non-null  object        
 6   PROD_QTY        246742 non-null  int64         
 7   TOT_SALES       246742 non-null  float64       
 8   PROD_freq       246742 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(6), object(1)
memory usage: 18.8+ MB


In [58]:
transaction_data_sorted.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES,PROD_freq
count,246742,246742.0,246742.0,246742.0,246742.0,246742.0,246742.0,246742.0
mean,2018-12-30 01:19:01.211467520,135.051098,135531.0,135131.1,56.351789,1.908062,7.321322,2651.619359
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.7,1410.0
25%,2018-09-30 00:00:00,70.0,70015.0,67569.25,26.0,2.0,5.8,1516.0
50%,2018-12-30 00:00:00,130.0,130367.0,135183.0,53.0,2.0,7.4,3134.0
75%,2019-03-31 00:00:00,203.0,203084.0,202653.8,87.0,2.0,8.8,3174.0
max,2019-06-30 00:00:00,272.0,2373711.0,2415841.0,114.0,200.0,650.0,3304.0
std,,76.787096,80715.28,78147.72,33.695428,0.659831,3.077828,777.492767


#### Observation
1. None of the dataset have any null values
2. We don't have std of Date as it doesnot make sense.
3. In product quantity we are observing a maximum quantity of 200, that seems uncommon for chips.

In [61]:
transaction_data_sorted[transaction_data_sorted['PROD_QTY']==200.00]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_freq
69762,2018-08-19,226,226000,226201,4,Dorito Corn Chp Supreme,200,650.0,3185
69763,2019-05-20,226,226000,226210,4,Dorito Corn Chp Supreme,200,650.0,3185


So there were two ocassions where the maximum product quanity of 200 was sold.
In both the cases it was the DORITO CORN CHP SUPREME.

Interesting fact about it is , the LYLTY_CARD_NUMBER is the SAME and it was bought on the SAME STORE, hence both the product was bought by the same customers. So let's check if the customer has had other transaction or not.

In [74]:
transaction_data_sorted[transaction_data_sorted['LYLTY_CARD_NBR']== 226000]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_freq
69762,2018-08-19,226,226000,226201,4,Dorito Corn Chp Supreme,200,650.0,3185
69763,2019-05-20,226,226000,226210,4,Dorito Corn Chp Supreme,200,650.0,3185


No, this customer have made only these two purchase, which is the maximum purchase and they don't have purchased anything else. It seems like they are not any ordinary retail customers 
and are buying chips for like a retail purpose, hence we will remove this loyalty card number from further analysis.

In [76]:
transaction_data_sorted=transaction_data_sorted[transaction_data_sorted['LYLTY_CARD_NBR']!= 226000].copy()

In [78]:
transaction_data_sorted.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246740 entries, 137549 to 68668
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            246740 non-null  datetime64[ns]
 1   STORE_NBR       246740 non-null  int64         
 2   LYLTY_CARD_NBR  246740 non-null  int64         
 3   TXN_ID          246740 non-null  int64         
 4   PROD_NBR        246740 non-null  int64         
 5   PROD_NAME       246740 non-null  object        
 6   PROD_QTY        246740 non-null  int64         
 7   TOT_SALES       246740 non-null  float64       
 8   PROD_freq       246740 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(6), object(1)
memory usage: 18.8+ MB


Now the two row with max purchase amount is removed.
Now lets check the summary statistics again.

In [81]:
transaction_data_sorted.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES,PROD_freq
count,246740,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0
mean,2018-12-30 01:18:58.448569600,135.050361,135530.3,135130.4,56.352213,1.906456,7.316113,2651.615036
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.7,1410.0
25%,2018-09-30 00:00:00,70.0,70015.0,67568.75,26.0,2.0,5.8,1516.0
50%,2018-12-30 00:00:00,130.0,130367.0,135181.5,53.0,2.0,7.4,3134.0
75%,2019-03-31 00:00:00,203.0,203083.2,202652.2,87.0,2.0,8.8,3174.0
max,2019-06-30 00:00:00,272.0,2373711.0,2415841.0,114.0,5.0,29.5,3304.0
std,,76.786971,80715.2,78147.6,33.695235,0.342499,2.474897,777.494435


Hence the max product quantity is now 5, which seems reasonable.

---

### Lets groupby the number of transaction by date