# Quantium Data Analytics #

#### Analysis of Customer Segments and Chip purchasing behaviour


Importing the necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data in the Pandas DataFrame

In [2]:
transaction_data=pd.read_csv('QVI_transaction_data.csv')
purchase_behaviour_data=pd.read_csv('QVI_purchase_behaviour.csv')

Now lets understand the both datasets and see the summaries of them

In [3]:
transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   DATE            264836 non-null  int64  
 1   STORE_NBR       264836 non-null  int64  
 2   LYLTY_CARD_NBR  264836 non-null  int64  
 3   TXN_ID          264836 non-null  int64  
 4   PROD_NBR        264836 non-null  int64  
 5   PROD_NAME       264836 non-null  object 
 6   PROD_QTY        264836 non-null  int64  
 7   TOT_SALES       264836 non-null  float64
dtypes: float64(1), int64(6), object(1)
memory usage: 16.2+ MB


In [4]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,43390,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,43599,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,43605,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,43329,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,43330,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


Observation:
The transaction dataset contains 264836 rows, and it have got 7 featues.
The only problem that seems to be in this dataset is that the format of Date is incorrect. (In should be in date format, but it is in integer format)

Converting Date from Integer to date format

In [5]:
# The value in the date column seems like excel serial date numbers where the dates are stored as the numbers of days since december 30,1899.
transaction_data['DATE']=pd.to_datetime(transaction_data['DATE'], origin='1899-12-30', unit='D')

In [6]:
transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            264836 non-null  datetime64[ns]
 1   STORE_NBR       264836 non-null  int64         
 2   LYLTY_CARD_NBR  264836 non-null  int64         
 3   TXN_ID          264836 non-null  int64         
 4   PROD_NBR        264836 non-null  int64         
 5   PROD_NAME       264836 non-null  object        
 6   PROD_QTY        264836 non-null  int64         
 7   TOT_SALES       264836 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB


In [7]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [8]:
transaction_data.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES
count,264836,264836.0,264836.0,264836.0,264836.0,264836.0,264836.0
mean,2018-12-30 00:52:12.879215616,135.08011,135549.5,135158.3,56.583157,1.907309,7.3042
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.5
25%,2018-09-30 00:00:00,70.0,70021.0,67601.5,28.0,2.0,5.4
50%,2018-12-30 00:00:00,130.0,130357.5,135137.5,56.0,2.0,7.4
75%,2019-03-31 00:00:00,203.0,203094.2,202701.2,85.0,2.0,9.2
max,2019-06-30 00:00:00,272.0,2373711.0,2415841.0,114.0,200.0,650.0
std,,76.78418,80579.98,78133.03,32.826638,0.643654,3.083226


---

In [9]:
purchase_behaviour_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   LYLTY_CARD_NBR    72637 non-null  int64 
 1   LIFESTAGE         72637 non-null  object
 2   PREMIUM_CUSTOMER  72637 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB


In [10]:
purchase_behaviour_data.head()

Unnamed: 0,LYLTY_CARD_NBR,LIFESTAGE,PREMIUM_CUSTOMER
0,1000,YOUNG SINGLES/COUPLES,Premium
1,1002,YOUNG SINGLES/COUPLES,Mainstream
2,1003,YOUNG FAMILIES,Budget
3,1004,OLDER SINGLES/COUPLES,Mainstream
4,1005,MIDAGE SINGLES/COUPLES,Mainstream


Observation:
Purchase behaviour data have got 3 features and just 72637 data rows. Here all the features seems to be in the correct format.

---

Now first lets focus on the Transaction_data

Product Name

In [11]:
transaction_data['PROD_NAME'].describe()

count                                     264836
unique                                       114
top       Kettle Mozzarella   Basil & Pesto 175g
freq                                        3304
Name: PROD_NAME, dtype: object

In [12]:
transaction_data['PROD_NAME'].unique()

array(['Natural Chip        Compny SeaSalt175g',
       'CCs Nacho Cheese    175g',
       'Smiths Crinkle Cut  Chips Chicken 170g',
       'Smiths Chip Thinly  S/Cream&Onion 175g',
       'Kettle Tortilla ChpsHny&Jlpno Chili 150g',
       'Old El Paso Salsa   Dip Tomato Mild 300g',
       'Smiths Crinkle Chips Salt & Vinegar 330g',
       'Grain Waves         Sweet Chilli 210g',
       'Doritos Corn Chip Mexican Jalapeno 150g',
       'Grain Waves Sour    Cream&Chives 210G',
       'Kettle Sensations   Siracha Lime 150g',
       'Twisties Cheese     270g', 'WW Crinkle Cut      Chicken 175g',
       'Thins Chips Light&  Tangy 175g', 'CCs Original 175g',
       'Burger Rings 220g', 'NCC Sour Cream &    Garden Chives 175g',
       'Doritos Corn Chip Southern Chicken 150g',
       'Cheezels Cheese Box 125g', 'Smiths Crinkle      Original 330g',
       'Infzns Crn Crnchers Tangy Gcamole 110g',
       'Kettle Sea Salt     And Vinegar 175g',
       'Smiths Chip Thinly  Cut Original 175g', 'K

Here we can notice that this dataset is entirely not about chips, as we can see some SALSA's around as well. Here as we are only dealing with chips, anything that mentions salsa is removed.

In [13]:
transaction_data['SALSA']=transaction_data['PROD_NAME'].str.lower().str.contains('salsa', na=False)
transaction_data['SALSA'].sum()

np.int64(18094)

In [14]:
ss=len(transaction_data['PROD_NAME'])
ss

264836

Among the 264836 rows of the PROD_NAME 18094 were SALSA, so we are going to remove it.

In [15]:
transaction_data=transaction_data[transaction_data['SALSA']==False].copy()
transaction_data=transaction_data.drop('SALSA',axis=1)

In [16]:
transaction_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246742 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            246742 non-null  datetime64[ns]
 1   STORE_NBR       246742 non-null  int64         
 2   LYLTY_CARD_NBR  246742 non-null  int64         
 3   TXN_ID          246742 non-null  int64         
 4   PROD_NBR        246742 non-null  int64         
 5   PROD_NAME       246742 non-null  object        
 6   PROD_QTY        246742 non-null  int64         
 7   TOT_SALES       246742 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.9+ MB


The new number of rows is 246742.

In [17]:
transaction_data['PROD_NAME'].describe()

count                                     246742
unique                                       105
top       Kettle Mozzarella   Basil & Pesto 175g
freq                                        3304
Name: PROD_NAME, dtype: object

In [18]:
top_product=transaction_data['PROD_NAME'].value_counts().index[0]

In [19]:
top_product

'Kettle Mozzarella   Basil & Pesto 175g'

Hence, from the summary of the Product Name of the Transaction Data, we found out that among 246742 data, there are only 105 unique product names
and among them the product with the top frequency of 3304 is Kettle Mozzarella Basil & Pesto 175g

Here we noticed that the product name have got certain special signs and numbers.
We don't require those hence we are going to remove them

But let's just store the sizes in another name, as it might carry some significance in the future.

Before doing that lets check whether there is a product or not that ends with the unit like g

In [20]:
not_ending_with_g=~transaction_data['PROD_NAME'].str.endswith('g')

In [21]:
transaction_data[not_ending_with_g]['PROD_NAME'].unique()

array(['Grain Waves Sour    Cream&Chives 210G',
       'Red Rock Deli Sp    Salt & Truffle 150G',
       'Smiths Thinly       Swt Chli&S/Cream175G',
       'Kettle 135g Swt Pot Sea Salt'], dtype=object)

So basically there are 4 products that doesnot end with g, among them three ends with G and the salt one have the size in the middle, that makes things interesting.

In [22]:
transaction_data['SIZES']=transaction_data['PROD_NAME'].str.extract(r'(\d+(?:\.\d+)?\s*(?i:g|kg|ml|l|oz))')

In [23]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0,175g
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3,175g
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9,170g
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0,175g
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8,150g


In [24]:
transaction_data['SIZES']=transaction_data['SIZES'].copy()

---

---

EXCEPTION CASE HANDLING

In [25]:
transaction_data[transaction_data['PROD_NAME']=='Kettle 135g Swt Pot Sea Salt']

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
65,2019-05-20,83,83008,82099,63,Kettle 135g Swt Pot Sea Salt,2,8.4,135g
153,2019-05-17,208,208139,206906,63,Kettle 135g Swt Pot Sea Salt,1,4.2,135g
174,2018-08-20,237,237227,241132,63,Kettle 135g Swt Pot Sea Salt,2,8.4,135g
177,2019-05-17,243,243070,246706,63,Kettle 135g Swt Pot Sea Salt,1,4.2,135g
348,2018-10-26,7,7077,6604,63,Kettle 135g Swt Pot Sea Salt,2,8.4,135g
...,...,...,...,...,...,...,...,...,...
264564,2018-10-08,260,260240,259480,63,Kettle 135g Swt Pot Sea Salt,2,8.4,135g
264574,2019-06-12,261,261035,259860,63,Kettle 135g Swt Pot Sea Salt,2,8.4,135g
264725,2018-07-20,266,266413,264246,63,Kettle 135g Swt Pot Sea Salt,1,4.2,135g
264767,2019-06-08,269,269133,265839,63,Kettle 135g Swt Pot Sea Salt,2,8.4,135g


In [26]:
transaction_data[transaction_data['PROD_NAME']=='Smiths Thinly       Swt Chli&S/Cream175G']

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
35,2018-08-19,51,51100,46803,37,Smiths Thinly Swt Chli&S/Cream175G,1,3.0,175G
421,2018-08-07,13,13010,11152,37,Smiths Thinly Swt Chli&S/Cream175G,2,6.0,175G
428,2019-06-25,13,13010,11159,37,Smiths Thinly Swt Chli&S/Cream175G,2,6.0,175G
527,2018-08-25,20,20329,17313,37,Smiths Thinly Swt Chli&S/Cream175G,1,3.0,175G
879,2018-07-13,45,45126,41112,37,Smiths Thinly Swt Chli&S/Cream175G,2,6.0,175G
...,...,...,...,...,...,...,...,...,...
263054,2019-04-25,195,195320,195230,37,Smiths Thinly Swt Chli&S/Cream175G,1,3.0,175G
263310,2019-03-17,205,205252,204325,37,Smiths Thinly Swt Chli&S/Cream175G,2,6.0,175G
263317,2018-07-01,205,205430,204503,37,Smiths Thinly Swt Chli&S/Cream175G,1,3.0,175G
263429,2019-03-03,213,213088,212425,37,Smiths Thinly Swt Chli&S/Cream175G,2,6.0,175G


Handling of this two case proves that our code to extract the sizes from the Product Name is accurate and effective.

---

---

Now we can remove this from the product name

In [27]:
transaction_data['PROD_NAME']=transaction_data['PROD_NAME'].str.replace(r'\d+(?:\.\d+)?\s*(?i:g|k|ml|lb|oz)\b','',regex=True)
transaction_data['PROD_NAME']=transaction_data['PROD_NAME'].str.strip()

In [28]:
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSalt,2,6.0,175g
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese,3,6.3,175g
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken,2,2.9,170g
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion,5,15.0,175g
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili,3,13.8,150g


---

---

Exception Case Handling

In [29]:
transaction_data[transaction_data['PROD_NBR']==37]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
35,2018-08-19,51,51100,46803,37,Smiths Thinly Swt Chli&S/Cream,1,3.0,175G
421,2018-08-07,13,13010,11152,37,Smiths Thinly Swt Chli&S/Cream,2,6.0,175G
428,2019-06-25,13,13010,11159,37,Smiths Thinly Swt Chli&S/Cream,2,6.0,175G
527,2018-08-25,20,20329,17313,37,Smiths Thinly Swt Chli&S/Cream,1,3.0,175G
879,2018-07-13,45,45126,41112,37,Smiths Thinly Swt Chli&S/Cream,2,6.0,175G
...,...,...,...,...,...,...,...,...,...
263054,2019-04-25,195,195320,195230,37,Smiths Thinly Swt Chli&S/Cream,1,3.0,175G
263310,2019-03-17,205,205252,204325,37,Smiths Thinly Swt Chli&S/Cream,2,6.0,175G
263317,2018-07-01,205,205430,204503,37,Smiths Thinly Swt Chli&S/Cream,1,3.0,175G
263429,2019-03-03,213,213088,212425,37,Smiths Thinly Swt Chli&S/Cream,2,6.0,175G


In [30]:
transaction_data[transaction_data['PROD_NBR']==63]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
65,2019-05-20,83,83008,82099,63,Kettle Swt Pot Sea Salt,2,8.4,135g
153,2019-05-17,208,208139,206906,63,Kettle Swt Pot Sea Salt,1,4.2,135g
174,2018-08-20,237,237227,241132,63,Kettle Swt Pot Sea Salt,2,8.4,135g
177,2019-05-17,243,243070,246706,63,Kettle Swt Pot Sea Salt,1,4.2,135g
348,2018-10-26,7,7077,6604,63,Kettle Swt Pot Sea Salt,2,8.4,135g
...,...,...,...,...,...,...,...,...,...
264564,2018-10-08,260,260240,259480,63,Kettle Swt Pot Sea Salt,2,8.4,135g
264574,2019-06-12,261,261035,259860,63,Kettle Swt Pot Sea Salt,2,8.4,135g
264725,2018-07-20,266,266413,264246,63,Kettle Swt Pot Sea Salt,1,4.2,135g
264767,2019-06-08,269,269133,265839,63,Kettle Swt Pot Sea Salt,2,8.4,135g


This illustrates that even in deleting the sizes from the product Name my code block is working accurately.

---

---

### Sorting Product by Frequency

In [31]:
frequency_map=transaction_data['PROD_NAME'].value_counts()
frequency_map

PROD_NAME
Kettle Mozzarella   Basil & Pesto       3304
Kettle Tortilla ChpsHny&Jlpno Chili     3296
Cobs Popd Swt/Chlli &Sr/Cream Chips     3269
Tyrrells Crisps     Ched & Chives       3268
Cobs Popd Sea Salt  Chips               3265
                                        ... 
Sunbites Whlegrn    Crisps Frch/Onin    1432
RRD Pc Sea Salt                         1431
NCC Sour Cream &    Garden Chives       1419
French Fries Potato Chips               1418
WW Crinkle Cut      Original            1410
Name: count, Length: 105, dtype: int64

In [32]:
transaction_data['PROD_freq']=transaction_data['PROD_NAME'].map(frequency_map)
transaction_data.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES,PROD_freq
0,2018-10-17,1,1000,1,5,Natural Chip Compny SeaSalt,2,6.0,175g,1468
1,2019-05-14,1,1307,348,66,CCs Nacho Cheese,3,6.3,175g,1498
2,2019-05-20,1,1343,383,61,Smiths Crinkle Cut Chips Chicken,2,2.9,170g,1484
3,2018-08-17,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion,5,15.0,175g,1473
4,2018-08-18,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili,3,13.8,150g,3296


In [33]:
transaction_data_sorted=transaction_data.sort_values('PROD_freq', ascending=False)
transaction_data_sorted.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES,PROD_freq
137549,2019-06-06,34,34057,31150,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g,3304
149057,2019-03-04,245,245223,247682,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g,3304
91915,2019-06-25,160,160226,161580,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g,3304
37807,2019-04-10,65,65122,62177,102,Kettle Mozzarella Basil & Pesto,1,5.4,175g,3304
245585,2018-09-22,91,91070,89505,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g,3304


In [34]:
transaction_data_sorted.tail()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES,PROD_freq
168217,2019-04-20,225,225142,225521,72,WW Crinkle Cut Original,2,3.4,175g,1410
76369,2019-03-26,160,160161,161150,72,WW Crinkle Cut Original,2,3.4,175g,1410
230584,2018-11-14,53,53110,47364,72,WW Crinkle Cut Original,1,1.7,175g,1410
65016,2019-04-20,178,178228,178998,72,WW Crinkle Cut Original,2,3.4,175g,1410
68668,2019-06-03,259,259013,257496,72,WW Crinkle Cut Original,2,3.4,175g,1410


In [35]:
transaction_data_sorted.drop('PROD_freq', axis=1)

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES
137549,2019-06-06,34,34057,31150,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g
149057,2019-03-04,245,245223,247682,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g
91915,2019-06-25,160,160226,161580,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g
37807,2019-04-10,65,65122,62177,102,Kettle Mozzarella Basil & Pesto,1,5.4,175g
245585,2018-09-22,91,91070,89505,102,Kettle Mozzarella Basil & Pesto,2,10.8,175g
...,...,...,...,...,...,...,...,...,...
168217,2019-04-20,225,225142,225521,72,WW Crinkle Cut Original,2,3.4,175g
76369,2019-03-26,160,160161,161150,72,WW Crinkle Cut Original,2,3.4,175g
230584,2018-11-14,53,53110,47364,72,WW Crinkle Cut Original,1,1.7,175g
65016,2019-04-20,178,178228,178998,72,WW Crinkle Cut Original,2,3.4,175g


transaction_data_sorted is our new data frame with product name and other values sorted in the order of highest frequency to the lowest frequency

---

### SORTED DATA SUMMARY, STATISTICS

In [36]:
transaction_data_sorted.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246742 entries, 137549 to 68668
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            246742 non-null  datetime64[ns]
 1   STORE_NBR       246742 non-null  int64         
 2   LYLTY_CARD_NBR  246742 non-null  int64         
 3   TXN_ID          246742 non-null  int64         
 4   PROD_NBR        246742 non-null  int64         
 5   PROD_NAME       246742 non-null  object        
 6   PROD_QTY        246742 non-null  int64         
 7   TOT_SALES       246742 non-null  float64       
 8   SIZES           246742 non-null  object        
 9   PROD_freq       246742 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(6), object(2)
memory usage: 20.7+ MB


In [37]:
transaction_data_sorted.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES,PROD_freq
count,246742,246742.0,246742.0,246742.0,246742.0,246742.0,246742.0,246742.0
mean,2018-12-30 01:19:01.211467520,135.051098,135531.0,135131.1,56.351789,1.908062,7.321322,2651.619359
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.7,1410.0
25%,2018-09-30 00:00:00,70.0,70015.0,67569.25,26.0,2.0,5.8,1516.0
50%,2018-12-30 00:00:00,130.0,130367.0,135183.0,53.0,2.0,7.4,3134.0
75%,2019-03-31 00:00:00,203.0,203084.0,202653.8,87.0,2.0,8.8,3174.0
max,2019-06-30 00:00:00,272.0,2373711.0,2415841.0,114.0,200.0,650.0,3304.0
std,,76.787096,80715.28,78147.72,33.695428,0.659831,3.077828,777.492767


#### Observation
1. None of the dataset have any null values
2. We don't have std of Date as it doesnot make sense.
3. In product quantity we are observing a maximum quantity of 200, that seems uncommon for chips.

In [38]:
transaction_data_sorted[transaction_data_sorted['PROD_QTY']==200.00]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES,PROD_freq
69762,2018-08-19,226,226000,226201,4,Dorito Corn Chp Supreme,200,650.0,380g,3185
69763,2019-05-20,226,226000,226210,4,Dorito Corn Chp Supreme,200,650.0,380g,3185


So there were two ocassions where the maximum product quanity of 200 was sold.
In both the cases it was the DORITO CORN CHP SUPREME.

Interesting fact about it is , the LYLTY_CARD_NUMBER is the SAME and it was bought on the SAME STORE, hence both the product was bought by the same customers. So let's check if the customer has had other transaction or not.

In [39]:
transaction_data_sorted[transaction_data_sorted['LYLTY_CARD_NBR']== 226000]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,SIZES,PROD_freq
69762,2018-08-19,226,226000,226201,4,Dorito Corn Chp Supreme,200,650.0,380g,3185
69763,2019-05-20,226,226000,226210,4,Dorito Corn Chp Supreme,200,650.0,380g,3185


No, this customer have made only these two purchase, which is the maximum purchase and they don't have purchased anything else. It seems like they are not any ordinary retail customers 
and are buying chips for like a retail purpose, hence we will remove this loyalty card number from further analysis.

In [40]:
transaction_data_sorted=transaction_data_sorted[transaction_data_sorted['LYLTY_CARD_NBR']!= 226000].copy()

In [41]:
transaction_data_sorted.info()

<class 'pandas.core.frame.DataFrame'>
Index: 246740 entries, 137549 to 68668
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   DATE            246740 non-null  datetime64[ns]
 1   STORE_NBR       246740 non-null  int64         
 2   LYLTY_CARD_NBR  246740 non-null  int64         
 3   TXN_ID          246740 non-null  int64         
 4   PROD_NBR        246740 non-null  int64         
 5   PROD_NAME       246740 non-null  object        
 6   PROD_QTY        246740 non-null  int64         
 7   TOT_SALES       246740 non-null  float64       
 8   SIZES           246740 non-null  object        
 9   PROD_freq       246740 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(6), object(2)
memory usage: 20.7+ MB


Now the two row with max purchase amount is removed.
Now lets check the summary statistics again.

In [42]:
transaction_data_sorted.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES,PROD_freq
count,246740,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0
mean,2018-12-30 01:18:58.448569600,135.050361,135530.3,135130.4,56.352213,1.906456,7.316113,2651.615036
min,2018-07-01 00:00:00,1.0,1000.0,1.0,1.0,1.0,1.7,1410.0
25%,2018-09-30 00:00:00,70.0,70015.0,67568.75,26.0,2.0,5.8,1516.0
50%,2018-12-30 00:00:00,130.0,130367.0,135181.5,53.0,2.0,7.4,3134.0
75%,2019-03-31 00:00:00,203.0,203083.2,202652.2,87.0,2.0,8.8,3174.0
max,2019-06-30 00:00:00,272.0,2373711.0,2415841.0,114.0,5.0,29.5,3304.0
std,,76.786971,80715.2,78147.6,33.695235,0.342499,2.474897,777.494435


Hence the max product quantity is now 5, which seems reasonable.

---

### Lets groupby the number of transaction by date

In [43]:
transaction_by_date=transaction_data_sorted.groupby('DATE').size()
transaction_by_date

DATE
2018-07-01    663
2018-07-02    650
2018-07-03    674
2018-07-04    669
2018-07-05    660
             ... 
2019-06-26    657
2019-06-27    669
2019-06-28    673
2019-06-29    703
2019-06-30    704
Length: 364, dtype: int64

In [44]:
existing_dates=transaction_data_sorted['DATE'].sort_values().unique()

In [45]:
full_range=pd.date_range(start='2018-07-01',end='2019-06-30')

In [46]:
missing_dates=full_range.difference(existing_dates)
missing_dates

DatetimeIndex(['2018-12-25'], dtype='datetime64[ns]', freq='D')

Here we can see that the data of 2018 December 25, which is Christmas day is missing.

In [47]:
transaction_by_date=transaction_by_date.reset_index()
transaction_by_date.columns=['DATE','COUNTS']

In [48]:
transaction_by_date

Unnamed: 0,DATE,COUNTS
0,2018-07-01,663
1,2018-07-02,650
2,2018-07-03,674
3,2018-07-04,669
4,2018-07-05,660
...,...,...
359,2019-06-26,657
360,2019-06-27,669
361,2019-06-28,673
362,2019-06-29,703


In [49]:
transaction_by_date=transaction_by_date.set_index('DATE').reindex(full_range,fill_value=0)
transaction_by_date=transaction_by_date.reset_index()
transaction_by_date.columns=['DATE','COUNTS']



In [50]:
transaction_by_date.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   DATE    365 non-null    datetime64[ns]
 1   COUNTS  365 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 5.8 KB


---

LET's PLOT Number of transaction by date

In [51]:
import plotly.express as px


In [52]:
fig=px.line(transaction_by_date, x='DATE', y='COUNTS')

fig.update_layout(title='Transactions Over Time',yaxis_title='Number of Transactions',xaxis=dict(dtick="M1", tickformat="%b %Y"),hovermode='x unified')
fig.add_annotation(
    x='2018-12-25',
    y=0,
    text="No sales on Christmas Day",
    showarrow=True,
    arrowhead=2,
    arrowsize=1,
    arrowwidth=2,
    arrowcolor="red",
    ax=50,
    ay=-40,
    bgcolor="white",
    bordercolor="red",
    borderwidth=2
)
fig.show()


The steep drop observed here is the data of christmas day, where the store was closed and there were no sales.