#### <span style='color:Blue'> **Fetch Data Analyst Project Overview:**</span>

> **Data Source: [Fetch Commerce Data](Raw Data)**

> **Data Quality Issue:**
1. **Inappropriate Data Types**
    - The data types of columns in all three tables (User, Product, and Transaction) were incorrectly defined upon initial reading, requiring adjustments to match their logical meanings (e.g., dates, integers, and floats).

2. **Duplicate and Missing Data**, where rows with identical values across all columns were identified as erroneous duplicates and those with missing value on User ID/ Barcode in the User/Product table would be dropped.
    - Product Table:
        - 215 duplicate rows across all columns were dropped.
        - 4,022 duplicate rows based on BARCODE were identified, with rows containing missing barcode information dropped first.A further 54 duplicate rows based on BARCODE remained, which were cleaned by retaining rows with fewer missing values.
    - Transaction Table:
        - 171 duplicate rows were identified and dropped. Duplicates were defined as rows where SCAN_DATE, RECEIPT_ID, USER_ID, and BARCODE values were identical, which is not valid for unique transactions.

3. **Inconsistent Labeling**
    - User Table: Issues with inconsistent gender labeling.
    - Product Table:
        - Brands were associated with multiple manufacturers.
        - Barcodes were labeled with multiple brand values.
    - Transaction Table: Quantity values were labeled as 'zero' instead of 0.0.

4. **Weak Common Keys for Mapping Across Tables**, the overlap between tables was unexpectedly low:
    - User Table and Transaction Table: Only 130 of 24,795 USER_ID values were common.
    - Product Table: 51.60% of rows had missing information, reducing the ability to map across tables effectively.

> **Data Assumptions For Cleaning:**
1. **Deduplication Logic**
    - In the User Table and Product Table, rows with identical values across all columns were assumed to be duplicates and were removed.
    - Each unique USER_ID in the User Table and each unique BARCODE in the Product Table should only be associated with one row.

2. **Consistency in Product Table**
    - Each BARCODE should be uniquely associated with a single, correct brand, and each brand should have a consistent manufacturer.
    - In cases of inconsistencies, the most frequently occurring brand value for each BARCODE was assumed to be correct. This was resolved through aggregation and frequency counts.

3. **Transaction Data Assumptions**
    - Each transaction (RECEIPT_ID) could involve multiple product barcodes, but only one USER_ID.
    - No two transactions should occur for the same user on the same scan date and time. Duplicate rows were dropped accordingly.

In [1]:
#basic packages 
import pandas as pd
import numpy as np
import random
import math
from scipy import stats
import pickle 
from itertools import combinations
from collections import Counter

#plot
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
import plotly.express as px
import plotly.graph_objects as pg
from plotly import tools
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook_connected"

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
custom_template = {
    "layout": pg.Layout(
        font={
            "family": "Courier New",
            "size": 12,
            "color": "#707070",
        },
        title={
            "font": {
                "family": "Courier New",
                "size": 14.5,
                "color": "#1f1f1f",
            },
        },
        plot_bgcolor="#ffffff",
        paper_bgcolor="#ffffff",
        colorway=px.colors.qualitative.G10,
    )
}
def format_title(title, subtitle=None, subtitle_font_size=13):
    title = f'<b>{title}</b>'
    if not subtitle:
        return title
    subtitle = f'<span style="font-size: {subtitle_font_size}px;">{subtitle}</span>'
    return f'{title}<br>{subtitle}'

##### <span style='color:Blue'> **I. Data Cleaning & Processing** </span>

> <span style='color:Blue'> **I.1. Data Quality Check & FIx Inconsistent Labeling.**</span>

In [3]:
def check_and_drop_duplicates(df):
    """
    Function to check for duplicate rows in a DataFrame, drop them, and reset the index.
    """
    num_duplicates = df.duplicated().sum()
    if num_duplicates > 0:
        print(f"{num_duplicates} rows in the dataset are duplicated and will be dropped.")
        df.drop_duplicates(inplace=True)
        df.reset_index(drop=True, inplace=True)
        print(f"Shape after dropping duplicates: {df.shape}")
    else:
        print("No duplicated rows detected in the dataset.")
    return df

def check_duplicates_by_columns(df, columns):
    """
    check for duplicate rows based on specific columns in a DataFrame.
    """
    duplicate_rows = df[df.duplicated(subset=columns, keep=False)]  # keep=False returns all duplicates
    if not duplicate_rows.empty:
        print(f"Found {len(duplicate_rows)} duplicate rows based on columns {columns}.")
    else:
        print(f"No duplicated rows detected based on columns {columns}.")
    return duplicate_rows

In [4]:
User = pd.read_csv('Raw Data/USER_TAKEHOME.csv', delimiter=',', encoding='utf-8')
User.head(2)

Unnamed: 0,ID,CREATED_DATE,BIRTH_DATE,STATE,LANGUAGE,GENDER
0,5ef3b4f17053ab141787697d,2020-06-24 20:17:54.000 Z,2000-08-11 00:00:00.000 Z,CA,es-419,female
1,5ff220d383fcfc12622b96bc,2021-01-03 19:53:55.000 Z,2001-09-24 04:00:00.000 Z,PA,en,female


In [5]:
User.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   ID            100000 non-null  object
 1   CREATED_DATE  100000 non-null  object
 2   BIRTH_DATE    96325 non-null   object
 3   STATE         95188 non-null   object
 4   LANGUAGE      69492 non-null   object
 5   GENDER        94108 non-null   object
dtypes: object(6)
memory usage: 4.6+ MB


> User Table: Convert to Appropriate Datatype

In [6]:
datetime_columns = ['CREATED_DATE', 'BIRTH_DATE']
for col in datetime_columns:
    User[col] = pd.to_datetime(User[col], errors='coerce')

> User Table: Handle Duplicates Rows and Check for duplicated User ID. 

In [7]:
User=check_and_drop_duplicates(User)

No duplicated rows detected in the dataset.


In [8]:
dup_userID= check_duplicates_by_columns(User, ['ID'])

No duplicated rows detected based on columns ['ID'].


> User Table: Handle Inconsistent Labeling

In [9]:
pd.set_option('max_colwidth', None)
varType= User[['STATE','LANGUAGE','GENDER']]
varType_df=pd.DataFrame.from_records([(col, varType[col].dtype, varType[col].nunique(), varType[col].unique().tolist()) for col in varType.columns],
                          columns=['Column_Name','Data_Type', 'Num_Unique_Values', 'Unique Values']).sort_values(by=['Num_Unique_Values'])
varType_df

Unnamed: 0,Column_Name,Data_Type,Num_Unique_Values,Unique Values
1,LANGUAGE,object,2,"[es-419, en, nan]"
2,GENDER,object,11,"[female, nan, male, non_binary, transgender, prefer_not_to_say, not_listed, Non-Binary, unknown, not_specified, My gender isn't listed, Prefer not to say]"
0,STATE,object,52,"[CA, PA, FL, NC, NY, IN, nan, OH, TX, NM, PR, CO, AZ, RI, MO, NJ, MA, TN, LA, NH, WI, IA, GA, VA, DC, KY, SC, MN, WV, DE, MI, IL, MS, WA, KS, CT, OR, UT, MD, OK, NE, NV, AL, AK, AR, HI, ME, ND, ID, WY, MT, SD, VT]"


In [10]:
# Clean the GENDER column by mapping the values, fill missing values with 'Unknown'
gender_mapping = {'female': 'Female','male': 'Male',
    'non_binary': 'Non-Binary','Non-Binary': 'Non-Binary',
    'transgender': 'Transgender',
    'prefer_not_to_say': 'Prefer not to say','Prefer not to say': 'Prefer not to say',
    'not_listed': 'Not Listed',"My gender isn't listed": 'Not Listed',
    'unknown': 'Unknown','not_specified': 'Unknown',None: 'Unknown',float('nan'): 'Unknown'
}
User['GENDER'] = User['GENDER'].map(gender_mapping).fillna('Unknown')
User['STATE'] = User['STATE'].fillna('Unknown')
print("Unique values in GENDER column after cleaning:",User['GENDER'].unique())

Unique values in GENDER column after cleaning: ['Female' 'Unknown' 'Male' 'Non-Binary' 'Transgender' 'Prefer not to say'
 'Not Listed']


In [35]:
Product = pd.read_csv('Raw Data/PRODUCTS_TAKEHOME.csv', delimiter=',', encoding='utf-8')
Product.head(2)

Unnamed: 0,CATEGORY_1,CATEGORY_2,CATEGORY_3,CATEGORY_4,MANUFACTURER,BRAND,BARCODE
0,Health & Wellness,Sexual Health,Conductivity Gels & Lotions,,,,796494400000.0
1,Snacks,Puffed Snacks,Cheese Curls & Puffs,,,,23278010000.0


In [36]:
Product.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 845552 entries, 0 to 845551
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   CATEGORY_1    845441 non-null  object 
 1   CATEGORY_2    844128 non-null  object 
 2   CATEGORY_3    784986 non-null  object 
 3   CATEGORY_4    67459 non-null   object 
 4   MANUFACTURER  619078 non-null  object 
 5   BRAND         619080 non-null  object 
 6   BARCODE       841527 non-null  float64
dtypes: float64(1), object(6)
memory usage: 45.2+ MB


> Product Table: Handle Duplicates Rows 

In [37]:
Product=check_and_drop_duplicates(Product)
Product = Product.dropna(subset=['BARCODE'])

215 rows in the dataset are duplicated and will be dropped.
Shape after dropping duplicates: (845337, 7)


> Product Table: Inconsistent BARCODE Labeling Issue: BARCODE being linked to different BRAND or MANUFACTURER values.

In [38]:
# Group by 'BARCODE' and aggregate the required values
barcode_analysis = (Product.groupby('BARCODE')
    .agg(unique_brands_count=('BRAND', 'nunique'),  # Count of unique brands
        unique_brands_list=('BRAND', lambda x: list(x.unique()))).reset_index())
#barcodes where the number of unique brands is greater than 1
barcodes_with_issues = barcode_analysis[barcode_analysis['unique_brands_count'] > 1]
barcodes_with_issues

Unnamed: 0,BARCODE,unique_brands_count,unique_brands_list
188,404310.0,2,"[BRAND NOT KNOWN, M&M'S]"
377,701983.0,2,"[SUNRIDGE FARMS, TRADER JOE'S]"
657,1018158.0,2,"[PALMER'S SKIN & HAIR CARE, PALMER]"
930,3454503.0,2,"[DISNEY, ICE BREAKERS]"
959,3473009.0,2,"[BUBBLE YUM, REESE'S]"
976,3484708.0,2,"[ICE BREAKERS, DISNEY]"
2646,20733060.0,2,"[LIDL, PRIVATE LABEL]"
2848,40111220.0,2,"[BOUNTY, MARS]"
3209,80310170.0,2,"[KINDER, KINDER'S]"
41099,17000330000.0,2,"[SCHWARZKOPF, GÖT2B]"


> Product Table: Check for duplicated Barcode. 

In [39]:
dup_barcode= check_duplicates_by_columns(Product, ['BARCODE'])
dup_barcode.sort_values(by=['BARCODE'])

Found 54 duplicate rows based on columns ['BARCODE'].


Unnamed: 0,CATEGORY_1,CATEGORY_2,CATEGORY_3,CATEGORY_4,MANUFACTURER,BRAND,BARCODE
841016,Snacks,Candy,Chocolate Candy,,MARS WRIGLEY,M&M'S,404310.0
139113,Snacks,Candy,Chocolate Candy,,PLACEHOLDER MANUFACTURER,BRAND NOT KNOWN,404310.0
610568,Snacks,Nuts & Seeds,Snack Seeds,,SUNRIDGE FARMS,SUNRIDGE FARMS,701983.0
645146,Snacks,Chips,Crisps,,TRADER JOE'S,TRADER JOE'S,701983.0
681134,Snacks,Nuts & Seeds,Almonds,,TRADER JOE'S,TRADER JOE'S,969307.0
171005,Snacks,Nuts & Seeds,Covered Nuts,,TRADER JOE'S,TRADER JOE'S,969307.0
428195,Health & Wellness,Skin Care,Facial Lotion & Moisturizer,,"R.M. PALMER COMPANY, LLC",PALMER,1018158.0
123189,Health & Wellness,Skin Care,Lip Balms & Treatments,Medicated Lip Treatments,"E.T. BROWNE DRUG CO., INC.",PALMER'S SKIN & HAIR CARE,1018158.0
36017,Snacks,Candy,Candy Variety Pack,,THE HERSHEY COMPANY,HERSHEY'S,3422007.0
422750,Snacks,Candy,Chocolate Candy,,THE HERSHEY COMPANY,HERSHEY'S,3422007.0


In [40]:
index_to_drop = [139113	,645146,681134,428195,36017	,402333,468650,539824,717296,137242,596671,132540,719868,
                 56987,260669,96435,274674,333739,184561,181892,162,216300,300301,37152,379700,303995,709460]
Product = Product.drop(index=index_to_drop)

> Product Table: Handle Inconsistent Labeling Issue 

In [41]:
issue_category_2 = Product[
    Product[['CATEGORY_1', 'CATEGORY_2', 'CATEGORY_3']].notna().all(axis=1) &  
    Product.duplicated(subset=['CATEGORY_1', 'CATEGORY_3'], keep=False) & 
    ~Product.duplicated(subset=['CATEGORY_1', 'CATEGORY_3', 'CATEGORY_2'], keep=False)  # CATEGORY_2 differs
]

issue_category_3 = Product[
    Product[['CATEGORY_2', 'CATEGORY_3', 'CATEGORY_4']].notna().all(axis=1) &  
    Product.duplicated(subset=['CATEGORY_2', 'CATEGORY_4'], keep=False) & 
    ~Product.duplicated(subset=['CATEGORY_2', 'CATEGORY_4', 'CATEGORY_3'], keep=False)  # CATEGORY_3 differs
]
print("Row Length with category labeling issues for CATEGORY_2:", len(issue_category_2))
print("Row Length with category labeling issues for CATEGORY_3:", len(issue_category_3))

Row Length with category labeling issues for CATEGORY_2: 0
Row Length with category labeling issues for CATEGORY_3: 0


In [42]:
# MANUFACTURER LABELING ISSUE IN PRODUCT DATASET
filtered_product = Product[Product[['BRAND', 'MANUFACTURER']].notna().all(axis=1)]

# Group by BRAND and MANUFACTURER and count occurrences of each combination
brand_manufacturer_count = filtered_product.groupby(['BRAND', 'MANUFACTURER']).size().reset_index(name='COUNT')

# find brands associated with more than one manufacturer
brands_with_multiple_manufacturers = brand_manufacturer_count.groupby('BRAND').filter(lambda x: len(x) > 1)
print("Brands with multiple manufacturers and their counts:")
brands_with_multiple_manufacturers

Brands with multiple manufacturers and their counts:


Unnamed: 0,BRAND,MANUFACTURER,COUNT
1317,CHAPSTICK,GLAXOSMITHKLINE,5
1318,CHAPSTICK,HALEON,1099
4129,LE PETIT MARSEILIAIS,J AND J CONSUMER PRODUCTS INC,5
4130,LE PETIT MARSEILIAIS,JOHNSON & JOHNSON,17
4131,LE PETIT MARSEILIAIS,SERVICE CONSOMMATEURS,1
7472,TYGAZ,TYGAZ,111
7473,TYGAZ,UNKNOWN,18


In [43]:
# Update the MANUFACTURER column in the Product dataframe
most_frequent_manufacturer = (brands_with_multiple_manufacturers.loc[
        brands_with_multiple_manufacturers.groupby('BRAND')['COUNT'].idxmax()
    ][['BRAND', 'MANUFACTURER']])
brand_to_manufacturer = dict(zip(most_frequent_manufacturer['BRAND'], most_frequent_manufacturer['MANUFACTURER']))
Product['MANUFACTURER'] = Product.apply(lambda row: brand_to_manufacturer[row['BRAND']]
    if row['BRAND'] in brand_to_manufacturer else row['MANUFACTURER'],axis=1)

print("Check Updated Product dataframe with corrected MANUFACTURER values:")
Product[Product['BRAND'].isin(['CHAPSTICK', 'TYGAZ','LE PETIT MARSEILIAIS'])].groupby('BRAND')['MANUFACTURER'].nunique()

Check Updated Product dataframe with corrected MANUFACTURER values:


BRAND
CHAPSTICK               1
LE PETIT MARSEILIAIS    1
TYGAZ                   1
Name: MANUFACTURER, dtype: int64

In [44]:
Transaction = pd.read_csv('Raw Data/TRANSACTION_TAKEHOME.csv', delimiter=',', encoding='utf-8')
Transaction.head(2)

Unnamed: 0,RECEIPT_ID,PURCHASE_DATE,SCAN_DATE,STORE_NAME,USER_ID,BARCODE,FINAL_QUANTITY,FINAL_SALE
0,0000d256-4041-4a3e-adc4-5623fb6e0c99,2024-08-21,2024-08-21 14:19:06.539 Z,WALMART,63b73a7f3d310dceeabd4758,15300010000.0,1.00,
1,0001455d-7a92-4a7b-a1d2-c747af1c8fd3,2024-07-20,2024-07-20 09:50:24.206 Z,ALDI,62c08877baa38d1a1f6c211a,,zero,1.49


In [45]:
Transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   RECEIPT_ID      50000 non-null  object 
 1   PURCHASE_DATE   50000 non-null  object 
 2   SCAN_DATE       50000 non-null  object 
 3   STORE_NAME      50000 non-null  object 
 4   USER_ID         50000 non-null  object 
 5   BARCODE         44238 non-null  float64
 6   FINAL_QUANTITY  50000 non-null  object 
 7   FINAL_SALE      50000 non-null  object 
dtypes: float64(1), object(7)
memory usage: 3.1+ MB


> Transaction Table: FIx Inconsistent Labeling for Quantity and Sale Fields

In [46]:
print(Transaction.FINAL_QUANTITY.unique().tolist())
print(Transaction.FINAL_SALE.unique().tolist())

['1.00', 'zero', '2.00', '3.00', '4.00', '4.55', '2.83', '2.34', '0.46', '7.00', '18.00', '12.00', '5.00', '2.17', '0.23', '8.00', '1.35', '0.09', '2.58', '1.47', '16.00', '0.62', '1.24', '1.40', '0.51', '0.53', '1.69', '6.00', '2.39', '2.60', '10.00', '0.86', '1.54', '1.88', '2.93', '1.28', '0.65', '2.89', '1.44', '2.75', '1.81', '276.00', '0.87', '2.10', '3.33', '2.54', '2.20', '1.93', '1.34', '1.13', '2.19', '0.83', '2.61', '0.28', '1.50', '0.97', '0.24', '1.18', '6.22', '1.22', '1.23', '2.57', '1.07', '2.11', '0.48', '9.00', '3.11', '1.08', '5.53', '1.89', '0.01', '2.18', '1.99', '0.04', '2.25', '1.37', '3.02', '0.35', '0.99', '1.80', '3.24', '0.94', '2.04', '3.69', '0.70', '2.52', '2.27']
[' ', '1.49', '3.49', '1.46', '3.59', '2.29', '10.99', '0.97', '7.48', '2.49', '5.25', '1.25', '2.92', '3.67', '1.39', '1.38', '4.18', '0.50', '2.68', '5.49', '1.44', '23.48', '4.88', '3.18', '0.18', '1.79', '1.54', '0.99', '2.39', '7.58', '3.00', '3.70', '6.47', '4.25', '5.48', '3.25', '0.88', '

In [47]:
Transaction['FINAL_QUANTITY'] = Transaction['FINAL_QUANTITY'].replace('zero', '0.0')
Transaction['FINAL_SALE'] = Transaction['FINAL_SALE'].replace(' ', np.nan)

> Transaction Table: Handle Duplicates rows by keeping records with less NaN value

In [48]:
Transaction= check_and_drop_duplicates(Transaction)

171 rows in the dataset are duplicated and will be dropped.
Shape after dropping duplicates: (49829, 8)


In [49]:
dup_transaction= check_duplicates_by_columns(Transaction, ['SCAN_DATE','RECEIPT_ID','USER_ID','BARCODE'])
dup_transaction.sort_values(by=['RECEIPT_ID','USER_ID'])

Found 49829 duplicate rows based on columns ['SCAN_DATE', 'RECEIPT_ID', 'USER_ID', 'BARCODE'].


Unnamed: 0,RECEIPT_ID,PURCHASE_DATE,SCAN_DATE,STORE_NAME,USER_ID,BARCODE,FINAL_QUANTITY,FINAL_SALE
0,0000d256-4041-4a3e-adc4-5623fb6e0c99,2024-08-21,2024-08-21 14:19:06.539 Z,WALMART,63b73a7f3d310dceeabd4758,1.530001e+10,1.00,
41464,0000d256-4041-4a3e-adc4-5623fb6e0c99,2024-08-21,2024-08-21 14:19:06.539 Z,WALMART,63b73a7f3d310dceeabd4758,1.530001e+10,1.00,1.54
1,0001455d-7a92-4a7b-a1d2-c747af1c8fd3,2024-07-20,2024-07-20 09:50:24.206 Z,ALDI,62c08877baa38d1a1f6c211a,,0.0,1.49
39205,0001455d-7a92-4a7b-a1d2-c747af1c8fd3,2024-07-20,2024-07-20 09:50:24.206 Z,ALDI,62c08877baa38d1a1f6c211a,,1.00,1.49
2,00017e0a-7851-42fb-bfab-0baa96e23586,2024-08-18,2024-08-19 15:38:56.813 Z,WALMART,60842f207ac8b7729e472020,7.874223e+10,1.00,
...,...,...,...,...,...,...,...,...
28116,fffbb112-3cc5-47c2-b014-08db2f87e0c7,2024-07-30,2024-08-04 11:43:31.474 Z,WALMART,5eb59d6be7012d13941af5e2,8.180000e+11,1.00,4.88
24975,fffbfb2a-7c1f-41c9-a5da-628fa7fcc746,2024-07-28,2024-07-28 11:47:34.180 Z,WALMART,62a0c8f7d966665570351bb8,1.300001e+10,1.00,
31547,fffbfb2a-7c1f-41c9-a5da-628fa7fcc746,2024-07-28,2024-07-28 11:47:34.180 Z,WALMART,62a0c8f7d966665570351bb8,1.300001e+10,1.00,3.48
24976,fffe8012-7dcf-4d84-b6c6-feaacab5074a,2024-09-07,2024-09-08 08:21:25.648 Z,WALGREENS,5f53c62bd683c715b9991b20,7.432310e+10,0.0,2.98


In [50]:
dup_transaction['nan_or_zero_count'] = dup_transaction.isna().sum(axis=1) + (dup_transaction == '0.0').sum(axis=1)
dup_transaction_sorted = dup_transaction.sort_values(by=['nan_or_zero_count', 'RECEIPT_ID', 'USER_ID'])
rows_to_drop = dup_transaction_sorted[~dup_transaction_sorted.index.isin(
    dup_transaction_sorted.drop_duplicates(subset=['SCAN_DATE', 'RECEIPT_ID', 'USER_ID', 'BARCODE'], keep='first').index
)]
Transaction = Transaction.drop(index=rows_to_drop.index)
len(Transaction)

24795

In [51]:
# Conversion dictionary mapping columns to target data types
conversion_map = {'PURCHASE_DATE': 'datetime','SCAN_DATE': 'datetime','FINAL_QUANTITY': 'float','FINAL_SALE': 'float'}

for col, dtype in conversion_map.items():
    if dtype == 'datetime':
        Transaction[col] = pd.to_datetime(Transaction[col], errors='coerce')
    elif dtype == 'float':
        Transaction[col] = pd.to_numeric(Transaction[col], errors='coerce')
print("\nUpdated Data Types:")
Transaction.dtypes


Updated Data Types:


RECEIPT_ID                     object
PURCHASE_DATE          datetime64[ns]
SCAN_DATE         datetime64[ns, UTC]
STORE_NAME                     object
USER_ID                        object
BARCODE                       float64
FINAL_QUANTITY                float64
FINAL_SALE                    float64
dtype: object

> Transaction Table: Check for Common rows based on Key column

In [52]:
check_UserID = Transaction.merge(User[['ID']], left_on='USER_ID', right_on='ID', how='inner', indicator=True)
print(f'Length of Common USER_ID in Transaction and User tables: {len(check_UserID)} out of {len(Transaction)}')

Length of Common USER_ID in Transaction and User tables: 130 out of 24795


In [53]:
check_barcode = Transaction.merge(Product[['BARCODE', 'CATEGORY_1']], on='BARCODE', how='left', indicator=True)
missing_category1_count = check_barcode['CATEGORY_1'].isnull().sum()
missing_percentage = (missing_category1_count /  len(check_barcode)) * 100
print(f"Percentage of rows with missing values in 'CATEGORY_1': {missing_percentage:.2f}%")

Percentage of rows with missing values in 'CATEGORY_1': 51.60%


In [54]:
#Left joined Transactions with User and Product table 
merged_df = pd.merge(Transaction, User, left_on='USER_ID', right_on='ID', how='left')
merged_df = pd.merge(merged_df, Product, on='BARCODE', how='left')
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24795 entries, 0 to 24794
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   RECEIPT_ID      24795 non-null  object             
 1   PURCHASE_DATE   24795 non-null  datetime64[ns]     
 2   SCAN_DATE       24795 non-null  datetime64[ns, UTC]
 3   STORE_NAME      24795 non-null  object             
 4   USER_ID         24795 non-null  object             
 5   BARCODE         21979 non-null  float64            
 6   FINAL_QUANTITY  24795 non-null  float64            
 7   FINAL_SALE      24795 non-null  float64            
 8   ID              130 non-null    object             
 9   CREATED_DATE    130 non-null    datetime64[ns, UTC]
 10  BIRTH_DATE      129 non-null    datetime64[ns, UTC]
 11  STATE           130 non-null    object             
 12  LANGUAGE        130 non-null    object             
 13  GENDER          130 non-null   

In [55]:
# Drop 'ID' column if it exists
if 'ID' in merged_df.columns:
    merged_df = merged_df.drop(columns=['ID']) 

In [56]:
missing_df = (merged_df.isnull().mean() * 100).reset_index()
missing_df.columns = ['Column', 'Missing Percentage']
fig = px.bar(missing_df[missing_df['Missing Percentage']>0],x='Missing Percentage',y='Column',orientation='h',
    title=format_title('Percentage of Missing Values per Column'),
    labels={'Missing Percentage': 'Percentage (%)'},
    text_auto='.2f',template=custom_template)
fig.add_vline(x=50, line_dash="dash", line_color="red",line_width=2,
    annotation_text="50% Benchmark", annotation_position="bottom left" )
fig.update_layout(xaxis_title='Columns', yaxis_title='Missing Percentage (%)',
                  title_x=0.5, margin=dict(l=200, r=0, t=30, b=0))
fig.show()

>Store 3 Tables and Merged Table as Pickle file 

In [57]:
Product.to_pickle("Cleaned Data/Product.pkl")
User.to_pickle("Cleaned Data/User.pkl")
Transaction.to_pickle("Cleaned Data/Transaction.pkl")
merged_df.to_pickle("Cleaned Data/merged_df.pkl")