<a href="https://colab.research.google.com/github/Ted-star7/Liquor-store-data-segregation/blob/main/Liquor_store_data_segregation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Uzapoint Liquor Data Classification & Cleansing Framework
## Data Integrity, Taxonomy Standardization & POS Optimization

**Author:** Teddy Kibuthu  
**Organization:** Uzapoint  
**Objective:** Standardize liquor product taxonomy, improve data integrity, and design a clean classification structure for POS optimization.

---

## 1. Executive Summary

This project focuses on cleaning, standardizing, and restructuring liquor product data to improve classification accuracy, onboarding simplicity, and reporting reliability within the Uzapoint POS system.

The dataset currently contains product-level classification fields including:

- Category
- Subcategory
- Product ID
- Product Label
- Product Image

However, inconsistencies in naming, classification structure, duplication, and missing image references may reduce data quality and operational efficiency.

This notebook establishes:

1. A structured cleaning pipeline
2. A standardized liquor taxonomy framework
3. A segregation model for anomaly detection
4. Business intelligence insights for POS enhancement
5. A final clean dataset ready for Power BI and POS integration

---

## 2. Business Objectives

The primary goals of this analysis are:

- Improve product classification consistency
- Detect and segregate anomalies
- Simplify liquor store onboarding structure
- Enhance POS search and filtering performance
- Prepare a clean, BI-ready dataset
- Enable future intelligent automation within Uzapoint

---

## 3. Project Roadmap

This notebook will proceed in structured stages:

1. Data Loading
2. Initial Data Audit
3. Data Cleaning & Standardization
4. Segregation of Anomalies
5. Image Integrity Analysis
6. Taxonomy Restructuring
7. Business Intelligence Insights
8. POS Integration Opportunities
9. Final Clean Dataset Export

## 4. Data Loading

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 4. Data Loading

This section loads the liquor product dataset from Google Drive into the Colab environment.

The dataset contains the following core fields:

- Category
- Subcategory
- Product ID
- Product Label
- Product Image (URL)

The objective of this stage is to:

- Load the dataset
- Validate its structure
- Confirm row count
- Preview initial records

In [6]:
import pandas as pd

#define file path
file_path = "/content/drive/My Drive/Dataset/Liquor Store Products.csv"

#load the csv
df = pd.read_csv(file_path)

#display the first 5 rows
print("Successfully loaded data. Here are the first 5 rows:")
print(df.head())

Successfully loaded data. Here are the first 5 rows:
  Category Subcategory  Product ID           Product Label  \
0    BEERS        CANS       10327  Tusker Lager Can 500ml   
1    BEERS        CANS       10328   Tusker Malt Can 500ml   
2    BEERS        CANS       10329   Tusker Lite Can 500ml   
3    BEERS        CANS       10330  Tusker Cider Can 500ml   
4    BEERS        CANS       10331     Guinness Can 500 Ml   

                                       Product Image  
0  https://uzapointerp.uzahost.com/uploads/produc...  
1  https://uzapointerp.uzahost.com/uploads/produc...  
2  https://uzapointerp.uzahost.com/uploads/produc...  
3  https://uzapointerp.uzahost.com/uploads/produc...  
4  https://uzapointerp.uzahost.com/uploads/produc...  


In [7]:
#Checking the Data Shape
print("Dataset shape:", df.shape)

#Checking column names
print("\ncolumns")
print(df.columns)

#Checking data info
print("\nDataset info")
print(df.info())


Dataset shape: (47944, 5)

columns
Index(['Category', 'Subcategory', 'Product ID', 'Product Label',
       'Product Image'],
      dtype='object')

Dataset info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47944 entries, 0 to 47943
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Category       47944 non-null  object
 1   Subcategory    47944 non-null  object
 2   Product ID     47944 non-null  int64 
 3   Product Label  47944 non-null  object
 4   Product Image  7977 non-null   object
dtypes: int64(1), object(4)
memory usage: 1.8+ MB
None


### 4.1 Data Structure Observations

The dataset contains:

- **Total Rows:** 47,944
- **Total Columns:** 5
- All core classification fields are complete (no nulls)
- Product Image column has significant missing values

#### Image Completeness Insight

Only 7,977 out of 47,944 products have image URLs.

This means approximately 83% of products lack image references.

This has implications for:

- POS user interface experience
- Product identification accuracy
- Duplicate detection using image similarity
- Onboarding simplicity

The next phase will focus on conducting a structured Data Audit.

## 5. Initial Data Audit

This section evaluates:

- Duplicate Product IDs
- Duplicate Product Labels
- Category consistency
- Subcategory consistency
- Image completeness ratio
- Whitespace and formatting inconsistencies

In [8]:
#check for duplicates in products Id
duplicates_id = df['Product ID'].duplicated().sum()

#check for duplicates in products label
duplicates_label = df['Product Label'].duplicated().sum()

#unique categorize
unique_categories = df['Category'].nunique()

#unique Subcategories
unique_subcategories = df['Subcategory'].nunique()

#Missing Images
missing_images = df['Product Image'].isnull().sum()
image_coverage = (1 - (missing_images / len(df))) * 100

print("Duplicate Product IDs:", duplicates_id)
print("Duplicate Product Labels:", duplicates_label)
print("Unique Categories:", unique_categories)
print("Unique Subcategories:", unique_subcategories)
print("Missing Images:", missing_images)
print("Image Coverage %:", image_coverage)



Duplicate Product IDs: 1
Duplicate Product Labels: 15190
Unique Categories: 391
Unique Subcategories: 1045
Missing Images: 39967
Image Coverage %: 16.638161188052724
