# Dataset Validation

This notebook is designed to perform **preliminay data validation** to the dataset used for the **Café X Product Performance Analysis** project.

The following aspects will be checked:

- Data Consistency
- Data Integrety
- Data Relevance

In [None]:
# Install necessary packages

!pip install pandas
!pip install openpyxl





[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
#import necessary library

import pandas as pd

In [None]:
#Load the dataset

menu = pd.read_csv("<path-to-file>/dirty_cafe_sales.csv")

## Get general information about the dataset

In [None]:
# Display basic information about the DataFrame

menu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


In [None]:
# View sample data

menu.head(3)

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2,4,Credit Card,Takeaway,08/09/2023
1,TXN_4977031,Cake,4,3,12,Cash,In-store,16/05/2023
2,TXN_4271903,Cookie,4,1,ERROR,Credit Card,In-store,19/07/2023


In [None]:
# View sample data

menu.tail(3)

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
9997,TXN_5255387,Coffee,4,2.0,8,Digital Wallet,,02/03/2023
9998,TXN_7695629,Cookie,3,,3,Digital Wallet,,02/12/2023
9999,TXN_6170729,Sandwich,3,4.0,12,Cash,In-store,07/11/2023


Transaction ID is the unique ID of the dataset representing a specif transaction. Thus, checks for duplicates in this column will be done first.

In [None]:
# Cardinality Check

menu['Transaction ID'].nunique(dropna=False)

10000

Now that it is confirmed that there are no duplicalte entry for a transaction, validation for other columns will be done.

In [None]:
# Check for unexepected values and frequency counts

menu['Item'].value_counts(dropna=False)

Item
Juice       1171
Coffee      1165
Salad       1148
Cake        1139
Sandwich    1131
Smoothie    1096
Cookie      1092
Tea         1089
UNKNOWN      344
NaN          333
ERROR        292
Name: count, dtype: int64

In [None]:
# Check for unexepected values and frequency counts

menu['Payment Method'].value_counts(dropna=False)

Payment Method
NaN               2579
Digital Wallet    2291
Credit Card       2273
Cash              2258
ERROR              306
UNKNOWN            293
Name: count, dtype: int64

In [None]:
# Check for unexepected values and frequency counts

menu['Location'].value_counts(dropna=False)

Location
NaN         3265
Takeaway    3022
In-store    3017
ERROR        358
UNKNOWN      338
Name: count, dtype: int64

In [None]:
# Check for unexepected values and frequency counts

menu['Transaction Date'].value_counts(dropna=False)

Transaction Date
UNKNOWN       159
NaN           159
ERROR         142
16/06/2023     40
06/02/2023     40
             ... 
24/11/2023     15
30/07/2023     15
11/03/2023     14
22/07/2023     14
17/02/2023     14
Name: count, Length: 368, dtype: int64

In [None]:
# Check for unexepected values and frequency counts

menu['Quantity'].value_counts(dropna=False)

Quantity
5          2013
2          1974
4          1863
3          1849
1          1822
UNKNOWN     171
ERROR       170
NaN         138
Name: count, dtype: int64

In [None]:
# Check for unexepected values and frequency counts

menu['Price Per Unit'].value_counts(dropna=False)

Price Per Unit
3          2429
4          2331
2          1227
5          1204
1          1143
1.5        1133
ERROR       190
NaN         179
UNKNOWN     164
Name: count, dtype: int64

In [None]:
# Check for unexepected values and frequency counts

menu['Total Spent'].value_counts(dropna=False)

Total Spent
6          979
12         939
3          930
4          923
20         746
15         734
8          677
10         524
2          497
9          479
5          468
16         444
25         259
7.5        237
1          232
4.5        225
1.5        205
NaN        173
UNKNOWN    165
ERROR      164
Name: count, dtype: int64

## Findings

- Several fields do not contribute to the project’s objectives and add unnecessary noise.
- Column headers are in **Title Case** instead of a consistent naming convention.
- Multiple columns contain missing or invalid values, with some having so many missing entries that they are no longer usable.
- Missing-value indicators are not standardized (e.g., `UNKNOWN`, `ERROR`, blank cells).

## To-Do List for Dataset Cleaning

**1. Remove Irrelevant and Unusable Columns**
- Drop fields that do not contribute to the project objectives.

**2. Convert column headers to snake_case**
- Standardize all column names to lowercase with underscores

**3. Handle Missing and Invalid Values**
- Supplement missing values where possible
- Standardize missing-value representation.




