# Day 1 Exercise: Cleaning Messy Cafe Sales Data

**Name:** Anton Shestakov  
**Date:** October 8, 2025

---

## Objective

Transform a messy cafe sales dataset into a tidy format, designate and validate a primary key, and create summary tables.

## Dataset

**File:** `../data/day1/dirty_cafe_sales.csv`  
**Rows:** 10,000 cafe transactions  
**Data Dictionary:** See `../data/day1/README.md`

## Deliverable

This notebook should **"Restart & Run All"** successfully when you're done!

---

## Section 1: Setup and Data Loading

### TODO 1: Import libraries

In [91]:
# TODO 1: Import pandas and numpy
# Uncomment the lines below and run this cell:

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


### TODO 2: Load the data

In [92]:
# TODO 2: Load the data
# Uncomment the lines below and run this cell:

df = pd.read_csv('../data/day1/dirty_cafe_sales.csv')
print(f"✅ Data loaded: {len(df):,} rows")

✅ Data loaded: 10,000 rows


---

## Section 2: Initial Exploration

Before cleaning, let's understand what we have.

### TODO 3: Display basic information

In [93]:
# TODO 3: Display the shape of the dataframe
# Uncomment and run:

print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

Dataset shape: 10,000 rows × 8 columns


In [94]:
# TODO 3 (continued): Display the first 10 rows
# Uncomment and run:

df.head(10)

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11
5,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,,2023-03-31
6,TXN_4433211,UNKNOWN,3,3.0,9.0,ERROR,Takeaway,2023-10-06
7,TXN_6699534,Sandwich,4,4.0,16.0,Cash,UNKNOWN,2023-10-28
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31


In [95]:
# TODO 3 (continued): Display column names and types
# Uncomment and run:

print("Column Types:")
print(df.dtypes)

Column Types:
Transaction ID      object
Item                object
Quantity            object
Price Per Unit      object
Total Spent         object
Payment Method      object
Location            object
Transaction Date    object
dtype: object


### TODO 4: Check for missing values

In [96]:
# TODO 4: Count missing values (NaN) in each column
# Uncomment and run:

print("Missing Values (NaN) per column:")
print(df.isnull().sum())

Missing Values (NaN) per column:
Transaction ID         0
Item                 333
Quantity             138
Price Per Unit       179
Total Spent          173
Payment Method      2579
Location            3265
Transaction Date     159
dtype: int64


### TODO 5: Check for sentinel values

Look for "ERROR" and "UNKNOWN" in the data.

In [97]:
# TODO 5: Count "ERROR" values in each column
# Uncomment and run:

print("'ERROR' values per column:")
print((df == 'ERROR').sum())

'ERROR' values per column:
Transaction ID        0
Item                292
Quantity            170
Price Per Unit      190
Total Spent         164
Payment Method      306
Location            358
Transaction Date    142
dtype: int64


In [98]:
# TODO 5 (continued): Count "UNKNOWN" values in each column
# Uncomment and run:

print("'UNKNOWN' values per column:")
print((df == 'UNKNOWN').sum())

'UNKNOWN' values per column:
Transaction ID        0
Item                344
Quantity            171
Price Per Unit      164
Total Spent         165
Payment Method      293
Location            338
Transaction Date    159
dtype: int64


### Reflection: What Issues Did You Find?

**TODO:** Write 2-3 sentences describing the data quality issues you observed.

The data contains a big number of omissons. Depending on the variable, the share of "bad" values can reach almost 40% like it is for the variable "location". Moreover, all variables have "object" type as an encoding type by default that is also not true for some variables with numerical values. 

---

## Section 3: Is This Data Tidy?

### TODO 6: Evaluate against tidy data principles

**The Three Rules:**
1. Each variable is a column
2. Each observation is a row
3. Each value is a cell

**Questions to answer in markdown:**

1. What is the unit of observation in this dataset? (What does each row represent?)

a transaction unit

2. Does each variable have its own column?

Yes

3. Is this dataset tidy? Why or why not?

According to three rules of tidy data, this data set can be called as tidy. Each variable is a separate column, each observation (which is a transaction id) is a row, and each value is a cell. Nevertheless, given into account a big number of missing values, the dataset is tidy but only structurally. In order to become a perfect version, it should be cleaned up from 'bad' values, i.e NaN, 'ERROR' and so on. 

---

## Section 4: Identify and Validate Primary Key

### TODO 7: Identify the primary key candidate

In [99]:
# TODO 7: Check if 'Transaction ID' is unique
# Uncomment and run:

is_unique = df['Transaction ID'].is_unique
print(f"Is 'Transaction ID' unique? {is_unique}")
print(f"Total rows: {len(df):,}")
print(f"Unique Transaction IDs: {df['Transaction ID'].nunique():,}")

Is 'Transaction ID' unique? True
Total rows: 10,000
Unique Transaction IDs: 10,000


In [100]:
# TODO 7 (continued): Check for any NULL values in 'Transaction ID'
# Uncomment and run:

null_count = df['Transaction ID'].isnull().sum()
print(f"NULL Transaction IDs: {null_count}")

NULL Transaction IDs: 0


In [101]:
# TODO 7 (continued): If there are duplicates, find them
# Uncomment and run:

duplicates = df[df.duplicated(subset=['Transaction ID'], keep=False)]
print(f"Duplicate rows: {len(duplicates)}")
if len(duplicates) > 0:
     print("\nShowing first few duplicates:")
     display(duplicates.head())

Duplicate rows: 0


In [102]:
nan = df['Transaction ID'].isna().sum()
error = (df['Transaction ID'] == "ERROR").sum()

print(f"The number of NaN: {nan}")
print(f"The number of ERROR: {error}")

The number of NaN: 0
The number of ERROR: 0


### TODO 8: Write validation assertions

Once you've confirmed (or fixed) the primary key, write assertions to prove it.

In [103]:
# TODO 8: Add assertions to validate primary key
# Uncomment and run (these will error if checks fail):

assert df['Transaction ID'].is_unique, "❌ Duplicate transaction IDs found"
assert df['Transaction ID'].notna().all(), "❌ NULL transaction IDs found"
assert df['Transaction ID'].isna().sum() == 0, "❌ NA transaction IDs found"
assert (df['Transaction ID'] == "ERROR").sum() == 0, "❌ ERROR transaction IDs found"
print("✅ Transaction ID is a valid primary key")

✅ Transaction ID is a valid primary key


### Reflection: Primary Key

**TODO:** Explain what you found and any decisions you made.

_[Your reflection here: Is Transaction ID a good primary key? Did you find any issues? How did you handle them?]_

Based on the definition of primary key, which is basically a unique identifier, the variable Transaction ID fits to this position pretty good. Firstly, we haven't found any duplicates, in other words each row in this column is a unique value. Moreover, there haven't been NULL values, nor NaN, nor ERROR (I added a new chunk of code for checking NaN/ERROR values as well, the validation assertions block was also a bit modified).

---

## Section 5: Handle Missing Values

### TODO 9: Standardize missing value representations

Convert "ERROR", "UNKNOWN", and empty strings to NaN.

In [104]:
# TODO 9: Replace sentinel values with NaN
# Uncomment and run:

df = df.replace(['ERROR', 'UNKNOWN', ''], np.nan)
print("✅ Replaced 'ERROR', 'UNKNOWN', and empty strings with NaN")

✅ Replaced 'ERROR', 'UNKNOWN', and empty strings with NaN


### TODO 10: Decide how to handle missing values

**Options:**
- Drop rows with missing values in critical columns
- Fill with default values
- Keep as NaN (document impact on analysis)

**Your strategy:**

Strategy: I will drop NAN values which are contained in variables 'Item' and 'Quantity'. These two indicators are key. Without this primary information, the following transactional data deprives meaning. The analysis can exist without supplementary data on location and payment method, but when we don't know, what kind of good we have sold, then the following complementary information loses its relevance. 

P.S naturally we might be interested in location analysis, then we should keep this data, but assume that we are interested in transactional operations in the first place, so we have to know primary info on item and quantity. 

In [105]:
# TODO 9 (continued): Check missing values again after standardization
# Uncomment and run:

print("Missing values after standardization:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum():,}")

Missing values after standardization:
Transaction ID         0
Item                 969
Quantity             479
Price Per Unit       533
Total Spent          502
Payment Method      3178
Location            3961
Transaction Date     460
dtype: int64

Total missing values: 10,082


In [106]:
df.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,,,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [107]:
# TODO 10: Implement your missing value strategy
# This is a decision point - choose your approach!
# Below is ONE option: Keep NaN as-is (document in reflection above)
# Uncomment and run:

# For this exercise, we'll keep NaN values and handle them in analysis
# (You could also drop rows or fill values - document your choice above!)
print("✅ Missing value strategy: Dropping NaN values for critical indicators: Item and Quantity")

✅ Missing value strategy: Dropping NaN values for critical indicators: Item and Quantity


In [108]:
df = df.dropna(subset=['Quantity', 'Item'])
print("Missing values after standardization:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum():,}")

Missing values after standardization:
Transaction ID         0
Item                   0
Quantity               0
Price Per Unit       464
Total Spent          432
Payment Method      2732
Location            3418
Transaction Date     396
dtype: int64

Total missing values: 7,442


---

## Section 6: Fix Type Issues

### TODO 11: Convert Quantity to integer

#### We don't need to run TODO 11 due to dropping all NaN in Quantity column

In [109]:
df.dtypes

Transaction ID      object
Item                object
Quantity            object
Price Per Unit      object
Total Spent         object
Payment Method      object
Location            object
Transaction Date    object
dtype: object

In [110]:
# TODO 11: Convert Quantity to integer
# Uncomment and run:

#df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce').astype('Int64')
#print("✅ Quantity converted to Int64 (allows NaN)")

In [111]:
df['Quantity'] = df['Quantity'].astype('Int64')

### TODO 12: Convert prices to float

In [112]:
# TODO 12: Convert 'Price Per Unit' to float
# Uncomment and run:

df['Price Per Unit'] = pd.to_numeric(df['Price Per Unit'], errors='coerce')
print("✅ 'Price Per Unit' converted to float64")

✅ 'Price Per Unit' converted to float64


### TODO 13: Convert Transaction Date to datetime

In [113]:
# TODO 12 (continued): Convert 'Total Spent' to float
# Uncomment and run:

df['Total Spent'] = pd.to_numeric(df['Total Spent'], errors='coerce')
print("✅ 'Total Spent' converted to float64")

✅ 'Total Spent' converted to float64


In [114]:
# TODO 13: Parse Transaction Date as datetime
# Uncomment and run:

df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], errors='coerce')
print("✅ 'Transaction Date' converted to datetime64")

✅ 'Transaction Date' converted to datetime64


### TODO 14: Verify types

In [115]:
# TODO 14: Display dtypes to verify conversions worked
# Uncomment and run:

print("Updated Column Types:")
print(df.dtypes)

Updated Column Types:
Transaction ID              object
Item                        object
Quantity                     Int64
Price Per Unit             float64
Total Spent                float64
Payment Method              object
Location                    object
Transaction Date    datetime64[ns]
dtype: object


### TODO 15: Write type assertions

In [116]:
# TODO 15: Add assertions to validate types
# Uncomment and run:

assert df['Quantity'].dtype in ['int64', 'Int64'], "❌ Quantity should be integer"
assert df['Price Per Unit'].dtype == 'float64', "❌ Price should be float"
assert df['Transaction Date'].dtype == 'datetime64[ns]', "❌ Date should be datetime"
print("✅ All types are correct!")

✅ All types are correct!


---

## Section 7: Validate Data Integrity

### TODO 16: Check if Total Spent = Quantity × Price Per Unit

In [117]:
# TODO 16: Calculate expected total
# Uncomment and run:

df['Calculated Total'] = df['Quantity'] * df['Price Per Unit']
print("✅ Calculated expected totals")

✅ Calculated expected totals


In [118]:
# TODO 16 (continued): Compare with actual Total Spent
# This uses np.isclose() for float comparison (allows tiny rounding differences)
# Uncomment and run:

mask = df['Total Spent'].notna() & df['Calculated Total'].notna()
mismatches = ~np.isclose(
     df.loc[mask, 'Total Spent'], 
     df.loc[mask, 'Calculated Total'],
     rtol=1e-05  # Relative tolerance for floating point comparison
 )
print(f"Mismatches found: {mismatches.sum()} out of {mask.sum()} rows with data")

Mismatches found: 0 out of 7732 rows with data


### TODO 17: Check for impossible values

In [119]:
# TODO 17: Check for negative or zero prices
# Uncomment and run:

bad_prices = df[df['Price Per Unit'] <= 0]
print(f"Rows with price <= 0: {len(bad_prices)}")
if len(bad_prices) > 0:
     display(bad_prices[['Transaction ID', 'Item', 'Price Per Unit']].head())

Rows with price <= 0: 0


In [120]:
# TODO 17 (continued): Check for zero or negative quantities
# Uncomment and run:

bad_qty = df[df['Quantity'] <= 0]
print(f"Rows with quantity <= 0: {len(bad_qty)}")
if len(bad_qty) > 0:
     display(bad_qty[['Transaction ID', 'Item', 'Quantity']].head())

Rows with quantity <= 0: 0


### Reflection: Data Integrity

**TODO:** What did you find? How did you handle integrity issues?

We carried out the standard validation of data integrity: firstly, a new column "Calculated Total" was created as a product of Quantity and Price. Then we checked whether our existed column "Total Spend", which is actually a product of these two aforementioned variables, fits with "Calculated Total" and contains no values mismatches. The following steps include validation on "bad" values in "Price per Unit" and "Quantity" columns, whether they are positive. Ultimately, it was figured out that there were no mismatches nor errors in recorded values. 

---

## Section 8: Create Summary Tables

Now that data is clean, answer some business questions!

### TODO 18: Total sales by payment method

In [121]:
df.shape

(8611, 9)

In [122]:
# TODO 18: Calculate total revenue and transaction count by payment method
# Uncomment and run (this one is fully worked as an example):

payment_summary = df.groupby('Payment Method').agg({
     'Total Spent': 'sum',
     'Transaction ID': 'count'
 }).round(2)
 
payment_summary.columns = ['Total Revenue', 'Transaction Count']
payment_summary = payment_summary.sort_values('Total Revenue', ascending=False)
 
print("Sales by Payment Method:")
display(payment_summary)

Sales by Payment Method:


Unnamed: 0_level_0,Total Revenue,Transaction Count
Payment Method,Unnamed: 1_level_1,Unnamed: 2_level_1
Credit Card,16878.0,1951
Digital Wallet,16833.0,1970
Cash,16725.5,1958


### TODO 19: Most popular items

In [123]:
# TODO 19: Find most popular items by quantity sold
# Pattern: df.groupby('Column')['Metric'].sum().sort_values(ascending=False)
# Uncomment and adapt:

popular_items = df.groupby('Item')['Quantity'].sum().sort_values(ascending=False)
print("Most Popular Items (by quantity):")
display(popular_items.head(10))

Most Popular Items (by quantity):


Item
Juice       3373
Coffee      3368
Cake        3329
Salad       3310
Sandwich    3245
Smoothie    3221
Tea         3154
Cookie      3090
Name: Quantity, dtype: Int64

In [124]:
# TODO 19 (continued): Find highest revenue items
# Use the same pattern but with 'Total Spent' instead of 'Quantity'
# Uncomment and adapt:

revenue_items = df.groupby('Item')['Total Spent'].sum().sort_values(ascending=False).round(2)
print("Highest Revenue Items:")
display(revenue_items.head(10))

Highest Revenue Items:


Item
Salad       15810.0
Sandwich    12220.0
Smoothie    12096.0
Juice        9588.0
Cake         9516.0
Coffee       6448.0
Tea          4506.0
Cookie       2928.0
Name: Total Spent, dtype: float64

### TODO 20: Location comparison

In [125]:
# TODO 20: Compare transaction volume and average transaction value by location
# This uses .agg() with multiple functions (like TODO 18)
# Uncomment and run:

location_summary = df.groupby('Location').agg({
     'Transaction ID': 'count',
     'Total Spent': ['sum', 'mean']
 }).round(2)
 
location_summary.columns = ['Transaction Count', 'Total Revenue', 'Avg Transaction Value']
print("Sales by Location:")
display(location_summary)

Sales by Location:


Unnamed: 0_level_0,Transaction Count,Total Revenue,Avg Transaction Value
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
In-store,2602,22360.0,9.05
Takeaway,2591,21697.0,8.79


---

## Section 9: Final Validation

### TODO 21: Run all validations

In [126]:
# TODO 21: Gather all your assertions in one cell to prove data quality
# Uncomment and run:

print("Running final validation...\n")
# 
# # Primary key
assert df['Transaction ID'].is_unique, "❌ Duplicate transaction IDs"
assert df['Transaction ID'].notna().all(), "❌ NULL transaction IDs"
print("✅ Primary key validated")
# 
# # Types
assert df['Quantity'].dtype in ['int64', 'Int64'], "❌ Quantity type wrong"
assert df['Price Per Unit'].dtype == 'float64', "❌ Price type wrong"
assert df['Transaction Date'].dtype == 'datetime64[ns]', "❌ Date type wrong"
print("✅ Types validated")
# 
# # Data ranges (only check non-null values)
assert (df['Quantity'].dropna() > 0).all(), "❌ Invalid quantities found"
assert (df['Price Per Unit'].dropna() > 0).all(), "❌ Invalid prices found"
print("✅ Data ranges validated")
# 
print("\n✅ All validations passed!")

Running final validation...

✅ Primary key validated
✅ Types validated
✅ Data ranges validated

✅ All validations passed!


---

## Section 10: Documentation

### TODO 22: Document your data cleaning process

Write a brief summary (8-10 sentences) of:
1. What problems you found
2. What decisions you made
3. What the implications are for analysis
4. What a stakeholder should know about this data

---

## Data Cleaning Summary

The dataset was originally messy, with inconsistent missing value encodings (NaN, ERROR, NULL) and all variables were encoded as object type. We standardized missing values to a single NaN format and converted variables into appropriate data types (i.e numerical for Quantity and Price). Since Item and Quantity are critical for transactional analysis, rows with missing values in these columns were dropped, reducing the dataset to 8,166 observations.
A primary key based on Transaction ID was validated, ensuring uniqueness and consistency for potential combining with other datasets. Assertions confirmed that calculated product between quantity and price matched reported Total Spent. Although the data is now structurally tidy and validated, some variables such as Location and Payment Method still contain high rates of missing values. This means analyses relying on those categories may be less representative, while analyses focusing on items and quantities can be performed at full quality. Stakeholders should know that the dataset is reliable for product level sales insights but limited for geographic or payment-related breakdowns. To sum it up, the data is sufficiently clean for core sales and revenue analysis, with clear limitations documented for context.

### Issues Found
- A large number of missing values in complementary variables such that Location (Around 40%)
- Omissions in critical indicators in columns Item/Quantity
- Different names of missing values including 'NaN', 'ERROR', 'NULL'
- Incorrect data encoding: all values were object type

### Actions Taken
- Converting all missing values to a single standardized form NaN
- Dropping NaN values in columns with critical information: Item and Quantity. That, in turn, reduced our dataset from 10,000 observation to 8,611
- Encoding modification: converting numerical variables to numerical type for the following analysis

### Assumptions Made
- The dataset and following analysis emphasizes the transactional data which must include non-null values of item and its sold quantity. For this reason we dropped NaN values in these columns, reducing the total number of dataset observation

### Implications for Analysis
- There is still a high rate of missing values in some categories, such as Location or Payment Method
- Data has a tidy structure
- There are no imputations, but dropping the missing values out of critical categories: Item and Quantity
- Validation checks passed
- No error in computational column Total Spend according to assertion data validation 

### Data Quality Assessment
- The data is mostly cleaned, but contains a significant number of missing values in some categories. We got the following results on missing values after standardization:
Transaction ID         0
Item                   0
Quantity               0
Price Per Unit       464
Total Spent          432
Payment Method      2732
Location            3418
Transaction Date     396

That means we could evaluate the quality of this data from 60,30% to 100% depending on the type of analysis (give into account that we have 8,166 obs after all modifciations). For example, if the analysis requires the information and analysis only on quantity by items, then the data is fully ready for performing a high-quality analysis. Including the column Price Per Unit will deteriorate the representativeness of the analysis from 100% to 94,3% because of missing values. 

---

## Congratulations!

You've successfully cleaned a real messy dataset using tidy data principles!

**Final check:** Can you **"Restart & Run All"** successfully? That's the gold standard!

**Reflection:** What was the hardest part? What did you learn?