# Cafe Sales – Dirty Data Cleaning Project

### - Data Source
The dataset is publicly available on Kaggle:  
🔗 [Kaggle: Cafe Sales - Dirty Data for Cleaning Training](https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training)

### - About the Dataset
This dataset contains 10,000 rows of synthetic cafe sales transactions. It was intentionally designed to be "dirty" — with missing values, inconsistent formatting, and data quality issues — to simulate real-world data cleaning challenges. It provides an ideal use case for practicing data cleaning, preprocessing, and feature engineering techniques.

### - Project Goal
The dataset is not analysis-ready. The goal of this project is to perform data cleaning and transformation to make the dataset suitable for further analysis. Key tasks include:
- Replacing invalid string values (e.g., `"error"`, `"unknown"`, `"nan"`) with proper missing value indicators (`NaN`)
- Handling missing values appropriately
- Cleaning textual data (e.g., removing extra characters, standardizing values)
- Parsing and transforming date-related columns into structured features
- Converting column data types to appropriate formats


In [26]:
import pandas as pd

# Load the raw dataset
data_path = "../data/raw/dirty_cafe_sales.csv"
df = pd.read_csv(data_path)

# Basic overview
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


In [28]:
# Quick look on the top 5 item
df.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [29]:
df.describe()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
count,10000,9667,9862,9821.0,9827.0,7421,6735,9841
unique,10000,10,7,8.0,19.0,5,4,367
top,TXN_1961373,Juice,5,3.0,6.0,Digital Wallet,Takeaway,UNKNOWN
freq,1,1171,2013,2429.0,979.0,2291,3022,159
