# 🧼 Data Cleaning & Wrangling Project
## 1. Project Overview
- **Dataset**: dirty_cafe_sales.csv
- **Source**: Kaggle Dataset - https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training/data
- **Goal**: This primary objective of this project was to identify and rectify these issues, producing a clean, reliable dataset suitable for further analysis and visualization. The process involved initial data exploration, identification of quality problems, development and implementation of a cleaning strategy, and final validation of the results.



In [None]:
# Import necessary libararies
import pandas as pd
import numpy as np

In [None]:
# Import Menu Item with Its Prices
menu_prices ={
    "Coffee": 2.0,
    "Tea": 1.5,
    "Sandwich": 4.0,
    "Salad": 5.0,
    "Cake": 3.0,
    "Cookie": 1.0,
    "Smoothie": 4.0,
    "Juice": 3.0
}

In [None]:
# Load Dataset
df = pd.read_csv('dirty_cafe_sales.csv')  # Change to your file
df.head()

## 2. Initial Data Exploration

In [None]:
print("Rows:", df.shape[0], "Columns:", df.shape[1]) # we have 8 columns and 10000 rows of data in the dataset
df.info() # transaction ID is unique idetifier of each data entry

In [None]:
df.describe()

In [None]:
df.isnull().sum()

## 3. Data Cleaning Steps

### 3.1 Convert Data Types

In [None]:
# Transcation ID is the unique identifier, so no change is made for it.
# Item - remove whitespace
df['Item'] = df['Item'].str.strip() 

# Quantity, Price Per Unit, Total Spent should be converted to numeric
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce')
df['Price Per Unit'] = pd.to_numeric(df['Price Per Unit'], errors='coerce')
df['Total Spent'] = pd.to_numeric(df['Total Spent'], errors='coerce')

# transaction date should be converted to date
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], errors='coerce')

df.head()

### 3.2 Handling Invalid/Inconsistent Values

In [None]:
df.replace(['ERROR', 'UNKNOWN'], np.nan, inplace=True)

In [None]:
item_mode = df['Item'].mode()[0]
df['Item'] = df['Item'].fillna(item_mode)

payment_mode = df["Payment Method"].mode()[0]
df["Payment Method"] = df["Payment Method"].fillna(payment_mode)

location_mode = df["Location"].mode()[0]
df["Location"] = df["Location"].fillna(location_mode)

date_mode = df["Transaction Date"].mode()[0]
df["Transaction Date"] = df["Transaction Date"].fillna(date_mode)

quantity_median = df['Quantity'].median()
df['Quantity'] = df['Quantity'].fillna(quantity_median)

In [None]:
# fill in the empty values in price per unit based on the items
def impute_price(row):
    if pd.isna(row['Price Per Unit']):
        item = row['Item']
        if item in menu_prices:
            return menu_prices[item]
        else:
            price_median = df['Price Per Unit'].median()
            return price_median
    return row['Price Per Unit']

df['Price Per Unit'] = df.apply(impute_price, axis=1)

df['Total Spent'] = df['Quantity'] * df['Price Per Unit']

df.info()

### 3.3 Remove Duplicates

In [None]:
initial_rows = len(df)
df.drop_duplicates(subset=['Transaction ID'], keep='first', inplace=True)
duplicates_removed = initial_rows - len(df)
if duplicates_removed > 0:
    print("Removed {duplicates_removed} duplicate rows based on 'Transaction ID'.")
else:
    print("No duplicate 'Transaction ID' found.")

## 4. Post-Cleaning Checks

In [None]:
df.isnull().sum()

## 5. Summary & Next Steps
The data cleaning process successfully addressed the identified quality issues in the original dirty_cafe_sales.csv dataset. Missing values, incorrect data types, invalid entries, and calculation inconsistencies were handled according to the defined plan. Validation confirmed that the resulting dataset, cleaned_cafe_sales.csv, is complete, consistent, and has appropriate data types.
**This cleaned dataset is now suitable for reliable downstream analysis, visualization, or machine learning tasks**

In [None]:
# save cleaned data
df.to_csv("cleaned_cafe_sales.csv", index=False)