In [30]:
import pandas as pd
import numpy as np
from pathlib import Path

### Import and Inspect Data
---------------------------

In this workout, we use a dataset of cafe sales from [kaggle](https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training) to show how to use data cleaning techniques to process dirty data.

In [31]:
# Load the café sales dataset
data_path = Path("data/dirty_cafe_sales.csv")

# Check if the file exists
if data_path.exists():
    df = pd.read_csv(data_path)
    print(f"Data loaded successfully. {df.shape[0]} rows and {df.shape[1]} columns.")
else:
    raise FileNotFoundError(f"The file {data_path} does not exist.")

Data loaded successfully. 10000 rows and 8 columns.


### Inspect Data
----------------

In [32]:
# Take a first look
df.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [33]:
print("\nDataFrame Information:")
df.info()

print("\nMissing values in each column:")
df.isnull().sum()


DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB

Missing values in each column:


Transaction ID         0
Item                 333
Quantity             138
Price Per Unit       179
Total Spent          173
Payment Method      2579
Location            3265
Transaction Date     159
dtype: int64

Through data inspecting we found that:
- some columns have wrong values, like `ERROR` and `UNKNOWN`
- some columns have null values. 
- the data type of all columns are not correct, because of the messy data values, pandas cannot correctly recognize the data type.

So next we will dealing with these problems by data cleaning. 

### Clean Data
--------------

#### Remove whitespace in strings

The method `str.stripe()` is a very useful tool to process string value, which removes whitespace characters (spaces, tabs, newlines) from the beginning and end of strings. But notice that `str.stripe()` does not touch the characters in the middle of a string, which means that the whitespace in the middle cannot be removed.

Since the string columns have mixed values, we should use `astype(str)` to force values into string type.

In [34]:
# --- 1. Remove whitespace in string columns ---
df['Transaction ID'] = df['Transaction ID'].astype(str).str.strip()
df['Item'] = df['Item'].astype(str).str.strip()
df['Payment Method'] = df['Payment Method'].astype(str).str.strip()
df['Location'] = df['Location'].astype(str).str.strip()

#### Lowercase string values

Making string values lowercase during cleaning is one of those “small steps, big impact” moves. Lowercase string values can avoid duplicate categories that differ only by case. The `str.lower()` method is the tool here to help reach this goal.

In [35]:
# --- 2. Lowercase string values ---
df['Item'] = df['Item'].str.lower()
df['Payment Method'] = df['Payment Method'].str.lower()
df['Location'] = df['Location'].str.lower()

# Check the cleaned DataFrame
df.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,coffee,2,2.0,4.0,credit card,takeaway,2023-09-08
1,TXN_4977031,cake,4,3.0,12.0,cash,in-store,2023-05-16
2,TXN_4271903,cookie,4,1.0,ERROR,credit card,in-store,2023-07-19
3,TXN_7034554,salad,2,5.0,10.0,unknown,unknown,2023-04-27
4,TXN_3160411,coffee,2,2.0,4.0,digital wallet,in-store,2023-06-11


### Correct column names 

We see that some column names have white space, like `Transaction ID`, `Price Per Unit`, `Total Spent`, `Payment Method` and `Transaction Date`. In order to keep the consistency between various tools and database, it's a good practice to replace the whitespace with underscore `_` in the column names.

In [36]:
# ---3. Correct column names ---
df.columns = df.columns.str.replace(" ", "_", regex=False) 

# Check the cleaned DataFrame
df.head()

Unnamed: 0,Transaction_ID,Item,Quantity,Price_Per_Unit,Total_Spent,Payment_Method,Location,Transaction_Date
0,TXN_1961373,coffee,2,2.0,4.0,credit card,takeaway,2023-09-08
1,TXN_4977031,cake,4,3.0,12.0,cash,in-store,2023-05-16
2,TXN_4271903,cookie,4,1.0,ERROR,credit card,in-store,2023-07-19
3,TXN_7034554,salad,2,5.0,10.0,unknown,unknown,2023-04-27
4,TXN_3160411,coffee,2,2.0,4.0,digital wallet,in-store,2023-06-11


In [37]:
# --- 4. Replace "nan" strings with actual NaN values ---
df[['Item', 'Payment_Method', 'Location']] = df[['Item', 'Payment_Method', 'Location']].replace("nan", np.nan)