# Assignment 2: Data Cleaning - Part 1: Validity Checker
## Group 105
- Natasa Bolic (300241734)
- Brent Palmer (300193610)
## Imports

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Introduction

Paragraph here

## Dataset Description

**Url:** https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training <br>
**Name:** Cafe Sales - Dirty Data for Cleaning Training <br>
**Author:** Ahmed Mohamed <br>
**Purpose:** The dirty cafe sales dataset was fabricated to practice data cleaning, deliberately including missing data, inconsistencies, and errors. The Kaggle description specifies that the dataset "can be used to practice cleaning techniques, data wrangling, and feature engineering."<br>
**Shape:** There are 10,000 rows and 8 columns. (10000, 8)<br>
**Features:** 
- `Transaction ID` (categorical): A unique id assigned to each transaction.
- `Item` (categorical): The name of the purchased item.
- `Quantity` (numerical): The count of the purchased item.
- `Price Per Unit` (numerical): The price of one unit of the purchased item, measured in dollars.
- `Total Spent` (numerical): The total amount spent in the transaction, measured in dollars. (Quantity * Price Per Unit)
- `Payment Method` (categorical): The transaction's method of payment.
- `Location` (categorical): The location of the transaction.
- `Transaction Date` (numerical): The transaction date.

Note that all the features may contain missing or invalid values, except for transaction ID, which is always present and unique.

## Loading Dataset and Basic Exploration

In [2]:
# Read in the dataset from a public repository
url = "https://raw.githubusercontent.com/Natasa127/CSI4142-A2/main/dirty_cafe_sales.csv"
sales = pd.read_csv(url)
sales.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [3]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


In [4]:
sales.shape

(10000, 8)

The following line is used to left-align the markdown tables included later in the notebook.

**Reference:** <br>
https://stackoverflow.com/questions/21892570/ipython-notebook-align-table-to-the-left-of-cell

In [5]:
%%html
<style>
table {float:left}
</style>

## Data Checks

### 1) Data Type Errors

This test checks the data type of an attribute whose entries should be numerical (either an integer or a float).

**References:** <br>
Converting to numeric: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html <br>
Setting the type: https://www.geeksforgeeks.org/python-pandas-dataframe-astype/ <br>
Selecting rows in one dataframe but not in another: https://discovery.cs.illinois.edu/guides/DataFrame-Row-Selection/dataframe-isin-selection/

In [6]:
# # Parameters to be edited by the user
# attributes = ['Quantity', 'Price Per Unit', 'Total Spent']
# datatypes = ['int', 'float']

# test_attribute = 'Quantity'
# test_datatype = 'int'


In [7]:
# # Error check
# def type_filter(df, col, datatype):
#     # Creates a copy so that the original dataset is not modified
#     df_filtered = df.copy()

#     # Converts numeric data to a numeric type and sets all other values to NaN
#     df_filtered[col] = pd.to_numeric(df_filtered[col], errors='coerce')
#     # Removes NaN values to leave only numerical values
#     df_filtered = df_filtered.dropna(subset=[col]).copy()
    
#     if datatype == 'int':
#         # Takes only the integer values
#         df_filtered = df_filtered[df_filtered[col] % 1 == 0].copy()

#         # Converts the type to integer (as opposed to float)
#         df_filtered[col] = df_filtered[col].astype(datatype)

#     # Returns the filtered dataset
#     return df_filtered

# checked_sales = type_filter(sales, test_attribute, test_datatype)
# checked_sales.info()


In [8]:
# # Accesses entries with invalid datatypes for the given column
# invalid_type = sales[~sales.index.isin(checked_sales.index)]
# # Obtains number of invalid entries
# print(len(invalid_type))
# # Displays 5 invalid entries
# invalid_type.head()

Results:

There are 479 rows with a quantity that is not an integer. This seems to occur when the value is unknown, and replaced by a string such as 'UNKNOWN' or 'ERROR' instead. For example, see the two rows below:

<u>Transaction ID / Item / Quantity
Transaction ID	/ Item	/ Quantity 	/ Price Per Unit	Total Spent	/ Payment Method	/ Location	/ Transaction Date</u>

TXN_3522028	/ Smoothie	/ ERROR	/ 4.0	/ 20.0	/ Cash	/ In-store	/ 2023-04-04

TXN_5522862	/ Cookie	/ ERROR	/ 1.0	/ 2.0	/ Credit Card	/ Takeaway	/ 2023-03-19

We perform data type checks for the rest of the numerical attributes so that the columns have the correct datatype in subsequent checks.

In [9]:
# # Filter by type for the remaining numerical attributes
# checked_sales = type_filter(checked_sales, attributes[1], datatypes[1])
# checked_sales = type_filter(checked_sales, attributes[2], datatypes[1])
# checked_sales.info()


### 2) Range Errors

This test checks the range of a numerical variable, which consists of checking if the value of the variable is within the minimum and maximum acceptable values for that attribute. Please note that our range check is inclusive, meaning we accept the provided minimum and maximum values. Remark that we also consider invalid data types as out of range.

There are three parameters.
- `test_attribute`: The column to perform the range check on.
    - There are three options, including `Quantity`, `Price Per Unit`, and `Total Spent`, as these are the only numerical attributes. 
- `minimum`: The minimum value of the range.
- `maximum`: The maximum value of the range.

In [70]:
# Parameters to be edited by the user

# Valid attributes for the range check
attributes = ['Quantity', 'Price Per Unit', 'Total Spent']

# Attribute selection
test_attribute = 'Quantity'

# Minimum value of the range
minimum = 1

# Maximum value of the range
maximum = 4

In [71]:
# Error check

# Evaluates a single value against a given range
def range_filter(value, minimum, maximum):
    try:
        value = float(value)
    except Exception as e:
        return False
    return minimum <= value <= maximum

# Apply the function to the test attribute, setting out of range values to True
invalid_range = sales[test_attribute].apply(
    lambda attribute: not range_filter(attribute, minimum, maximum)
)

# Save the invalid rows
invalid_range_df = sales.loc[invalid_range]

# Print the number of rows with a value outside of the given range for the designated attribute
print(f"Number of rows with invalid range: {invalid_range.sum()}\n")

# Display the first 3 rows with a value outside of the given range for the designated attribute
print("Example of three rows with an invalid value:")
invalid_range_df.head(3)

Number of rows with invalid range: 2492

Example of three rows with an invalid value:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
5,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,,2023-03-31
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31


#### Results

Below are the example results from running the range check on the `Quantity` attribute, with the minimum set to `1` and the maximum set to `4`.

There are 2492 rows where the `Quantity` value is out of range. For example, see the ten rows below:

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_2602893  | Smoothie | 5        | 4.0           | 20.0        | Credit Card   | NaN       | 2023-03-31       |
| TXN_4717867  | NaN      | 5        | 3.0           | 15.0        | NaN           | Takeaway  | 2023-07-28       |
| TXN_2064365  | Sandwich | 5        | 4.0           | 20.0        | NaN           | In-store  | 2023-12-31       |
| TXN_2548360  | Salad    | 5        | 5.0           | 25.0        | Cash          | Takeaway  | 2023-11-07       |
| TXN_9437049  | Cookie   | 5        | 1.0           | 5.0         | NaN           | Takeaway  | 2023-06-01       |
| TXN_8876618  | Cake     | 5        | 3.0           | 15.0        | Cash          | ERROR     | 2023-03-25       |
| TXN_3522028  | Smoothie | ERROR    | 4.0           | 20.0        | Cash          | In-store  | 2023-04-04       |
| TXN_9400181  | Sandwich | 5        | 4.0           | 20.0        | Cash          | In-store  | 2023-06-03       |
| TXN_5183041  | Cookie   | 5        | 1.0           | 5.0         | Credit Card   | In-store  | 2023-04-20       |
| TXN_8467949  | Smoothie | 5        | 4.0           | 20.0        | Credit Card   | NaN       | 2023-03-11       |

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>Note that we also include wrong data types as out of range, hence the inclusion of `ERROR` in our list of values that are out of range. We also acknowledge that a maximum of `4` does not necessarily make sense in this context (people can obviously buy five sandwiches if they wish), however we chose this value to illustrate that the range check functions as intended.

### 3) Format Errors

The format check ensures data follows a pre-defined format. For example, this test can check that:
- Transaction ID is stored in the correct format (TXN_1234567)
- Dates are stored in the correct format (YYYY-MM-DD)

There is one parameter, `test_attribute`, which lets you select which column you would like to perform the format check on. The two options include `Transaction ID` and `Transaction Date`, as these are the only columns that have a pre-defined format.

Regular expressions are used to assert the validity of the format.

**References:** <br>
Regex: https://www.w3schools.com/python/python_regex.asp

In [46]:
# Parameters to be edited by the user

# Valid attributes for the format check
attributes = ['Transaction ID', 'Transaction Date']

# Attribute selection
test_attribute = 'Transaction Date'

In [50]:
# Error Check

format_regex = r"^TXN_\d{7}$" if test_attribute == 'Transaction ID' else r"^\d{4}-\d{2}-\d{2}$"

# Evaluates a single value against a given regex format
def format_filter(value, format_regex):
    return False if not isinstance(value, str) else bool(re.findall(format_regex, value))

# Apply the function to the test attribute, setting invalid formats to True
invalid_format = sales[test_attribute].apply(
    lambda attribute: not format_filter(attribute, format_regex)
)

# Save the invalid rows
invalid_format_df = sales.loc[invalid_format]

# Print the number of rows with invalid formatting on the chosen test attribute
print(f"Number of rows with invalid format: {invalid_format.sum()}\n")

# Display the first 3 rows with invalid formatting on the chosen test attribute
print("Example of three rows with an invalid value:")
invalid_format_df.head(3)

Number of rows with invalid format: 460

Example of three rows with an invalid value:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
11,TXN_3051279,Sandwich,2,4.0,8.0,Credit Card,Takeaway,ERROR
29,TXN_7640952,Cake,4,3.0,12.0,Digital Wallet,Takeaway,ERROR
33,TXN_7710508,UNKNOWN,5,1.0,5.0,Cash,,ERROR


#### Results

Below are the example results from running the format check on the `Transaction Date` column.

There are 460 rows where the transaction date is in the wrong format. For example, see the ten rows below:

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_3051279  | Sandwich | 2        | 4.0           | 8.0         | Credit Card   | Takeaway  | ERROR            |
| TXN_7640952  | Cake     | 4        | 3.0           | 12.0        | Digital Wallet| Takeaway  | ERROR            |
| TXN_7710508  | UNKNOWN  | 5        | 1.0           | 5.0         | Cash          | NaN       | ERROR            |
| TXN_2091733  | Salad    | 1        | 5.0           | 5.0         | NaN           | In-store  | NaN              |
| TXN_7028009  | Cake     | 4        | 3.0           | 12.0        | NaN           | Takeaway  | ERROR            |
| TXN_7447872  | Juice    | 2        | NaN           | 6.0         | NaN           | NaN       | NaN              |
| TXN_1001832  | Salad    | 2        | 5.0           | 10.0        | Cash          | Takeaway  | UNKNOWN          |
| TXN_7943008  | Coffee   | 1        | 2.0           | 2.0         | Credit Card   | NaN       | ERROR            |
| TXN_1093800  | Sandwich | 3        | 4.0           | 12.0        | Cash          | Takeaway  | NaN              |
| TXN_6463132  | Cookie   | 5        | 1.0           | 5.0         | Credit Card   | Takeaway  | NaN              |