# Assignment 2: Data Cleaning - Part 1: Validity Checker
## Group 105
- Natasa Bolic (300241734)
- Brent Palmer (300193610)
## Imports

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Introduction

This notebook is for part 1 of the assignment, where we create a clean data checker that can check for 8 validity problems, as well as for exact duplicates and near-duplicates. The checks identify rows that are potentially problematic (it does not actually clean the data, as per the assignment instructions). The checks are also minimally parameterized, allowing users to specify which attributes and rules they would like to use for the check. For each check, our program saves the entire set of invalid rows, and automatically prints the first three. We also provide an example set of some invalid rows from a sample run. We have chosen the `Cafe Sales - Dirty Data for Cleaning Training` dataset from Kaggle, which provides a set of 10,000 transactions at a coffee shop. The data is intentionally dirty, which is great for testing our clean data checker. Specifically, this notebook checks for the following errors:

1. Data Type Errors
2. Range Errors
3. Format Errors
4. Consistency Errors
5. Uniqueness Errors
6. Presence Errors
7. Length Errors
8. Look-up Errors
9. Exact Duplicate Errors
10. Near Duplicate Errors

The checks are completely independent from each other. Thus, in terms of how to use the notebook, choose any test and run the cells for that test sequentially to view the results.

## Dataset Description

**Url:** https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training <br>
**Name:** Cafe Sales - Dirty Data for Cleaning Training <br>
**Author:** Ahmed Mohamed <br>
**Purpose:** The dirty cafe sales dataset was fabricated to practice data cleaning, deliberately including missing data, inconsistencies, and errors. The Kaggle description specifies that the dataset "can be used to practice cleaning techniques, data wrangling, and feature engineering."<br>
**Shape:** There are 10,000 rows and 8 columns. (10000, 8)<br>
**Features:** 
- `Transaction ID` (categorical): A unique id assigned to each transaction.
- `Item` (categorical): The name of the purchased item.
- `Quantity` (numerical): The count of the purchased item.
- `Price Per Unit` (numerical): The price of one unit of the purchased item, measured in dollars.
- `Total Spent` (numerical): The total amount spent in the transaction, measured in dollars. (Quantity * Price Per Unit)
- `Payment Method` (categorical): The transaction's method of payment.
- `Location` (categorical): The location of the transaction.
- `Transaction Date` (numerical): The transaction date.

Note that all the features may contain missing or invalid values, except for transaction ID, which is always present and unique.

## Loading Dataset and Basic Exploration

In [2]:
# Read in the dataset from a public repository
url = "https://raw.githubusercontent.com/Natasa127/CSI4142-A2/main/dirty_cafe_sales.csv"
sales = pd.read_csv(url)
sales.head()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
0,TXN_1961373,Coffee,2,2.0,4.0,Credit Card,Takeaway,2023-09-08
1,TXN_4977031,Cake,4,3.0,12.0,Cash,In-store,2023-05-16
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
3,TXN_7034554,Salad,2,5.0,10.0,UNKNOWN,UNKNOWN,2023-04-27
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11


In [3]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    10000 non-null  object
 1   Item              9667 non-null   object
 2   Quantity          9862 non-null   object
 3   Price Per Unit    9821 non-null   object
 4   Total Spent       9827 non-null   object
 5   Payment Method    7421 non-null   object
 6   Location          6735 non-null   object
 7   Transaction Date  9841 non-null   object
dtypes: object(8)
memory usage: 625.1+ KB


In [4]:
sales.shape

(10000, 8)

The following line is used to left-align the markdown tables included later in the notebook.

**Reference:** <br>
https://stackoverflow.com/questions/21892570/ipython-notebook-align-table-to-the-left-of-cell

In [5]:
%%html
<style>
table {float:left}
</style>

## Data Checks

### 1) Data Type Errors

The data type check ensures that an attribute's data is the correct data type.

There are 3 parameters:
- `method`: Which data type check you want to run. There are two integer options:
    - `1`: The first method checks if the data in the `test_attribute` column, in its current format, has the same data type as the `test_datatype`.
    - `2`: The second method can only be used to check for `float` and `int` data types, and checks if the data in the `test_attribute` column can be cast to the same data type as the `test_datatype`.
- `test_attribute`: The column to perform the data type check on.
- `test_datatype`: The desired data type of the chosen column.

`method 1` can be used as a quick check to verify if the data in a column is stored in the correct data type. However, when reading in a dataset, if there is at least one string in a column, the column `Dtype` will be set to `object` and Pandas will store everything as a `string`. So, even `integer` and `float` values will be stored as `string`, despite representing numerical values. As such, we implemented `method 2` as a more comprehensive check to account for this case.

In [6]:
# Parameters to be edited by the user

# Choice of method
method = 2 # as described above, we suggest using method 2 for this dataset, as all the values are saved as strings

# Valid attributes for data type check
method_1_attributes = ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transcation Date']
method_1_datatypes = [str, int, float] # in this dataset, everything is saved as an str
method_2_attributes = ['Quantity', 'Price Per Unit', 'Total Spent'] # method 2 only applicable to numerical columns
method_2_datatypes = [int, float]

# Attribute selection
test_attribute = "Price Per Unit"

# Datatype selection
test_datatype = float

In [7]:
# Error check

# Evalutes a single value's data type against the desired data type
def type_filter_method1(value, test_datatype):
    if pd.isna(value):
        return False
    return isinstance(value, test_datatype)

# Evalutes a single value's data type against the desired data type
def type_filter_method2(value, test_datatype):
    if pd.isna(value):
        return False
    if test_datatype == int:
        try:
            value = float(value)
            return value % 1 == 0
        except Exception as e:
            return False
    else:
        try:
            value = test_datatype(value)
            return True
        except Exception as e:
            return False

if method == 1:
    # Apply the function to the test attribute, setting values with an invalid data type to True
    invalid_datatype = sales[test_attribute].apply(
        lambda attribute: not type_filter_method1(attribute, test_datatype)
    )
else:
    # Apply the function to the test attribute, setting values with an invalid data type to True
    invalid_datatype = sales[test_attribute].apply(
        lambda attribute: not type_filter_method2(attribute, test_datatype)
    )

# Save the invalid rows
invalid_datatype_df = sales.loc[invalid_datatype]

# Print the number of rows where the test attribute value contains an incorrect datatype
print(f"Number of rows where the {test_attribute} value's data type is not {test_datatype}: {invalid_datatype.sum()}\n")

# Display the first 3 rows where the test attribute value contains an incorrect datatype
print(f"Examples of three rows where the {test_attribute} value's data type is not {test_datatype}:")
invalid_datatype_df.head(3)

Number of rows where the Price Per Unit value's data type is not <class 'float'>: 533

Examples of three rows where the Price Per Unit value's data type is not <class 'float'>:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
56,TXN_3578141,Cake,5,,15.0,,Takeaway,2023-06-27
65,TXN_4987129,Sandwich,3,,,,In-store,2023-10-20
68,TXN_8427104,Salad,2,ERROR,10.0,,In-store,2023-10-27


#### Results

There are 533 rows with a `Price Per Unit` that is not an `float`. This occurs when the value contains strings such as `UNKNOWN` or `ERROR`, or when there is a missing value. We will run the test again using `int` to illustrate that our check can distinguish decimal numbers from integers. The ten rows below are rows where the `Price Per Unit` value is not a `float` (from the initial test):

| Transaction ID | Item     | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|---------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_3578141   | Cake    | 5        | NaN           | 15.0        | NaN            | Takeaway  | 2023-06-27       |
| TXN_4987129   | Sandwich | 3        | NaN           | NaN         | NaN            | In-store  | 2023-10-20       |
| TXN_8427104   | Salad   | 2        | ERROR         | 10.0        | NaN            | In-store  | 2023-10-27       |
| TXN_8035512   | Tea     | 3        | NaN           | 4.5         | Cash           | UNKNOWN   | 2023-10-29       |
| TXN_7447872   | Juice   | 2        | NaN           | 6.0         | NaN            | NaN       | NaN              |
| TXN_4633784   | ERROR   | 5        | NaN           | 15.0        | NaN            | In-store  | 2023-02-06       |
| TXN_2484241   | Cake    | 3        | UNKNOWN       | 9.0         | Digital Wallet | NaN       | 2023-07-19       |
| TXN_9336980   | Salad   | 4        | UNKNOWN       | 20.0        | Cash           | In-store  | 2023-06-06       |
| TXN_4031509   | NaN     | 4        | NaN           | 16.0        | Credit Card    | Takeaway  | 2023-01-04       |
| TXN_7965998   | Juice   | 1        | UNKNOWN       | 3.0         | Credit Card    | In-store  | 2023-11-02       |


In [8]:
# Same check, but with int instead of float to demonstrate the check distinguishes ints from floats - all values of 1.5 should now also be filtered out.

# Method selection
method = 2

# Attribute selection
test_attribute = "Price Per Unit"

# Datatype selection
test_datatype = int

if method == 1:
    # Apply the function to the test attribute, setting values with an invalid data type to True
    invalid_datatype = sales[test_attribute].apply(
        lambda attribute: not type_filter_method1(attribute, test_datatype)
    )
else:
    # Apply the function to the test attribute, setting values with an invalid data type to True
    invalid_datatype = sales[test_attribute].apply(
        lambda attribute: not type_filter_method2(attribute, test_datatype)
    )

# Save the invalid rows
invalid_datatype_df = sales.loc[invalid_datatype]

# Print the number of occurrences of "1.5" in Price Per Unit
print(f"Number of occurrences of 1.5 in Price Per Unit: {sales['Price Per Unit'].value_counts()['1.5']}\n")

# Print the number of rows where the test attribute value contains an incorrect datatype
print(f"Number of rows where the {test_attribute} value's data type is not {test_datatype}: {invalid_datatype.sum()}\n")

# Display the first 3 rows where the test attribute value contains an incorrect datatype
print(f"Examples of three rows where the {test_attribute} value's data type is not {test_datatype}:")
invalid_datatype_df.head(3)

Number of occurrences of 1.5 in Price Per Unit: 1133

Number of rows where the Price Per Unit value's data type is not <class 'int'>: 1666

Examples of three rows where the Price Per Unit value's data type is not <class 'int'>:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
14,TXN_8915701,ERROR,2,1.5,3.0,,In-store,2023-03-21
42,TXN_6650263,Tea,2,1.5,UNKNOWN,,Takeaway,2023-01-10
56,TXN_3578141,Cake,5,,15.0,,Takeaway,2023-06-27


#### Results (Using `int` Instead of `float`)

There are now 1666 results instead of 533, an increase of 1133, which is exactly how many occurrences of `1.5` there are in the `Price Per Unit` column. Note that `1.5` is the only decimal value, thus our check has successfully filtered out all of the decimal values when an `int` is requested. Below are three examples of invalid rows. Note that they include a decimal value in the `Price Per Unit` column.

| Transaction ID | Item  | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_8915701   | ERROR | 2        | 1.5           | 3.0         | NaN           | In-store  | 2023-03-21       |
| TXN_6650263   | Tea   | 2        | 1.5           | UNKNOWN     | NaN           | Takeaway  | 2023-01-10       |
| TXN_3578141   | Cake  | 5        | NaN           | 15.0        | NaN           | Takeaway  | 2023-06-27       |


### 2) Range Errors

This test checks the range of a numerical variable, which consists of checking if the value of the variable is within the minimum and maximum acceptable values for that attribute. Please note that our range check is inclusive, meaning we accept the provided minimum and maximum values. Remark that we also consider invalid data types and missing values as out of range.

There are three parameters.
- `test_attribute`: The column to perform the range check on.
    - There are three options, including `Quantity`, `Price Per Unit`, and `Total Spent`, as these are the only numerical attributes. 
- `minimum`: The minimum value of the range.
- `maximum`: The maximum value of the range.

In [9]:
# Parameters to be edited by the user

# Valid attributes for the range check
attributes = ['Quantity', 'Price Per Unit', 'Total Spent']

# Attribute selection
test_attribute = 'Quantity'

# Minimum value of the range
minimum = 1

# Maximum value of the range
maximum = 4

In [10]:
# Error check

# Evaluates a single value against a given range
def range_filter(value, minimum, maximum):
    try:
        value = float(value)
    except Exception as e:
        return False
    return minimum <= value <= maximum

# Apply the function to the test attribute, setting out of range values to True
invalid_range = sales[test_attribute].apply(
    lambda attribute: not range_filter(attribute, minimum, maximum)
)

# Save the invalid rows
invalid_range_df = sales.loc[invalid_range]

# Print the number of rows with a value outside of the given range for the designated attribute
print(f"Number of rows where the {test_attribute} value is outside of the defined range ({minimum}, {maximum}): {invalid_range.sum()}\n")

# Display the first 3 rows with a value outside of the given range for the designated attribute
print(f"Examples of three rows where the {test_attribute} value is outside of the defined range ({minimum}, {maximum}):")
invalid_range_df.head(3)

Number of rows where the Quantity value is outside of the defined range (1, 4): 2492

Examples of three rows where the Quantity value is outside of the defined range (1, 4):


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
5,TXN_2602893,Smoothie,5,4.0,20.0,Credit Card,,2023-03-31
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31


#### Results

Below are the example results from running the range check on the `Quantity` attribute, with the minimum set to `1` and the maximum set to `4`.

There are 2492 rows where the `Quantity` value is out of range. Note that we also include wrong data types as out of range, hence the inclusion of `ERROR` in our list of values that are out of range. We also acknowledge that a maximum of `4` does not necessarily make sense in this context (people can obviously buy five sandwiches if they wish), however we chose this value to illustrate that the range check functions as intended. For examples of invalid rows, see the ten rows below:

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_2602893  | Smoothie | 5        | 4.0           | 20.0        | Credit Card   | NaN       | 2023-03-31       |
| TXN_4717867  | NaN      | 5        | 3.0           | 15.0        | NaN           | Takeaway  | 2023-07-28       |
| TXN_2064365  | Sandwich | 5        | 4.0           | 20.0        | NaN           | In-store  | 2023-12-31       |
| TXN_2548360  | Salad    | 5        | 5.0           | 25.0        | Cash          | Takeaway  | 2023-11-07       |
| TXN_9437049  | Cookie   | 5        | 1.0           | 5.0         | NaN           | Takeaway  | 2023-06-01       |
| TXN_8876618  | Cake     | 5        | 3.0           | 15.0        | Cash          | ERROR     | 2023-03-25       |
| TXN_3522028  | Smoothie | ERROR    | 4.0           | 20.0        | Cash          | In-store  | 2023-04-04       |
| TXN_9400181  | Sandwich | 5        | 4.0           | 20.0        | Cash          | In-store  | 2023-06-03       |
| TXN_5183041  | Cookie   | 5        | 1.0           | 5.0         | Credit Card   | In-store  | 2023-04-20       |
| TXN_8467949  | Smoothie | 5        | 4.0           | 20.0        | Credit Card   | NaN       | 2023-03-11       |



### 3) Format Errors

The format check ensures data follows a pre-defined format. For example, this test can check that:
- Transaction ID is stored in the correct format (TXN_1234567)
- Dates are stored in the correct format (YYYY-MM-DD)

There is one parameter, `test_attribute`, which lets you select which column you would like to perform the format check on. The two options include `Transaction ID` and `Transaction Date`, as these are the only columns that have a pre-defined format.

Regular expressions are used to assert the validity of the format.

**References:** <br>
Regex: https://www.w3schools.com/python/python_regex.asp

In [11]:
# Parameters to be edited by the user

# Valid attributes for the format check
attributes = ['Transaction ID', 'Transaction Date']

# Attribute selection
test_attribute = 'Transaction Date'

In [12]:
# Error Check

# Select the necessary regex based on the chosen test attribute
format_regex = r"^TXN_\d{7}$" if test_attribute == 'Transaction ID' else r"^\d{4}-\d{2}-\d{2}$"

# Evaluates a single value against a given regex format
def format_filter(value, format_regex):
    return False if not isinstance(value, str) else bool(re.findall(format_regex, value))

# Apply the function to the test attribute, setting invalid formats to True
invalid_format = sales[test_attribute].apply(
    lambda attribute: not format_filter(attribute, format_regex)
)

# Save the invalid rows
invalid_format_df = sales.loc[invalid_format]

# Print the number of rows with invalid formatting on the chosen test attribute
print(f"Number of rows where the {test_attribute} value has an invalid format: {invalid_format.sum()}\n")

# Display the first 3 rows with invalid formatting on the chosen test attribute
print(f"Examples of three rows where the {test_attribute} value has an invalid format:")
invalid_format_df.head(3)

Number of rows where the Transaction Date value has an invalid format: 460

Examples of three rows where the Transaction Date value has an invalid format:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
11,TXN_3051279,Sandwich,2,4.0,8.0,Credit Card,Takeaway,ERROR
29,TXN_7640952,Cake,4,3.0,12.0,Digital Wallet,Takeaway,ERROR
33,TXN_7710508,UNKNOWN,5,1.0,5.0,Cash,,ERROR


#### Results

Below are the example results from running the format check on the `Transaction Date` column.

There are 460 rows where the transaction date is in the wrong format. For example, see the ten rows below:

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_3051279  | Sandwich | 2        | 4.0           | 8.0         | Credit Card   | Takeaway  | ERROR            |
| TXN_7640952  | Cake     | 4        | 3.0           | 12.0        | Digital Wallet| Takeaway  | ERROR            |
| TXN_7710508  | UNKNOWN  | 5        | 1.0           | 5.0         | Cash          | NaN       | ERROR            |
| TXN_2091733  | Salad    | 1        | 5.0           | 5.0         | NaN           | In-store  | NaN              |
| TXN_7028009  | Cake     | 4        | 3.0           | 12.0        | NaN           | Takeaway  | ERROR            |
| TXN_7447872  | Juice    | 2        | NaN           | 6.0         | NaN           | NaN       | NaN              |
| TXN_1001832  | Salad    | 2        | 5.0           | 10.0        | Cash          | Takeaway  | UNKNOWN          |
| TXN_7943008  | Coffee   | 1        | 2.0           | 2.0         | Credit Card   | NaN       | ERROR            |
| TXN_1093800  | Sandwich | 3        | 4.0           | 12.0        | Cash          | Takeaway  | NaN              |
| TXN_6463132  | Cookie   | 5        | 1.0           | 5.0         | Credit Card   | Takeaway  | NaN              |

### 4) Consistency Errors

The consistency check validates that the data in a row follows some designated rule that involves multiple columns. An example of a rule is that `Quantity` * `Price Per Unit` == `Total Spent`.

There is one parameter:
- `rule`: The rule providing the logic to check. This should be an expression that can be evaluated on the row. Note that the name of the Series in the expression should be `row`, as we will be evaluating one row at a time.

We acknowledge that another approach would be to let users pick from a set of predefined rules. This would simplify the input process, as users would not need to input their own rules. However, we wanted to keep the consistency check extensible, allowing users to input their own rules. To match the simplicity of the predefined rules approach, we will provide example rules that users can input; these are the presets we would have chosen. This way, our checker is just as easy to use as one with presets, as the user can simply paste in the example rules, while maintaining flexibility.

The following are some examples of parameters:
- **Example 1:** Asserting that the total amount spent corresponds to the quantity purchased and the price per unit.
    - `rule`: `'row["Quantity"] * row["Price Per Unit"] == row["Total Spent"]'`
- **Example 2:** Asserting that the price of each `Item` is accurate
    - `rule`: `'(row["Item"] == "Coffee" and row["Price Per Unit"] == 2) or (row["Item"] == "Tea" and row["Price Per Unit"] == 1.5) or (row["Item"] == "Sandwich" and row["Price Per Unit"] == 4) or (row["Item"] == "Salad" and row["Price Per Unit"] == 5) or (row["Item"] == "Cake" and row["Price Per Unit"] == 3) or (row["Item"] == "Cookie" and row["Price Per Unit"] == 1) or (row["Item"] == "Smoothie" and row["Price Per Unit"] == 4) or (row["Item"] == "Juice" and row["Price Per Unit"] == 3)'`
- **Example 3:** Asserting that `Takeaway` purchases can only be made with `Credit Card` or `Digital Wallet`, and that `In-store` pruchases are made with `Credit Card`, `Digital Wallet`, or `Cash`.
    - `rule`: `'((row["Location"] == "In-store") and (row["Payment Method"] in ["Credit Card", "Digital Wallet", "Cash"])) or ((row["Location"] == "Takeaway") and (row["Payment Method"] in ["Credit Card", "Digital Wallet"]))'`

Note that we are intepreting the location of `Takeaway` to mean you online-ordered, hence not being able to use cash.

**References:** <br>
Eval: https://docs.python.org/3/library/functions.html#eval

In [13]:
# Parameters to be edited by the user

# Valid attributes for the consistency check (for reference when creating the rule, this is not actually used)
attributes = ["Item", "Quantity", "Price Per Unit", "Total Spent", "Payment Method", "Location", "Transaction Date"]

# The expression to evaluate, representing the rule applied
# In the rule expression, access any attribute with the syntax row['Attribute'], since we process one row at a time
rule = 'row["Quantity"] * row["Price Per Unit"] == row["Total Spent"]'

In [14]:
# Error check

# Evalutes a single row against a given rule
def consistency_filter(row, rule):
    try:
        # Convert the row to a dictionary, since eval() can use a dictionary
        row = row.to_dict()

        # Convert the numeric attributes to float to allow for expression evaluation
        for key, value in row.items():
            try:
                row[key] = float(value)
            except Exception as e:
                pass

        # Return the result of the expression
        return eval(rule)
    except Exception as e:
        # If the eval fails (for example, 2.0 * "error" will throw an error), return False
        return False

# Apply the function to each row, setting inconsistent rows to True
invalid_consistency = sales.apply(
    lambda row: not consistency_filter(row, rule),
    axis=1
)

# Save the invalid rows
invalid_consistency_df = sales.loc[invalid_consistency]

# Print the number of inconsistent rows based on the provided rule
print(f"Number of rows where the rule {rule} is invalid: {invalid_consistency.sum()}\n")

# Display the first 3 inconsistent rows based on the provided rule
print(f"Examples of three rows where the rule {rule} is invalid:")
invalid_consistency_df.head(3)

Number of rows where the rule row["Quantity"] * row["Price Per Unit"] == row["Total Spent"] is invalid: 1456

Examples of three rows where the rule row["Quantity"] * row["Price Per Unit"] == row["Total Spent"] is invalid:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
2,TXN_4271903,Cookie,4,1.0,ERROR,Credit Card,In-store,2023-07-19
20,TXN_3522028,Smoothie,ERROR,4.0,20.0,Cash,In-store,2023-04-04
25,TXN_7958992,Smoothie,3,4.0,,UNKNOWN,UNKNOWN,2023-12-13


#### Results

Below are the example results from running the consistency check with the `rule` set to `row['Quantity'] * row['Price Per Unit'] == row['Total Spent']`.

There are 1456 rows where the row is not consistent with the provided rule. Note that the dataset was created in a way where this rule is never invalid (except for invalid or missing values), so we add a few rows (that do not pass this rule due to their actual values) to an altered version of the dataset, in a new Dataframe called `altered_sales`, to demonstrate that the rule checker properly filters out results based on mathematical operations. Below are the first ten rows of results from the test on the original dataset. As discussed, they all include either missing data or invalid data.

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_4271903   | Cookie   | 4        | 1.0           | ERROR       | Credit Card    | In-store  | 2023-07-19       |
| TXN_3522028   | Smoothie | ERROR    | 4.0           | 20.0        | Cash           | In-store  | 2023-04-04       |
| TXN_7958992   | Smoothie | 3        | 4.0           | NaN         | UNKNOWN        | UNKNOWN   | 2023-12-13       |
| TXN_8927252   | UNKNOWN  | 2        | 1.0           | ERROR       | Credit Card    | ERROR     | 2023-11-06       |
| TXN_6650263   | Tea      | 2        | 1.5           | UNKNOWN     | NaN            | Takeaway  | 2023-01-10       |
| TXN_5522862   | Cookie   | ERROR    | 1.0           | 2.0         | Credit Card    | Takeaway  | 2023-03-19       |
| TXN_3578141   | Cake     | 5        | NaN           | 15.0        | NaN            | Takeaway  | 2023-06-27       |
| TXN_2080895   | Cake     | UNKNOWN  | 3.0           | 3.0         | Digital Wallet | In-store  | 2023-04-19       |
| TXN_4987129   | Sandwich | 3        | NaN           | NaN         | NaN            | In-store  | 2023-10-20       |
| TXN_8501819   | Juice    | NaN      | 3.0           | 6.0         | Cash           | NaN       | 2023-03-30       |


In [15]:
# Prepare the altered dataset
altered_sales = sales.copy()

# Create three new rows where Quantity * Price Per Unit != Total Spent
duplicate_transaction_ids = pd.DataFrame({
    "Transaction ID": ["TXN_1993289", "TXN_8472252", "TXN_9250024"],
    "Item": ["Sandwich", "Smoothie", "Cookie"],
    "Quantity": ["2", "1", "2"],
    "Price Per Unit": ["4.0", "4.0", "1.0"],
    "Total Spent": ["4.0", "7.0", "2.5"],
    "Payment Method": [np.nan, np.nan, "Digital Wallet"],
    "Location": ["In-store", np.nan, np.nan],
    "Transaction Date": ["2023-04-18", "2023-02-04", "2023-03-21"]
})

altered_sales = pd.concat([altered_sales, duplicate_transaction_ids], ignore_index=True)
altered_sales.tail()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
9998,TXN_7695629,Cookie,3,,3.0,Digital Wallet,,2023-12-02
9999,TXN_6170729,Sandwich,3,4.0,12.0,Cash,In-store,2023-11-07
10000,TXN_1993289,Sandwich,2,4.0,4.0,,In-store,2023-04-18
10001,TXN_8472252,Smoothie,1,4.0,7.0,,,2023-02-04
10002,TXN_9250024,Cookie,2,1.0,2.5,Digital Wallet,,2023-03-21


In [16]:
# Error check on the altered dataset

# Set the rule
rule = "row['Quantity'] * row['Price Per Unit'] == row['Total Spent']"

# Apply the function to each row, setting inconsistent rows to True
invalid_consistency = altered_sales.apply(
    lambda row: not consistency_filter(row, rule),
    axis=1
)

# Save the invalid rows
invalid_consistency_df = altered_sales.loc[invalid_consistency]

# Print the number of inconsistent rows based on the provided rule
print(f"Number of rows where the rule {rule} is invalid: {invalid_consistency.sum()}\n")

# Display the first 3 inconsistent rows based on the provided rule
print(f"Examples of three rows where the rule {rule} is invalid:")
invalid_consistency_df.tail(3)

Number of rows where the rule row['Quantity'] * row['Price Per Unit'] == row['Total Spent'] is invalid: 1459

Examples of three rows where the rule row['Quantity'] * row['Price Per Unit'] == row['Total Spent'] is invalid:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
10000,TXN_1993289,Sandwich,2,4.0,4.0,,In-store,2023-04-18
10001,TXN_8472252,Smoothie,1,4.0,7.0,,,2023-02-04
10002,TXN_9250024,Cookie,2,1.0,2.5,Digital Wallet,,2023-03-21


#### Results on Altered Dataset

Below are the example results from running the consistency check on the altered dataset with the `rule` set to `row['Quantity'] * row['Price Per Unit'] == row['Total Spent']`.

As expected, there are now 1459 invalid rows, 3 more than on the original dataset. Below are the last three invalid rows, demonstrating that the check successfully filtered out the added rows that are inconsistent with the rule.

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_1993289   | Sandwich | 2        | 4.0           | 4.0         | NaN            | In-store  | 2023-04-18       |
| TXN_8472252   | Smoothie | 1        | 4.0           | 7.0         | NaN            | NaN       | 2023-02-04       |
| TXN_9250024   | Cookie   | 2        | 1.0           | 2.5         | Digital Wallet | NaN       | 2023-03-21       |

### 5) Uniqueness Errors

The uniqueness check ensures that each value in a column is unique.

There is one parameter:
- `test_attribute`: The column to perform the uniqueness check on.
    - There is only one option, `Transaction ID`, since that is the only column that is meant to be unique.
 
Note that the uniqueness check could be run on any of the columns, but we only provide `Transaction ID` in the selection because it is the only attribute that is meant to be unique. Feel free to run the test on any other attribute for detection of rows that contain a value that is not unique in the specified column.

**References:** <br>
Accessing a Specific Value From Value Counts: https://stackoverflow.com/questions/35277075/python-pandas-counting-the-occurrences-of-a-specific-value

In [17]:
# Parameters to be edited by the user

# Valid attributes for the uniqueness check
# The uniqueness check would run successfully on any column, but this is the only column that should be unique, thus it is the only column included in our list of valid attributes.
attributes = ['Transaction ID']

# Attribute selection
test_attribute = 'Transaction ID'

In [18]:
# Error Check

# Store a series of the counts of each value in the chosen column
attribute_series_counts = sales[test_attribute].value_counts()

# Evaluates a single value, checking if it is unique in the chosen column
def uniqueness_filter(value, counts):
    if pd.isna(value):
        return False
    if counts[value] == 1:
        return True
    return False

# Apply the function to the test attribute, setting rows with non-unique values in the designated column to True
invalid_uniqueness = sales[test_attribute].apply(
    lambda attribute: not uniqueness_filter(attribute, attribute_series_counts)
)

# Save the invalid rows
invalid_uniqueness_df = sales.loc[invalid_uniqueness]

# Print the number of rows with a value that is not unique in the chosen column
print(f"Number of rows where the {test_attribute} value is not unique: {invalid_uniqueness.sum()}\n")

# Display the first 3 rows with a value that is not unique in the chosen column
print(f"Examples of three rows where the {test_attribute} value is not unique:")
invalid_uniqueness_df.head(3)

Number of rows where the Transaction ID value is not unique: 0

Examples of three rows where the Transaction ID value is not unique:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date


#### Results

In the `Transaction ID` column, there are no duplicates, thus there are no rows with invalid uniqueness in the context of the `Transaction ID` column.

We will add a few rows to an altered version of the dataset with duplicate `Transaction ID` values, in a new DataFrame called `altered_sales`, to demonstrate that the uniqueness checker properly identifies duplicates. 

**References:** <br>
Add Rows to DF: https://www.geeksforgeeks.org/how-to-add-one-row-in-an-existing-pandas-dataframe/

In [19]:
# Prepare the altered dataset
altered_sales = sales.copy()

# Create new rows with duplicate transaction ids
duplicate_transaction_ids = pd.DataFrame({
    "Transaction ID": ["TXN_1535311", "TXN_1222338", "TXN_6842808"],
    "Item": ["Coffee", "Cookie", "Sandwich"],
    "Quantity": ["2", "4", "2"],
    "Price Per Unit": ["2.0", "1.0", "4.0"],
    "Total Spent": ["4.0", "3.0", "8.0"],
    "Payment Method": ["Cash", "Cash", "Cash"],
    "Location": ["Takeaway", "Takeaway", "Takeaway"],
    "Transaction Date": ["2023-09-08", "2023-10-08", "2023-09-10"]
})

altered_sales = pd.concat([altered_sales, duplicate_transaction_ids], ignore_index=True)
altered_sales.tail()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
9998,TXN_7695629,Cookie,3,,3.0,Digital Wallet,,2023-12-02
9999,TXN_6170729,Sandwich,3,4.0,12.0,Cash,In-store,2023-11-07
10000,TXN_1535311,Coffee,2,2.0,4.0,Cash,Takeaway,2023-09-08
10001,TXN_1222338,Cookie,4,1.0,3.0,Cash,Takeaway,2023-10-08
10002,TXN_6842808,Sandwich,2,4.0,8.0,Cash,Takeaway,2023-09-10


In [20]:
# Error check on altered dataset

# Store a series of the counts of each value in the chosen column
attribute_series_counts = altered_sales[test_attribute].value_counts()

# Apply the function to the test attribute, setting rows with non-unique values in the designated column to True
invalid_uniqueness = altered_sales[test_attribute].apply(
    lambda attribute: not uniqueness_filter(attribute, attribute_series_counts)
)

# Save the invalid rows
invalid_uniqueness_df = altered_sales.loc[invalid_uniqueness]

# Print the number of rows with a value that is not unique in the chosen column
print(f"Number of rows where the {test_attribute} value is not unique: {invalid_uniqueness.sum()}\n")

# Display the first 3 rows with a value that is not unique in the chosen column
print(f"Examples of three rows where the {test_attribute} value is not unique:")
invalid_uniqueness_df.head(3)

Number of rows where the Transaction ID value is not unique: 6

Examples of three rows where the Transaction ID value is not unique:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
116,TXN_1535311,Juice,3,3.0,9.0,Cash,,2023-03-16
306,TXN_1222338,Cookie,1,1.0,1.0,,In-store,2023-10-11
521,TXN_6842808,Tea,2,1.5,3.0,,UNKNOWN,2023-10-22


#### Results on Altered Dataset

Below are the example results from running the uniqueness check on the `Transaction ID` column on the altered dataset, which has three new rows with duplicate `Transaction ID` values.

As expected, there are 6 rows where the `Transaction ID` value is not unique. Note that this retrieves both the first occurrence, and the duplicate occurrences of the value. See the invalid rows below.

| Transaction ID | Item     | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|---------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_1535311  | Juice   | 3        | 3.0           | 9.0         | Cash          | NaN       | 2023-03-16       |
| TXN_1222338  | Cookie  | 1        | 1.0           | 1.0         | NaN           | In-store  | 2023-10-11       |
| TXN_6842808  | Tea     | 2        | 1.5           | 3.0         | NaN           | UNKNOWN   | 2023-10-22       |
| TXN_1535311  | Coffee  | 2        | 2.0           | 4.0         | Cash          | Takeaway  | 2023-09-08       |
| TXN_1222338  | Cookie  | 4        | 1.0           | 3.0         | Cash          | Takeaway  | 2023-10-08       |
| TXN_6842808  | Sandwich | 2       | 4.0           | 8.0         | Cash          | Takeaway  | 2023-09-10       |


### 6) Presence Errors

The presence check validates that values are not missing in a designated column.

There is one parameter:
- `test_attribute`: The column to perform the presence check on.
    - Every column is valid for this check, as none of the columns should have missing values.

In [21]:
# Parameters to be edited by the user

# Valid attributes for the presence check
attributes = ["Transaction ID", "Item", "Quantity", "Price Per Unit", "Total Spent", "Payment Method", "Location", "Transaction Date"]

# Attribute selection
test_attribute = "Item"

In [22]:
# Error check

# Apply pd.isna() to the test attribute, setting rows with missing values in the designated column to True
invalid_presence = sales[test_attribute].apply(lambda attribute: pd.isna(attribute))

# Save the invalid rows
invalid_presence_df = sales.loc[invalid_presence]

# Print the number of rows with a missing value in the chosen test attribute
print(f"Number of rows where the {test_attribute} value is missing: {invalid_presence.sum()}\n")

# Display the first 3 rows with a missing value in the chosen test attribute
print(f"Examples of three rows where the {test_attribute} value is missing:")
invalid_presence_df.head(3)

Number of rows where the Item value is missing: 333

Examples of three rows where the Item value is missing:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
30,TXN_1736287,,5,2.0,10.0,Digital Wallet,,2023-06-02
61,TXN_8051289,,1,3.0,3.0,,In-store,2023-10-09


#### Results

Below are the example results from running the presence check on the `Item` attribute.

There are 333 rows where the `Item` value is missing. For examples of invalid rows, see the ten rows below:
| Transaction ID | Item | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_4717867  | NaN  | 5        | 3.0           | 15.0        | NaN            | Takeaway  | 2023-07-28       |
| TXN_1736287  | NaN  | 5        | 2.0           | 10.0        | Digital Wallet | NaN       | 2023-06-02       |
| TXN_8051289  | NaN  | 1        | 3.0           | 3.0         | NaN            | In-store  | 2023-10-09       |
| TXN_6044979  | NaN  | 1        | 1.0           | 1.0         | Cash           | In-store  | 2023-12-08       |
| TXN_4132730  | NaN  | 5        | 1.0           | 5.0         | NaN            | In-store  | 2023-03-12       |
| TXN_9517146  | NaN  | 5        | 5.0           | 25.0        | Cash           | Takeaway  | 2023-10-30       |
| TXN_4031509  | NaN  | 4        | NaN           | 16.0        | Credit Card    | Takeaway  | 2023-01-04       |
| TXN_3494565  | NaN  | 2        | 4.0           | 8.0         | ERROR          | NaN       | 2023-07-10       |
| TXN_5115080  | NaN  | 5        | 3.0           | 15.0        | Credit Card    | NaN       | 2023-02-18       |
| TXN_8964522  | NaN  | 3        | 3.0           | 9.0         | Credit Card    | Takeaway  | 2023-10-28       |


### 7) Length Errors

The length check validates that the number of characters of a value is within a designated range. Please note that our length check is inclusive, meaning we accept the provided minimum and maximum length values. Remark that invalid data types and missing values fail the length check.

There are three parameters:
- `test_attribute`: The column to perform the length check on.
    - There are five options, `Transaction ID`, `Item`, `Payment Method`, `Location`, and `Transaction Date`, as they are strings with a length that can be checked.
- `minimum`: The minimum accepted length.
- `maximum`: The maximum accepted length.

Note that some attributes, like `Transaction ID` and `Transaction Date`, should have an exact length. In these cases, simply set the `miniumum` and `maximum` parameters to the exact length required. For example, a `Transaction ID` must be 11 characters, so set both `minimum` and `maximum` to 11.

In [23]:
# Parameters to be edited by the user

# Valid attributes for the length check
attributes = ["Transaction ID", "Item", "Payment Method", "Location", "Transaction Date"]

# Attribute selection
test_attribute = "Payment Method"

# Minimum value of the length
minimum = 4 # Cash

# Maximum value of the length
maximum = 14 # Digital Wallet

In [24]:
# Error check

# Evaluates a single value's length against the given valid range of lengths
def length_filter(value, minimum_length, maximum_length):
    if pd.isna(value) or not isinstance(value, str):
        return False
    return minimum_length <= len(value) <= maximum_length

# Apply the function to the test attribute, setting out of range length values to True
invalid_length = sales[test_attribute].apply(
    lambda attribute: not length_filter(attribute, minimum, maximum)
)

# Save the invalid rows
invalid_length_df = sales.loc[invalid_length]

# Print the number of rows with a length value outside of the given valid length range for the designated attribute
initial_invalid_count = invalid_length.sum()
print(f"Number of rows where the {test_attribute} value's length is outside of the defined range of valid lengths ({minimum}, {maximum}): {initial_invalid_count}\n")

# Display the first 3 rows with a length value outside of the given valid length range for the designated attribute
print(f"Examples of three rows where the {test_attribute} value's length is outside of the defined range of valid lengths ({minimum}, {maximum}):")
invalid_length_df.head(3)

Number of rows where the Payment Method value's length is outside of the defined range of valid lengths (4, 14): 2579

Examples of three rows where the Payment Method value's length is outside of the defined range of valid lengths (4, 14):


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31
13,TXN_9437049,Cookie,5,1.0,5.0,,Takeaway,2023-06-01


#### Results

Below are the example results from running the length check on the `Payment Method` attribute, with the minimum set to `4` (corresponding to `Cash`), and the maximum set to `14` (corresponding to `Digital Wallet`). 

There are 2579 rows where the length of the `Payment Method` attribute is out of range of the desired length. However, note that we chose minimum and maximum values based on the lengths of the smallest and largest words in the column, and there are no strings that are not from the set of valid strings and outside of this range. Thus, the only invalid values based on the length check are missing values, as even the invalid values of `ERROR` and `UNKNOWN` fall within this range. Thus, we will provide a second batch of results where we set the maximum length to `13` to intentionally filter out any occurrences of `Digital Wallet`, to show that our length check actually filters strings out based on their length, and doesn't exclusively filter out missing values. Below are the first ten rows of results from the test using a minimum of `4` and a maximum of `14`. As discussed, the values are all NaN.

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|---------------|-----------|------------------|
| TXN_4717867  | NaN      | 5        | 3.0           | 15.0        | NaN           | Takeaway  | 2023-07-28       |
| TXN_2064365  | Sandwich | 5        | 4.0           | 20.0        | NaN           | In-store  | 2023-12-31       |
| TXN_9437049  | Cookie   | 5        | 1.0           | 5.0         | NaN           | Takeaway  | 2023-06-01       |
| TXN_8915701  | ERROR    | 2        | 1.5           | 3.0         | NaN           | In-store  | 2023-03-21       |
| TXN_3765707  | Sandwich | 1        | 4.0           | 4.0         | NaN           | NaN       | 2023-06-10       |
| TXN_2616390  | Sandwich | 2        | 4.0           | 8.0         | NaN           | NaN       | 2023-09-18       |
| TXN_9677376  | Smoothie | 4        | 4.0           | 16.0        | NaN           | In-store  | 2023-08-15       |
| TXN_6855453  | UNKNOWN  | 4        | 3.0           | 12.0        | NaN           | In-store  | 2023-07-17       |
| TXN_2655815  | Smoothie | 4        | 4.0           | 16.0        | NaN           | Takeaway  | 2023-06-08       |
| TXN_2083138  | Smoothie | 3        | 4.0           | 12.0        | NaN           | In-store  | 2023-04-17       |


In [25]:
# Running the test with a maximum length of 13 to intentionally filter out Digital Wallet

# Minimum value of the length
minimum = 4 # Cash

# Maximum value of the length
maximum = 13 # Filter out Digital Wallet

# Apply the function to the test attribute, setting out of range length values to True
invalid_length = sales[test_attribute].apply(
    lambda attribute: not length_filter(attribute, minimum, maximum)
)

# Save the invalid rows
invalid_length_df = sales.loc[invalid_length]

# Print the number of rows with a length value outside of the given valid length range for the designated attribute
updated_invalid_count = invalid_length.sum()
print(f"Number of rows where the {test_attribute} value's length is outside of the defined range of valid lengths ({minimum}, {maximum}): {updated_invalid_count}")

# Display the number of occurrences of Digital Wallet
print(f"Number of occurrences of Digital Wallet: {sales['Payment Method'].value_counts()['Digital Wallet']}")
print(f"Note that the number of occurrences of Digital Wallet ({sales['Payment Method'].value_counts()['Digital Wallet']}) \
+ the previous number of values of invalid length ({initial_invalid_count}) = {updated_invalid_count}. Thus, the check \
successfully captures Digital Wallet due to its length.\n")

# Display the first 3 rows with a length value outside of the given valid length range for the designated attribute
print(f"Examples of three rows where the {test_attribute} value's length is outside of the defined range of valid lengths ({minimum}, {maximum})")
invalid_length_df.head(3)

Number of rows where the Payment Method value's length is outside of the defined range of valid lengths (4, 13): 4870
Number of occurrences of Digital Wallet: 2291
Note that the number of occurrences of Digital Wallet (2291) + the previous number of values of invalid length (2579) = 4870. Thus, the check successfully captures Digital Wallet due to its length.

Examples of three rows where the Payment Method value's length is outside of the defined range of valid lengths (4, 13)


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
4,TXN_3160411,Coffee,2,2.0,4.0,Digital Wallet,In-store,2023-06-11
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
9,TXN_2064365,Sandwich,5,4.0,20.0,,In-store,2023-12-31


#### Results (Filtering Out Digital Wallet)

Below are the example results from running the length check on the `Payment Method` attribute, with the minimum set to `4` (corresponding to `Cash`), and the maximum set to `13` (filtering out `Digital Wallet` based on length). 

There are 4870 rows where the length of the `Payment Method` attribute is out of range of the desired length. Note that the count has increased just as much as the number of occurrences of `Digital Wallet`. Thus, the length check successfully filtered out `Digital Wallet` based on its length. Below are the first ten rows of results from the test using a minimum of `4` and a maximum of `13`. As discussed, it includes `Digital Wallet`.

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_3160411  | Coffee   | 2        | 2.0           | 4.0         | Digital Wallet | In-store  | 2023-06-11       |
| TXN_4717867  | NaN      | 5        | 3.0           | 15.0        | NaN            | Takeaway  | 2023-07-28       |
| TXN_2064365  | Sandwich | 5        | 4.0           | 20.0        | NaN            | In-store  | 2023-12-31       |
| TXN_9437049  | Cookie   | 5        | 1.0           | 5.0         | NaN            | Takeaway  | 2023-06-01       |
| TXN_8915701  | ERROR    | 2        | 1.5           | 3.0         | NaN            | In-store  | 2023-03-21       |
| TXN_3765707  | Sandwich | 1        | 4.0           | 4.0         | NaN            | NaN       | 2023-06-10       |
| TXN_5132361  | Sandwich | 3        | 4.0           | 12.0        | Digital Wallet | Takeaway  | 2023-12-01       |
| TXN_2616390  | Sandwich | 2        | 4.0           | 8.0         | NaN            | NaN       | 2023-09-18       |
| TXN_7640952  | Cake     | 4        | 3.0           | 12.0        | Digital Wallet | Takeaway  | ERROR            |
| TXN_1736287  | NaN      | 5        | 2.0           | 10.0        | Digital Wallet | NaN       | 2023-06-02       |


### 8) Look-up Errors

The look-up check ensures that a given value exists in a pre-defined finite set of values.

There are two parameters:
- `test_attribute`: The column to perform the look-up check on.
    - Valid options include `Item`, `Payment Method`, `Price Per Unit`, and `Location`, as these are the four features that must only have specific values.
- `look_up_table`: The list of acceptable values. The following points describe what the value of `look_up_table` should be depending on the selected `test_attribute`.
    - `Item`: `["Coffee", "Tea", "Sandwich", "Salad", "Cake", "Cookie", "Smoothie", "Juice"]`
    - `Price Per Unit`: `["2.0", "1.5", "4.0", "5.0", "3.0", "1.0"]`
    - `Payment Method`: `["Digital Wallet", "Credit Card", "Cash"]`
    - `Location`: `["Takeaway", "In-store"]`

In [26]:
# Parameters to be edited by the user

# Valid attributes for the look-up check
attributes = ["Item", "Price Per Unit", "Payment Method", "Location"]

# Attribute selection
test_attribute = "Item"

# Look-up table of valid values
look_up_table = ["Coffee", "Tea", "Sandwich", "Salad", "Cake", "Cookie", "Smoothie", "Juice"]

In [27]:
# Error check

# Checks if a single value is in the look-up table
def look_up_filter(value, look_up_table):
    return value in look_up_table

# Apply the function to the test attribute, setting rows whose value is not in the look-up table to True
invalid_look_up = sales[test_attribute].apply(
    lambda attribute: not look_up_filter(attribute, look_up_table)
)

# Save the invalid rows
invalid_look_up_df = sales.loc[invalid_look_up]

# Print the number of rows with a value that is not in the look-up table for the designated attribute
print(f"Number of rows where the {test_attribute} value is not in the look-up table ({look_up_table}): {invalid_look_up.sum()}\n")

# Display the first 3 rows with a value that is not in the look-up table for the designated attribute
print(f"Examples of three rows where the {test_attribute} value is not in the look-up table ({look_up_table}):")
invalid_look_up_df.head(3)

Number of rows where the Item value is not in the look-up table (['Coffee', 'Tea', 'Sandwich', 'Salad', 'Cake', 'Cookie', 'Smoothie', 'Juice']): 969

Examples of three rows where the Item value is not in the look-up table (['Coffee', 'Tea', 'Sandwich', 'Salad', 'Cake', 'Cookie', 'Smoothie', 'Juice']):


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
6,TXN_4433211,UNKNOWN,3,3.0,9.0,ERROR,Takeaway,2023-10-06
8,TXN_4717867,,5,3.0,15.0,,Takeaway,2023-07-28
14,TXN_8915701,ERROR,2,1.5,3.0,,In-store,2023-03-21


#### Results

Below are the example results from running the look-up check on the `Item` attribute, with the `look_up_table` set to `["Coffee", "Tea", "Sandwich", "Salad", "Cake", "Cookie", "Smoothie", "Juice"]`, as this is the set of valid `Item` values.

There are 969 rows where the `Item` value is not in the look-up table. For examples of invalid rows, see the ten rows below:

| Transaction ID | Item     | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|---------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_4433211  | UNKNOWN | 3        | 3.0           | 9.0         | ERROR          | Takeaway  | 2023-10-06       |
| TXN_4717867  | NaN     | 5        | 3.0           | 15.0        | NaN            | Takeaway  | 2023-07-28       |
| TXN_8915701  | ERROR   | 2        | 1.5           | 3.0         | NaN            | In-store  | 2023-03-21       |
| TXN_1736287  | NaN     | 5        | 2.0           | 10.0        | Digital Wallet | NaN       | 2023-06-02       |
| TXN_8927252  | UNKNOWN | 2        | 1.0           | ERROR       | Credit Card    | ERROR     | 2023-11-06       |
| TXN_7710508  | UNKNOWN | 5        | 1.0           | 5.0         | Cash           | NaN       | ERROR            |
| TXN_6855453  | UNKNOWN | 4        | 3.0           | 12.0        | NaN            | In-store  | 2023-07-17       |
| TXN_8914892  | UNKNOWN | 5        | 5.0           | 25.0        | Digital Wallet | NaN       | 2023-03-15       |
| TXN_8051289  | NaN     | 1        | 3.0           | 3.0         | NaN            | In-store  | 2023-10-09       |
| TXN_9099694  | UNKNOWN | 3        | 5.0           | 15.0        | NaN            | Takeaway  | 2023-11-18       |

### 9) Exact Duplicate Errors

The exact duplicate check validates that there are no rows that are identical over all columns.

This check does not take any parameters, because it must check all columns by definition.

**References:** <br>
Exact Duplicates: https://uottawa.brightspace.com/d2l/le/content/490358/viewContent/6620388/View (Slide 27)

In [28]:
# This check does not take any parameters

In [29]:
# Error check

# Apply the .duplicated method to the DataFrame to create a Series, with exact duplicates set to True
duplicates = sales.duplicated()

# Save the invalid rows
invalid_exact_duplicate_df = sales.loc[duplicates]

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

# Display the first 3 rows that are exact duplicates
print("Examples of three duplicate rows:")
invalid_exact_duplicate_df.head(3)

Number of duplicate rows: 0

Examples of three duplicate rows:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date


#### Results

There are no exact duplicates. This makes sense, since we have previously validated that the `Transaction ID` column is unique.

We will add a few rows that are exact duplicates of existing rows to an altered version of the dataset, in a new DataFrame called `altered_sales`, to demonstrate that the exact duplicates checker properly identifies duplicates. 

In [30]:
# Prepare the altered dataset
altered_sales = sales.copy()

# Create three new rows that are exact duplicates of existing rows
duplicate_transaction_ids = pd.DataFrame({
    "Transaction ID": ["TXN_1993289", "TXN_8472252", "TXN_9250024"],
    "Item": ["Sandwich", "Smoothie", "Cookie"],
    "Quantity": ["2", "1", "2"],
    "Price Per Unit": ["4.0", "4.0", "1.0"],
    "Total Spent": ["8.0", "4.0", "2.0"],
    "Payment Method": [np.nan, np.nan, "Digital Wallet"],
    "Location": ["In-store", np.nan, np.nan],
    "Transaction Date": ["2023-04-18", "2023-02-04", "2023-03-21"]
})

altered_sales = pd.concat([altered_sales, duplicate_transaction_ids], ignore_index=True)
altered_sales.tail()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
9998,TXN_7695629,Cookie,3,,3.0,Digital Wallet,,2023-12-02
9999,TXN_6170729,Sandwich,3,4.0,12.0,Cash,In-store,2023-11-07
10000,TXN_1993289,Sandwich,2,4.0,8.0,,In-store,2023-04-18
10001,TXN_8472252,Smoothie,1,4.0,4.0,,,2023-02-04
10002,TXN_9250024,Cookie,2,1.0,2.0,Digital Wallet,,2023-03-21


In [31]:
# Error check on altered dataset

# Apply the .duplicated method to the DataFrame to create a Series, with exact duplicates set to True
duplicates = altered_sales.duplicated()

# Save the invalid rows
invalid_exact_duplicate_df = altered_sales.loc[duplicates]

# Print the number of rows that are exact duplicates
print(f"Number of duplicate rows: {duplicates.sum()}\n")

# Display the first 3 rows that are exact duplicates
print("Examples of three duplicate rows:")
invalid_exact_duplicate_df.head(3)

Number of duplicate rows: 3

Examples of three duplicate rows:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
10000,TXN_1993289,Sandwich,2,4.0,8.0,,In-store,2023-04-18
10001,TXN_8472252,Smoothie,1,4.0,4.0,,,2023-02-04
10002,TXN_9250024,Cookie,2,1.0,2.0,Digital Wallet,,2023-03-21


#### Results on Altered Dataset

Below are the example results from running the exact duplicates check on the altered dataset, which has three new duplicate rows.

As expected, there are 3 duplicate rows. Note that this only shows the rows that are duplicates, and not the first occurrence of the rows. See the invalid rows below.

| Transaction ID | Item      | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|----------|----------|---------------|-------------|----------------|-----------|------------------|
| TXN_1993289  | Sandwich | 2        | 4.0           | 8.0         | NaN            | In-store  | 2023-04-18       |
| TXN_8472252  | Smoothie | 1        | 4.0           | 4.0         | NaN            | NaN       | 2023-02-04       |
| TXN_9250024  | Cookie   | 2        | 1.0           | 2.0         | Digital Wallet | NaN       | 2023-03-21       |

### 10) Near Duplicate Errors

The near duplicate check identifies rows differ only by a synonym in a designated attribute. These are rows which can result from different naming of the same object.

There are two parameters:
- `test_attribute`: The column to perform the near-duplicate check on.
    - Valid options include `Item`, `Payment Method`, and `Location`, as these are the three features for which their values could have synonyms.
- `synonym_dict`: The dictionary containing the values of the attribute which may have synonyms as the keys of the dictionary, and the corresponding synonyms as the items of the dictionary. The following points give some examples of what could be included in `synonym_dict` based on the selected `test_attribute`.
    - `Item`: 
        - `"Smoothie": ["Shake", "Fruit Blend"]`
        - `"Cookie": ["Biscuit", "Wafer", "Shortbread"]`
    - `Payment Method`:
        - `"Digital Wallet": ["Apple Pay", "E-Wallet"]`
    - `Location`: `["Takeaway", "In-store"]`
        - `"Takeaway": ["Online", "Take-out"]`
        - `"In-store": ["Home"]`

Note that we count exact duplicates as near duplicates.

**References:** <br>
Dictionary access: https://www.w3schools.com/python/python_dictionaries_access.asp <br>
Reverse a dictionary: https://stackoverflow.com/questions/483666/reverse-invert-a-dictionary-mapping <br>
Merging a list of lists into one list: https://www.geeksforgeeks.org/merge-multiple-lists-into-one-list/

In [32]:
# Parameters to be edited by the user

# Valid attributes for the near duplicate check
attributes = ["Item", "Payment Method", "Location"]

# Attribute selection
test_attribute = "Item"

# Dictionary of values of the chosen attribute which may have synonyms along with their synonyms
synonym_dict = {
    "Smoothie": ["Shake", "Fruit Blend"],
    "Cookie": ["Biscuit", "Wafer", "Shortbread"]
}

In [33]:
# Error check

# Reverse the synonym dictionary so that we can access the key value from the synonym efficiently
synonym_to_key = {
    synonym: key for key, items in synonym_dict.items() for synonym in items
}

# Checks if a single value is in the list of all synonyms
# If so, returns the key value of the synonym in order to set all the synonyms to a baseline value
# If not, returns the original value
def synonym_replacer(value, synonym_to_key):
    if value in synonym_to_key.keys():
        return synonym_to_key[value]
    else:
        return value

# Create a copy of the dataframe where all the synonyms of the test attribute are replcaed by their key value
sales_replaced_synonyms = sales.copy()
sales_replaced_synonyms[test_attribute] = sales[test_attribute].apply(
    lambda attribute: synonym_replacer(attribute, synonym_to_key)
)

# Check for exact duplicates in the modified dataset (using the method from section 9) to obtain the near duplicates
# Apply the .duplicated method to the DataFrame to create a Series, with near duplicates set to True
near_duplicates = sales_replaced_synonyms.duplicated()

# Save the invalid rows
invalid_near_duplicate_df = sales.loc[near_duplicates]

# Print the number of rows that are near duplicates
print(f"Number of duplicate rows: {near_duplicates.sum()}\n")

# Display the first 3 rows that are near duplicates
print("Examples of three near duplicate rows:")
invalid_near_duplicate_df.head(3)

Number of duplicate rows: 0

Examples of three near duplicate rows:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date


#### Results

There are no near duplicates. This is expected because `Transaction ID` column is unique and the dataset did not come with synonyms for the different items.

We will add a few rows that are near duplicates of existing rows to an altered version of the dataset, in a new DataFrame called `altered_sales`, to demonstrate that the near duplicates checker properly identifies near duplicates caused by synonyms.

More specifically, we add two copies of a row with the item `"Cookie"` and replace its value by two different synonyms (`"Biscuit", "Wafer"`). We add a copy of a row with the item `"Smoothie"` and replace its value by the synonym `"Fruit Blend"`. Finally, we add a copy of a row with the item `"Cookie"` and replace its value by `"Biscuit"`, but we also change the `Transaction Date` so that it is no longer a duplicate. So, we expect our checked to identify the first three instances as near duplicates and not the fourth.

In [34]:
# Prepare the altered dataset
altered_sales = sales.copy()

# Create three new rows that are exact duplicates of existing rows
near_duplicate_transaction_ids = pd.DataFrame({
    "Transaction ID": ["TXN_3779366", "TXN_3779366", "TXN_9989415", "TXN_2153529"],
    "Item": ["Biscuit", "Wafer", "Fruit Blend", "Biscuit"],
    "Quantity": ["1", "1", "5", "1"],
    "Price Per Unit": ["1.0", "1.0", "UNKNOWN", "1.0"],
    "Total Spent": ["1.0", "1.0", "20.0", "1.0"],
    "Payment Method": ["Digital Wallet", "Digital Wallet", "Credit Card", "Credit Card"],
    "Location": ["In-store", "In-store", "Takeaway", "UNKNOWN"],
    "Transaction Date": ["2023-04-15", "2023-04-15", "2023-05-26", "2023-10-25"]
})

altered_sales = pd.concat([altered_sales, near_duplicate_transaction_ids], ignore_index=True)
altered_sales.tail()

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
9999,TXN_6170729,Sandwich,3,4.0,12.0,Cash,In-store,2023-11-07
10000,TXN_3779366,Biscuit,1,1.0,1.0,Digital Wallet,In-store,2023-04-15
10001,TXN_3779366,Wafer,1,1.0,1.0,Digital Wallet,In-store,2023-04-15
10002,TXN_9989415,Fruit Blend,5,UNKNOWN,20.0,Credit Card,Takeaway,2023-05-26
10003,TXN_2153529,Biscuit,1,1.0,1.0,Credit Card,UNKNOWN,2023-10-25


In [35]:
# Error check on altered dataset

# Create a copy of the dataframe where all the synonyms of the test attribute are replcaed by their key value
sales_replaced_synonyms = altered_sales.copy()
sales_replaced_synonyms[test_attribute] = altered_sales[test_attribute].apply(
    lambda attribute: synonym_replacer(attribute, synonym_to_key)
)

# Check for exact duplicates in the modified dataset (using the method from section 9) to obtain the near duplicates
# Apply the .duplicated method to the DataFrame to create a Series, with near duplicates set to True
near_duplicates = sales_replaced_synonyms.duplicated()

# Save the invalid rows
invalid_near_duplicate_df = altered_sales.loc[near_duplicates]

# Print the number of rows that are near duplicates
print(f"Number of duplicate rows: {near_duplicates.sum()}\n")

# Display the first 3 rows that are near duplicates
print("Examples of three near duplicate rows:")
invalid_near_duplicate_df.head(3)

Number of duplicate rows: 3

Examples of three near duplicate rows:


Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
10000,TXN_3779366,Biscuit,1,1.0,1.0,Digital Wallet,In-store,2023-04-15
10001,TXN_3779366,Wafer,1,1.0,1.0,Digital Wallet,In-store,2023-04-15
10002,TXN_9989415,Fruit Blend,5,UNKNOWN,20.0,Credit Card,Takeaway,2023-05-26


#### Results on Altered Dataset

Below are the example results from applying the near duplicates check to the altered dataset with three new near duplicate rows.

As expected, the checker identifies 3 near duplicate rows. It does not identify the fourth row which is not a true near duplicate. See the invalid rows below.

| Transaction ID | Item        | Quantity | Price Per Unit | Total Spent | Payment Method  | Location  | Transaction Date |
|---------------|------------|----------|---------------|-------------|-----------------|-----------|------------------|
| TXN_3779366   | Biscuit    | 1        | 1.0           | 1.0         | Digital Wallet  | In-store  | 2023-04-15       |
| TXN_3779366   | Wafer      | 1        | 1.0           | 1.0         | Digital Wallet  | In-store  | 2023-04-15       |
| TXN_9989415   | Fruit Blend | 5       | UNKNOWN       | 20.0        | Credit Card     | Takeaway  | 2023-05-26       |

## Conclusion

Overall, we performed validity checks for the following 10 types of errors: data type, range, format, consistency, uniqueness, presence, length, look-up, exact duplicate and near duplicate. All of the checks correctly identified any errors in the dataset, as well as the errors that we simulated to verify their performance. To further ensure the validity of the dataset, we could expand the near duplicate check to allow the user to perform an exact duplicate check on a subset of the columns, as it is possible that one attribute (such as Transaction ID in this case) is different when it should be the same. Another extension would be to investigate potential causes for the errors we identified, along with identifying the damage that they can cause.

## References
Dataset: https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training <br>
Left-align Markdown Tables: https://stackoverflow.com/questions/21892570/ipython-notebook-align-table-to-the-left-of-cell <br>
Regex: https://www.w3schools.com/python/python_regex.asp <br>
Eval: https://docs.python.org/3/library/functions.html#eval <br>
Accessing a Specific Value From Value Counts: https://stackoverflow.com/questions/35277075/python-pandas-counting-the-occurrences-of-a-specific-value <br>
Add Rows to DF: https://www.geeksforgeeks.org/how-to-add-one-row-in-an-existing-pandas-dataframe/ <br>
Exact Duplicates: https://uottawa.brightspace.com/d2l/le/content/490358/viewContent/6620388/View (Slide 27) <br>
Dictionary access: https://www.w3schools.com/python/python_dictionaries_access.asp <br>
Reverse a dictionary: https://stackoverflow.com/questions/483666/reverse-invert-a-dictionary-mapping <br>
Merging a list of lists into one list: https://www.geeksforgeeks.org/merge-multiple-lists-into-one-list/