# Week 11: Data Validation ✅

## 1. What is Data Validation?

**Data validation** is the process of ensuring the **accuracy, completeness, and reliability** of data before it's used for analysis or decision-making. The main goal is to verify that the data meets specific predefined criteria and standards.

This is a crucial step for:
* **Maintaining data quality** by ensuring data is clean and useful.
* **Preventing errors** that could lead to faulty decisions.
* **Improving decision-making** by providing a foundation of high-quality, trustworthy data.

---

## 2. The Three Types of Data Validation

Data validation can be broken down into three main categories, each checking a different aspect of the data.


1.  **Structural Validation**: Checks if the data conforms to the required schema or model.
2.  **Content Validation**: Focuses on the accuracy and relevance of the data's content.
3.  **Logical Validation**: Ensures the data makes sense in its business context by checking it against business rules.

---

## 3. Structural Validation

**Structural validation** checks that data is organized and formatted correctly according to its specified structure. This is the first line of defense and is essential for automated processing and system integration.

### Key Structural Validation Tasks
* **Data Type Checks**: Ensures a field contains the correct data type (e.g., an integer, string, or date). For example, a `date` field shouldn't contain alphabetic characters.
* **Format Checks**: Verifies that data follows a specific format, like a phone number (`(xxx) xxx-xxxx`) or an email address.
* **Data Integrity Checks**: In relational databases, this ensures relationships are maintained, such as checking that a foreign key points to an existing record.
* **Required Fields Check**: Confirms that mandatory fields are not empty or null.
* **File Size and Type Validation**: Checks that uploaded files are of the expected type (e.g., `.csv`) and do not exceed size limits.

#### Example: Online Retail Transactions
Consider this dataset:

| Customer\_ID | Date\_of\_Purchase | Order\_Amount | Product\_ID | Customer\_Email |
| :--- | :--- | :--- | :--- | :--- |
| 123 | 2023-08-15 | 59.99 | P001 | john.doe@example.com |
| 124 | 15-08-2023 | 120.50 | P002 | jane.smith@example.com |
| 125 | 2023-08-16 | -75.00 | P003 | adam\_1@wrongformat |
| 126 | not available | 99.99 | NULL | eve.williams@example.com |
| | 2023-08-17 | 49.50 | P005 | mark.adams@example.com |

**Structural Errors Found:**
* **Data Type Error**: The `Date_of_Purchase` for customer 126 is a string ("not available") instead of a date.
* **Format Error**: The date for customer 124 ("15-08-2023") is in a different format than the others ("YYYY-MM-DD").
* **Required Field Error**: The `Customer_ID` is missing in the last row.
* **Integrity Error**: The `Product_ID` is NULL for customer 126, which might violate a foreign key constraint if every order must have a product.

---

## 4. Content Validation

While structural validation checks the *shape* of the data, **content validation** checks the *substance*. It verifies that the data entries themselves are accurate, relevant, and correct for their intended use.

### Key Content Validation Tasks
* **Range Checks**: Ensures data values fall within an acceptable range (e.g., an `age` field shouldn't have a negative number).
* **Uniqueness Checks**: Verifies that values in a field required to be unique (like `Customer_ID` or `email`) are not duplicated.
* **Referential Integrity**: Confirms that relationships between tables are consistent, like ensuring an `OrderID` in a `Shipments` table exists in the `Orders` table.
* **List and Set Validation**: Checks if a value is one of a predefined set of options (e.g., a `Country` field must contain a valid country name from a standard list).
* **Pattern Matching**: Uses techniques like regular expressions to validate complex formats, such as ensuring an email address is properly formed.
* **Cross-Field Validation**: Applies rules that depend on multiple fields. For example, a `Delivery_Date` must be later than the `Order_Date`.

#### Example: Online Retail Transactions (Revisited)
Using the same dataset:

**Content Errors Found:**
* **Range Check Error**: The `Order_Amount` for customer 125 is negative (-75.00), which is likely invalid.
* **Pattern Matching Error**: The `Customer_Email` for customer 125 ("adam\_1@wrongformat") is not a valid email format.
* **Uniqueness Check Error**: `Customer_ID` 126 appears twice, which might indicate a duplicate entry that needs to be resolved.

---

## 5. Logical Validation

**Logical validation** is the most advanced type, focusing on whether the data adheres to specific **business rules and logic**. It verifies the contextual correctness of data based on its relationship with other data points.

### Key Logical Validation Tasks
* **Consistency Checks**: Verifies that data across different fields is consistent with business rules (e.g., a patient's treatment is appropriate for their diagnosis).
* **Sequential Validation**: Ensures data follows a logical sequence (e.g., a `Shipping_Date` must come after an `Order_Date`).
* **Dependency Checks**: Validates data relationships (e.g., an employee's manager must hold a more senior position).
* **Calculative Validation**: Checks that computed fields are derived correctly (e.g., `Total_Price` = `Unit_Price` * `Quantity` + `Tax`).
* **Conditional Validation**: Applies rules that are only triggered under specific conditions (e.g., a "senior discount" is only applied if the customer's age is over 65).

#### Example: Employee Timesheets
Consider this employee timesheet data:

| Employee\_ID | Date | Hours\_Worked | Hourly\_Rate | Department |
| :--- | :--- | :--- | :--- | :--- |
| 1001 | 2023-09-01 | 8 | 25 | HR |
| 1002 | 2023-09-01 | 9 | 30 | IT |
| 1003 | 2023-09-01 | 15 | 22 | Finance |
| 1004 | 2023-09-01 | **25** | 28 | Operations |
| 1005 | 2023-09-01 | 7 | **45** | HR |
| 1006 | 2023-09-01 | 12 | **120** | IT |

**Logical Errors Found:**
* **Consistency Error**: Employee 1004 worked **25 hours** in a single day, which violates the business rule that daily hours cannot exceed 24.
* **Consistency Error**: Employee 1006 in the IT department has an `Hourly_Rate` of **$120**, while employee 1002 in the same department has a rate of $30. This might be a logical error if there's a business rule defining a much narrower pay band for that role.
* **Consistency Error**: Employee 1005 in HR has an `Hourly_Rate` of **$45**, which is significantly higher than employee 1001 ($25) in the same department. This could be a logical error or require further investigation.

---

## 6. Error Handling

Once validation rules detect an error, you need a plan to manage it. **Error handling** provides strategies to manage and resolve errors found during data processing. An effective strategy not only fixes current errors but also helps prevent future ones.

The process generally involves:
1.  **Detection**: Identifying the error.
2.  **Reporting**: Logging the error with clear details.
3.  **Assessment**: Understanding the error's impact.
4.  **Response & Resolution**: Taking action to correct the error (e.g., fix, flag, or reject the data).
5.  **Documentation & Review**: Recording the resolution.
6.  **Prevention**: Implementing measures to stop the error from happening again.