Data cleaning includes multiple steps. Will go through one by one in detail by taking some example.

1. Schema and type Validaton

| Issue                                       | Description                                                        | Example                                       | Recommended Solution                                                                               |
| ------------------------------------------- | ------------------------------------------------------------------ | --------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| 1. **Wrong Data Type**                      | Data type mismatch between source and target schema.               | `"123"` (string) instead of `123` (int)       | Cast to correct type using safe parsing. Validate post-casting.                                    |
| 2. **Inconsistent Types in Same Column**    | Same column contains mixed types.                                  | `[123, "abc", null]`                          | Use `case when` or `try_cast` to isolate bad records.                                              |
| 3. **Corrupt or Malformed Values**          | Fields don't follow expected format or contain corrupt characters. | `"abc"` in a date field                       | Use regex, parsing functions, and try-except blocks. Flag or move bad records to quarantine.       |
| 4. **Missing or Extra Columns**             | Input dataset has fewer or more columns than expected.             | Source has 9 columns, schema expects 10       | Compare schema programmatically. Auto-correct where possible or log and quarantine.                |
| 5. **Incorrect or Mismatched Column Names** | Case mismatch or spelling errors in column names.                  | `"user_id"` vs `"UserId"`                     | Use mapping logic to rename before applying schema.                                                |
| 6. **Empty Strings Instead of Nulls**       | Fields contain `""` but are semantically null.                     | `""` in an `email` field                      | Normalize by converting empty strings to nulls.                                                    |
| 7. **Array/Object Type Errors**             | Arrays/structs nested incorrectly or inconsistent structure.       | JSON field has different schema than expected | Use schema inference + validation with expected schema. Flatten and re-parse if necessary.         |
| 8. **Date/Time Format Issues**              | Dates stored in multiple formats or invalid dates.                 | `"2025/01/31"`, `"31-01-2025"`                | Normalize all to ISO (`yyyy-MM-dd`). Use parsing libraries (e.g., `to_date`, `datetime.strptime`). |
| 9. **Precision Loss**                       | Implicit casting of decimals can lead to loss of precision.        | `123.4567` becomes `123.45`                   | Define explicit precision using `DecimalType(10, 4)` in schema.                                    |
| 10. **Boolean Value Ambiguity**             | Values like "Yes", "No", 1, 0, "true" are mixed.                   | `["Yes", "No", 1, 0, true, false]`            | Standardize to true/false using UDF or mapping dictionary.                                         |
