# Week 7: Data Quality & Anomalies

## 1. Introduction to Data Quality

### 1.1. Definition
**Data quality** refers to the state or condition of data, evaluated based on factors like its **accuracy, completeness, reliability, relevance, and timeliness**. High-quality data is crucial for organizations to make informed decisions, enhance operational efficiency, and maintain a competitive advantage.

### 1.2. The Cost of Poor Data Quality
Poor data quality has significant financial repercussions.
* The estimated cost to the US economy in 2016 was **$3.1 trillion** (IBM).
* Companies lose an average of **12% of their revenue** due to bad data (Experian).
* The average annual cost for a company is **$14.2 million** (Gartner).

---

## 2. The Importance of High-Quality Data

High-quality data is fundamental to various aspects of an organization's success:
* **Enhanced Decision-Making**: Ensures decisions are based on accurate, factual information, increasing confidence and reducing the risk of mistakes.
* **Regulatory Compliance & Risk Management**: Helps organizations comply with regulations to avoid legal penalties and allows for proactive risk identification and mitigation.
* **Operational Efficiency**: Streamlines processes by reducing time spent on data corrections and verifications, leading to better resource allocation. (e.g., Zara's responsive supply chain) .
* **Customer Satisfaction**: Enables personalized experiences and service improvements by accurately understanding customer behavior. (e.g., Netflix's recommendation system, which drives over 80% of viewed content) .
* **Financial Health**: Leads to cost savings by reducing errors and can uncover new revenue opportunities through better targeting and market analysis.
* **Reputation and Trust**: Builds trust with stakeholders (customers, investors, partners) by demonstrating reliability and commitment to excellence.
* **Innovation and Growth**: Provides the necessary insights for business intelligence and analytics, offering a competitive edge by identifying trends and optimizing operations faster than competitors.

---

## 3. Case Studies of Data Quality Failures

### 3.1. UK Post Office Horizon IT Scandal
Hundreds of post office operators were wrongly convicted of theft and fraud because of errors produced by the faulty **Horizon IT system**. The system's data incorrectly showed money missing from their branches, highlighting how poor data quality can lead to devastating real-world consequences.

### 3.2. The Therac-25 Incident (1985-1987)
The **Therac-25**, a computer-controlled radiation therapy machine, delivered lethal radiation overdoses due to software bugs. This was a critical data quality failure stemming from:
* **Software Errors**: Incorrect calculations within the software.
* **Lack of Data Validation**: The system lacked proper checks and limitations on treatment settings entered by operators.
* **Faulty Risk Assessments**: The manufacturer underestimated the probability of software malfunctions.

---

## 4. Data Quality Dimensions vs. Measures

It's important to distinguish between dimensions and measures.
* **Dimensions (The 'What')**: These are the **qualitative aspects** or characteristics that contribute to data's overall quality. They provide a framework for assessment.
    * Examples: **Accuracy, Completeness, Consistency, Timeliness, Relevance, Reliability**.
* **Measures (The 'How')**: These are the **quantitative metrics** or indicators used to evaluate data against the dimensions. They are the practical tools for assessment.
    * Examples: **Error Rate, Fill Rate, Duplicate Rate, Latency**.

---

## 5. Common Data Quality Challenges

Maintaining data quality is difficult due to several challenges:
* **Volume and Variety of Data**: The sheer amount and diversity of data make it hard to manage and ensure consistency.
* **Data Silos**: Data stored in isolated systems leads to inconsistencies and redundancy.
* **Evolving Data (Data Drift)**: Data changes over time due to new business processes or market conditions, which can degrade model accuracy.
* **Human Error**: Mistakes in data entry, interpretation, and management are common and can propagate through systems.
* **Lack of Data Governance**: Without a formal governance framework, it's difficult to enforce standards, policies, and procedures for data quality.
* **Complex Data Integration**: Combining data from various sources with different formats and standards often introduces errors.
* **Inadequate Tools**: Lack of effective tools hinders the ability to detect, correct, and prevent data quality issues.
* **Poor Awareness**: A lack of understanding across the organization about the importance of data quality leads to it not being prioritized.
* **Regulatory Compliance**: Adhering to an ever-changing landscape of data regulations is a constant challenge.
* **Resource Constraints**: Limited budget, time, and skilled personnel can impede data quality initiatives.

---

## 6. Understanding Data Anomalies

### 6.1. Definition and Importance
**Data anomalies** are irregularities, deviations, or unusual occurrences in data that differ from expected patterns. Identifying them is crucial for maintaining data quality and ensuring accurate analysis. Anomaly detection is vital in fields like fraud detection, healthcare monitoring, and predictive maintenance.

### 6.2. Types of Anomalies
There are three primary types of data anomalies:

#### 6.2.1. Point Anomalies
A single data point that significantly deviates from the rest of the dataset.
* **Example**: In a table of employee work hours, most values are between 4 and 10, but one entry is **500**. This is a clear point anomaly, likely a data entry error.

#### 6.2.2. Contextual Anomalies
A data point that is anomalous only within a specific context (e.g., time or location) but not otherwise.
* **Example**: A high energy usage value for a property at 3:00 AM might be anomalous, as usage is typically low at that time. However, the same value at 7:00 PM would be normal.

#### 6.2.3. Collective Anomalies
A collection of related data points that, as a group, deviate from the overall pattern, even though individual points may not be anomalous.
* **Example**: In credit card transactions, a single small purchase at a burger restaurant is normal. However, ten small purchases at the same restaurant in a few minutes could represent a **collective anomaly** indicating fraudulent activity, like testing a stolen card.

---

## 7. A Framework for Data Quality Problems (Rahm & Do)

Data quality problems can be categorized based on their source and level.

### 7.1. Single-Source vs. Multi-Source Problems
* **Single-Source Problems**: Issues originating from a single database or file.
* **Multi-Source Problems**: Issues arising from the integration of data from multiple sources.

### 7.2. Schema-Level vs. Instance-Level Problems
* **Schema Level**: Problems related to the structure and design of the data model.
    * *Single-Source*: Lack of integrity constraints, poor schema design (e.g., uniqueness, referential integrity).
    * *Multi-Source*: Heterogeneous data models, naming conflicts (e.g., `Cid` vs. `Cno`), structural conflicts (e.g., one field for `Name` vs. `FirstName` and `LastName`).
* **Instance Level**: Problems related to the actual data values (the instances or records).
    * *Single-Source*: Data entry errors like misspellings, redundancy, contradictory values (e.g., `age` does not match `birth_date`).
    * *Multi-Source*: Overlapping, contradicting, or inconsistent data (e.g., different gender representations like "0/1" vs. "F/M").

---

## 8. Categorizing Issues by Data Type

Data quality issues can also be grouped by the type of anomaly.

### 8.1. Syntactical Anomalies
These relate to the format and values of data.
* **Lexical Errors**: Typos and spelling mistakes (e.g., "Lipzig" instead of "Leipzig").
* **Domain Format Errors**: Inconsistent value formats (e.g., "Buntine, Wray Lindsay" vs. "Wray L. Buntine").
* **Irregularities**: Non-uniform use of units or abbreviations (e.g., salary in USD vs. EUR).

### 8.2. Semantic Anomalies
These relate to the meaning, comprehensiveness, and non-redundancy of data.
* **Integrity Constraint Violations**: A value is outside its defined range (e.g., `bdate=30.13.70`).
* **Contradictions**: Violations of dependencies between attributes (e.g., `age` and `DOB` do not align).
* **Duplicates**: Multiple records representing the same real-world entity.

### 8.3. Coverage Anomalies
These are gaps or distortions in how well data represents its target population, space, or time.
* **Under-coverage**: Parts of a population are rarely or never recorded (e.g., a specific store has no sales data for Sundays).
* **Over-coverage**: Segments are overrepresented, often due to duplicates or system errors (e.g., a sensor sends the same reading multiple times).

---

## 9. Manifestations of "Dirty Data"

Dirty data can be classified into three main categories:

### 9.1. Missing Data
* **Optional Missing Data**: Acceptable blanks where a value is not required (e.g., `middle_name`, `apartment_number`).
* **Erroneous Missing Data**: Unacceptable blanks where a value is mandatory for correctness or compliance (e.g., `order_id`, `transaction_timestamp`).

### 9.2. Not Missing But Wrong Data
The data exists, but it's incorrect.
* **Integrity Constraint Violations**: Violations of data type, uniqueness, or referential integrity.
* **Data Entry Errors**: Misspellings, erroneous entries, or data entered into the wrong fields.

### 9.3. Not Missing, Not Wrong, But Unusable Data
The data is technically correct but cannot be used effectively without processing.
* **Ambiguity**: Use of abbreviations (`Dr.` for Doctor or Drive) or incomplete context (`Sydney` in Australia or Canada).
* **Inconsistent Representation**: Different encoding formats, measurement units (currency, weight), or special characters (dashes in phone numbers) across systems make data hard to integrate.

---

## 10. Data Quality Management Frameworks

These are structured approaches to ensure data is fit for use. They provide the principles, policies, and processes to manage data quality throughout its lifecycle.

### 10.1. Core Components
A robust framework typically includes:
* **Data Governance**: The overall management of data availability, usability, integrity, and security.
* **Data Quality Dimensions**: Defining what "quality" means for the organization.
* **Data Quality Standards**: Establishing clear rules and standards for data.
* **Data Quality Assessment**: Measuring and evaluating the current state of data quality.
* **Data Quality Improvement**: Processes for cleansing and correcting data.
* **Data Quality Monitoring**: Continuously tracking data quality over time.
* **Tools & Technologies**: The software used to implement the framework.
* **Training & Awareness**: Cultivating a data-quality-conscious culture.

---

## 11. The Role of Machine Learning (ML) in Data Quality

ML is increasingly vital for enhancing data quality by automating and refining processes.
* **Automated Error Detection**: ML models can learn normal data patterns to identify anomalies and errors automatically.
* **Data Cleansing**: Algorithms can suggest or perform corrections for identified issues.
* **Predictive Data Quality**: ML can predict where quality issues are likely to occur in the future.
* **Enhanced Data Matching**: Used to identify and merge duplicate records (entity resolution) with greater accuracy.
* **Natural Language Processing (NLP)**: Helps standardize and cleanse unstructured text data.
* **Data Enrichment**: Augments existing data by adding missing information from external sources.

Task 1:
Find out the possible data anomalies in this table.
Entry:
S010 Anddy Lee, level D, 500 work hours

Anomalies:
500 work hours
Name could be a typo for "Andy"

Task 2:
Problems in multi-scource table
Name conflicts:
Customer vs client
Cid vs Cno
Sex vs Gender

Structural conflicts:
Different representations for names and addresses

Data conflicts:
Different gender representations (0/1 vs F/M)
Duplicate record (Kristen Smith)
Cid/Cno not matchable between sources

Schema level, there are name conflicts (synonyms Customer/Client, Cid/Cno, Sex/Gender) and structural
conflicts (different representations for names and addresses).  
Instance level, we note that there are different gender representations (“0”/”1” vs. “F”/”M”) and presumably a
duplicate record (Kristen Smith).
The latter observation also reveals that while Cid/Cno are both source-specific identifiers, their contents are not
comparable between the sources; different numbers (11/493) may refer to the same person while different persons can have
the same number (24)