# Week 10: Data Integration & Enrichment 🔗

This week, we're moving from transforming a single dataset to the more complex task of combining and enhancing data from multiple sources.

---

## 1. Defining the Concepts

While they are often done together, data integration and data enrichment are distinct processes with different goals.

### 1.1. What is Data Enrichment?
**Data enrichment** is the process of enhancing an existing dataset by **appending additional context or information** from external sources. The goal is to improve the quality, depth, and value of your data, making it more useful for analysis.

For example, you could enrich a customer list with demographic data, geographic details, or social media information.

### 1.2. What is Data Integration?
**Data integration** is the process of **combining data from different sources to create a single, unified view**. This is essential in any organization where data is stored in separate databases, spreadsheets, or systems. Key activities include merging different data structures (schemas) and identifying records that refer to the same real-world entity (entity resolution).

### 1.3. Enrichment vs. Integration: Key Differences

| Aspect | Data Enrichment | Data Integration |
| :--- | :--- | :--- |
| **Purpose** | To enhance a dataset's value by adding more detailed information. | To combine datasets to create a unified, consistent, and accessible whole. |
| **Output** | An enhanced dataset with new layers of information. | A single, consolidated dataset from multiple sources. |
| **Process** | Appending relevant data to existing records. | Merging and reconciling data, resolving structural and format conflicts. |

---

## 2. The Challenges of Data Integration

Combining data is rarely straightforward. Common challenges include:
* **Heterogeneous Data**: Data from different sources is often developed independently with different schemas and objectives.
* **Incompatible Taxonomies**: Sources may have different definitions for the same concept, like what constitutes a "customer".
* **Different Abstraction Levels**: Data might be provided at different granularities, like sales data at a suburb level versus a state level.
* **Data Quality Issues**: Combining data from multiple sources can amplify existing errors and inconsistencies.
* **Time Synchronisation**: Data may have been collected during different time windows, making direct comparison difficult.

---

## 3. The Integration Process: A Two-Level Approach

The data integration process can be broken down into two main categories: **Schema Integration** and **Data-Level Integration**.


## 4. Level 1: Schema Integration 🗺️

This level deals with the structure of the data. A **schema** is the blueprint for your data—it defines the tables, attributes, data types, and relationships. Schema integration involves merging these blueprints from different sources into one unified, mediated schema.

### 4.1. Schema Integration Problems
* **Structure Conflicts**: Inconsistencies in how data is structured (e.g., XML vs. relational database).
* **Naming Conflicts**:
    * **Synonyms**: Different names for the same thing (e.g., `CustomerID` vs. `ClientID`).
    * **Homonyms**: The same name used for different things (e.g., `ID` could be a customer ID or a product ID).
* **Entity Resolution Conflicts**:
    * **Different Units**: Temperature in Celsius vs. Fahrenheit.
    * **Value Heterogeneity**: Inconsistent use of abbreviations (`St.` vs. `Street`).

### 4.2. Schema Matching Techniques
**Schema matching** is the task of identifying which elements in one schema correspond to elements in another.
* **Name-Based Matching**: Compares the names of attributes, often using techniques like expanding abbreviations (`loc` to `location`) and identifying synonyms (`cost` and `price`).
* **Instance-Based Matching**: Looks at the actual data values (instances) to find matches. This can be done with handcrafted rules or by training a machine learning classifier to predict if two attributes are a match based on their content.

---

## 5. Level 2: Data-Level Integration

Once the schemas are aligned, you must integrate the actual data values. This focuses on the rows (tuples) and columns (attributes) themselves.

### 5.1. Attribute-Level Integration (Columns)
This involves handling redundancy and correlations between attributes.
* **Chi-Square ($x^2$) Test**: A statistical test used to determine if there is a significant association between two **categorical variables** (e.g., is gender independent of education level?). It compares observed frequencies to expected frequencies to see if they are related.
* **Correlation Coefficient (r)**: Measures the strength and direction of a **linear relationship** between two **numerical variables**. Values range from -1 (strong negative correlation) to +1 (strong positive correlation).

### 5.2. Tuple-Level Integration (Rows)
This is about finding and merging rows (tuples) from different datasets that refer to the same real-world object.

#### String Matching
A core part of tuple integration is determining if two strings, like "Dave Smith" and "David D. Smith," are a match. Common methods include:
* **Sequence-Based Measures**: These calculate the "cost" of transforming one string into another. The most famous is **Edit Distance**, which counts the minimum number of insertions, deletions, and substitutions required.
* **Set-Based Measures**: These treat strings as sets of words (tokens) and use measures like **TF/IDF** to calculate similarity based on word frequency.
* **Phonetic Similarity**: Matches strings based on how they sound (e.g., "Smith" and "Smythe").

#### Data Matching Approaches
Beyond simple string matching, you can match entire records:
* **Rules-Based Matching**: Uses a weighted combination of similarity scores from multiple attributes (e.g., 30% name similarity + 30% phone similarity + ...) to decide if two records are a match.
* **Learning-Based Matching**:
    * **Supervised Learning**: Trains a model like a decision tree on a labeled dataset of known matches and non-matches to learn a complex matching rule.
    * **Clustering**: Groups similar records together, assuming that all records within a single cluster represent the same entity.