## Introduction to Data-Centric AI, MIT IAP 2023.

### Lecture 1: Data-Centric AI vs. Model-Centric AI

What is not data centric AI:

Examples:

1. Hand-picking a bunch of data points you think will improve the model.
2. Double the size of your dataset and train an improved model on more data.

Data centric version:

Examples:

1. Coreset selection: coreset selection in supervised learning is to produce a weighted subset of data, so that training only on the subset achieves similar performance as training on the entire dataset ***Look for testing it***.

2. Dataset augmentation.
3. outlier detection and removal.
4. Feature selection.
5. Establishing consensus labels.
6. active learning (selecting the most informative data to label next).
7. curriculum learning.

### Lecture 2: Label Errors

**Finding label erros by sorting data by loss**

- how much error is in the dataset.
- then, define which labels have more error. sort!
- whats the cutt-off?.

**What is confident learning**

Confident learning is a framework of theory and algorithms for:

- Finding label erros in a dataset.
- Ranking data by likehood of being a label issue.
- Learning with noise labels.
- ***Complete characterization of label noise in a dataset*** (focus here!!).

key idea:

    With confident learning, any models predicted probabilities to find label errors. (data centric, model agnostic)

**Where do noise labels come from??**

- clicked the wrong bottom.
- mistakes.
- mismeasurement.
- incompetence.
- another ml model's bad predictions.

All of these results in label flippigns.

Example of label flippings:

- Image of a Dog is labeled in fox.
- Tweet "Hi Welcome to the team" is labeled toxic language.

**How noise labels are generated**

- uniform/symmetric class-conditional label noise.
- systematic/assymetric class-conditional label noise.
- instance-dependent label noise.

**What's uncertainty**

Its the opposite of confidence (lack of confidence). Depends on:

- the difficult of an example (aletoric - label noise: labels have been flipped to another class).
- missunderstaing the example (epistemic - model noise: erroneus predicted probabilities).

**CL assumes class-conditional label noise**

We assume labels are flipped based on an unknown transition matrix P(y~|y*) that depends only on pairwase noise rates between classes, not the data `x`. The `class conditional` label noise depends on the class, not the data.

`Deep learning is robust to label noise`. These results assume uniformly random label noise and usually don't apply to real world seetings.

**How does confident learning work**

Directly estimate the joint distribution of observed noise labels and latent true labels.

__key_idea__: First we find the threshold as a proxy for the machine's self confidence, on average, for each task/class j (you can apply cross-validation to get out-of-sample predicted probabilities. Then average the preidcted probabilities by class).

$$
  t_{j} = \frac{1}{|X_{\hat{y}=j}|} \sum_{x \in X_{\hat{y}=j}} \mathbb P(\hat{y}=j; x; O)
$$

```python

from  cleanlab.filters import find_label_issues

#! Works with any ml model -  just input the model's predicted probabilities

ordered_label_issues = find_label_issues(
  labels=labels,
  pred_probs=pred_probs, # out-of-sample predicted probabilities from any model
  return_indices_ranked_by='self_confidence'
)
```

**Ranking label errors**

Self confidence.

$$
  \mathbb P(\hat{y}=i; x)
$$

Normalized margin (with formula). Then, sort it!!

$$
  \mathbb P(\hat{y} = i) - \max_{i \in m}(\mathbb P(\hat{y}))
$$

### Lecture 3: Dataset Creation and Curation

**Concerns when sourcing the data**

Key questions when sourcing training data

1. How will the resulting ml model be used?
    - On what population will model be making predictions and when.

2. Hypothetical edge cases where we need model make right predictions.
    - High stakes scenarios, rare events.

sparse correlation: Ml Model = shortcut cheaters.

selection bias: training delta distribution != distribbution in deployment.

causes:

    - time.
    - overfitting.
    - rare events.
    - Convenience.
    - Location.

`Validation data` should be similar to distribution in deployment (use most recent data).

__How much data should be collected??__

Goal: 95% accuracy.

One idea: plot (x: size data, y: accuracy; labels: every algo to validate).

**Concerns when sourcing the label**

play with annotators (Data Annotation).

problems:

1.  Low accuracy annotators.
2. Copycat

https://newscatcherapi.com/blog/top-6-text-annotation-tools

Multi-annotator estimate:

- Consensus label = single best label (Simple majority vote - confidence score).
- confidence in consensus = how likely is it wrong.
- quality of annotator = overall accuracy of their labels.

Better methodology: CrowdLab (it should converge with model performance - it could be bat if your model is poorly accurate. take care of it!)

https://cleanlab.ai/blog/multiannotator/

__Textbook__: Human-in-the-loop Machine Learning (Robert Monarch)