## Introduction to Data-Centric AI, MIT IAP 2023.

### Lecture 1: Data-Centric AI vs. Model-Centric AI

What is not data centric AI:

Examples:

1. Hand-picking a bunch of data points you think will improve the model.
2. Double the size of your dataset and train an improved model on more data.

Data centric version:

Examples:

1. Coreset selection: coreset selection in supervised learning is to produce a weighted subset of data, so that training only on the subset achieves similar performance as training on the entire dataset ***Look for testing it***.

2. Dataset augmentation.
3. outlier detection and removal.
4. Feature selection.
5. Establishing consensus labels.
6. active learning (selecting the most informative data to label next).
7. curriculum learning.

### Lecture 2: Label Errors

**Finding label erros by sorting data by loss**

- how much error is in the dataset.
- then, define which labels have more error. sort!
- whats the cutt-off?.

**What is confident learning**

Confident learning is a framework of theory and algorithms for:

- Finding label erros in a dataset.
- Ranking data by likehood of being a label issue.
- Learning with noise labels.
- ***Complete characterization of label noise in a dataset*** (focus here!!).

key idea:

    With confident learning, any models predicted probabilities to find label errors. (data centric, model agnostic)

**Where do noise labels come from??**

- clicked the wrong bottom.
- mistakes.
- mismeasurement.
- incompetence.
- another ml model's bad predictions.

All of these results in label flippigns.

Example of label flippings:

- Image of a Dog is labeled in fox.
- Tweet "Hi Welcome to the team" is labeled toxic language.

**How noise labels are generated**

- uniform/symmetric class-conditional label noise.
- systematic/assymetric class-conditional label noise.
- instance-dependent label noise.

**What's uncertainty**

Its the opposite of confidence (lack of confidence). Depends on:

- the difficult of an example (aletoric - label noise: labels have been flipped to another class).
- missunderstaing the example (epistemic - model noise: erroneus predicted probabilities).

**CL assumes class-conditional label noise**

We assume labels are flipped based on an unknown transition matrix P(y~|y*) that depends only on pairwase noise rates between classes, not the data `x`. The `class conditional` label noise depends on the class, not the data.

`Deep learning is robust to label noise`. These results assume uniformly random label noise and usually don't apply to real world seetings.

**How does confident learning work**

Directly estimate the joint distribution of observed noise labels and latent true labels.

__key_idea__: First we find the threshold as a proxy for the machine's self confidence, on average, for each task/class j (you can apply cross-validation to get out-of-sample predicted probabilities. Then average the preidcted probabilities by class).

$$
  t_{j} = \frac{1}{|X_{\hat{y}=j}|} \sum_{x \in X_{\hat{y}=j}} \mathbb P(\hat{y}=j; x; O)
$$

```python

from  cleanlab.filters import find_label_issues

#! Works with any ml model -  just input the model's predicted probabilities

ordered_label_issues = find_label_issues(
  labels=labels,
  pred_probs=pred_probs, # out-of-sample predicted probabilities from any model
  return_indices_ranked_by='self_confidence'
)
```

**Ranking label errors**

Self confidence.

$$
  \mathbb P(\hat{y}=i; x)
$$

Normalized margin (with formula). Then, sort it!!

$$
  \mathbb P(\hat{y} = i) - \max_{i \in m}(\mathbb P(\hat{y}))
$$

### Lecture 3: Dataset Creation and Curation

**Concerns when sourcing the data**

Key questions when sourcing training data

1. How will the resulting ml model be used?
    - On what population will model be making predictions and when.

2. Hypothetical edge cases where we need model make right predictions.
    - High stakes scenarios, rare events.

sparse correlation: Ml Model = shortcut cheaters.

selection bias: training delta distribution != distribbution in deployment.

causes:

    - time.
    - overfitting.
    - rare events.
    - Convenience.
    - Location.

`Validation data` should be similar to distribution in deployment (use most recent data).

__How much data should be collected??__

Goal: 95% accuracy.

One idea: plot (x: size data, y: accuracy; labels: every algo to validate).

**Concerns when sourcing the label**

play with annotators (Data Annotation).

problems:

1.  Low accuracy annotators.
2. Copycat

https://newscatcherapi.com/blog/top-6-text-annotation-tools

Multi-annotator estimate:

- Consensus label = single best label (Simple majority vote - confidence score).
- confidence in consensus = how likely is it wrong.
- quality of annotator = overall accuracy of their labels.

Better methodology: CrowdLab (it should converge with model performance - it could be bat if your model is poorly accurate. take care of it!)

https://cleanlab.ai/blog/multiannotator/

__Textbook__: Human-in-the-loop Machine Learning (Robert Monarch)

### Lecture 4: Data-centric Evaluation of ML Models

Loss function evaluates model predictions for a new example vs its given label.

Loss may be function of:

1. The predicted class of $\hat{y} \in {1, 2,....,k}$ deemed most likely for x.
    
    Examples of such classification losses: accuracy, balanced accuracy, precision, recall, ...

2. The predicted probabilities $p1, p2, ....., pk \in R^{k}$ of each class for x.

    Examples of such classification losses: log loss, AUROC, calibration error..

Typical score = average of $Loss(M(x_{i}), y_{i})$ over many examples held-out during training.

Alternatives:

- Average loss for example from each class separately (e.g per-class accuracy).
- Report complete confusion matrix.

invest as much as time thinking about this as:
- what models  to apply.
- how to improve them.

Model evaluation has a huge impact in real applications.

__Common pitfalls when evaluating models__

1. Failing to use truly held-out data (data leakage).
2. Reporting only average loss can under-represent severe failure cases for rare examples/subpopulations (misspecified metric).
3. Validation data not representative of deployment setting (selection bias).
4. Some labels incorrect (annotation error).

__Underperforming Subpopulations__

define `data slice`: a subset of the dataset that shares common characteristics.

`Model predictions should not depend on which slice a datapoint belongs to`

Look for improvements:

- Over-sample (up-weight) examples from minority subgroup that is receiving poor predictions.
- Collect aditional data from the subgroup with poor performance.

To see if this has promise (most uncommom):

- Re-fit model many versions of dataset with this subgroup down-subsampled to varying  degrees.
- Extrapolate the resulting model performance (overall and for subgroups) expected if you had more data from this subgroup.

__Discovering underperforming subpopulations__

1. Sort examples in the validation data by their loss value, and look at the examples with the high loss for which your model is making the worst prediction (Error Analysis).
2. Apply clustering to these examples with high loss to uncover clusters that share common themes amongst these examples.

Look for `slice discovery method`, algo that finds slicing functions, which split a dataset into underperforming slices.

__why did my model get a particular prediction wrong__

1. Given label is incorrect and, our model actually gives the right prediction (recommended action: correct the label).

2. Example does not belong to any class or, is fundamentally not predictible (recommended action: toss this example from the dataset, consider adding an "other" class if many such examples).

3. Example is an outlier (reco: toss example if similar examples would never be seen in deployment) (Can add synthetic data - data augmentation - so model becomes invarient to difference that makes this outlier stand from other examples).

There are cases where `model centric` is actually needed, as not all the models can handle all the data patterns in real world. (Otherwise, why look for better algos??)

4. Dataset has other examples with nearly identical features but diferent label (reco: define classes more distinctly, meassure extra features to enrich the data).

`The key is to practice with the influence functions`: Influence reveals which datapoints have greatest impact on the model. For instance, correcting the label of a mislabeled datapoint with high influence can produce much better model improvement than correcting a mislabeled datapoint that has low influence

### Lecture 5: Class Imbalance, Outliers, and Distribution Shift

__Evaluation metrics__

If you’re splitting a dataset into train/test splits, make sure to use stratified data splitting to ensure that the train distribution matches the test distribution (otherwise, you’re creating a distribution shift) problem.

With imbalanced data, standard metrics like accuracy might not make sense. There is no one-size-fits-all solution for choosing an evaluation metric: the choice should depend on the problem.

__Training models on imbalanced data__

Once an evaluation metric has been chosen, you can try training a model in the standard way. If training a model on the true distribution works well, i.e., the model scores highly on the evaluation metric over a held-out test set that matches the real-world distribution, then you’re done!

if not, apply:

- Sample weights.
- Over sampling.
- Under sampling.
- SMOTE.
- Balanced mini batch training: For models trained with mini-batches, like neural networks, when assembling the random subset of data for each mini-batch, you can include datapoints from minority classes with higher probability, such that the mini-batch is balanced. This approach is similar to over-sampling, and it does not throw away data.

__Identifying outliers__

Some techniques:

- Tukey’s fences. A simple method for scalar real-valued data. If Q1 and Q3 are the lower and upper quartiles, then this test says that any observation outside the following range is considered an outlier.

- Z-score: The Z-score is the number of standard deviations by which a value is above or below the mean. An outlier is a data point that has a high-magnitude Z-score. You can apply this technique to individual features as well.

- isolation forest.
- knn distance.
- Reconstruction-based methods. Autoencoders are generative models that are trained to compress high-dimensional data into a low-dimensional representation and then reconstruct the original data. If an autoencoder learns a data distribution, then it should be able to encode and then decode an in-distribution data point back into a data point that is close to the original input data. However, for out-of-distribution data, the reconstruction will be worse, so you can use reconstruction loss as a score for detecting outliers.

__Distribution shift__

Distribution shift is a challenging problem that occurs when the joint distribution of inputs and outputs differs between training and test stages, i.e., This issue is present, to varying degrees, in nearly every practical ML application, in part because it is hard to perfectly reproduce testing conditions at training time.

$$
P_{train}(X,y) \not = P_{test}(X,y)
$$

__types of distribbution shift__

**Covariate shift / data shift**

Covariate shift occurs when $P(x)$ changes between train and test, but $P(y | x)$ does not. In other words, the distribution of inputs changes between train and test, but the relationship between inputs and outputs does not change.

https://dcai.csail.mit.edu/lectures/imbalance-outliers-shift/covariate-shift.svg

Examples of covariate shift:

- Self-driving car trained on the sunny streets of San Francisco and deployed in the snowy streets of Boston.
- Speech recognition model trained on native English speakers and then deployed for all English speakers.
- Diabetes prediction model trained on hospital data from Boston and deployed in India.

**Concept shift**

Concept shift occurs when $P(y | x)$ changes between train and test, but $P(x)$ does not.  In other words, the input distribution does not change, but the relationship between inputs and outputs does. This can be one of the most difficult types of distribution shift to detect and correct.

It is tricky to come up with real-world examples of concept shift where there is absolutely no change in $P(x)$:

- Making purchase recommendations based on web browsing behavior, trained on pre-pandemic data and deployed in March 2020. Before the pandemic vs during the pandemic, the relationship between browsing behavior and purchases did (e.g., someone who watched lots of travel videos on YouTube before the pandemic might buy plane or hotel tickets, while during the pandemic they might pay for nature documentary movies).

**Prior probability shift / label shift**

To understand prior probability shift, consider the example of spam classification, where a commonly-used model is Naive Bayes. If the model is trained on a balanced dataset of 50% spam and 50% non-spam emails, and then it’s deployed in a real-world setting where 90% of emails are spam, that is an example of prior probability shift.

Another example is when training a classifier to predict diagnoses given symptoms, as the relative prevalence of diseases is changing over time. Prior probability shift shift (rather than covariate shift) is the appropriate assumption to make here, because diseases cause symptoms.

**Detecting and addressing distribution shift**

Some ways you can detect distribution shift in deployments:

- Monitor the performance of your model. Monitor accuracy, precision, statistical measures, or other evaluation metrics. If these change over time, it may be due to distribution shift.
- Monitor your data. You can detect data shift by comparing statistical properties of training data and data seen in a deployment.

At a high level, distribution shift can be addressed by fixing the data and re-training the model. In some situations, the best solution is to collect a better training set.

If unlabeled testing data are available while training, then one way to address covariate shift is to assign individual sample weights to training datapoints to weigh their feature-distribution such that the weighted distribution resembles the feature-distribution of test data. In this setting, even though test labels are unknown, label shift can similarly be addressed by employing shared sample weights for all training examples with the same class label, in order to make the weighted feature-distribution in training data resemble the feature distribution in the test data. However, concept shift cannot be addressed without knowledge of its form in this setting, because there is no way to quantify it from unlabeled test data.




### Lecture 8: Encoding Human Priors: Data Augmentation and Prompt Engineering

- Human priors. They are prior knowledge we have about the world, about the data, about the task. And we often take them for granted, like the rotated dog.

In the case of the rotated dog, it’s a special type of human prior that is particularly useful. That’s an invariance. Basically a change to the input data that doesn’t change its output. This is useful because we can find smart ways to encode them in the input training data without needing to gather more data.

- Encoding — what does that mean? It just means finding a function to represent the invariance. So for rotating the image, it’s just a function for rotation.

Specifically, we’re looking at adapting the data today. It’s a very effective place to be doing this and much easier than making architectural or loss function adaptations. It’s a common technique that we ML researchers and practitioners all do.

__Human Priors to Augment Training Data__

what data augmentation can do is it enables you to encode your human priors over invariances that you know about in your data and you’re able to augment your dataset further, such as using flip and rotation on those dog pictures that you saw previously.

Now, those are pretty simple. There are far more advanced methods, such as Mobius transformations (Zhou et al., 2020). If you have classes, you can also use an effective method called Mixup, where you can mix your different classes together to be used as interpolated examples in alpha space (Zhang et al., 2017). What does that mean? If you have dog pictures and cat pictures, you can overlay these images together (e.g. by varying the alpha or A parameter in RGBA). For example, you can change the alpha of a cat image to 60% and the dog image to 40%. You’d get a blended cat-dog, and as a human, you’d agree that there is a cat and dog in it. Then, you could actually change your class label to be 60% cat and 40% dog for your model to predict. You can vary this however you want across your data to produce more training examples with precise labels. This is actually a very effective technique and used pretty widely now.

Data augmentation can also be taken to the extreme of synthetic data augmentation. This means using the data you already have, you can even train a model to generate more of that kind or class of data. For this, you can train your own model or you can use foundation models, such as DALL-E or Stable Diffusion in the image scenario, to generate more data from them. Just know that you have to think about how this impacts your test set, if the foundation model has been trained on samples in your test set.

__Human Priors at Test-Time (LLMs)__

One popular method is called prompt engineering. It is used for large language models (LLMs). What this means is that you are changing the input at test time to elicit certain results at output time. For example, you can ask an LLM to write a letter of recommendation for a student. It’ll write a letter of recommendation that’s pretty average. But if you ask it to write a letter of recommendation for a student who gets into MIT, then it does much better because it assumes your letter will get into MIT.



### Anexos

https://dcai.csail.mit.edu/lectures/
https://towardsdatascience.com/outlier-detection-with-autoencoders-6c7ac3e2aa90
https://pyod.readthedocs.io/en/latest/
https://direct.mit.edu/books/edited-volume/3841/Dataset-Shift-in-Machine-Learning
https://dcai.csail.mit.edu/lectures/interpretable-features/