# Data Collection

In [1]:
%load_ext autoreload
%autoreload 2
%pdb on

Automatic pdb calling has been turned ON


## Motivation

#### Re-introduce the problem to be solved

#### Do you already have a dataset? 

If yes, skip to next section.

#### List potential data sources to solve the described problem

#### Find & Document where you can get that data

#### Filter the list to include the most promising data sources

#### Check legal obligations & get authorization if necessary

#### Create a workspace with enough storage space

#### Acquire the raw data for each data source

#### Is there any preprocessing/cleaning/labeling of the raw data?

Check the following data preparation techniques for each raw dataset:
- Discretization or bucketing.
- Tokenization, part-of-speech tagging.
- SIFT feature extraction
- Removal of instances
- Processing of missing values.

If Yes…
- Was the "raw" data saved in addition to the preprocessed data?
- Is the software used to preprocess/clean/label the instances available?

#### Convert each raw dataset into a format you can easily manipulate

#### Check the size & type of data

Allocate enough compute for further processing.

#### Create the data instances using the different data sets

## Composition

#### How was the data associated with each instance acquired?

#### What do the instances that comprise the dataset represent?

#### How many instances are they in total (of each type, if appropriate)?

#### Does the resulting dataset contain all instances or is it a sample from a larger set?

#### Is there a label or target associated with each instance? 

If so, please provide a description.

#### Is any information missing from the individual instances?

#### Are there any relationships between the instances?

#### How will you split the data (e.g. training, development/validation, testing)?

#### Are there any errors, sources of noise, or redundancies in the dataset?

#### Is the dataset self-contained, or does it rely on external resources?

#### Does the dataset contain data that might be considered confidential?

## Assembly

#### What are the procedures used to collect the data (e.g sensor, curation, program, API)?

#### How would you sample the training data?

Consider the following options:
- Non-probability Sampling
    - Convenience*:* Samples of data are selected based on their availability.
    - Snowball*:* Future samples are selected based on existing samples.
    - Judgment: Experts decide what samples to include.
    - Quota: select samples based on quotas for slices of data with no randomization.
- Simple Random: give all samples equal probabilities of being selected.
- Stratified: divide data into groups of interest & sample from each group separately.
- Weighted: each sample is given a weight. Weights are used to sample.
- Reservoir: for streaming data. Involves 3 steps.
    1. Put the first *k* elements into the reservoir.
    2. For each incoming *n***th** element, generate a random number *i* such that 1 ≤ *i* ≤ *n*.
    3. If 1 ≤ *i* ≤ *k*: replace the *i***th** element in the reservoir with the *n***th** element. Else, do nothing.
- Importance sampling for difficult-to-sample-from distributions.

## Preprocessing / Cleaning / Labeling

#### How do you intend to label the data?

- Hand Labels. Issues:
    - Slow labeling leads to slow iteration speed & makes your model less adaptive to changing environments and requirements.
    - Label multiplicity: multiple label “perspectives” that come from different annotators.
        - Solution: To minimize the disagreement among annotators, it’s important to first have a clear problem definition.
    - Data lineage: Indiscriminately using data from multiple sources, generated by different annotators, w/o examining their quality can cause your model to fail.
- Natural Labels.

#### Is there a lack of labels? If yes, how will you handle it?

Options:
- Weak supervision: Leverages (often noisy) heuristics to generate labels.
    - Approaches: *Keyword heuristic; Regular expressions; Database; model outputs.*
- Semi-supervision: Leverages structural assumptions to generate labels.
- Transfer learning: Leverages models pretrained on another task for your new task.
- Active learning: Labels data samples that are most useful to your model

#### Is there a class imbalance? If yes, how will you handle it?

Steps:
1. Use the right evaluation metrics: PRECISION, RECALL, F1, ROC.
2. Data-level methods: Resampling
3. Algorithm-level methods
    - Cost-sensitive learning
    - Class-balanced loss
    - Focal loss

#### Do you need more data samples for training? If yes, how will you get/generate them?

Options:
- Simple Label-Preserving Transformations
- Input Perturbation
- Data Synthesis

## Uses

#### Has the dataset been used for any tasks already?

#### Is there a repository that links to any or all papers or systems that use the dataset?

#### What (other) tasks could the dataset be used for?

#### Is there anything about the composition of the dataset that might impact future uses?

#### Are there tasks for which the dataset should not be used?

## Intent to distribution

#### Will the dataset be distributed to third parties?

#### How will the dataset will be distributed (e.g. tarball on website, API, GitHub)?

#### When will the dataset be distributed?

#### Will the dataset be distributed under the copyright or another IP license or ToU?

#### Have any third parties imposed IP restrictions on the data associated with the instances?

#### Do any regulatory restrictions apply to the dataset or to individual instances?

## Maintenance

#### Who is supporting/hosting/maintaining the dataset?

#### How can the owner/curator/manager of the dataset be contacted (e.g. email address)?

#### Will the dataset be updated (e.g. to correct labeling errors, add new instances, etc)?

#### If the dataset relates to people, are applicable limits on the retention of the data?

#### Will older versions of the dataset continue to be supported/hosted/maintained?

#### If others want to extend/augment/build on/contribute to the dataset, is it possible?

---