### Amazon Ground Truth

- use a combinations of **human labelers** and **active learning model** to efficiently label data.
- reduce the time and cost to label datasets by 70%.

### K-Means Clustering

[Video](https://www.youtube.com/watch?v=Cf_LSDCEBzk)

You select k, a predetermined number of clusters that you want to form. Then k points (centroids for k clusters) are selected at random locations in feature space. For each point in your training dataset:

1. You find the centroid that the point is closest to
2. And assign that point to that cluster
3. Then, for each cluster centroid, you move that point such that it is in the center of all the points that are were assigned to that cluster in step 2.
4. Repeat steps 2 and 3 until you’ve either reached convergence and points no longer change cluster membership or until some specified number of iterations have been reached.


### Choosing a "Good" K

One method for choosing a "good" k, is to choose based on empirical data.

- A bad k would be one so high that only one or two very close data points are near it, and
- Another bad k would be one so low that data points are really far away from the centers.

You want to select a k such that data points in a single cluster are close together but that there are enough clusters to effectively separate the data. You can approximate this separation by measuring how close your data points are to each cluster center; the average centroid distance between cluster points and a centroid. After trying several values for k, the centroid distance typically reaches some "elbow"; it stops decreasing at a sharp rate and this indicates a good value of k.

### Data Dimensionality
One thing to note is that it’s often easiest to form clusters when you have **low-dimensional data**. For example, it can be difficult, and often noisy, to get good clusters from data that has over 100 features. In high-dimensional cases, there is often a dimensionality reduction step (like PCA) that takes place before data is analyzed by a clustering algorithm. 

### Normalization

[Source](https://en.wikipedia.org/wiki/Feature_scaling)

To make sure the feature measurements are consistent and comparable, you’ll scale all of the numerical features into a range between 0 and 1. This is a pretty typical normalization step.

1. Rescaling (min-max normalization)

data will be in [-1, +1] range

$
x'={\frac  {x-{\text{min}}(x)}{{\text{max}}(x)-{\text{min}}(x)}}
$

2. Mean normalization 

centers data around mean point [-3, +3]

$
{\displaystyle x'={\frac {x-{\text{average}}(x)}{{\text{max}}(x)-{\text{min}}(x)}}}
$
3. Standardization (Z-score Normalization)

$
x' = \frac{x - \bar{x}}{\sigma}
$

### PCA

[Video](https://www.youtube.com/watch?v=uyl44T12yU8)

Principal Component Analysis (PCA) attempts to reduce the number of features within a dataset while retaining the “principal components”, which are defined as weighted combinations of existing features that:

1. Are uncorrelated with one another, so you can treat them as independent features, and
2. Account for the largest possible variability in the data!

So, depending on how many components we want to produce, the first one will be responsible for the largest variability on our data and the second component for the second-most variability, and so on. Which is exactly what we want to have for clustering purposes!

**PCA is commonly used when you have data with many many features.**

# SageMaker example repository
https://github.com/aws/amazon-sagemaker-examples

## Defining a Custom Model
To define a custom model, you need to have the model itself and the following two scripts:

1. A training script that defines how the model will accept input data, and train. This script should also save the trained model parameters.
2. A predict script that defines how a trained model produces an output and in what format.

### PyTorch
In PyTorch, you have the option of defining a neural network of your own design. These models do not come with any built-in predict scripts and so you have to write one yourself.

### SKLearn
The scikit-learn library, on the other hand, has many pre-defined models that come with train and predict functions attached!

You can define custom SKLearn models in a very similar way that you do PyTorch models only you typically only have to define the training script. You can use the default predict function.