# Mark's Problem: Unsupervised Learning

Mark regularly gets handed files full of fashion images, labelled by category. He wants to know how he can use this to help keep up with the latest trends for the magazine.

For now, he's interested in producing a visualization of the various categories so that he can learn more about them. He's hoping his these explorations will eventually help him speed up the process of sorting through what he gets sent to review every week. 

But first, he has to put this data in a usable format.

In [None]:
from src.data import RawDataset, Dataset
from src.utils import list_dir
from src.paths import raw_data_path

When you are developing in a module, it's really handy to have these lines:

In [None]:
%load_ext autoreload
%autoreload 2

We want to see debug-level logging in the notebook. Here's the incantation

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# More Datasets! Practice Makes Perfect. 
Acually, practice just makes permanent. **Perfect practice** makes perfect, but we digress.

## Adding and processing the Fashion-MNIST (FMNIST) Dataset


Recall that our approach to building a usable dataset is:

1. Assemble the raw data files. Generate (and record) hashes to ensure the validity of these files.
2. Add LICENSE and DESCR (description) metadata to make the raw data usable for other people, and
3. Write a function to process the raw data into a usable format (for us, a `Dataset` object)
4. Write transformation functions on `Dataset` objects that fit our data munging into an automated reproducible workflow. 

In practice, that means:

* Create a `RawDataset`
    * `add_url()`: give instructions for how to `fetch` your data and add a `DESCR` and `LICENSE`
    * `add_process()`: add a function that knows how to process your specific dataset
* `workflow.add_raw_dataset()`: add the `RawDataset` to your `workflow`
* Transform your `Dataset`
    * (Optionally add a `transformer` function to the `workflow`)
    * `workflow.add_transformer()`: further transform your data. 
* Run `make data`

Looking at the FMNIST GitHub documentation, we see that the raw data is distributed as a set of 4 files. 

| Name  | Content | Examples | Size | Link | MD5 Checksum|
| --- | --- |--- | --- |--- |--- |
| `train-images-idx3-ubyte.gz`  | training set images  | 60,000|26 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz)|`8d4fb7e6c68d591d4c3dfef9ec88bf0d`|
| `train-labels-idx1-ubyte.gz`  | training set labels  |60,000|29 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz)|`25c81989df183df01b3e8a0aad5dffbe`|
| `t10k-images-idx3-ubyte.gz`  | test set images  | 10,000|4.3 MBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)|`bef4ecab320f06d8554ea6380940ec79`|
| `t10k-labels-idx1-ubyte.gz`  | test set labels  | 10,000| 5.1 KBytes | [Download](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz)|`bb300cfdad3c16e7a12a480ee83cd310`|


Let's give our dataset a name.

In [None]:
dataset_name="f-mnist"

### Download and Check Hashes
Because Zalando are excellent data citizens, they have conveniently given us MD5 hashes that we can verify when we download this data.

In [None]:
# Set the log level to DEBUG so we can see what's going on
logger.setLevel(logging.DEBUG)

In [None]:
# Specify the raw files  and their hashes
data_site = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com'
file_list = [
    ('train-images-idx3-ubyte.gz','8d4fb7e6c68d591d4c3dfef9ec88bf0d'),
    ('train-labels-idx1-ubyte.gz','25c81989df183df01b3e8a0aad5dffbe'),
    ('t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79'),
    ('t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310'),
]

In [None]:
fmnist = RawDataset(dataset_name)
for file, hashval in file_list:
    url = f"{data_site}/{file}"
    fmnist.add_url(url=url, hash_type='md5', hash_value=hashval)
# Download and check the hashes
fmnist.fetch()

In [None]:
list_dir(raw_data_path)

### Don't forget the License and Description

In [None]:
# Easy case. Zalando are good data citizens, so their data License is directly available from
# their Raw Data Repo on github

# Notice we tag this data with the name `LICENSE`
fmnist.add_url(url='https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/LICENSE',
            name='LICENSE', file_name=f'{dataset_name}.license')


In [None]:
# What does the raw data look like?
# Where did I get it from? 
# What format is it in?
# What should it look like when it's processed?
fmnist_readme = '''
Fashion-MNIST
=============

Notes
-----
Data Set Characteristics:
    :Number of Instances: 70000
    :Number of Attributes: 728
    :Attribute Information: 28x28 8-bit greyscale image
    :Missing Attribute Values: None
    :Creator: Zalando
    :Date: 2017

This is a copy of Zalando's Fashion-MNIST [F-MNIST] dataset:
https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST is a dataset of Zalando's article images—consisting of a
training set of 60,000 examples and a test set of 10,000
examples. Each example is a 28x28 grayscale image, associated with a
label from 10 classes. Fashion-MNIST is intended to serve as a direct
drop-in replacement for the original [MNIST] dataset for benchmarking
machine learning algorithms. It shares the same image size and
structure of training and testing splits.

References
----------
  - [F-MNIST] Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.
    Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747
  - [MNIST] The MNIST Database of handwritten digits. Yann LeCun, Corinna Cortes,
    Christopher J.C. Burges. http://yann.lecun.com/exdb/mnist/
'''

fmnist.add_metadata(kind="DESCR", contents=fmnist_readme)

In [None]:
fmnist.fetch()

Recall, most unpacking can be handled automagically. Just run it.

In [None]:
fmnist.unpack()

## Converting a `RawDataset` into a usable `Dataset`

Recall that we need to write a processing function and add it to our `RawDataset`.

### Processing the raw data
Finally, we need to convert the raw data into usable `data` and `target` vectors.
The code at https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py tells us how to do that. Having a look at the sample code, we notice that we need numpy. How do we add this to the environment?
* Add it to `environment.yml`
* `make requirements`

Once we have done this, we can do the following processing and setup:

In [None]:
import numpy as np

unpack_path = fmnist.unpack()
kind = "train"

label_path = unpack_path / f"{kind}-labels-idx1-ubyte"
with open(label_path, 'rb') as fd:
    target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
dataset_path = unpack_path / f"{kind}-images-idx3-ubyte"
with open(dataset_path, 'rb') as fd:
    data = np.frombuffer(fd.read(), dtype=np.uint8, offset=16).reshape(len(target), 784)

print(f'Data: {data.shape}, Target: {target.shape}')

### Building a `Dataset`

Time to build a processing function. Recall that a processing function produces a dictionary of kwargs that can be used as a `Dataset` constructor:
    

In [None]:
from src.data import Dataset
help(Dataset.__init__)

Rewriting the sample code into the framework gives us this:

In [None]:
#%%file -a ../src/data/localdata.py
#__all__ += ['process_mnist']

def process_mnist(dataset_name='mnist', kind='train', metadata=None):
    '''
    Load the MNIST dataset (or a compatible variant; e.g. F-MNIST)

    dataset_name: {'mnist', 'f-mnist'}
        Which variant to load
    kind: {'train', 'test'}
        Dataset comes pre-split into training and test data.
        Indicates which dataset to load
    metadata: dict
        Additional metadata fields will be added to this dict.
        'kind': value of `kind` used to generate a subset of the data
    '''
    if metadata is None:
        metadata = {}
        
    if kind == 'test':
        kind = 't10k'

    label_path = interim_data_path / dataset_name / f"{kind}-labels-idx1-ubyte"
    with open(label_path, 'rb') as fd:
        target = np.frombuffer(fd.read(), dtype=np.uint8, offset=8)
    dataset_path = interim_data_path / dataset_name / f"{kind}-images-idx3-ubyte"
    with open(dataset_path, 'rb') as fd:
        data = np.frombuffer(fd.read(), dtype=np.uint8,
                                       offset=16).reshape(len(target), 784)
    metadata['subset'] = kind
    
    dset_opts = {
        'dataset_name': dataset_name,
        'data': data,
        'target': target,
        'metadata': metadata,
    }
    return dset_opts


Now add this process function to the built in workflow in order to automate `Dataset` creation.

In [None]:
from functools import partial
from src.data.localdata import process_mnist

In [None]:
fmnist.unpack(force=True)
fmnist.load_function = partial(process_mnist, dataset_name='f-mnist')
ds = fmnist.process(force=True)

In [None]:
ds.data.shape, ds.target.shape

## Add this Dataset to the master dataset list

In [None]:
from src import workflow

In [None]:
# Add the Raw Dataset to the master list of Raw Datasets
workflow.add_raw_dataset(fmnist)
workflow.available_raw_datasets()

In [None]:
# Create a pair of Datasets from this Raw Dataset, by specifying different options for the RawDataset creation
for kind in ['train', 'test']:
    workflow.add_transformer(from_raw=fmnist.name, raw_dataset_opts={'kind':kind}, 
                             output_dataset=f"{fmnist.name}_{kind}")

workflow.get_transformer_list()

Apply the transforms and save the resulting Datasets. This is the same as doing a `make data`


In [None]:
logger.setLevel(logging.INFO)
workflow.make_data()

In [None]:
!cd .. && make data

Now we can load these datsets by name:


In [None]:
ds = Dataset.load("f-mnist_test")
print(f"Data:{ds.data.shape}, Target:{ds.target.shape}")

In [None]:
ds = Dataset.load("f-mnist_train")
print(f"Data:{ds.data.shape}, Target:{ds.target.shape}")

### Don't forget: check in your changes  using `git`

* Check in the generated `raw_datasets.json`, `transformer_list.json` in to source code control
* do a `make data`
* add tests if you haven't yet


## Summary
Mark is well on his way to doing data science on his fashion data. In this example, he:
* Created a `RawDataset` consisting of 4 raw data files
* Checked the hashes of these files against known (published) values
* Added license and description metadata
* Added a processing function to parse the contents of these raw data files into a usable format, and
* Created "test" and "train" variants of a `Dataset` object from this `RawDataset`


In [None]:
from functools import partial
from src.data.localdata import process_mnist

# Create a RawDataset from known hashes
fmnist = RawDataset('f-mnist')
data_site = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com'
file_list = [
    ('train-images-idx3-ubyte.gz','8d4fb7e6c68d591d4c3dfef9ec88bf0d'),
    ('train-labels-idx1-ubyte.gz','25c81989df183df01b3e8a0aad5dffbe'),
    ('t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79'),
    ('t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310'),
]
for file, hashval in file_list:
    fmnist.add_url(url=f"{data_site}/{file}", hash_type='md5', hash_value=hashval)
# Add metadata and processing functions
fmnist.add_url(url='https://raw.githubusercontent.com/zalandoresearch/fashion-mnist/master/LICENSE',
               name='LICENSE', file_name=f'{dataset_name}.license')
fmnist.add_metadata(kind="DESCR", contents=fmnist_readme)
fmnist.load_function = partial(process_mnist, dataset_name='f-mnist')
workflow.add_raw_dataset(fmnist)
workflow.make_raw()

# Add Datasets (directly from raw)
for kind in ['train', 'test']:
    workflow.add_transformer(from_raw=fmnist.name, raw_dataset_opts={'kind':kind}, 
                             output_dataset=f"{fmnist.name}_{kind}")
workflow.make_data()

In [None]:
workflow.available_datasets()