# Example: Plotting a bin collection and subsample

This notebook contains a minimal example of the `plot` method in the `BinCollection` class, used to plot a 3D histogram of the collection, and of a subsample drawn from it. 

In [None]:
%matplotlib notebook

import os
import yaml
from io import StringIO
from numpy.random import seed as npseed  # type: ignore
from pandas import read_csv

# os.chdir("..")
os.chdir("/Users/thobson/github/Living-with-machines/subsamplr")
from subsamplr import BinCollection, UnitGenerator

Configuration parameters specifying the subsampling variables of interest, their upper & lower bounds and bin sizes:

In [None]:
config_str = """
    # Subsampling dimensions
    variables:
        - {name: 'year', class: 'discrete', type: 'int', min: 1800,
            max: 1919, discretisation: 1, bin_size: 10}
        - {name: 'word_count', class: 'continuous', type: 'int',
            min: 0, max: 1000, bin_size: 100}
        - {name: 'ocr_quality_mean', class: 'continuous', type: 'float',
            min: 0.6, max: 1, bin_size: 0.1}
    """


In [None]:
# Read the YAML config.
config = yaml.safe_load(StringIO(config_str))

A `BinCollection` instance is constructed, with dimensions taken from the configuration parameters.

In [None]:
bc = BinCollection.construct(config)

Newspaper article data is taken from the test fixture containing 100,000 rows.

In [None]:
df = read_csv('tests/fixtures/articles_query_result_100000.csv', sep=",")
df.head()

Subsampling units are generated from the data and assigned to the `BinCollection`.

In [None]:
# Generate article units from the test fixture (data frame)
# and assign to the bin collection.
units = UnitGenerator.generate_units(
    df, unit_id="article_id", variables=bc.dimensions)

for unit, values in units:
    bc.assign_to_bin(unit, values)

From the 100,000 rows in the data, 61581 are assigned to bins (the others being excluded, either because they fall outside the configured bounds of the bin collection, or because they contain missing values).

In [None]:
bc.count_units()

Draw a representative subsample of 1,000 articles.

In [None]:
k = 1000
seed = 14722
npseed(seed)
subsample = bc.select_units(k)

Finally plot the full `BinCollection` and the subsample.

In [None]:
bc.plot(subsample=subsample)