In [2]:
CONN_STRING="postgresql://postgres:password1@localhost/discogs"
%load_ext sql
%sql $CONN_STRING

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: postgres@discogs'

# Subsampling the data

Since the datasets are large, we may want to subsample the datasets in order to have our machine learning models run faster.

## Stratified Subsampling

Stratified sampling tries to preserve the ratio of a given column.

In [3]:
%%sql

DROP TABLE IF EXISTS dataset_sample;

SELECT madlib.stratified_sample(
                                'release_f',          -- source table
                                'dataset_sample',     -- output table
                                0.01,                 -- sample proportion
                                'genre');             -- strata definition

 * postgresql://postgres:***@localhost/discogs
Done.
1 rows affected.


stratified_sample


We can check the size of the new table.

In [4]:
%sql SELECT COUNT(*) FROM dataset_sample;

 * postgresql://postgres:***@localhost/discogs
1 rows affected.


count
5484


We can also check the breakdown of the features.

In [5]:
%sql SELECT genre, COUNT(*) FROM dataset_sample GROUP BY genre;

 * postgresql://postgres:***@localhost/discogs
15 rows affected.


genre,count
latin,4
funk soul,92
nonmusic,12
rock,480
childrens,1
brass military,1
pop,19
jazz,87
hip hop,271
stage screen,1


## Balanced Subsampling

Balanced sampling tries to have an equal number of instances of each label.

In [10]:
%%sql

DROP TABLE IF EXISTS dataset_sample;

SELECT madlib.balance_sample(
                                'release_f',          -- Source table
                                'dataset_sample',     -- Output table
                                'genre',              -- Strata definition
                                'uniform',
                                5000
);

 * postgresql://postgres:***@localhost/discogs
Done.
1 rows affected.


balance_sample


We can check the size of the new table.

In [11]:
%sql SELECT COUNT(*) FROM dataset_sample;

 * postgresql://postgres:***@localhost/discogs
1 rows affected.


count
5010


We can also check the breakdown of the features.

In [12]:
%sql SELECT genre, COUNT(*) FROM dataset_sample GROUP BY genre;

 * postgresql://postgres:***@localhost/discogs
15 rows affected.


genre,count
latin,334
funk soul,334
nonmusic,334
rock,334
childrens,334
brass military,334
pop,334
jazz,334
hip hop,334
stage screen,334
