# `block_cdist` example

For `synthimpute` package. Uses the `mpg` sample dataset.

## Setup

In [1]:
import synthimpute as si
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean
import math

In [2]:
mpg = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')
# Drop class columns and sometimes-missing horsepower.
mpg.drop(['origin', 'name', 'horsepower'], axis=1, inplace=True)

## Synthesize

In [3]:
synth = si.rf_synth(mpg, ['cylinders'], random_state=0)

Synthesizing feature 1 of 5: weight...
Synthesizing feature 2 of 5: displacement...
Synthesizing feature 3 of 5: mpg...
Synthesizing feature 4 of 5: model_year...
Synthesizing feature 5 of 5: acceleration...


Round `model_year` to avoid dropping records when we block on it. We don't need to do this with `cylinders` because it was used as a seed, where exact values are sampled.

In [4]:
synth.model_year = synth.model_year.round()

## `block_cdist`

In [5]:
b0 = si.block_cdist(synth, mpg, metric='euclidean')
b0.head()

Unnamed: 0,dist,id1,id2
0,4175.112094,0,0
1,4736.944359,1,0
2,4181.31889,2,0
3,4725.971649,3,0
4,4172.34757,4,0


### Compare to simple

In [6]:
math.isclose(b0.iloc[0].dist, euclidean(synth.iloc[0], mpg.iloc[0]))

True

### Block

In [7]:
b1 = si.block_cdist(synth, mpg, ['cylinders'], metric='euclidean')
b1.head()

Running block 1 of 5...
Running block 2 of 5...
Running block 3 of 5...
Running block 4 of 5...
Running block 5 of 5...


Unnamed: 0,dist,id1,id2
0,3288.503643,0,14
1,3295.291033,2,14
2,3285.178833,4,14
3,3286.017737,6,14
4,3302.908089,7,14


In [8]:
row = b1.iloc[0]
math.isclose(row.dist, euclidean(synth.iloc[int(row.id1)], mpg.iloc[int(row.id2)]))

True

### Block on two variables

In [9]:
b2 = si.block_cdist(synth, mpg, ['cylinders', 'model_year'], metric='euclidean')
b2.head()

Running block 1 of 38...
Running block 2 of 38...
Running block 3 of 38...
Running block 4 of 38...
Running block 5 of 38...
Running block 6 of 38...
Running block 7 of 38...
Running block 8 of 38...
Running block 9 of 38...
Running block 10 of 38...
Running block 11 of 38...
Running block 12 of 38...
Running block 13 of 38...
Running block 14 of 38...
Running block 15 of 38...
Running block 16 of 38...
Running block 17 of 38...
Running block 18 of 38...
Running block 19 of 38...
Running block 20 of 38...
Running block 21 of 38...
Running block 22 of 38...
Running block 23 of 38...
Running block 24 of 38...
Running block 25 of 38...
Running block 26 of 38...
Running block 27 of 38...
Running block 28 of 38...
Running block 29 of 38...
Running block 30 of 38...
Running block 31 of 38...
Running block 32 of 38...
Running block 33 of 38...
Running block 34 of 38...
Running block 35 of 38...
Running block 36 of 38...
Running block 37 of 38...
Running block 38 of 38...


Unnamed: 0,dist,id1,id2
0,3001.983847,0,102
1,3008.554302,2,102
2,2973.168051,16,102
3,3007.085028,33,102
4,3009.621328,35,102


In [10]:
row = b2.iloc[0]
math.isclose(row.dist, euclidean(synth.iloc[int(row.id1)], mpg.iloc[int(row.id2)]))

True

### Compare blockings

The more columns are blocked, the fewer comparisons are made.

All specifications are unique for `id1`/`id2` pairs.

In [11]:
pd.DataFrame({
    'blocks': ['none', 'cylinders', 'cylinders+model_year'],
    'rows': [b0.shape[0], b1.shape[0], b2.shape[0]],
    'rows_per_id1_id2': [b0.groupby(['id1', 'id2']).size().max(),
                         b1.groupby(['id1', 'id2']).size().max(),
                         b2.groupby(['id1', 'id2']).size().max()]
})

Unnamed: 0,blocks,rows,rows_per_id1_id2
0,none,158404,1
1,cylinders,56851,1
2,cylinders+model_year,5336,1
