# `nearest_synth_train_test` example

For `synthimpute` package. Uses the `mpg` sample dataset.

## Setup

In [2]:
import synthimpute as si
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.model_selection import train_test_split

In [3]:
mpg = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
)
# Drop class columns and sometimes-missing horsepower.
mpg.drop(["origin", "name", "horsepower"], axis=1, inplace=True)

In [4]:
train, test = train_test_split(mpg, test_size=0.5, random_state=0)
train.reset_index(inplace=True)
test.reset_index(inplace=True)

## Synthesize

In [5]:
synth = si.rf_synth(train, ["cylinders"], random_state=0)

Synthesizing feature 1 of 6: index...
Synthesizing feature 2 of 6: displacement...
Synthesizing feature 3 of 6: weight...
Synthesizing feature 4 of 6: model_year...
Synthesizing feature 5 of 6: acceleration...
Synthesizing feature 6 of 6: mpg...


## `nearest_synth_train_test`

### Scaled

By default, `nearest_synth_train_test` scales the `train` and `test` set with respect to `synth`.

In [6]:
nearest = si.nearest_synth_train_test(synth, train, test)
nearest.head()

Calculating nearest records to training set...
Calculating nearest records to test set...


Unnamed: 0,synth_id,train_id,train_dist,test_id,test_dist
0,0,17,0.378928,79,0.927314
1,1,183,0.61198,153,0.188572
2,2,32,0.00458,138,0.392288
3,3,32,0.066896,138,0.360982
4,4,195,0.709552,169,0.58344


### Unscaled

To validate against `euclidean()`. Results differ a bit from the scaled version.

In [7]:
nearest_unscaled = si.nearest_synth_train_test(synth, train, test, scale=False)
nearest_unscaled.head()

Calculating nearest records to training set...
Calculating nearest records to test set...


Unnamed: 0,synth_id,train_id,train_dist,test_id,test_dist
0,0,17,29.731607,125,56.92533
1,1,183,54.347781,179,89.30349
2,2,32,0.519549,75,49.764004
3,3,32,1.433793,75,50.859762
4,4,178,36.664323,100,61.569039


In [8]:
nearest_train = si.nearest_record(synth, train)
nearest_train.head()

Unnamed: 0,id_A,id_B,dist
0,0,17,29.731607
1,1,183,54.347781
2,2,32,0.519549
3,3,32,1.433793
4,4,178,36.664323


Verify that `dist` matches `euclidean()`.

In [9]:
euclidean(synth.loc[0], train.loc[int(nearest_train.iloc[0].id_B)])

29.73160656535383

## `nearest_synth_train_test_record`

Note this uses the scaled version.

In [10]:
si.nearest_synth_train_test_record(nearest.iloc[0], synth, train, test)

Synthetic record 0 is closest to training record 17 (distance=0.38) and closest to test record 79 (distance=0.93).


Unnamed: 0,train,synth,test
acceleration,12.0,12.0,11.0
cylinders,8.0,8.0,8.0
displacement,302.0,302.0,350.0
index,166.0,136.285215,124.0
model_year,75.0,74.0,73.0
mpg,13.0,13.0,11.0
weight,3169.0,3169.0,3664.0
