# `nearest_synth_train_test` example

For `synthimpute` package. Uses the `mpg` sample dataset.

## Setup

In [1]:
import synthimpute as si
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.model_selection import train_test_split

In [2]:
mpg = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
)
# Drop class columns and sometimes-missing horsepower.
mpg.drop(["origin", "name", "horsepower"], axis=1, inplace=True)

In [3]:
train, test = train_test_split(mpg, test_size=0.5, random_state=0)
train.reset_index(inplace=True)
test.reset_index(inplace=True)

## Synthesize

In [4]:
synth = si.rf_synth(train, ["cylinders"], random_state=0)

Synthesizing feature 1 of 6: acceleration...
Synthesizing feature 2 of 6: index...
Synthesizing feature 3 of 6: weight...
Synthesizing feature 4 of 6: displacement...
Synthesizing feature 5 of 6: model_year...
Synthesizing feature 6 of 6: mpg...


## `nearest_synth_train_test`

### Scaled

By default, `nearest_synth_train_test` scales the `train` and `test` set with respect to `synth`.

In [5]:
nearest = si.nearest_synth_train_test(synth, train, test)
nearest.head()

Calculating nearest records to training set...
Calculating nearest records to test set...


Unnamed: 0,synth_id,train_id,train_dist,test_id,test_dist,dist_diff,dist_ratio
0,0,43,0.763422,193,0.621526,0.141896,1.228303
1,1,51,0.252011,80,0.26938,-0.017368,0.935524
2,2,158,0.379365,86,0.325907,0.053458,1.164029
3,3,44,0.398754,150,0.655776,-0.257022,0.608065
4,4,67,0.282852,165,0.45186,-0.169009,0.625971


### Unscaled

To validate against `euclidean()`. Results differ a bit from the scaled version.

In [6]:
nearest_unscaled = si.nearest_synth_train_test(synth, train, test, scale=False)
nearest_unscaled.head()

Calculating nearest records to training set...
Calculating nearest records to test set...


Unnamed: 0,synth_id,train_id,train_dist,test_id,test_dist,dist_diff,dist_ratio
0,0,26,49.461566,87,72.968374,-23.506808,0.677849
1,1,185,32.474953,179,110.893846,-78.418893,0.292847
2,2,158,39.166733,164,18.976923,20.18981,2.063914
3,3,44,29.236967,25,66.282123,-37.045156,0.441099
4,4,67,1.017027,165,43.336294,-42.319267,0.023468


In [7]:
nearest_train = si.nearest_record(synth, train)
nearest_train.head()

Unnamed: 0,id_A,id_B,dist
0,0,26,49.461566
1,1,185,32.474953
2,2,158,39.166733
3,3,44,29.236967
4,4,67,1.017027


Verify that `dist` matches `euclidean()`.

In [8]:
euclidean(synth.loc[0], train.loc[int(nearest_train.iloc[0].id_B)])

49.46156593628014

## `nearest_synth_train_test_record`

Note this uses the scaled version.

In [9]:
si.nearest_synth_train_test_record(nearest.iloc[0], synth, train, test)

Synthetic record 0 is closest to training record 43 (distance=0.76) and closest to test record 193 (distance=0.62).


Unnamed: 0,train,synth,test
acceleration,13.2,13.311413,14.0
cylinders,8.0,8.0,8.0
displacement,318.0,303.188492,318.0
index,208.0,226.026085,215.0
model_year,76.0,77.0,76.0
mpg,13.0,13.0,13.0
weight,3940.0,3365.0,3755.0
