# `nearest_synth_train_test` example

For `synthimpute` package. Uses the `mpg` sample dataset.

## Setup

In [1]:
import synthimpute as si
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.model_selection import train_test_split

In [2]:
mpg = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')
# Drop class columns and sometimes-missing horsepower.
mpg.drop(['origin', 'name', 'horsepower'], axis=1, inplace=True)

In [3]:
train, test = train_test_split(mpg, test_size=0.5, random_state=0)
train.reset_index(inplace=True)
test.reset_index(inplace=True)

## Synthesize

In [4]:
synth = si.rf_synth(train, ['cylinders'], random_state=0)[train.columns]

Synthesizing feature 1 of 6: weight...
Synthesizing feature 2 of 6: index...
Synthesizing feature 3 of 6: displacement...
Synthesizing feature 4 of 6: acceleration...
Synthesizing feature 5 of 6: model_year...
Synthesizing feature 6 of 6: mpg...


## `nearest_synth_train_test`

### Scaled

By default, `nearest_synth_train_test` scales the `train` and `test` set with respect to `synth`.

In [5]:
nearest = si.nearest_synth_train_test(synth, train, test)
nearest.head()

Calculating nearest records to training set...
Calculating nearest records to test set...


Unnamed: 0,synth_id,train_id,train_dist,test_id,test_dist,dist_diff,dist_ratio
0,0,92,0.197639,162,1.444015,-1.246375,0.136868
1,1,84,0.002391,80,0.414749,-0.412358,0.005765
2,2,177,0.498632,27,0.361559,0.137073,1.379116
3,3,121,0.303821,121,0.380731,-0.07691,0.797993
4,4,48,0.274674,67,0.463127,-0.188453,0.593085


### Unscaled

To validate against `euclidean()`. Results differ a bit from the scaled version.

In [6]:
nearest_unscaled = si.nearest_synth_train_test(synth, train, test, scale=False)
nearest_unscaled.head()

Calculating nearest records to training set...
Calculating nearest records to test set...


Unnamed: 0,synth_id,train_id,train_dist,test_id,test_dist,dist_diff,dist_ratio
0,0,92,6.097228,14,94.246709,-88.149481,0.064694
1,1,84,2.026717,179,33.882159,-31.855442,0.059817
2,2,74,3.501353,126,28.379739,-24.878386,0.123375
3,3,74,21.737288,126,34.526932,-12.789644,0.629575
4,4,48,18.631348,169,41.085382,-22.454034,0.453479


In [7]:
nearest_train = si.nearest_record(synth, train)
nearest_train.head()

Unnamed: 0,id_A,id_B,dist
0,0,92,6.097228
1,1,84,2.026717
2,2,74,3.501353
3,3,74,21.737288
4,4,48,18.631348


Verify that `dist` matches `euclidean()`.

In [8]:
euclidean(synth.loc[0], train.loc[int(nearest_train.iloc[0].id_B)])

6.097227834440599

## `nearest_synth_train_test_record`

In [9]:
si.nearest_synth_train_test_record(nearest.iloc[0], synth, train, test)

Synthetic record 0 is closest to training record 92 (distance=0.2) and closest to test record 162 (distance=1.44).


Unnamed: 0,train,synth,test
acceleration,19.0,19.0,19.0
cylinders,8.0,8.0,6.0
displacement,260.0,260.0,258.0
index,222.0,222.0,162.0
model_year,77.0,77.0,75.0
mpg,17.0,15.5,15.0
weight,4060.0,4065.909838,3730.0
