# 2. Data Preparation
*Note: If you are prompted to select a kernel, please select SageMaker Jumpstart PyTorch 1.0*

In this notebook we will prepare a sample data set.

You can select Run->Run All Cells from the menu to run all cells in Studio (or Cell->Run All in a SageMaker Notebook Instance).

In [None]:
# !pip install sklearn

In [None]:
import os

from source.config import Config
from source.preprocessing import pivot_data, sample_dataset
from source.dataset import DatasetGenerator

In [None]:
config = Config(filename="config/config.yaml", fetch_sensor_headers=False)
config

In [None]:
dirname = os.path.dirname(config.fleet_dataset_fn)
if not os.path.exists(dirname):
    os.makedirs(dirname)

## Defining the dataset
You can define your own dataset or use our scripts to generate a toy dataset

In [None]:
should_generate_data = True

In [None]:
if should_generate_data:
    fleet_statistics_fn = "data/generation/fleet_statistics.csv"
    generator = DatasetGenerator(fleet_statistics_fn=fleet_statistics_fn,
                                 fleet_info_fn=config.fleet_info_fn, 
                                 fleet_sensor_logs_fn=config.fleet_sensor_logs_fn, 
                                 period_ms=config.period_ms, 
                                 )
    generator.generate_dataset()

assert os.path.exists(config.fleet_info_fn), "Please copy your data to {}".format(config.fleet_info_fn)
assert os.path.exists(config.fleet_sensor_logs_fn), "Please copy your data to {}".format(config.fleet_sensor_logs_fn)

## Merge the sensor data and fleet vehicle data together

In [None]:
pivot_data(config)

In [None]:
sample_dataset(config)

## Next Stage
Up next we'll visualize the sample data. [Click here to continue.](./3_data_visualization.ipynb).