# Data Preparation at Scale Using Amazon SageMaker Data Wrangler and Processing

#### The smaller version of our dataset (covering 1 month) is about 5 GB of data. We can analyze that dataset on a modern workstation without too much difficulty. But what about the full dataset, which is closer to 500 GB? If we want to prepare the full dataset, we need to work with horizontally scalable cluster computing frameworks. Furthermore, activities such as encoding categorical variables can take quite some time if we use inefficient processing frameworks.

## Exporting the flow

#### Data Wrangler is very handy when we want to quickly explore a dataset. But we can also export the results of a flow into Amazon SageMaker Feature Store, generate a SageMaker pipeline, create a Data Wrangler job, or generate Python code. We will not use these capabilities now, but feel free to experiment with them.

## Data preparation at scale with SageMaker Processing

#### Now let's turn our attention to preparing the entire dataset. At 500 GB, it's too large to process using sklearn on a single EC2 instance. We will write a SageMaker processing job that uses Spark ML for data preparation. (Alternatively, you can use Dask, but at the time of writing, SageMaker Processing does not provide a Dask container out of the box.)
#### The Processing Job part of this chapter's notebook walks you through launching the processing job. Note that we'll use a cluster of 15 EC2 instances to run the job

In [None]:
spark_processor = PySparkProcessor(
    base_job_name=”spark-preprocessor”,
    framework_version=”3.0”,
    role=role,
    instance_count=15,
    instance_type=”ml.m5.4xlarge”,
    max_runtime_in_seconds=7200,
)


#### While there are many ways to use these tools, we recommend using Data Wrangler for interactive exploration of small to mid-sized datasets. For processing large datasets in their entirety, switch to programmatic use of processing jobs using the Spark framework to take advantage of parallel processing. (At the time of writing, Data Wrangler does not support running on multiple instances, but you can run a processing job on multiple instances.) You can always export a Data Wrangler flow as a starting point.

#### If your dataset is many terabytes, consider running a Spark job directly in EMR or Glue and invoking SageMaker using the SageMaker Spark SDK. EMR and Glue have optimized Spark runtimes and more efficient integration with S3 storage.