# Transform data by using Spark

This notebook transforms sales order data; converting it from CSV to Parquet format and splitting customer name into two separate fields.

## Set variables

In [2]:
import uuid

# Variable for unique folder name
folderName = uuid.uuid4()

StatementMeta(sparkahez51n, 0, 3, Finished, Available)

## Load source data

Let's start by loading some historical sales order data into a dataframe.

In [3]:
order_details = spark.read.csv('/data/*.csv', header=True, inferSchema=True)

StatementMeta(sparkahez51n, 0, 4, Finished, Available)

## Transform the data structure

The source data includes a **CustomerName** field, that contains the customer's first and last name. Modify the dataframe to separate this field into separate **FirstName** and **LastName** fields.

In [4]:
from pyspark.sql.functions import split, col

# Create the new FirstName and LastName fields
transformed_df = order_details.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1))

# Remove the CustomerName field
transformed_df = transformed_df.drop("CustomerName")

StatementMeta(sparkahez51n, 0, 5, Finished, Available)

## Save the transformed data

Now save the transformed dataframe in Parquet format in a folder specified in a variable (Overwriting the data if it already exists).

In [5]:
transformed_df.write.mode("overwrite").parquet('/%s' % folderName)
print ("Transformed data saved in %s!" % folderName)

StatementMeta(sparkahez51n, 0, 6, Finished, Available)

Transformed data saved in 71b7a753-818c-4dda-b5b6-639a9ec9342b!
