# Transform data by using Spark

This notebook transforms sales order data; converting it from CSV to Parquet format and splitting customer name into two separate fields.

## Set variables

In [None]:
import uuid

# Variable for unique folder name
folderName = uuid.uuid4()

## Load source data

Let's start by loading some historical sales order data into a dataframe.

In [None]:
order_details = spark.read.csv('/data/*.csv', header=True, inferSchema=True)

## Transform the data structure

The source data includes a **CustomerName** field, that contains the customer's first and last name. Modify the dataframe to separate this field into separate **FirstName** and **LastName** fields.

In [None]:
from pyspark.sql.functions import split, col

# Create the new FirstName and LastName fields
transformed_df = order_details.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1))

# Remove the CustomerName field
transformed_df = transformed_df.drop("CustomerName")

## Save the transformed data

Now save the transformed dataframe in Parquet format in a folder specified in a variable (Overwriting the data if it already exists).

In [None]:
transformed_df.write.mode("overwrite").parquet('/%s' % folderName)
print ("Transformed data saved in %s!" % folderName)