# Data Transformations Using Spark in Synapse Analytics

This notebook transforms trips data; converting it from CSV to Parquet format and splitting customer name into two separate fields.


In [None]:
## Set variables

import uuid

# Variable for unique folder name
target_folderName = "tranformed"

## Load source data

Let's start by loading trips data into a dataframe.

In [None]:
%%pyspark
trips_df = spark.read.load('/trips/*.csv', format='csv', header=True, inferSchema=True
)
display(trips_df.limit(10))

## Transform the data structure

The source data includes a **CustomerName** field, that contains the customer's first and last name. Modify the dataframe to separate this field into separate **FirstName** and **LastName** fields.

In [None]:
from pyspark.sql.functions import split, col

# Create the new FirstName and LastName fields
trips_df = trips_df.withColumn("FirstName", split(col("customerName"), " ").getItem(0)).withColumn("LastName", split(col("customerName"), " ").getItem(1))

# Remove the CustomerName field
trips_df = trips_df.drop("customerName")

Extract the new column **Year** from the **tripDate** field. Modify the dataframe to create the year field.

In [None]:
from pyspark.sql.functions import substring

trips_df = trips_df.withColumn('year', substring('tripDate', 7, 10))

## Save the transformed data

Now save the transformed dataframe in Parquet format in a folder specified in a variable (Overwriting the data if it already exists).

In [None]:
trips_df.write.mode("overwrite").parquet('/%s' % target_folderName)
print ("Transformed data saved in %s!" % target_folderName)