Take the raw Delta table for the Titanic dataset, clean up some data types, engineer a new feature, then save the silver cleaned dataset.

In [0]:
from pyspark.sql.types import IntegerType, DoubleType

In [0]:
#   Read in and just take a look at the head
titanic_df = spark.table('bronze.raw_titanic')
titanic_df.limit(10).show()

In [0]:
#   Arguably the most important feature has no null values (no rows to remove)
titanic_df.select('Survived').distinct().show()

## Fix data types

In [0]:
#   All columns are currently strings, but several should be integers
int_columns = ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch']
for col_name in int_columns:
    titanic_df = titanic_df.withColumn(col_name, titanic_df[col_name].cast(IntegerType()))

In [0]:
#   Finally, fare should be double type
titanic_df = titanic_df.withColumn('Fare', titanic_df['Fare'].cast(DoubleType()))

## Engineer new features

In this case, there aren't many new features we'd like to construct.

In [0]:
#   Number of family members aboard (not including the individual)
titanic_df = titanic_df.withColumn('FamilyAboard', titanic_df['SibSp'] + titanic_df['Parch'])

## Save Delta table

In [0]:
#   Write to Delta table. Ensure the schema exists in the workspace
spark.sql("CREATE DATABASE IF NOT EXISTS silver")
(
    titanic_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("silver.titanic")
)