# 1. Implementation of an ETL process

This notebook runs an ETL script on the base data, which extracts, transforms and finally loads the data into the PostgreSQL database.

## Extraction

First, the necessary imports are carried out, the database access data defined, the session initialised and the data fetched from the csv file:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# PostgreSQL access data
host = "bda_gr4_database"
port = "5432"
database = "domainanalysis"
user = "postgres"
password = "postgres"
table = "domain"

# PostgreSQL connection url
connection = f"jdbc:postgresql://{host}:{port}/{database}"

# Create a Spark session
spark = SparkSession.builder \
    .appName("etl_domains") \
    .getOrCreate()

# Read csv file into Spark data frame
domains_df = spark.read.csv('../data/real_domains.csv', escape = "\"").toDF("top_level_domain", "mx_record", "a_record", "timestamp")

# Delete the timestamp column
domains_df = domains_df.drop('timestamp')

# Display the data frame
domains_df.show()

## Transformation

The second step is the transformation of the data frame. The data set is passed to a function that cleans the data frame of special characters. Finally, empty lines are replaced by `None`.

In [None]:
# Function to clean up a data frame
def clean_data(df, column, to_delete, to_replace):
    cleaned_df = df.withColumn(column, regexp_replace(column, to_delete, to_replace))
    return cleaned_df

In [None]:
# Save the column names
col_names = domains_df.schema.names

# Clean up each column
for column in col_names:
    domains_df = clean_data(domains_df, column, '\\[|\\]|\\"', "")

In [None]:
# Show first 5 rows
domains_df.head(5)

In [None]:
# Replace all empty rows with "None" and split A- and MX-records
domains_df = domains_df \
                .withColumn('mx_record', when(domains_df['mx_record'] == '', None).otherwise(split(domains_df['mx_record'], ','))) \
                .withColumn('a_record', when(domains_df['a_record'] == '', None).otherwise(split(domains_df['a_record'], ','))) 

# Display the data frame
domains_df.show()

## Loading

The last step is to load the cleaned data frame into the PostgreSQL database. To speed up the writing process, `8 partitions` are created for parallel processing and the `batchsize` is set to `10000`.

In [None]:
# Write the data frame to the PostgreSQL database
domains_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", table) \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()