# Enhancement of existing data

This notebook enhances the information previously inserted into the database, i.e. adds features based on the given data.

## Setup

First, the required imports are made, the session is initialized and the data is collected from the Postgres database:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, BooleanType

#PostgreSQL access data
host = "bda_gr4_database"
port = "5432"
database = "domainanalysis"
user = "postgres"
password = "postgres"

# PostgreSQL connection url
connection = f"jdbc:postgresql://{host}:{port}/{database}"
    
# Initialize a new session
spark = SparkSession.builder \
    .appName("Enhancement_DomainAnaylsis") \
    .getOrCreate()

# Read data from the database
domains_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain") \
                .option("user", user) \
                .option("password", password) \
                .load()

## Enhancement 

The second step consists of the collection/enhancement of data. At first, udf functions to be used for counting array elements with `None` as `0` and for checking whether an mx_record (if existent) includes `localhost`.

In [None]:
# Udf for counting elements in an array / None as 0
def count_arr(arr): return 0 if arr == None else len(arr)
count_arr_udf = udf(count_arr, IntegerType())

# Udf for checking whether "localhost" exists in the array of mx_records
def uses_localhost(mx_records): return mx_records != None and 'localhost' in mx_records
uses_localhost_udf = udf(uses_localhost, BooleanType())

Now, we can use these functions and store their returned values in a new dataframe:

In [None]:
domains_df_enhanced = domains_df \
    .withColumn("a_record_count", count_arr_udf("a_record")) \
    .withColumn("mx_record_count", count_arr_udf("mx_record")) \
    .withColumn("mx_uses_localhost", uses_localhost_udf("mx_record")) \
    .drop("mx_record") \
    .drop("a_record")

Furthermore, we can create two new dataframes containing the top 10 A/MX-record data incl. the counted number:

In [None]:
# Count the occrence of A-records
a_record_count = domains_df.withColumn('a_record', explode(col('a_record'))) \
        .groupBy('a_record') \
        .count()

# Count the occrence of MX-records
mx_record_count = domains_df.withColumn('mx_record', explode(col('mx_record'))) \
        .groupBy('mx_record') \
        .count()

# Finally, create new data frames only containing the top 10 A-/MX-records
a_record_count_top_ten_df = a_record_count.orderBy(['count'], ascending = [False]).limit(10)
mx_record_count_top_ten_df = mx_record_count.orderBy(['count'], ascending = [False]).limit(10)

## Store the data

Last but not least, the data is stored. Due to the chosen event-driven architecure, other tables than the one used in the previous script are stated in order to (only) emit events concerning the newly inserted data.

In [None]:
a_record_count_top_ten_df.write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "a_record_count_global") \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()

mx_record_count_top_ten_df.write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "mx_record_count_global") \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()

domains_df_enhanced.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_enhanced_based_on_existing_data") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()