# 1. Validation of A and MX records

This notebook checks the correctness of the given `A-records` and `MX-records` from the basic data. 

## Setup

First, the required installations and imports are made, the session is initialized and the data is collected from the Postgres database:

In [None]:
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#PostgreSQL access data
host = "bda_gr4_database"
port = "5432"
database = "domainanalysis"
user = "postgres"
password = "postgres"

# PostgreSQL connection url
connection = f"jdbc:postgresql://{host}:{port}/{database}"

# Create a Spark session
spark = SparkSession.builder \
    .appName("domain_analysis") \
    .getOrCreate()

# Read data from the database
domains_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain") \
                .option("user", user) \
                .option("password", password) \
                .load()

# Display the data frame
domains_df.limit(15).toPandas()

## Checking the records

Secondly, the functions are imported from the functions notebook `Functions.ipynb` in order to use them for the definition of the `UDFs`:

In [None]:
# Import all functions from Funtions.ipynb
from ipynb.fs.full.Functions import *

# Creating of UDF's
udf_getARecords = udf(getARecords, ArrayType(StringType()))
udf_getARecords_error = udf(getARecords_error, IntegerType())
udf_getMXRecords = udf(getMXRecords, ArrayType(StringType()))
udf_getMXRecords_error = udf(getMXRecords_error, IntegerType())

Next, the A and MX records are checked by calling the UDFs and creating the returned results in new columns:

In [None]:
# Create the new columns with the results
domains_checked_df = domains_df.withColumn("a_record_checked", udf_getARecords("top_level_domain"))
domains_checked_df = domains_checked_df.withColumn("a_record_checked_error", udf_getARecords_error("top_level_domain"))
domains_checked_df = domains_checked_df.withColumn("mx_record_checked", udf_getMXRecords("top_level_domain"))
domains_checked_df = domains_checked_df.withColumn("mx_record_checked_error", udf_getMXRecords_error("top_level_domain"))

In preparation for writing to the database, the data frame is put into the correct order:

In [None]:
# Changing the order of the data frame
domains_checked_df = domains_checked_df.select("top_level_domain", "a_record", "a_record_checked", "a_record_checked_error", "mx_record", "mx_record_checked", "mx_record_checked_error")

Last but not least, let's check if the records have either not been changed, changed completely or partially in comparison to the original data.

In [None]:
# TODO: Add "How many match?"

In [None]:
# Count the occrence of checked A-records
a_record_checked_count = domains_df.withColumn('a_record_checked', explode(col('a_record'))) \
        .groupBy('a_record_checked') \
        .count()

# Count the occrence of checked MX-records
mx_record_checked_count = domains_df.withColumn('mx_record_checked', explode(col('mx_record'))) \
        .groupBy('mx_record_checked') \
        .count()

# Finally, create new data frames only containing the top 10 A-/MX-records
a_record_count_top_ten_df = a_record_checked_count.orderBy(['count'], ascending = [False]).limit(10)
mx_record_count_top_ten_df = mx_record_checked_count.orderBy(['count'], ascending = [False]).limit(10)

## Store the data

Finally, the columns `a_record` and `mx_record` are removed from the data frame. Furthermore, the first 15 rows of the data frame are displayed as a check for writing to the database:

In [None]:
# Remove columns
domains_checked_df = domains_checked_df.drop("a_record").drop("mx_record")
domains_checked_df.limit(15).toPandas()

In [None]:
# Write the data frame to the PostgreSQL database
domains_checked_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_records_checked") \
    .option("batchsize", 10000) \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()