# Domain analysis

This notebook performs a domain analysis based on the data `real_domains.csv`. First, it checks the correctness of the specified `A records` and `MX records`. The current A-records and MX-records of the top-level domain are determined. In the next step, the `redirects` and `HTTP status codes` are obtained from the top-level domains. The Python package called `dnspython` is used for the determination. In addition, a further analysis of the MX records determines their locations and the corresponding ASNs. Free IP geolocation databases from MaxMind are used for this.

## Setup

First, the required installations and imports are made, the session is initialized and the basis data is collected from the Postgres database:

In [None]:
# Required installations
!pip install ipynb
!pip install dnspython
!pip install geoip2

# Required imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# PostgreSQL access data
host = "bda_gr4_database"
port = "5432"
database = "domainanalysis"
user = "postgres"
password = "postgres"

# PostgreSQL connection url
connection = f"jdbc:postgresql://{host}:{port}/{database}"

# Create a Spark session
spark = SparkSession.builder \
    .appName("domain_analysis") \
    .getOrCreate()

# Read data from the database
domains_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain") \
                .option("user", user) \
                .option("password", password) \
                .load()

# Display the data frame
domains_df.limit(15).toPandas()

## 1. Validation of A and MX records

In the following chapter, the functions are imported from the notebook `Functions.ipynb` in order to use them for the definition of the `UDFs`. In doing so, the current `a-records` and `mx-records` are to be determined from the `top-level-domain`.  If the records cannot be determined, the returned errors are stored in separate columns:

In [None]:
# Import all functions from Funtions.ipynb
from ipynb.fs.full.Functions import getARecords, getARecords_error, getMXRecords, getMXRecords_error

# Creating of UDF's
udf_getARecords = udf(getARecords, ArrayType(StringType()))
udf_getARecords_error = udf(getARecords_error, IntegerType())
udf_getMXRecords = udf(getMXRecords, ArrayType(StringType()))
udf_getMXRecords_error = udf(getMXRecords_error, IntegerType())

Next, the A and MX records are checked by calling the UDFs and creating the returned results in new columns:

In [None]:
# Create the new columns with the results
domains_checked_df = domains_df.withColumn("a_record_checked", udf_getARecords("top_level_domain")) \
                            .withColumn("a_record_checked_error", udf_getARecords_error("top_level_domain")) \
                            .withColumn("mx_record_checked", udf_getMXRecords("top_level_domain")) \
                            .withColumn("mx_record_checked_error", udf_getMXRecords_error("top_level_domain"))

In preparation for writing to the database, the data frame is put into the correct order:

In [None]:
# Changing the order of the data frame
domains_checked_df = domains_checked_df.select("top_level_domain", "a_record", "a_record_checked", "a_record_checked_error", "mx_record", "mx_record_checked", "mx_record_checked_error")

Last but not least, let's check if the records have either not been changed, changed completely or partially in comparison to the original data.

In [None]:
# Count the occrence of checked A-records
a_record_checked_count = domains_df.withColumn('a_record_checked', explode(col('a_record'))) \
        .groupBy('a_record_checked') \
        .count()

# Count the occrence of checked MX-records
mx_record_checked_count = domains_df.withColumn('mx_record_checked', explode(col('mx_record'))) \
        .groupBy('mx_record_checked') \
        .count()

# Finally, create new data frames only containing the top 10 A-/MX-records
a_record_count_top_ten_df = a_record_checked_count.orderBy(['count'], ascending = [False]).limit(10)
mx_record_count_top_ten_df = mx_record_checked_count.orderBy(['count'], ascending = [False]).limit(10)

Finally, the columns `a_record` and `mx_record` are removed from the `domains_checked_df` data frame. Furthermore, the first 15 rows of the data frame are displayed as a check for writing to the database:

In [None]:
# Remove columns and show the data frame
domains_checked_df = domains_checked_df.drop("a_record").drop("mx_record")
domains_checked_df.limit(15).toPandas()

In [None]:
# Display the data frame
a_record_count_top_ten_df.limit(10).toPandas()

In [None]:
# Display the data frame
mx_record_count_top_ten_df.limit(10).toPandas()

After the generated data frames have been checked, they must be written to the PostgreSQL database for data visualisation. To enable a faster write speed, `repartition` and `batchsize` are specified:

In [None]:
# Write data frames to the PostgreSQL database
a_record_count_top_ten_df.write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "a_record_checked_count_global") \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()

mx_record_count_top_ten_df.write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "mx_record_checked_count_global") \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()

domains_checked_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_records_checked") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

## 2. Determination of redirects and HTTP status codes

This chapter contains the definition of the `redirects` and `HTTP status` codes to the top level domains. In the following, the functions are imported from the function notebook `Functions.ipynb` in order to use them for the definition of the `UDFs`:

In [None]:
# Import from Funtions.ipynb
from ipynb.fs.full.Functions import *

# Creating of UDF's
udf_getRedirectUrl = udf(getRedirectUrl, StringType())
udf_getStatusCodeUrl = udf(getStatusCodeUrl, IntegerType())

In this block, the `redirections` and the `HTTP status codes` are determined by calling the UDF functions. For this purpose, the column `top_level_domain` is passed as a parameter. Subsequently, the redirections and status codes are stored in new columns:

In [None]:
# Create the new columns with the results
domains_redirect_df = domains_df.withColumn("redirection", udf_getRedirectUrl("top_level_domain")) \
                                .withColumn("status_code", udf_getStatusCodeUrl("top_level_domain"))

For storage in the PostgreSQL database, the `a_record` and `mx_record` columns are removed from the `domains_redirect_df` data frame and the first 15 rows of the data frame are displayed for checking for writing to the database:

In [None]:
# Remove columns and show the data frame
domains_redirect_df = domains_redirect_df.drop("a_record").drop("mx_record")
domains_redirect_df.limit(15).toPandas()

In [None]:
domains_redirect_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_redirection") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

## 3. MX record location and ASN determination

In this section, the `locations` and `providers` are identified based on the ANS using the MX records already checked. For each MX record the iso code, location, postal code, latitude, longitude and the organisation of the autonomous system are presented:

In [None]:
# Read data from the database
domain_records_checked_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain_records_checked") \
                .option("user", user) \
                .option("password", password) \
                .load()

# Drop columns
domain_records_checked_df = domain_records_checked_df.drop("a_record_checked").drop("a_record_checked_error").drop("mx_record_checked_error")

# Display the data frame
domain_records_checked_df.limit(20).toPandas()

Next, the functions from the function notebook `Functions.ipynb` that are to be used for the definition of the `UDFs` are imported. After that, StructTypes of StructFields are created to define the return types:

In [None]:
# Import all functions from Funtions.ipynb
from ipynb.fs.full.Functions import *

schema_location = StructType([
    StructField("iso_code", StringType(), True),
    StructField("city", StringType(), True),
    StructField("postal", StringType(), True),
    StructField("latitude", StringType(), True),
    StructField("longitude", StringType(), True)])

schema_asn = StructType([
    StructField("autonomous_system_organization", StringType(), True)])

udf_getARecords = udf(getARecords, ArrayType(StringType()))
udf_getGeoLite2_Location = udf(getGeoLite2_Location, schema_location)
udf_getGeoLite2_ASN = udf(getGeoLite2_ASN, schema_asn)


In order to be able to perform the location and provider query, the ip address of the mx records must first be found out. Based on this, the `locations` and the `providers` are determined and the data frame `domains_mx_record_geolite2_df` is generated:

In [None]:
# Create the new columns with the results
domains_mx_record_geolite2_df = domain_records_checked_df.select(domain_records_checked_df.top_level_domain,explode(domain_records_checked_df.mx_record_checked).alias('mx_record_checked'))
domains_mx_record_geolite2_df = domains_mx_record_geolite2_df.withColumn("mx_record_ip", udf_getARecords("mx_record_checked")) \
                            .withColumn('mx_record_ip', explode(col('mx_record_ip'))) \
                            .withColumn("location", udf_getGeoLite2_Location("mx_record_ip")) \
                            .withColumn("asn", udf_getGeoLite2_ASN("mx_record_ip")) \
                            .select("top_level_domain", "mx_record_checked", "mx_record_ip", "location.*", "asn.*")

For storage in the PostgreSQL database, the first 15 rows of the data frame are displayed for checking for writing to the database:

In [None]:
# Display the data frame
domains_mx_record_geolite2_df.limit(15).toPandas()

In [None]:
# Write the data frame to the PostgreSQL database
domains_mx_record_geolite2_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_mx_record_geolite2") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()