# Domain analysis

This notebook performs a domain analysis based on the data `real_domains.csv`. First, it checks the correctness of the specified `A records` and `MX records`. The current A-records and MX-records of the top-level domain are determined. In the next step, the `redirects` and `HTTP status codes` are obtained from the top-level domains. The Python package called `dnspython` is used for the determination. In addition, a further analysis of the MX records determines their locations and the corresponding ASNs. Free IP geolocation databases from MaxMind are used for this.

## Setup

First, the required installations and imports are made, the session is initialized and the basis data is collected from the Postgres database:

In [1]:
# Required installations
!pip install ipynb
!pip install dnspython
!pip install geoip2

# Required imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# PostgreSQL access data
host = "bda_gr4_database"
port = "5432"
database = "domainanalysis"
user = "postgres"
password = "postgres"

# PostgreSQL connection url
connection = f"jdbc:postgresql://{host}:{port}/{database}"

# Create a Spark session
spark = SparkSession.builder \
    .appName("domain_analysis") \
    .getOrCreate()

# Read data from the database
domains_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain") \
                .option("user", user) \
                .option("password", password) \
                .load()

# Display the data frame
domains_df.limit(15).toPandas()

Collecting ipynb
  Downloading ipynb-0.5.1-py3-none-any.whl (6.9 kB)
Installing collected packages: ipynb
Successfully installed ipynb-0.5.1
Collecting dnspython
  Downloading dnspython-2.1.0-py3-none-any.whl (241 kB)
[K     |████████████████████████████████| 241 kB 2.9 MB/s eta 0:00:01
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-2.1.0
Collecting geoip2
  Downloading geoip2-4.2.0-py2.py3-none-any.whl (25 kB)
Collecting aiohttp<4.0.0,>=3.6.2
  Downloading aiohttp-3.7.4.post0-cp39-cp39-manylinux2014_x86_64.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 3.8 MB/s eta 0:00:01
Collecting maxminddb<3.0.0,>=2.0.0
  Downloading maxminddb-2.0.3.tar.gz (286 kB)
[K     |████████████████████████████████| 286 kB 11.4 MB/s eta 0:00:01
Collecting multidict<7.0,>=4.5
  Downloading multidict-5.1.0-cp39-cp39-manylinux2014_x86_64.whl (151 kB)
[K     |████████████████████████████████| 151 kB 9.7 MB/s eta 0:00:01     |██████████▉                     | 5

21/07/03 17:47:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/07/03 17:47:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/07/03 17:47:55 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


Unnamed: 0,top_level_domain,mx_record,a_record
0,0-5-1.de,"[smtp-02.tld.t-online.de, smtp-01.tld.t-online...",[80.150.6.143]
1,0-24versicherung.de,,
2,0-1.de,[localhost],[91.195.241.137]
3,0-3.de,"[teller.ggeg.eu, smtp.ggeg.eu]",[185.163.116.240]
4,0-263475.de,,[176.9.76.101]
5,0-apps.de,[mx.leo.org],[80.190.158.32]
6,0-32.de,[smtpin.rzone.de],[81.169.145.66]
7,0-1000.de,[localhost],[91.195.241.137]
8,0-100kmh.de,[smtpin.rzone.de],[81.169.145.95]
9,0-2.de,[localhost],[127.0.0.1]


## 1. Validation of A and MX records

In the following chapter, the functions are imported from the notebook `Functions.ipynb` in order to use them for the definition of the `UDFs`. In doing so, the current `a-records` and `mx-records` are to be determined from the `top-level-domain`.  If the records cannot be determined, the returned errors are stored in separate columns:

In [2]:
# Import all functions from Funtions.ipynb
from ipynb.fs.full.Functions import getARecords, getARecords_error, getMXRecords, getMXRecords_error

# Creating of UDF's
udf_getARecords = udf(getARecords, ArrayType(StringType()))
udf_getARecords_error = udf(getARecords_error, IntegerType())
udf_getMXRecords = udf(getMXRecords, ArrayType(StringType()))
udf_getMXRecords_error = udf(getMXRecords_error, IntegerType())

Next, the A and MX records are checked by calling the UDFs and creating the returned results in new columns:

In [3]:
# Create the new columns with the results
domains_checked_df = domains_df.withColumn("a_record_checked", udf_getARecords("top_level_domain")) \
                            .withColumn("a_record_checked_error", udf_getARecords_error("top_level_domain")) \
                            .withColumn("mx_record_checked", udf_getMXRecords("top_level_domain")) \
                            .withColumn("mx_record_checked_error", udf_getMXRecords_error("top_level_domain"))

In preparation for writing to the database, the data frame is put into the correct order:

In [4]:
# Changing the order of the data frame
domains_checked_df = domains_checked_df.select("top_level_domain", "a_record", "a_record_checked", "a_record_checked_error", "mx_record", "mx_record_checked", "mx_record_checked_error")

Last but not least, let's check if the records have either not been changed, changed completely or partially in comparison to the original data.

In [5]:
# Count the occrence of checked A-records
a_record_checked_count = domains_df.withColumn('a_record_checked', explode(col('a_record'))) \
        .groupBy('a_record_checked') \
        .count()

# Count the occrence of checked MX-records
mx_record_checked_count = domains_df.withColumn('mx_record_checked', explode(col('mx_record'))) \
        .groupBy('mx_record_checked') \
        .count()

# Finally, create new data frames only containing the top 10 A-/MX-records
a_record_count_top_ten_df = a_record_checked_count.orderBy(['count'], ascending = [False]).limit(10)
mx_record_count_top_ten_df = mx_record_checked_count.orderBy(['count'], ascending = [False]).limit(10)

Finally, the columns `a_record` and `mx_record` are removed from the `domains_checked_df` data frame. Furthermore, the first 15 rows of the data frame are displayed as a check for writing to the database:

In [6]:
# Remove columns and show the data frame
domains_checked_df = domains_checked_df.drop("a_record").drop("mx_record")
domains_checked_df.limit(15).toPandas()

                                                                                

Unnamed: 0,top_level_domain,a_record_checked,a_record_checked_error,mx_record_checked,mx_record_checked_error
0,0-5-1.de,[80.150.6.143],0,"[smtp-01.tld.t-online.de., smtp-02.tld.t-onlin...",0
1,0-24versicherung.de,,1,,1
2,0-1.de,[64.190.62.111],0,[localhost.],0
3,0-3.de,[185.163.116.240],0,"[teller.ggeg.eu., smtp.ggeg.eu.]",0
4,0-263475.de,[176.9.76.101],0,,2
5,0-apps.de,[80.190.158.32],0,[mx.leo.org.],0
6,0-32.de,[81.169.145.66],0,[smtpin.rzone.de.],0
7,0-1000.de,[64.190.62.111],0,[localhost.],0
8,0-100kmh.de,[81.169.145.162],0,[smtpin.rzone.de.],0
9,0-2.de,[52.58.78.16],0,,2


In [7]:
# Display the data frame
a_record_count_top_ten_df.limit(10).toPandas()

                                                                                

Unnamed: 0,a_record_checked,count
0,91.195.241.137,6
1,80.150.6.143,3
2,134.209.177.247,1
3,213.239.200.93,1
4,172.67.208.86,1
5,85.214.69.88,1
6,85.17.4.21,1
7,185.53.178.13,1
8,81.169.145.66,1
9,104.24.103.72,1


In [8]:
# Display the data frame
mx_record_count_top_ten_df.limit(10).toPandas()

Unnamed: 0,mx_record_checked,count
0,localhost,7
1,smtp-01.tld.t-online.de,3
2,smtp-02.tld.t-online.de,3
3,mxf993.netcup.net,2
4,smtpin.rzone.de,2
5,mail.0-500.de,1
6,mail.0--2.de,1
7,smtp.ggeg.eu,1
8,mxlb.ispgateway.de,1
9,smx.bestcpanel.eu,1


After the generated data frames have been checked, they must be written to the PostgreSQL database for data visualisation. To enable a faster write speed, `repartition` and `batchsize` are specified:

In [9]:
# Write data frames to the PostgreSQL database
a_record_count_top_ten_df.write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "a_record_checked_count_global") \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()

mx_record_count_top_ten_df.write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "mx_record_checked_count_global") \
    .option("user", user) \
    .option("password", password) \
    .mode("append") \
    .save()

domains_checked_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_records_checked") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

                                                                                

## 2. Determination of redirects and HTTP status codes

This chapter contains the definition of the `redirects` and `HTTP status` codes to the top level domains. In the following, the functions are imported from the function notebook `Functions.ipynb` in order to use them for the definition of the `UDFs`:

In [10]:
# Import from Funtions.ipynb
from ipynb.fs.full.Functions import *

# Creating of UDF's
udf_getRedirectUrl = udf(getRedirectUrl, StringType())
udf_getStatusCodeUrl = udf(getStatusCodeUrl, IntegerType())

In this block, the `redirections` and the `HTTP status codes` are determined by calling the UDF functions. For this purpose, the column `top_level_domain` is passed as a parameter. Subsequently, the redirections and status codes are stored in new columns:

In [11]:
# Create the new columns with the results
domains_redirect_df = domains_df.withColumn("redirection", udf_getRedirectUrl("top_level_domain")) \
                                .withColumn("status_code", udf_getStatusCodeUrl("top_level_domain"))

For storage in the PostgreSQL database, the `a_record` and `mx_record` columns are removed from the `domains_redirect_df` data frame and the first 15 rows of the data frame are displayed for checking for writing to the database:

In [12]:
# Remove columns and show the data frame
domains_redirect_df = domains_redirect_df.drop("a_record").drop("mx_record")
domains_redirect_df.limit(15).toPandas()

                                                                                

Unnamed: 0,top_level_domain,redirection,status_code
0,0-5-1.de,heijtech.de,403.0
1,0-24versicherung.de,,
2,0-1.de,0-1.de,403.0
3,0-3.de,,
4,0-263475.de,0-263475.de,200.0
5,0-apps.de,0-apps.de,404.0
6,0-32.de,0-32.de,200.0
7,0-1000.de,0-1000.de,403.0
8,0-100kmh.de,,
9,0-2.de,,


In [13]:
domains_redirect_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_redirection") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

                                                                                

## 3. MX record location and ASN determination

In this section, the `locations` and `providers` are identified based on the ANS using the MX records already checked. For each MX record the iso code, location, postal code, latitude, longitude and the organisation of the autonomous system are presented:

In [14]:
# Read data from the database
domain_records_checked_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain_records_checked") \
                .option("user", user) \
                .option("password", password) \
                .load()

# Drop columns
domain_records_checked_df = domain_records_checked_df.drop("a_record_checked").drop("a_record_checked_error").drop("mx_record_checked_error")

# Display the data frame
domain_records_checked_df.limit(20).toPandas()

Unnamed: 0,top_level_domain,mx_record_checked
0,0-0-0-1.de,"[smtp-01.tld.t-online.de., smtp-02.tld.t-onlin..."
1,0--2.de,"[mxf993.netcup.net., mail.0--2.de.]"
2,0-12.de,[localhost.]
3,0-24h.de,[mxlb.ispgateway.de.]
4,0-9a-z.de,"[mail.0-9a-z.de., mx2fb1.netcup.net.]"
5,0-263475.de,
6,0-100kmh.de,[smtpin.rzone.de.]
7,0-17.de,[localhost.]
8,0-2.de,
9,0-app.de,[mail.0-app.de.]


Next, the functions from the function notebook `Functions.ipynb` that are to be used for the definition of the `UDFs` are imported. After that, StructTypes of StructFields are created to define the return types:

In [15]:
# Import all functions from Funtions.ipynb
from ipynb.fs.full.Functions import *

schema_location = StructType([
    StructField("iso_code", StringType(), True),
    StructField("city", StringType(), True),
    StructField("postal", StringType(), True),
    StructField("latitude", StringType(), True),
    StructField("longitude", StringType(), True)])

schema_asn = StructType([
    StructField("autonomous_system_organization", StringType(), True)])

udf_getARecords = udf(getARecords, ArrayType(StringType()))
udf_getGeoLite2_Location = udf(getGeoLite2_Location, schema_location)
udf_getGeoLite2_ASN = udf(getGeoLite2_ASN, schema_asn)


In order to be able to perform the location and provider query, the ip address of the mx records must first be found out. Based on this, the `locations` and the `providers` are determined and the data frame `domains_mx_record_geolite2_df` is generated:

In [16]:
# Create the new columns with the results
domains_mx_record_geolite2_df = domain_records_checked_df.select(domain_records_checked_df.top_level_domain,explode(domain_records_checked_df.mx_record_checked).alias('mx_record_checked'))
domains_mx_record_geolite2_df = domains_mx_record_geolite2_df.withColumn("mx_record_ip", udf_getARecords("mx_record_checked")) \
                            .withColumn('mx_record_ip', explode(col('mx_record_ip'))) \
                            .withColumn("location", udf_getGeoLite2_Location("mx_record_ip")) \
                            .withColumn("asn", udf_getGeoLite2_ASN("mx_record_ip")) \
                            .select("top_level_domain", "mx_record_checked", "mx_record_ip", "location.*", "asn.*")

For storage in the PostgreSQL database, the first 15 rows of the data frame are displayed for checking for writing to the database:

In [17]:
# Display the data frame
domains_mx_record_geolite2_df.limit(15).toPandas()

                                                                                

Unnamed: 0,top_level_domain,mx_record_checked,mx_record_ip,iso_code,city,postal,latitude,longitude,autonomous_system_organization
0,0-0-0-1.de,smtp-01.tld.t-online.de.,194.25.134.76,DE,Duisburg,47199.0,51.4328,6.752,Deutsche Telekom AG
1,0-0-0-1.de,smtp-02.tld.t-online.de.,194.25.134.12,DE,Duisburg,47199.0,51.4328,6.752,Deutsche Telekom AG
2,0--2.de,mxf993.netcup.net.,46.38.249.147,DE,,,51.2993,9.491,netcup GmbH
3,0--2.de,mail.0--2.de.,46.38.249.147,DE,,,51.2993,9.491,netcup GmbH
4,0-24h.de,mxlb.ispgateway.de.,80.67.18.126,DE,,,51.2993,9.491,Host Europe GmbH
5,0-9a-z.de,mx2fb1.netcup.net.,188.68.47.177,DE,Gifhorn,38518.0,52.4803,10.5526,netcup GmbH
6,0-100kmh.de,smtpin.rzone.de.,81.169.145.97,DE,Cologne,50739.0,50.9771,6.9186,Strato AG
7,0-app.de,mail.0-app.de.,91.184.49.169,NL,Amsterdam,1012.0,52.3759,4.8975,LeaseWeb Netherlands B.V.
8,0-500.de,call.c2.wtf.,5.9.98.91,DE,,,51.2993,9.491,Hetzner Online GmbH
9,0-3.de,teller.ggeg.eu.,85.209.51.51,DE,,,51.2993,9.491,netcup GmbH


In [18]:
# Write the data frame to the PostgreSQL database
domains_mx_record_geolite2_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_mx_record_geolite2") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

                                                                                

## 4. IPv6 and SOA information fetch

This section contains the fetch of IPv6, nameserver information and a ranking of the top ten master nameservers and the companies behind.
The following block generates the required functions as user defined function.

In [19]:
udf_IPv6Record = udf(IPv6Record, BooleanType())
udf_IPv6Record_error = udf(IPv6Record_error, IntegerType())
udf_getSOAInformation = udf(getSOAInformation, ArrayType(StringType()))
udf_getSOAInformation_error = udf(getSOAInformation_error, IntegerType())
udf_getNameServers = udf(getNameServers, ArrayType(StringType()))
udf_getNameServers_error = udf(getNameServers_error, IntegerType())

### IPv6 information
The IPv6 information provided by the following code are the availability of IPv6 (`ipv6_available`, boolean) and the possible error code during a request (`ipv6_error`). Therefore, the dataframe 'domains_ipv6_df' is created.

In [20]:
# Create df for ipv6 data
domains_ipv6_df = domains_df.withColumn("ipv6_available", udf_IPv6Record('top_level_domain'))\
                        .withColumn("ipv6_error", udf_IPv6Record_error("top_level_domain"))\
                        .drop('mx_record').drop('a_record')

domains_ipv6_df.limit(10).toPandas()

Unnamed: 0,top_level_domain,ipv6_available,ipv6_error
0,0-5-1.de,True,0
1,0-24versicherung.de,,1
2,0-1.de,False,2
3,0-3.de,False,2
4,0-263475.de,False,2
5,0-apps.de,False,2
6,0-32.de,True,0
7,0-1000.de,False,2
8,0-100kmh.de,True,0
9,0-2.de,True,0


In [21]:
#Write the data frame to the PostgreSQL database
domains_ipv6_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "ip_v6_information") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

### SOA information
Among the SOA record, the names of the nameservers `nameservers` (iincluding master server) are fetched in the following. The variable `nameservers_error` indicates if problems ocurred during the request wheres as `nameservers_count` contains the number of nameservers used per domain.

In [22]:
# Create df for SOA data and drop unnecessary columns
domains_soa_df = domains_df.withColumn("soa_information", udf_getSOAInformation("top_level_domain"))\
                        .withColumn("soa_information_error", udf_getSOAInformation_error("top_level_domain"))\
                        .withColumn("nameservers", udf_getNameServers("top_level_domain"))\
                        .withColumn("nameservers_error", udf_getNameServers_error("top_level_domain"))\
                        .drop('mx_record').drop('a_record')

# Add nameserver count variable
def count_arr(arr): return 0 if arr == None else len(arr)
count_arr_udf = udf(count_arr, IntegerType())
domains_soa_df = domains_soa_df.withColumn("nameservers_count", count_arr_udf('nameservers'))

The details concerning the master nameserver are provided as ArrayType. As this is inconvenient with regard to the data structure (atomicity), the column `soa_infos_rep` is converted into a string type to separate its contents into separate variables.
These are the name of the primary master nameserver `soa_name`, the refresh time of the SOA record executed by the secondary nameservers (`refresh`), the 'Time to live' (TTL), which defines how long a secondary nameserver caches the requested SOA information `minimum`.

In [23]:
# Change ArrayType<String> into String as preparation for information separation
domains_soa_df = domains_soa_df.withColumn("soa_infos_rep", concat_ws(" ", "soa_information"))

# Split SOA information into separate columns (all String)
split_col = split(domains_soa_df['soa_infos_rep'], ' ')
domains_soa_df = domains_soa_df.withColumn('soa_name', split_col.getItem(0))\
                        .withColumn('refresh', split_col.getItem(3))\
                        .withColumn('minimum', split_col.getItem(6))

# Helping function to catch empty entries concerning the master nameserver
def replace_empty_strings(x):
    return when(col(x) == "", None).otherwise(col(x))

domains_soa_df = domains_soa_df.withColumn("soa_name", replace_empty_strings("soa_name"))

Some fetched entries contain unnecessary characters which need to be removed (dot at the end of the name).

In [24]:
# Remove last dot per soa mname
domains_soa_df = domains_soa_df.withColumn('soa_name', regexp_replace('soa_name', '.$', ''))   

# Remove last dot per nameserver entry
lambda_dot_remove = lambda arr: [x[:-1] for x in arr]
def fn_remove_dot(arr): return None if arr == None else lambda_dot_remove(arr)
udf_remove_last_char_in_array = udf(fn_remove_dot, ArrayType(StringType()))

domains_soa_df = domains_soa_df \
    .select("*") \
    .withColumn('nameservers', udf_remove_last_char_in_array(col('nameservers')))

In [25]:
# Change data type of the time setting columns into int
domains_soa_df = domains_soa_df.withColumn("refresh", domains_soa_df["refresh"].cast(IntegerType()))\
                        .withColumn("minimum", domains_soa_df["minimum"].cast(IntegerType()))\
                        .drop('soa_information').drop('soa_infos_rep')

In [26]:
#Write the data frame to the PostgreSQL database
domains_soa_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "soa") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()

                                                                                

### Top ten master namservers and related company information
This section creates a dataframe concerning the top ten master neameservers and the companies behind.

In [27]:
# Count the occrence of SOA records
soa_name_count_top_ten_df = domains_soa_df.withColumn('soa_name', (col('soa_name'))) \
        .groupBy('soa_name') \
        .count()

soa_name_count_top_ten_df = soa_name_count_top_ten_df.orderBy(['count'], ascending = [False]).limit(10)
soa_name_count_top_ten_df.limit(5).toPandas()

Unnamed: 0,soa_name,count
0,ns1.sedoparking.com,6
1,,4
2,dns01-tld.t-online.de,3
3,root-dns.netcup.net,2
4,brit.ns.cloudflare.com,1


The company informations per master nameserver will be added by a request of its IP adresses `ipv4`. The IP address(es) are the used within the functions concerning GeoLite2 to get the company information ()

In [28]:
# Remove None entries (copmany information not available)
soa_name_count_top_ten_df = soa_name_count_top_ten_df.na.drop(subset=["soa_name"])

soa_name_count_top_ten_df = soa_name_count_top_ten_df.withColumn("ipv4", udf_getARecords("soa_name"))\
                      .withColumn('ipv4', explode(col('ipv4'))) \
                      .withColumn("location", udf_getGeoLite2_Location("ipv4")) \
                      .withColumn("asn", udf_getGeoLite2_ASN("ipv4")) \
                      .select("soa_name", "count", "ipv4", "location.*", "asn.*")
#soa_name_count_top_ten_df.limit(5).toPandas()

In [29]:
#Write the data frame to the PostgreSQL database
soa_name_count_top_ten_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "soa_top_ten") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()