# 1. Checking the redirections

This notebook sets the redirects and HTTP status codes based on the specified top-level domains.

## Setup

First, the required imports are made, the session is initialized and the data is collected from the Postgres database:

In [None]:
# Required installations
!pip install ipynb
!pip install dnspython
!pip install geoip2

# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#PostgreSQL access data
host = "bda_gr4_database"
port = "5432"
database = "domainanalysis"
user = "postgres"
password = "postgres"

# PostgreSQL connection url
connection = f"jdbc:postgresql://{host}:{port}/{database}"

# Create a Spark session
spark = SparkSession.builder \
    .appName("domain_analysis") \
    .getOrCreate()

# Read data from the database
domains_df = spark.read \
                .format("jdbc") \
                .option("url", connection) \
                .option("dbtable", "domain") \
                .option("user", user) \
                .option("password", password) \
                .load()

# Display the data frame
domains_df.limit(15).toPandas()

## Determination of redirects and HTTP status codes

Secondly, the functions are imported from the functions notebook `Functions.ipynb` in order to use them for the definition of the `UDFs`:

In [None]:
# Import from Funtions.ipynb
from ipynb.fs.full.Functions import *

# Creating of UDF's
udf_getRedirectUrl = udf(getRedirectUrl, StringType())
udf_getStatusCodeUrl = udf(getStatusCodeUrl, StringType())

In this block, the `redirections` and the `HTTP status codes` are determined by calling the UDF functions. The `top_level_domain` column is passed as a parameter. The forwardings and status codes are determined for each top-level domain and stored in new columns:

In [None]:
# Create the new columns with the results
domains_redirect_df = domains_df.withColumn("redirection", udf_getRedirectUrl("top_level_domain"))
domains_redirect_df = domains_redirect_df.withColumn("status_code", udf_getStatusCodeUrl("top_level_domain"))

## Store the data

Finally, the columns `a_record` and `mx_record` are removed from the data frame. Furthermore, the first 15 rows of the data frame are displayed as a check for writing to the database:

In [None]:
# Remove columns
domains_redirect_df = domains_redirect_df.drop("a_record").drop("mx_record")
domains_redirect_df.limit(15).toPandas()

In [None]:
# Write the data frame to the PostgreSQL database
domains_redirect_df.repartition(8).write \
    .format("jdbc") \
    .option("url", connection) \
    .option("dbtable", "domain_redirection") \
    .option("user", user) \
    .option("batchsize", 10000) \
    .option("password", password) \
    .mode("append") \
    .save()