# Summary

The goal of this notebook is to be an exercise of PySpark, and it consists in preparing data to be delivered in an expected format.

The input data comes from [web_input.csv](../data/web_input.csv).
The expected output data is in [web_output.csv](../data/web_output.csv).

In the cells below, I'm going to create Python code to import required libraries and apply the required data transformations using PySpark to achieve the expected output.

# Imports

In [None]:
import os
import sys

import pyspark.sql.functions as F
from pyspark.sql import SparkSession

# Required for Spark to find Python executable
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['JAVA_HOME'] = '/opt/homebrew/opt/openjdk@17'

In [None]:
spark = SparkSession.builder.appName("web_input").getOrCreate()

# Read CSV

In [None]:
input_df = spark.read.csv('../data/web_input.csv', header=True)

In [None]:
input_df.show()

In [None]:
input_df.printSchema()

# Create Output DF

## Date Generated, AllimID, Name

In [None]:
# Creates a copy of a dataframe so we keep the original intact
output_df = input_df.select('*')

In [None]:
output_df = output_df.withColumn('Date Generated', F.lit('2022-07-14 14:35:04')) \
    .withColumnsRenamed({'integerId': 'AllimID', 'name': 'Name'})

## Entity Type

In [None]:
help(F.concat)

In [None]:
output_df = output_df.withColumn('Entity Type', F.concat(
    F.lit("\""),
    F.split_part(F.col('type'), F.lit('#'), F.lit(2)),
    F.lit("\"")
))

## Risk Labels

In [None]:
output_df = output_df.withColumn('risk_labels_stage', F.concat(
    F.lit("\""),
    F.split_part(F.col('riskLabel'), F.lit('#'), F.lit(2)),
    F.lit("\"")
))

In [None]:
output_df = output_df.withColumn('Risk Labels',
                                 F.when(
                                     F.contains(F.col('risk_labels_stage'), F.lit('CnForcedLabor')),
                                     F.lit('"CnForcedLabor"')
                                 ).otherwise(F.col('risk_labels_stage')))

## Primary City, Primary Country

In [None]:
output_df = (output_df
             # .withColumn('Primary City', F.coalesce(F.col('cityAndRegion'), F.lit('')))
             .withColumn('Primary City', F.col('cityAndRegion'))
             .withColumnRenamed('country', 'Primary Country'))

# Group values

In [None]:
output_df_grouped = output_df.groupBy(['Date Generated', 'AllimID', 'Name', 'Primary City', 'Primary Country']).agg(
    F.collect_set('Entity Type').alias('Entity Type'),
    F.collect_set('Risk Labels').alias('Risk Labels')
).select(['Date Generated',
          'AllimID',
          'Name',
          'Entity Type',
          'Risk Labels',
          'Primary City',
          'Primary Country'])

# Cast list to str

It was required to cast the list to string since CSV datasource doesn't support the column `Entity Type` of the type "ARRAY<STRING>"

In [None]:
final_df_list_str = output_df_grouped.withColumns({
    'Entity Type': F.sort_array(F.col('Entity Type')).cast('string'),
    'Risk Labels': F.sort_array(F.col('Risk Labels')).cast('string')
})

# Write CSV

In [None]:
(
    final_df_list_str
    .sort(F.lower(F.col('AllimID')))
    .write
    .mode('overwrite').csv('../data/my_output.csv', header=True, escape='"')
)