# Task 1 - Intro

The following presents a few simple tasks designed for initial orientation and to introduce some basic PySpark functions.

In [1]:
# Setting up a Spark Session
from pyspark.sql import SparkSession
import pyspark
import pyspark.sql.functions as F
from pyspark.sql.types import NumericType

spark = SparkSession.builder \
    .appName("Pyspark Intro Taks")\
    .getOrCreate()

1. Read the dataset dataset1 from the data directory and check the schema. How many different datatypes appear in the dataset?

In [35]:
#Task 1 - Code here

<details>
<summary>Solution - 1</summary>
<code>
df = spark.read.csv("../data/GroceryDataset.csv", header=True, inferSchema=True)<br>
# Get datatypes - Option 1:
df.printSchema()
<br>
# Get datatypes - Option 2:
distinct_dataTypes = df.dtypes
</code>
</details>

2. Display the first 5 and the last 5 rows of the dataset. What might be the reason that a PySparkValueError occurs when you try to show the last 5 rows?

In [3]:
#Task 2 - Code here

<details>
<summary>Solution - 2</summary>
<code>
#Display the first 5 rows:
spark.createDataFrame(df.head(5)).show()
<br>
#Display the last 5 rows:
spark.createDataFrame(df.tail(5)).show()
# This throws an PySparkValueError because the last 17 rows of column "Rating" have the value 'NULL'. Therefore spark cannot determine the Value type of this column without further information.
# If you want to display the last rows of the table you have to set n=18 instead of n=5.
</code>
</details>

3. Count the total (distinct) number of records. Then, drop all rows containing 'NULL' values. Now, count the total (distinct) number of records again. <br> What is the problem if you simply drop all rows containing 'NULL' values?

In [44]:
#Task 3 - Code here

<details>
<summary>Solution - 3</summary>
<code>
total_number = df.distinct().count()
print(total_number)
total_number_noNULL = df.dropna().distinct().count()
print(total_number_noNULL)
<br>#The dataset is significantly smaller after dropping all the rows containing 'NULL' values. This can directly impact all further data analyzation and machine learning training as important rows might be dropped during the process.
</code>
</details>

4. Clean Dataset/Data Preparation - Part 1: Address missing values (Null-Values)
    * In the column 'Currency' replace all 'NULL' with the given Currency symbol.
    * In the column 'Discount' replace all 'NULL' with 'No Discount' instead.
    * In the column 'Title' replace all 'NULL' with 'No Title'.
    * In the column 'Feature' replace all 'NULL' with 'No Feature'.
    * In the column 'Product Description' replace all 'NULL' with 'No Description'.

In [30]:
#Task 4 - Code here

<details>
<summary>Solution - 4</summary>
<code>
df = df.withColumn('Currency', F.when(F.col('Currency').isNull(),'$').otherwise(F.col('Currency'))) \
.withColumn('Discount', F.when(F.col('Discount').isNull(),'No Discount').otherwise(F.col('Discount'))) \
.withColumn('Title',F.when(F.col('Title').isNull(),'No Title').otherwise(F.col('Title'))) \
.withColumn('Feature',F.when(F.col('Feature').isNull(),'No Feature').otherwise(F.col('Feature'))) \
.withColumn('Product Description',F.when(F.col('Product Description').isNull(),'No Product Description').otherwise(F.col('Product Description'))) \
</code>
</details>

5. Clean Dataset/Data Preparation - Part 2:<br> In the column 'Rating' extract the rating that is written after "Rated". Use the regular expression "\d+(\.\d+)?" to extract only the numbers. Additionally make sure only 0 appears otherwise.

In [31]:
#Task 5 - Code here

<details>
<summary>Solution - 5</summary>
<code>
df = df.withColumn('Rating', F.regexp_extract(F.col('Rating'), "\d+(\.\d+)?", 0)) \
.withColumn('Rating', 
            F.when(F.col('Rating').isNull(), 0)
            .when(F.col('Rating')=="",0)
            .otherwise(F.col('Rating')))
</code>
</details>

6. Clean Dataset/Data Preparation - Part 3:<br>Remove all string values in the column 'Price'. Additionally remove the $ -symbol and cast the price to numerical type. Then calculate the mean of the column 'Price', round the mean value to two decimals and replace all the 'NULL' in the column with the calculated mean.

In [34]:
# Task 6 - Code here

<details>
<summary>Solution - 6</summary>
<code>
df =df.withColumn('Price',F.when(F.col('Price').startswith("$"), F.substring_index(F.col('Price'), "$", -1)).otherwise(0)) \
.withColumn('Price', F.col('Price').cast('float'))
<br>
#Calculate the mean and replace all 0 values with the mean
price_mean_value = round(df.agg(F.mean(F.col('Price'))).collect()[0][0],2)
df = df.withColumn('Price', F.when(F.col("Price") == 0.0,price_mean_value).otherwise(F.col('Price')))\
.withColumn('Price', F.round(F.col('Price')))
</code>
</details>

7. Filter the dataset where the condition X is fullfiled.

In [None]:
#Task 7 - Code here

<details>
<summary>Solution - 7</summary>
<code>
df = df.filter()
<br>
<br>
</code>
</details>

8. Aggregate the dataset: Calculate the sum and the mean of the column x and y.

In [36]:
#Task 8 - Code here

<details>
<summary>Solution - 8</summary>
<code>
# Aggregate dataframe - Option 1:<br>
agg_df = df.agg(
    F.sum(F.col()),
    F.mean(F.col())
)
<br>
<br>
# Aggregate dataframe - Option 2:<br>
agg_df = df.agg(
    {"numeric_column": "sum",
     "another_numeric_column": "mean"}
     )
</code>
</details>

9. Group the dataset by the categorical column x and calculate the sum of y.

In [37]:
#Task 9 - Code here

<details>
<summary>Solution - 9</summary>
<code>
<br>
<br>
</code>
</details>

10. Left Join the dataset with dataset2 based on a common key which you have to identify yourself.

In [38]:
#Task 10 - Code here

<details>
<summary>Solution - 10</summary>
<code>
<br>
<br>
</code>
</details>

11. Write and save the processed dataset to an output file "GroceryDataset_solution" as a .parquet-file.

In [39]:
# Task 11 - Code here

<details>
<summary>Solution - 11</summary>
<code>
df.write.mode("overwrite").parquet("../data/GroceryDataset_solution.parquet")
</code>
</details>

Please note that this setup is solely for training purposes. In reality, the dataset is unevenly weighted and not suitable for tasks like Machine Learning. This training is intended to provide an initial understanding of how to use PySpark and perform basic data cleaning steps.